A Genetic Algorithm for Computing Ramsey Numbers: Update

Friends_strangers_graph

All the 78 possible friends-strangers graphs with 6 nodes. For each graph the red/blue nodes shows a sample triplet of mutual friends/strangers.

In my last post on this topic, I discussed how I was working on a genetic algorithm to search mathematical graphs for elusive properties called Ramsey Numbers. (For a refresher on genetic algorithms,  visit here, and for a refresher on Ramsey Numbers, visit here). I’ve been doing some work on it since then (check out the code here), and I thought I would describe some improvements and further progress I’ve made in this area.

New features:

  • colorings dumped to a file at the end of each run
  • ability to load in data sets from file, further refining of the data than starting from scratch each time

The next problem I ran up against while working through this was that even if I am able to load in previously analyzed data, I still only have one fitness function that checks a static set of edges. As I see it, there are two ways to solve this:

  • Make the current fitness function dynamic; that is, it tests a different set of edges every time. However, this is counterproductive to the purpose of the program “eliminating” certain sets of edges in each “round”. However, this would be easier to maintain than the other option, which is
  • Make a “FitnessHandler” method that takes in a value for which method to run, and uses that to determine what set of edges to test. However, this would lead to a lot of extra code and overhead. I’m thinking having a static variable at the beginning of each run with what “fitness method” to start on, so that it doesn’t have to start on round one each time.

I haven’t fully decided which of these I will go with. I feel like the second one fills my purpose of methodically “weeding out” the improbable graphs, but its going to be a lot of extra work. Oh well, nothing worthwhile ever came easy…

Leave a note here or on my github if you have suggestions!

Advertisements

Introduction to Genetic Algorithms

In response to my previous article about genetic algorithms for Ramsey theory, a few readers asked me to give a bit more of an introduction about genetic algorithms. Here you will find a beginner’s look at what a genetic algorithm is, what it is useful for, and how you can use one in your own work.

What is a genetic algorithm (GA)?

To begin with, let’s talk about what a genetic algorithm is, on its most basic level. The word genetic suggests something to do with biology. If you can reach way back to your high school science classes you might remember the basic genetic process: plants and animals are born, mate with one another, and create new generations. Over time, these generations tend to emphasize certain traits essential to survival while downplaying the weaker (recessive) traits. A genetic algorithm works much the same way. We come up with a population of possible solutions to a problem, “mate” them together, and look at our new solutions to see if they are any better. Over time, we can create solutions that converge to better and better values. This is useful when a problem is too complex to search all possibilities. Below you will see an image describing how a simple GA works.

GeneticAlgorithms (1)

Crossovers, fitness, mutation, oh my!

Before moving on, it would be useful to define some of the terms used above.

  • chromosome:  a proposed “solution” to the problem at hand. This is usually represented by a bitstring, that is to say, a list of 0’s and 1’s. It can also hold any other information that is crucial to our problem.
  • population: a collection of these “chromosomes” that we use to combine together and make new generations. The population represents the set of all the ideas that we have about this problem at the moment
  • fitness: the fitness of a chromosome is a number representing how good it is. That is, the fitness represents how good this solution is at solving our problem. An example of this is if you were trying to find the shortest route to get somewhere. Fitness for a problem like this would be a number representing the distance it takes for each path you could choose. In the end, you want to find the path with the shortest distance. Ultimately, the fitness function is defined by the programmer, and it can measure whatever you want.
  • mutation: mutation is the random entering of new data into the gene pool. Just like in biology, sometimes mutations occur and create things that were never intended. However, sometimes this inadvertent change can be to our advantage if we’re getting stuck. A common example of a mutation is to just change a small part of the chromosome, and move on.
  • crossover: the “crossover function” represents the operations that we do in order to “mate” two (or sometimes more) of our chromosomes. There are many different types of crossover techniques, some better for certain situations than others, but at the heart of it a crossover just represents a way to combine the “traits” of two or more chromosomes into a new “baby” chromosome to be inserted into our population. This is the bulk of a genetic algorithms’ work. As the population evolves and new generations of solutions are created, the goal is to keep around the solutions with “good” fitness, and get rid of the chromosomes that aren’t doing so well. The crossover is also user-defined, and tweaking it to optimize results is common.
  • termination condition: all good things must come to an end. While we in theory could just leave our algorithm running forever, that would not be very helpful because as I will discuss later in this post, solutions can’t keep getting better forever. In addition, sometimes you won’t actually get to the “best solution” and instead will converge on what’s called a “local minima/maxima”. When this happens, it means you’ve found an “okay” solution, but the population got flooded with many similar chromosomes and couldn’t improve itself after that point. Think of it this way: if everyone in the world were exactly the same, would you expect any different from their children? The termination condition you choose depends on the problem, but some common terminating conditions are:
    • a) finding the best solution (ideal)
    • b) running a preset number of generations and using that as a cutoff point
    • c) quitting after every member of the population falls within a certain similarity range (this means that no new/better solutions are likely to be produced)

But why would I want to do all this?

I know this seems like a pretty complicated process. Why not just use a computer to figure out the real solution instead of dancing around it in this complicated manner? Well, it turns out that’s not always possible…

Enter the Traveling Salesman Problem (TSP). While there are many problems that are still very hard for us to solve with computers, this is one of the best known and most studied. It turns out that using a genetic algorithm is actually a pretty good approach and is much quicker than running an exhaustive search of all paths possible. Remember when I used the example of finding the shortest distance to go somewhere? That’s pretty much what this is. I’ve done a blog post on this before (see here). In the traveling salesman problem, there is a man that needs to visit a list of different cities, but he wants to get there and back as quickly as possible. Therefore, you need to find the shortest route to hit all the cities and return home. This isn’t always as easy as it sounds, and as the number of cities grows, so too does the time it takes to find the right one. Very quickly it becomes implausible to check every possible path, so we use a genetic algorithm to help us weed out the bad ones.

Another example of how we would use a genetic algorithm is for graph theory problems that also have a huge number of possible solutions. You can see how I applied a genetic algorithm to the problem of Ramsey numbers here.

Okay, so how do I use this in my project?

It’s pretty simple. I know it looks complicated, but once you get everything apportioned out correctly it’s not that bad. Things to think about when applying a GA to a hard problem (we will use the Traveling Salesman Problem as an example here):

  • First decide what your “chromosomes” will look like. These are the meat of your population, and these are what will be mutated and crossovered in order to create better solutions. For the TSP, we will use a a sequence of numbers denoting what cities to visit and in what order. (For example, “1 5 4 2 3 1” describes a way you could make a circuit through 5 cities).
  • Decide on how fitness will be evaluated. This is important because members with better fitness will be the ones that stick around in your population and (hopefully) make your solutions better. In our problem, fitness refers to how far our chosen path takes us. The lower the fitness score, the shorter our travel is. We want to minimize this number with our algorithm.
  • Next we need to figure out what our crossover will be. This is very important to consider, because we want something that will take parts of both of our “parents” and combine them together in some way to make a “baby”. For traveling salesman, we can’t just grab some from one parent and some from the other, because we run the risk of duplicates. (we don’t want a path to look like “1 2 2 3 4 1”, city 2 is visited twice and we never get to city 5). Therefore, we have to use more sophisticated methods. I won’t go into them here but if you’re interested check out this wikipedia page for more info on crossover techniques.

What are the downsides?

“Every rose has its thorn,” as they say, and genetic algorithms are no different. What seems like a cure-all has its hindrances as well. As I mentioned before, GAs have a problem of converging too early on a value that’s not quite ideal. Often this takes many repetitions of running the program and fine tuning things such as population size and mutation rate. Sometimes happening upon the “best” solution is a product of randomness. But in a GA, we use this randomness to our advantage as much as we can. Another inherent problem is the fact that we have to program this framework around it. Using a GA is only viable if its an extremely complex problem that cannot be solved more efficiently. For example, using a GA to solve a 5 city TSP would be foolish. We can run through all those possibilities very quickly. On the other hand, bumping that up to 50 or 500 cities proves a much harder challenge.

In addition, using a genetic algorithm means you have to find a type of chromosome, fitness function, population size, and crossover that works for you. Pick the wrong values for these, and the program will behave less than optimally. Experimentation and continually tweaking the parameters of your model is necessary. In some genetic algorithms, the fitness function for even one chromosome can take quite a while to compute. This makes some genetic algorithms very slow to apply.

Where can I learn more?

I hope this has covered the basics of genetic algorithms and interested you in learning more. If you would like to see a project I have done involving GAs you can read A Genetic Algorithm Approach to Ramsey Theory, and for a broader range of discussion about the theory and applications of GAs check out the book An Introduction To Genetic Algorithms by Melanie Mitchell.

Hope you enjoyed this introduction to the wide world of genetic programming; if you have questions or suggestions please leave them in the comments below!

A Genetic Algorithm Approach to Ramsey Theory

Background and Introduction

Ramsey Theory is the study of combinatorial problems proposed by Frank Ramsey in 1930. The version of his problem as applied to graphs asks, “what is the smallest complete graph such that there is at least one clique of a given size and color?” This can be represented by R(x,y) where x and y are the sizes of the cliques to find for red and blue, respectively.

R(3,3) means to find the smallest graph such that we are guaranteed either a red 3-clique (3 vertices connected together at every point, or a complete graph on 3 vertices) or a blue 3-clique. This has been shown to be 6. We can show this by providing an example that a complete graph on 5 vertices (K_5) can be colored such that it contains no red or blue 3-cliques. Then, we show that for every coloring of K_6, there must necessarily be a red triangle or a blue triangle in the graph. This can be verified computationally or combinatorially. (See the nice example here: Theorem on friends and strangers)

While the problem is relatively easy to solve for a small value like R(3,3), the complexity of the problem increases greatly when considering larger clique sizes. For instance, R(4,4) is 18, and we only know that the value of R(5,5) is somewhere between 43 and 49 inclusive. Because the number of cases one must check increases exponentially with each increase in clique size, it becomes impractical for traditional computing very quickly.

Therefore, we will utilize a genetic algorithm in an attempt to verify and perhaps improve the lower bound of R(5,5). To do this, the program must complete the following:

  • Use the genetic algorithm to test for graphs that have low numbers of cliques

  • Perform exhaustive testing on these graphs, and if we can find even one coloring where a clique does not exist, we will have shown that R(5,5) > 43.

 

Implementation

My algorithm is implemented using Java. Graphs are represented using an adjacency matrix, and a coloring of the graph (ColorMatrix) are 2D boolean arrays. In this scheme, coloring[x][y]==0 represents a blue edge from x to y and coloring[x][y]==1 represents a red edge. The colorings are symmetric; that is, coloring[x][y] == coloring[y][x]. These colorings form a base for my Chromosome class, which is used in the genetic algorithm to represent a “piece” of information that can be mated with other chromosomes, mutated, and scored based on the number of cliques found. A population stores multiple chromosomes.

My basic algorithm works as follows:

  1. Create a population of random colorings of K_{43}
  2. Check a set of edges for 5-cliques (how these sets are determined will be discussed later). Assign a fitness score based on the number of cliques found

  3. Crossover/mutate the chromosomes

  4. Run this many times until every member of the population has fitness 0 (no cliques found in that set of edges)

  5. Take these “possible zeroes” and begin to test more edges to find cliques

    1. if we find any new cliques, this coloring is no longer any good to us (we are looking ultimately for a coloring with no cliques whatsoever)

    2. pare down the population to a set of graphs that have no cliques in set 1 and set 2

    3. eventually this leads to an exhaustive search but by that point there will be only a few “likely” candidates that passed every test set (this saves computation time rather than checking every clique right off the bat)

 

Calculating Fitness

The fitness function for our genetic algorithm checks a set of edges in the graph to see if there are any cliques. We test from “test sets” instead of checking every clique every time. The reasoning behind this is that if a clique is found on the first test, we need not go any farther. The multiple-stage fitness testing allows us to prune out “bad” data and get more likely solutions to our problem. The first test set data is computed as follows (each 5-tuple represents a set of five edges to check for a clique — that is, all the same color)

(0,1,2,3,4), (0,2,3,4,5), (0,3,4,5,6),

\dots (0,39,40,41,42), (1,2,3,4,5), (1,3,4,5,6),

\dots, (38,39,40,41,42)

 

We use a simple method of counting up iteratively the first two items in the list, then using increasing consecutive numbers for the rest of the clique. This way, lots of cliques are tested, but this test set leaves out cliques that are not formed from consecutive edges (e.g., (0,2,4,6,8) would not be tested in this list). Once the list of edges to check is generated, we simply test the edges on each node in the set. If the colors are all the same, then we have a clique and add to our fitness.

The second test set would be applied after the population has converged for the first time to zeroes, and this time counts up by multiples of 2 (so (0, 2, 4, 6, 8) would be a viable choice in this set).

An alternative fitness function was considered that involves inserting vertices into a binary search tree based on their coloring values. To do this, first one creates a random permutation of edges (values 0-42). Then using the colorings (0 and 1 instead of positive and negative) we can insert the nodes either left or right in a binary search tree with the first node in the list as the root. By looking at the output of this tree, we can get a visual representation of color patterns. The idea is that by finding a long “path” in one of these trees we know that it has seen the same color many times, thus suggesting a clique exists.

In the below example, the set (0,1,2,3,4) forms a long “path”, meaning that each edge is the same color.

Figure 1. Binary search tree built from a sample coloring of (0,1,2,3,4,5)

Unfortunately, this implementation is not currently functional in my program.

 

Parent Selection and Crossover

The backbone of any genetic algorithm is the crossover. This models the mating process in traditional biology and is used to make the Chromosomes (solutions) improve themselves over many generations by keeping successful traits and discarding others. Parents are selected in my model randomly from the population, but there is a small chance (about 5%) that the best member of the population would be chosen as a parent instead. This provides a slight bias toward better traits but still allows for plenty of randomness and new data.

The crossover is a simple one. Using two Chromosomes (colorings of the graph) we use a random value for each x, y in the colorMatrix to determine whether the “mom” or the “pop” will contribute their genes (coloring[x][y]). There is a 50-50 chance of inserting mom(x,y) into the baby as opposed to pop(x,y). The “baby” Chromosome results in a mixture of data from both parents, and if the new Chromosome’s fitness (number of cliques) is lower than the current worst value in the population, we replace that value with our new baby. This allows the population to improve over time.

Our mutation method simply flips a bit in the coloring based on a random value. The mutation rate was 0.08, or 8%.

 

Problems and Solutions

One of the things I quickly noticed when building this project was that the fitness function, although it does not brute force every possible clique check, can still take a prohibitively long time at higher population sizes. The getFitness() function is called a bit too liberally, and saving the fitness value to retrieve later would improve the efficiency in this area.

Another downfall is that since we are only testing a limited set of cliques each time, it is possible (although desirable, given the current approach) to “learn the data”, and evolve a solution that contains no cliques in the tested area, but may have cliques elsewhere. While this seems like a problem at first glance, we can use this to our advantage because having the algorithm “learn” each test data set is what allows us to prune the search space and find possible colorings that do not contain cliques. If we decided this was an undesirable behavior however, we could change the fitness function to include x random permutations of edges each time. The problem with this method is that we could never be sure that a clique certainly did not exist in a certain area over multiple rounds, as the sets tested would be randomized each time. Finally, it is possible to go with the old-fashioned approach and simply check every clique after all.

 

Future Work

While this program currently provides a proof of concept and implementation up to the first round of test data, there is much room for expanding this project. First of all, multiple rounds need to be completed within the algorithm, so that we can generate colorings that are more likely to have no cliques. Other things that would be useful are outputting data to a text file for further reading/analysis, and performing general optimizations throughout my code for performance.

In addition, I would really like to get a working implementation of the Alternate Fitness function implemented and working. I feel like this is an interesting direction to take the project and it would be helpful to have a more visual representation of where cliques are occurring.

While the population initialization currently creates a population of random colorings, it would be interesting to see what happens when we change the ratios of red edges to blue edges in the chromosomes, starting at initialization. What if each coloring was required to be within a certain amount red and a certain amount blue? This is definitely something to check into as well.

 

Conclusion

Currently, we are able to generate lists of colorings that do not have cliques within the first test set of edges. As we add on more edge sets to test, we will create more constrained lists. Further testing will reveal if any of these “possible” solutions truly do not contain such cliques. For this, an exhaustive search is the only way. Although it is on the order of 900,000 edge combinations ({43 \choose 5} = 962598) to check, this is time consuming but doable, especially if we have a relatively small set of graphs to check.

Finally, the code is available on https://github.com/nelsonam/ramsey if anyone is interested. Share your comments in the box below and if you have any questions feel free to ask!

Holiday Maths Problem #2 – Solution

So back in my previous post I discussed the problem of the Traveling Santa. He wants to deliver gifts as quickly as possible to seven different towns. At its essence, this is a Traveling Salesman Problem, a well known and very difficult problem in graph theory and combinatorial optimization. I decided to take my own crack at the problem and see if I could devise a solution.

I wrote a program to tell me the shortest Hamiltonian cycle through the 7 towns. I used the networkx library to set up my graphs and then generated a list of all the Hamiltonian cycles. Then using the weights provided by each edge I calculated the total distance for each of the paths. After that, I was able to narrow it down to the shortest ones.

One solution that my program found was the cycle that visits [7, 4, 2, 1, 5, 6, 3, 7] with a distance of 39. You can check out the code here, and I’d be interested to hear about your ideas or alternative implementations.

Holiday Maths Puzzle #2

Merry Christmas and Happy Holidays to everyone, I’m back with a new holiday themed puzzle to distract your mind away from annoying family members or holiday stress.

#2: Santa needs to deliver his presents, and fast! He has a map of the towns he needs to go to, but he doesn’t have much time left before Christmas is over! Help him deliver some last minute gifts in the shortest distance possible. You must visit every town at least once (duplicates are allowed, but for bonus points see if you can visit each town only once and still be as quick as possible). Each node in the graph is a town and the weight of each edge represents how far it is from one town to another. (Note: graph is not to scale)

towns

Day 17 and 18: SNCURCS – State of North Carolina Undergraduate Research and Creativity Symposium

So I didn’t post yesterday because I was travelling. I was at SNCURCS this weekend (apparently it is pronounced “snickers”, which was news to me) to present some work that I did at my REU last summer. I was pretty nervous because I thought that I didn’t remember the work very well since its been a while since last summer.

As it turned out, the talk went fairly well, even though I had to speed through my slides and risk losing people due to the time constraints. I also realized that \today in LaTeX does in fact print the current date, but only at the last time you compile it. So it ended up the date on my slides showed up as the day before, but no big deal. The group of people that I was presenting with also did graph theory related projects, so I thought that was pretty cool. I saw an interesting presentation  about quantifying and mapping not only our social networks, but also our “meta-network”, which includes shared knowledge and the like. Network science sounds like pretty interesting stuff after hearing him talk about it.

My problem that I presented on was based on a variation of the game Nim but played on mathematical graphs. Instead of removing tokens from a pile we can remove a number of edges connected to a single vertex in the graph. The player that removes the last edge is the winner. We used these rules to analyze and determine the Sprague-Grundy number, or nimber of different states in the game and different types of graphs. I’m planning to do a whole post on this one of these days.

A side note: I’m realizing that this post-every-day schedule has led to some good ideas and writing experience but also has led to (in my opinion) a drop in quality. It was a learning experience for sure, but I’m not going to push myself to get out a post every day for the rest of the month I think. I have graduate school applications and school and work that I need to attend to. I’ll still be posting, I’ll just save up until I have a few good ones. Thanks for reading!

Day 16: Research Things

Tomorrow I’m giving a talk about my research. I’m cozied up in my hotel room, and ready for a good sleep and hot breakfast in the morning if we’re being honest. It wasn’t a bad drive down here (3 hours), but I felt like I was racing the sun the whole way, because I was arriving just as it had gotten dark. Nature always wins.

Things haven’t been going too well personally, but giving this presentation tomorrow will be good research experience. It’s called Champion Spiders in the Game of Graph Nim, and involves a modification of a classic game to give it a graph theory bent. It’s pretty interesting stuff.

Come back tomorrow and I’ll talk about it in more detail.