# A Quick Introduction to HDFS

The Hadoop filesystem is called HDFS, and today I’m going to give a short introduction to how it works for a beginner.

The Hadoop File System (HDFS) sits on top of a Hadoop cluster and facilitates the distributed storage and access of files.  When a file is stored in HDFS, it is split into chunks called “blocks”. They can be of different sizes. The blocks are scattered between the nodes. These nodes have a daemon running called a datanode. There is one node called the namenode that has metadata about the blocks and their whereabouts.

To protect against network or disk failure data is replicated in three places across the cluster. This makes the data redundant. Therefore if one datanode goes down, there are other copies of the data elsewhere. When this happens a new copy of the data is created, so that there are always three.

The namenode is even more important, because it has metadata about all the files. If there is a network issue, all of the data will be unavailable. However, if the disk on the namenode fails, the data may be lost forever, because the namenode has all the information about how the pieces of the files go together. We’d still have all the chunks on the data nodes, but we’d have no idea what file they go to.

To get around this issue, one solution is to also mount the drive on a network file system (NFS). Another way to approach this (which is a better alternative) is to have an active namenode and a standby namenode. This way, there is a “backup” if something goes wrong.

Some commands:
• To list files on HDFS:
• `\$ hadoop fs -ls`
• To put files on HDFS:
• `\$ hadoop fs -put filename`
• this takes a local file and puts on HDFS
• To display the end of a file :
• `\$ hadoop fs -tail filename`
• Most bash commands will work if you put a dash in front of them
• `\$ hadoop fs -cat`
• `\$ hadoop fs -mv`
• `\$ hadoop fs -mkdir`
• etc…

# Introduction to P vs. NP

Its a Millennium Problem (reward \$1,000,000) and the subject of a movie. But what is this mysterious problem, and what makes it so hard?

### Creation vs. Verification

Pop quiz: Choose which one of these tasks takes longer.

• A) someone hands you a jigsaw puzzle and asks you if it is finished or not
• B) someone hands you a jigsaw puzzle and asks you to put it together

Answer: BWhile you can determine if a puzzle is finished or not in a split second (it is easy to tell if pieces are missing), it is not so easy to put all those pieces together yourself. We have problems like this in computer science and mathematics as well. Think about this: what if there was a method that could solve a jigsaw puzzle as fast as you could check if its right? Wouldn’t that be amazing? That’s what this problem seeks to find out: if such a method exists.

### The Class P

There is a class of problems that can be solved in “polynomial time”, and we call this P. Polynomial time means that as the complexity of the problem increases, the time it takes to solve increases at a rate no greater than a polynomial would. Informally, we can say that for P problems, we can solve the problem “quickly”. (Polynomial time).

Here’s an example: Say we have a list of numbers in front of us, and we want to pick out the number that is the greatest. At the very worst, we could start at the beginning of the list and compare every number until we got to the end of the list. Yes, the list may be very very long. But the time it takes to check every number increases in a linear fashion. More numbers does not make it exponentially more difficult. If we compare the growth rates of linear time $O(n)$ (graph A) and polynomial time $O(n^2)$ (graph B) we can clearly see that solving this problem is quicker than polynomial time, thus it is in P.

Graph A

Graph B

### The Class NP

“Mom, she got more than me!”

The class of NP problems, put simply, can be verified in polynomial time. NP stands for “nondeterministic polynomial time” and harkens back to a computational construct known as a Turing Machine. Consider this example: a woman is dividing up cookies of different sizes into two groups for her two young children. Naturally, they are very picky about things being fair, so she needs to make sure that each child has exactly the same amount of cookies (by weight). While we can show that this problem can get quite hard to sit down and solve, if someone showed us two piles of cookies we could quickly add up the weights of the two piles and tell if they were the same or not.

If you have only 2 cookies to separate out, sure, its easy. Put one cookie in each pile and that’s the best you can do. But what if you had 5 cookies? 10? Keep in mind that each cookie weighs a slightly different amount and we want the two groups of cookies to be equal in weight as much as possible. Adding more cookies to work with increases our options for separating them exponentially. This is not good. If she has 5 cookies, the number of different ways to separate them is $2^5 = 32$, but if she doubles the amount of cookies to 10, the possibilities skyrocket because $2^{10} = 1024$.

Let’s consider, for fun, the number of possibilities if she had 100 cookies! $2^{100}=1,267,650,600,228,229,401,496,703,205,376$. This…is pretty large. But wait! Computers can do that in a heartbeat, right? They’re way faster than we are! This is all fine and good, until you realize that the number of seconds in the age of the entire universe is (only) $450,000,000,000,000,000$. So even if our computer could go through thousands or millions of computations per second, it would still take longer than the entire age of the universe. This is what makes these sorts of problems so hard.

### P = NP?

So now that you know a little about what P and NP are, you may be wondering what all this business about “$P=NP$” is. What does it mean, after all? If someone were to prove that P equals NP, this would mean that every problem in the class NP (the hard ones) can be solved in a polynomial time, just like the P problems. This would have huge implications for all sorts of applications, not just in the mathematical world. For instance, a cornerstone of many security systems rests on the difficulty of factoring very large numbers. If $P=NP$, it would mean that there exist very easy methods for factoring these numbers, and as such, financial systems everywhere would be vulnerable.

This can also be applied to many logistical and optimization problems (see my posts on intractable problems here and here). For many of the NP problems, finding an “easy” method to solve one can be applied to any of the problems in that class. We don’t need to show an easy method to solve every problem in the world. We only need one. That shouldn’t be too hard, right? Unfortunately, researchers have been working on this problem for decades without success. There have been many attempted “proofs” but they all ultimately have fallen short. The problem is so difficult that the Clay Mathematics Institute is offering \$1,000,000 for a correct proof of this problem one way or another.

Although an official answer to this question has not been decided one way or another, many professionals believe that $P \neq NP$. This means that problems in NP will always be inherently “harder” than problems in P, and at the worst case require an exhaustive (brute force) search to find a solution. No matter how good our computers get, these problems will still be very difficult for us to solve, especially at large cases.

I hope you enjoyed this little introduction to P and NP and if you have any comments, questions or corrections, please leave them in the comments below!

# Beeminder and Self-Tracking: Five Months In

So back in January, I started using a little service called Beeminder to track my goals and encourage me to do more things I’ve been putting off. I also experimented with several other Quantified Self tools, and learned a lot about  how simply tracking what we do day to day can open our eyes to things we need to improve at and things we perhaps do too much of (for me, its instant messaging with my friends). It’s been five months since I started this journey of learning about myself, and I thought I’d give an update about how its been going.

My findings: I couldn’t be happier with the results.

I know its easy to get started on a new diet or workout program and feel all motivated and enthusiastic at first. That’s how New Year’s resolutions start, after all. We get all excited about changing our lives and then a few days or weeks later, its back to the same old same old. When I started tracking myself and using Beeminder for my goals, I was afraid a similar thing would happen. It’s just a fad, I thought. My newest obsession. It will subside in a few weeks. I’m happy to say that five months later, I am still going strong and I have made  a lot of progress.

Wait, what is all this Beeminder/Quantified Self stuff? What are you talking about?

Glad you asked. I did a post on all this stuff a few months ago called Quantifying the Self and discussed the nuts and bolts of all this, but here’s a quick review:

Beeminder is a web service that allows you to track your goals by plotting progress points on a graph each day. You try to stay on “the yellow brick road” (that is, the goal you set for yourself — say, doing 20 pushups a day) and if you fall off the road, in order to recommit and get back on the road you pledge money (you’re basically “betting” that you will be able to keep up with your goal). If you fall off the road again, you pay up. Here’s a picture:

As you can see, the data points are staying above the yellow line, meaning I am on track for my goal. This screenshot was taken back in February, after I had been tracking my reading habits for only a month. Let’s take a look at my graph now:

There are a lot more data points, and I’ve racked up over 2000 minutes read (that’s 9 books) since January! There are a few flat parts (in March I did a lot of traveling and didn’t find a lot of time to read), and because of this I actually derailed once. But I pledged my \$5 and so far I’ve been doing well once again. I really enjoy seeing the graph and the data points grow, and the Android App has been absolutely essential to my success. It doesn’t get much easier than entering in a few numbers into the app every day.

Of course, this is just my most successful graph. I have others that I have utterly failed at, like flossing my teeth:

This one…has a lot more flatlining. A product of good old akrasia. As you can see, I’ve derailed on this one and have been for quite some time…I think I’m about ready to pony up the cash and try again though (for real this time!) Some people have told me that paying the service money when you fail your goals seems cruel or manipulative on their part, but I don’t see it that way at all. This is something want to do, and if I don’t follow through with it, then there has to be some sort of pain associated with that to deter me from failing again. And besides, this company has done so much for me and I use the service so much I am happy to give them a few bucks here and there. The staff is great and responsive and the website is always getting updates. They deserve it.

In addition, I’ve found from my pedometer app on my phone that I’ve walked over 600,000 steps this year! That’s over 300 miles. The day to day walking may not seem like much, but adding it all up like that certainly has a big impact on me and motivates me to keep going and do more.

My most recent project involves taming the beast that is my email inbox. I’ll admit it: I’m an email hoarder. While other people keep their inboxes neat, tidy, and organized, I have a giant deluge of thousands of emails just sitting there, taking up space and making it impossible to find anything. Usually, if I don’t find an email interesting (such as an advertisement or a newsletter) I won’t even click on it. Thus, there are hundreds of unread emails as well. This is a product of my laziness over time, and it just keeps getting worse. I’ve decided its finally time to do something about it. I set up a Beeminder goal to track my inbox size, and I’m going to either file away or delete 500 emails per week until my inbox reaches the fabled Inbox Zero. Once I reach that milestone, the challenge will be to keep it there, and keep my email inbox manageable and not overflowing like it was before. This leads to less stress, easier location of important emails, and if something is in the inbox, it means I need to deal with it right away. I feel like this will give great gains in productivity and overall happiness.

And just in the interest of transparency, my goals are all publicly viewable on my beeminder page, and that makes me not accountable to just one person, but the internet at large when you display your successes and failures in this way. There’s something strangely motivational about that.

What have I learned?

As for some of the other services, they didn’t stick quite as well. But the trial period of testing out these new things definitely taught me a lot about how my mind works and how we can battle this beast of distraction and bad judgement that rears its ugly head daily. Humans aren’t necessarily rational creatures by nature, but we can learn what mistakes we make and how our minds try to trick us. Then we can trick it right back. I just bought the book Thinking, Fast and Slow by Daniel Kahneman, and it discusses at length these “cognitive biases” and ways to get around them. The result? A happier, healthier, more productive life.

Final Thoughts

If you’re curious about what self-tracking can do for you, I urge you to try tracking a simple thing in your life, something you already do, for just a week. It becomes like a game to try to “beat your best score” and its really fulfilling to see your progress over time. I really recommend trying this to anyone that is interested in achieving their goals and improving their life. It’s surely changed mine.

# Introduction to Genetic Algorithms

In response to my previous article about genetic algorithms for Ramsey theory, a few readers asked me to give a bit more of an introduction about genetic algorithms. Here you will find a beginner’s look at what a genetic algorithm is, what it is useful for, and how you can use one in your own work.

What is a genetic algorithm (GA)?

To begin with, let’s talk about what a genetic algorithm is, on its most basic level. The word genetic suggests something to do with biology. If you can reach way back to your high school science classes you might remember the basic genetic process: plants and animals are born, mate with one another, and create new generations. Over time, these generations tend to emphasize certain traits essential to survival while downplaying the weaker (recessive) traits. A genetic algorithm works much the same way. We come up with a population of possible solutions to a problem, “mate” them together, and look at our new solutions to see if they are any better. Over time, we can create solutions that converge to better and better values. This is useful when a problem is too complex to search all possibilities. Below you will see an image describing how a simple GA works.

Crossovers, fitness, mutation, oh my!

Before moving on, it would be useful to define some of the terms used above.

• chromosome:  a proposed “solution” to the problem at hand. This is usually represented by a bitstring, that is to say, a list of 0’s and 1’s. It can also hold any other information that is crucial to our problem.
• population: a collection of these “chromosomes” that we use to combine together and make new generations. The population represents the set of all the ideas that we have about this problem at the moment
• fitness: the fitness of a chromosome is a number representing how good it is. That is, the fitness represents how good this solution is at solving our problem. An example of this is if you were trying to find the shortest route to get somewhere. Fitness for a problem like this would be a number representing the distance it takes for each path you could choose. In the end, you want to find the path with the shortest distance. Ultimately, the fitness function is defined by the programmer, and it can measure whatever you want.
• mutation: mutation is the random entering of new data into the gene pool. Just like in biology, sometimes mutations occur and create things that were never intended. However, sometimes this inadvertent change can be to our advantage if we’re getting stuck. A common example of a mutation is to just change a small part of the chromosome, and move on.
• crossover: the “crossover function” represents the operations that we do in order to “mate” two (or sometimes more) of our chromosomes. There are many different types of crossover techniques, some better for certain situations than others, but at the heart of it a crossover just represents a way to combine the “traits” of two or more chromosomes into a new “baby” chromosome to be inserted into our population. This is the bulk of a genetic algorithms’ work. As the population evolves and new generations of solutions are created, the goal is to keep around the solutions with “good” fitness, and get rid of the chromosomes that aren’t doing so well. The crossover is also user-defined, and tweaking it to optimize results is common.
• termination condition: all good things must come to an end. While we in theory could just leave our algorithm running forever, that would not be very helpful because as I will discuss later in this post, solutions can’t keep getting better forever. In addition, sometimes you won’t actually get to the “best solution” and instead will converge on what’s called a “local minima/maxima”. When this happens, it means you’ve found an “okay” solution, but the population got flooded with many similar chromosomes and couldn’t improve itself after that point. Think of it this way: if everyone in the world were exactly the same, would you expect any different from their children? The termination condition you choose depends on the problem, but some common terminating conditions are:
• a) finding the best solution (ideal)
• b) running a preset number of generations and using that as a cutoff point
• c) quitting after every member of the population falls within a certain similarity range (this means that no new/better solutions are likely to be produced)

But why would I want to do all this?

I know this seems like a pretty complicated process. Why not just use a computer to figure out the real solution instead of dancing around it in this complicated manner? Well, it turns out that’s not always possible…

Enter the Traveling Salesman Problem (TSP). While there are many problems that are still very hard for us to solve with computers, this is one of the best known and most studied. It turns out that using a genetic algorithm is actually a pretty good approach and is much quicker than running an exhaustive search of all paths possible. Remember when I used the example of finding the shortest distance to go somewhere? That’s pretty much what this is. I’ve done a blog post on this before (see here). In the traveling salesman problem, there is a man that needs to visit a list of different cities, but he wants to get there and back as quickly as possible. Therefore, you need to find the shortest route to hit all the cities and return home. This isn’t always as easy as it sounds, and as the number of cities grows, so too does the time it takes to find the right one. Very quickly it becomes implausible to check every possible path, so we use a genetic algorithm to help us weed out the bad ones.

Another example of how we would use a genetic algorithm is for graph theory problems that also have a huge number of possible solutions. You can see how I applied a genetic algorithm to the problem of Ramsey numbers here.

Okay, so how do I use this in my project?

It’s pretty simple. I know it looks complicated, but once you get everything apportioned out correctly it’s not that bad. Things to think about when applying a GA to a hard problem (we will use the Traveling Salesman Problem as an example here):

• First decide what your “chromosomes” will look like. These are the meat of your population, and these are what will be mutated and crossovered in order to create better solutions. For the TSP, we will use a a sequence of numbers denoting what cities to visit and in what order. (For example, “1 5 4 2 3 1” describes a way you could make a circuit through 5 cities).
• Decide on how fitness will be evaluated. This is important because members with better fitness will be the ones that stick around in your population and (hopefully) make your solutions better. In our problem, fitness refers to how far our chosen path takes us. The lower the fitness score, the shorter our travel is. We want to minimize this number with our algorithm.
• Next we need to figure out what our crossover will be. This is very important to consider, because we want something that will take parts of both of our “parents” and combine them together in some way to make a “baby”. For traveling salesman, we can’t just grab some from one parent and some from the other, because we run the risk of duplicates. (we don’t want a path to look like “1 2 2 3 4 1”, city 2 is visited twice and we never get to city 5). Therefore, we have to use more sophisticated methods. I won’t go into them here but if you’re interested check out this wikipedia page for more info on crossover techniques.

What are the downsides?

“Every rose has its thorn,” as they say, and genetic algorithms are no different. What seems like a cure-all has its hindrances as well. As I mentioned before, GAs have a problem of converging too early on a value that’s not quite ideal. Often this takes many repetitions of running the program and fine tuning things such as population size and mutation rate. Sometimes happening upon the “best” solution is a product of randomness. But in a GA, we use this randomness to our advantage as much as we can. Another inherent problem is the fact that we have to program this framework around it. Using a GA is only viable if its an extremely complex problem that cannot be solved more efficiently. For example, using a GA to solve a 5 city TSP would be foolish. We can run through all those possibilities very quickly. On the other hand, bumping that up to 50 or 500 cities proves a much harder challenge.

In addition, using a genetic algorithm means you have to find a type of chromosome, fitness function, population size, and crossover that works for you. Pick the wrong values for these, and the program will behave less than optimally. Experimentation and continually tweaking the parameters of your model is necessary. In some genetic algorithms, the fitness function for even one chromosome can take quite a while to compute. This makes some genetic algorithms very slow to apply.

I hope this has covered the basics of genetic algorithms and interested you in learning more. If you would like to see a project I have done involving GAs you can read A Genetic Algorithm Approach to Ramsey Theory, and for a broader range of discussion about the theory and applications of GAs check out the book An Introduction To Genetic Algorithms by Melanie Mitchell.

Hope you enjoyed this introduction to the wide world of genetic programming; if you have questions or suggestions please leave them in the comments below!

# Animation Units for Facial Expression Tracking – Thesis Update #3

See the previous posts in this series:

So the past few months, I’ve been hard at work on my thesis concerning facial expression analysis with Microsoft Kinect. Unfortunately my blog has suffered quite a bit in wake of everything I’ve had going on, but I’m trying to post a few new things as the school semester winds down.

My ongoing project concerning Facial Expression Analysis with Kinect is making progress, and as the semester winds down, I am trying to prepare a final product to present. As I described in my last thesis update, I had been able to test a sample product which overlays a 3D mesh on the user’s face and tracks the different points on the face. In particular, a subset of these points are called Animation Units, or AUs, and are essential to the expression recognition algorithm. The points are defined based on the Candide3 model and the different values are delineated on Microsoft’s API page about the Kinect Face Tracking SDK. There are six different values that describe movements and placement of the basic facial features such as eyes, eyebrows, and mouth, and the table of values for the AUs is reproduced below (from Microsoft’s Kinect API page):

 AU Name and Value AU Value Interpretation AU0 – Upper Lip Raiser 0=neutral, covering teeth 1=showing teeth fully -1=maximal possible pushed down lip AU1 – Jaw Lowerer 0=closed 1=fully open -1=closed, same as 0 AU2 – Lip Stretcher 0=neutral 1=fully stretched (joker’s smile) -0.5=rounded (pout) -1=fully rounded (kissing mouth) AU3 – Brow Lowerer 0=neutral -1=raised almost all the way 1=fully lowered (to the limit of the eyes) AU4 – Lip Corner Depressor 0=neutral -1=very happy smile 1=very sad frown AU5 – Outer Brow Raiser 0=neutral -1=fully lowered as a very sad face 1=raised as in an expression of deep surprise

When these values are considered, I was able to determine a set of bounds for what expression would represent in my application. However, I learned quickly that the extremal values of -1 and 1 are just that, extremal. Even sitting in front of the camera making the most exaggerated faces I could, it was very difficult to get above a .5 or .6 in some areas. In addition, the eyebrow data was almost always inaccurate in my testing because I wear glasses and this confused the camera. The Kinect saw the top of my glasses as my eyebrows and thus, showed not very much movement at all. When I took my glasses off, placement and tracking returned to normal, but it was impossible for me to see the data being processed.

Another unforseen problem that I ran into was that the camera is perhaps a bit too sensitive, and since it is tracking many frames every second, even if you sit as still as possible there will still be some variation in the tracked points, and if the camera loses track of the face entirely, the points can really go haywire. Therefore, trying to keep the points within specific bounds is more difficult than expected. If the camera loses track of the face the application can say you went from surprised to sad to angry in less than a second. It will probably not get done in this version of the project, but it would be useful in the future to store a buffer of points perhaps every second or so so that it stabilizes the constantly changing data and provides more time for processing before changing frames completely.

Overall I’ve certainly learned a lot throughout the course of this project and I would like to continue to work on it more throughout the summer. I will post a few more thoughts on the process and results as I carry on and wrap things up. I’ll have a copy of my thesis online at some point too.

In other news, I’m going to be starting the Master’s program in Computer Science at Appalachian State University in the fall, and I’m pretty excited about it. I plan to do research on algorithms and graph theory.

Stay tuned…

UPDATE: See the next post in this series here: Kinect Face Tracking — Results :  Thesis Update #4

# The ABCs of Computer Science

So over the next few months or so, I’d like to write a series of educational posts on “The ABCs of Computer Science”. In each post, I would talk about a topic that begins with the letter of the alphabet I am on. This will not only serve as an educational resource for people wanting to learn new things or get an overview of some of the big topics in computer science, but also as a learning experiment for me.

I plan to make an ABCs post at least every two weeks, and my regular posts about thesis updates and mathematical curiosities will continue. I have a tentative list of topics but I’d like your input. What topics would you like to see me write about? What’s something you are interested to learn? What can I use for elusive letters such as Q, X, and Z? Leave your ideas in the comments below and I will do my best to incorporate them.

# Quantifying the Self

So I’ve been doing a bit of an experiment this year. Sure, everyone says they want to do this or do that, lose weight, eat better, exercise more, etc, but how do we keep ourselves to these goals? As you may be aware, the you of the future is always a bit more conscientious than the you of today: “I’ll eat  ice cream today and go on the diet tomorrow”, “Just one more day of sleeping in and I’ll get up early tomorrow.” The list goes on. Now being the geek that I am, I began to wonder if there were a more scientific way to go about all this.

Something that I’ve found that is pretty easy to do and had a big impact is simply tracking the things you want to do with your time and see how it stacks up over time. I started when I found a website called Beeminder. You can start as many “goals” as you like, such as “go to the gym twice a week” or “floss every day”. You know, those things we want to do but have a hard time actually doing. You go in and plot a data point each day and you can see your progress towards the goal. It has a “yellow brick road” for you to follow and if you do more than average one day you get “safe days” where you don’t have to work as hard. It’s really engaging to me to see my graphs grow.