For background, see my three previous posts in this series:
- Facial Expression Analysis With Microsoft Kinect : Thesis Update #1
- Some Faces : Thesis Update #2
- Animation Units for Facial Expression Tracking : Thesis Update #3
My thesis has been successfully completed and defended now, and I am currently on break for the summer. I thought I would make a post to wrap up some loose ends and talk about some things I didn’t have a chance to talk about before. In my last post, I discussed the significance of animation units to my facial tracking algorithm. Now the way I use these units is pretty simple, but can lead to some complex classifications. The first thing to do is to consider what the desired face would look like. Picture it in the mind’s eye. Say for example we wanted to recognized a surprised face. How would this look? Chances are, when thinking of a surprised face, the mind visualizes raised eyebrows, wide open eyes, and a gaping mouth. The next task (and the bulk of my algorithm) asks: How do we translate these facial movements into the points tracked on our facial mesh?
We use animation units as “wrappers” for sets of points on the face plane. Instead of having to track and check multiple different places on the eyebrows or mouth for example, the animation units allow us to track the movements of those features as a whole. Since the animation unit values lie between -1 and 1, we must assign “bounds” for each expression to where if the user’s features fall within that range, we can assume the user is creating that expression. These values at present are determined by extensive testing and seeing what values are frequently displayed when creating a certain expression. It would not be difficult to build a classifier for these expression bounds, and use it to train the program over multiple different faces and expressions in order to get the best and most accurate data for each type of face.
In my application, we track six different types of expressions.
In addition, we look for two different types of angry faces: angry with a closed mouth (glaring at someone) or angry with an open mouth (as in yelling).
To see a simple flowchart detailing some preliminary bounds for each expression (not exhaustive), check out the chart here (click for larger view).
There is a bit of a “lag” in my application on recognizing these expressions, because the live video stream captures many frames each second, and there is a tiny bit of delay in figuring out what expression that frame’s data fits into. As such, the output of my program is a bit inaccurate still. Because it prints off what the expression every frame, there can be a bit of a buildup and after a while it will start showing expressions at a bit of a delay. For example, if the user acts surprised sometimes the program will not actually print “surprised” for a fraction of a second afterwards, because its busy trying to run through the frames as they come in. A simple remedy to this would be to create a “buffer” of tracked points and use the average of the data over a few seconds in order to determine the facial expression. Because the camera is very sensitive, we are prone to having the data change at the slightest movement of the face. Indeed, even trying to sit as still as possible still results in some small changes in the data. Another thing I noticed that creating a buffer of data could help solve is when the camera loses track of the face for only a moment, it begins to spit out garbage data as it attempts to relocate the face.
Overall we can see a good proof of concept of the capabilities of the Kinect Face Tracking API and there is a lot of room for improvement in the future. Possible future additions/enhancements include:
- tracking a wider range of expressions
- wider range of accessibility (glasses/hats, children, elderly people)
- more specific bounds for facial expressions, use neural networks or something
- more interactive application
- use facial expression recognition to interface with other environments (i.e., call a command by smiling)