Today I’m going to talk about a personal project I’ve been working on recently. I was trying to come up with some way to make a cool project with natural language processing and I had also noticed that with the rise of social networks, there is a treasure trove of data out there waiting to be analyzed. I’m a fairly active user of Twitter, and its a fun way to get short snippets of info from people or topics you’re interested in. I know personally, I have found a lot of math, science, and tech people that have twitter accounts and post about their work or the latest news in the field. I find just on a short inspection that the people I follow tend to fall into certain “groups”:

• math
• computer science
• general science
• authors
• tech bloggers
• feminism
• various political and social activist figures (civil rights, digital privacy, etc)

This list gives a pretty good insight into the things I am most interested in and like hearing about on my twitter timeline. Now I thought to myself, what if I could automate this process and analyze any user’s timeline to find out what “groups” their friends fell into as well? I had my project.

Currently in the beginning stages, I decided to tentatively call my project “FriendCloud” and registered with the API to start messing around. I’m using Python-Twitter to interact with the twitter API, and its helping me to get practice with Python at the same time. The first thing I wanted to do was be able to pull down a list of all the people that I follow. Since I follow a little over 1000 people, this proved to be a daunting task with the rate limiting that Twitter has built in to their API. At the moment, what I had to do was get my script to pull as many user objects as possible until the rate limit ran out, then put the program to sleep for 15 minutes until the rates refreshed and I could download more.

It took a little over an hour to get the list of all my friends and I am trying to look into a way to do this quicker in the future. After that, I can go through users and pull down a selection of their tweets. After this is done, I have a corpus of text that I can analyze. I have been using NLTK (a Python NLP toolkit) to pick out some of the most common keywords and themes. There is a lot of extraneous data to deal with, but as I pare it down I’ve noticed some interesting trends just in my own tweets.

I hope in the future to be able to extend this to the people I follow on twitter and be able to place them into rough “groups” based on their most commonly tweeted keywords (similar to how a word cloud works). In this way, a user can get an at a glance look at what topics the person is most likely interested in and what sort of people they may be likely to follow in the future.

# Intractable Problems — Part Two: Data Storage

This post continues my series on intractable problems. In this installment, I will talk about problems relating to Data Storage. As a refresher, remember that an intractable problem is one that is very computationally complex and very difficult to solve using a computer without some sort of novel thinking. I will discuss two famous problems related to Data Storage below, as well as provide a few examples and references.

Part Two — Data Storage

Knapsack — given a set of items with weights $w$ and values $v$, and a knapsack with capacity $C$, maximize the value of the items in the knapsack without going over capacity.

To start with, here is a small example that I refer to in this previous post. If you’re trying to place ornaments on a tree, you want to get the max amount of coverage possible. However, the tree can only hold so much weight before it falls over. What is the best way to pick ornaments and decorations such that you can cover as much of the tree as possible without it falling over?

I actually saw a really interesting video describing an example of this problem in this video (watch the first few minutes). The professor here sets up a situation of Indiana Jones trying to grab treasures from a temple before it collapses. He wants to get the most valuable treasures he can, but he can only carry so much.

Class: There are several different types of knapsack problems, but the most common one (the one discussed above) is one-dimensional knapsack. The decision problem (can we get to a value $V$ without exceeding weight $W$?) is NP-Complete. However, the optimization problem (what is the most value we can get for the least weight?) is NP-Hard.

References:

Wikipedia Page – General discussion of the Knapsack problem, different types, complexity, and a high level view of several algorithms for solving

Coursera Course on Discrete Optimization – The source for the above video and a great discussion of not only Knapsack but quite a few of these problems

Knapsack Problem at Rosetta Code – a good example data set and a variety of implementations in different languages

$============================$

Bin Packing — given a set of items with weights $w$ and a set of $n$ bins with capacity $c$ each, place the items into the bins such that the minimum amount of bins are used.

This problem is very similar to the knapsack problem found above, but this time we don’t care how much the items are worth. We just want to pack them in the smallest area possible. Solving this problem is invaluable for things like shipping and logistics. Obviously, companies want to be able to ship more with less space.

A more commonplace example could be thinking of this problem as Tetris in real life. Consider that you’re moving to a new place,  and you have you and your friends car in which to move things. How can you place all the items in your cars such that you take up space the most efficiently?

Some progress has been made on reasonably large data sets by using what is called the “first fit decreasing algorithm”. This means that you pick up an item, and place it into the first bin that it will fit in. If it can’t fit in any of the current bins, make a new bin for it. Decreasing means that before you start placing items, you sort them all from biggest to smallest. You probably do this in your everyday life. If you want to pack a box, you start with the big items, right? No need to put lots of small items on the bottom. By getting the big items out of the way first, you can be more flexible with the remaining space because you will then have smaller items.

Class: This problem is NP-Hard.

References:

Wikipedia Page – a high level description of the problem

First Fit Decreasing Paper (pdf) – this is a technical paper describing computational bounds for using the first fit decreasing algorithm. Not for beginners.

3D Bin Packing Simulation – looks like a resource for companies to use to pack boxes and such

$============================$

I hope this provided a little taste of why these problems are so important. If you know any other good resources please let me know.