Today I’m going to talk about a personal project I’ve been working on recently. I was trying to come up with some way to make a cool project with natural language processing and I had also noticed that with the rise of social networks, there is a treasure trove of data out there waiting to be analyzed. I’m a fairly active user of Twitter, and its a fun way to get short snippets of info from people or topics you’re interested in. I know personally, I have found a lot of math, science, and tech people that have twitter accounts and post about their work or the latest news in the field. I find just on a short inspection that the people I follow tend to fall into certain “groups”:
- computer science
- general science
- tech bloggers
- various political and social activist figures (civil rights, digital privacy, etc)
This list gives a pretty good insight into the things I am most interested in and like hearing about on my twitter timeline. Now I thought to myself, what if I could automate this process and analyze any user’s timeline to find out what “groups” their friends fell into as well? I had my project.
Currently in the beginning stages, I decided to tentatively call my project “FriendCloud” and registered with the API to start messing around. I’m using Python-Twitter to interact with the twitter API, and its helping me to get practice with Python at the same time. The first thing I wanted to do was be able to pull down a list of all the people that I follow. Since I follow a little over 1000 people, this proved to be a daunting task with the rate limiting that Twitter has built in to their API. At the moment, what I had to do was get my script to pull as many user objects as possible until the rate limit ran out, then put the program to sleep for 15 minutes until the rates refreshed and I could download more.
It took a little over an hour to get the list of all my friends and I am trying to look into a way to do this quicker in the future. After that, I can go through users and pull down a selection of their tweets. After this is done, I have a corpus of text that I can analyze. I have been using NLTK (a Python NLP toolkit) to pick out some of the most common keywords and themes. There is a lot of extraneous data to deal with, but as I pare it down I’ve noticed some interesting trends just in my own tweets.
I hope in the future to be able to extend this to the people I follow on twitter and be able to place them into rough “groups” based on their most commonly tweeted keywords (similar to how a word cloud works). In this way, a user can get an at a glance look at what topics the person is most likely interested in and what sort of people they may be likely to follow in the future.