Mining Twitter for Trends in a Geographic Area
Local or Targeted trends on Twitter is not something that Twitter provides. So I decided to create my own solution. For this project I used Perl, Twitter API via Net::Twitter, Yahoo Term Extraction Service, and SQLite3 to store data. This technique is a quick and dirty way to perform a simple data analysis on Twitter posts, and extract current conversation trends.
To view the most recent output from the process please go to my site: Des Moines Twitter Trends. (Updated for V2)
Process Overview:
- Get New Local Twitter Posts
- Clean and Filter Data
- Extract Significant Terms
- Rank Term Occurrence
- Record Term and Tweet Data
- Consume Historical Data
Process Details:
Get New Local Twitter Posts
I used the Perl Net::Twitter API interface to download new Twitter posts. This was freely available via CPAN. This provided an easy way to interface with the Twitter API. I created a new account, and registered my application with Twitter. I used this account information, and Net::Twitter to download Twitter Posts via the Twitter API.
Code for Defining Net::Twitter, and downloading tweets:
#$user: Twitter Account Username
#$password: Twitter Account Password
#$consumer_key: Twitter Application Consumer Key
#$consumer_key: Twitter Application Consumer Secretmy $nt = Net::Twitter->new(
traits => [qw/API::REST API::Search OAuth/],
username => $user,
password => $password,
consumer_key => $consumer_key,
consumer_secret => $consumer_secret
);#q: Search Query. I am using none
#geocode: geographical coordinates for search
#rpp: results per pagemy $statuses = $nt->search({ q=> $search, geocode=> $geocode, rpp=>$rpp});
$results = $statuses->{results};
The Net::Twitter object constructor requires the account, and application information. Then a search request is made. The response will be similar to the JSON object that is returned via the search api. To access the posts, we have to call “result” property which contains all the Twitter posts.
Clean and Filter Twitter Posts
This step is pretty straightforward. We take the results from our previous step, and we loop through each post, and run a set of search and replace regular expressions against the post data. We are specifically looking to remove things such as URLs, HTML characters, and other data specific to Twitter such as the abbreviation RT.
Extract Significant Terms
For this step we will need to compile the cleaned data into a large string. Once the data is compiled we can post it to the Yahoo Term Extraction Service (YTES). Again there is a free CPAN module for the YTES. Before using the YTES I had to sign up for a Yahoo Developer API Key. The provided module is very programmer friendly, just plug and play.
Accessing the Yahoo Term Extraction Service:
#appid: Yahoo Application/API key
#context: content to extract from
my $yte = WebService::Yahoo::TermExtractor->new( appid => $appid, context => $text);my $terms = $yte->get_terms;
Define a yahoo term extractor object with You API key and a context. Then call the get_terms function, and it returns an array extracted terms.
Rank Term Occurrence
Next we rank the output from the last step. For each term we loop through every tweet, and check to see if it occurs. If it does we add a point to the term.
Record Term and Tweet Data
Next we save the list of terms and their counts to a historical data store. We also save the Twitter posts. These are saved so we can filter out already sampled posts.
Consume Historical Data
Finally we pull a list of the top trends, and output it as a Twitter Post. We use a simple SQL query to pull a list of trends by their occurrence in descending order. Then using Net::Twitter to post it to an account.
Net::Twitter Update Code
my $result = $nt->update(“$text”);
Sample Output:
Using the data store from June 25th, it produced this data from Des Moines, IA the scripts determined the following terms were ranked highly:
michael jackson, farrah fawcett, iphone, risingstars, king of pop
This was pretty easy to validate, because news of Micheal Jackson and Farrah Fawcett dominated news coverage. Also there was the a Republican event called Rising Stars in Iowa that week.
This is not “true” data mining it is a bastardized process. It does have some bugs, but it tends to be very accurate when definite trends appear.













One Comment
great idea! love it how you use the yahoo service.