So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis.
For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support.
STEP 1: GETTING AUTHENTICATED TO TWITTER
First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that.
Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter.
library(twitteR) ## R package that does some of the Twitter API heavy lifting consumerKey <- "INSERT YOUR KEY HERE" consumerSecret <- "INSERT YOUR SECRET HERE" reqURL <- "https://api.twitter.com/oauth/request_token " accessURL <- "https://api.twitter.com/oauth/access_token " authURL <- "https://api.twitter.com/oauth/authorize " twitCred <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake() save(cred, file="credentials.RData")
STEP 2: GETTING THE TWEETS
Once you have your authentication credentials set, you can use them to grab tweets from Twitter.
The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub.
##Authentication load("credentials.RData") ##has my secret keys and shiz registerTwitterOAuth(twitCred) ##logs me in ##Get the tweets about "comcast email" to work with tweetList <- searchTwitter("comcast email", n = 1000) tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame
As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted the results to a Data Frame to make it easier to analyze the results.
STEP 3: GETTING RID OF THE JUNK
Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at firstname.lastname@example.org”
The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way the user feels about it, other than they use it for their business.
So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases:
- “fix” AND “email” in them (in that order)
- “Comcast” AND “email” in them in that order
- “no email” in them
- Any tweet that comes from a source with “comcast” in the handle
- “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them
This is done with this code:
##finds the rows that have the phrase "fix ... email" in them fixemail <- grep("(fix.*email)", tweetList$text) ##finds the rows that have the phrase "comcast ... email" in them comcastemail <- grep("[Cc]omcast.*email", tweetList$text) ##finds the rows that have the phrase "no email" in them noemail <- grep("no email", tweetList$text) ##finds the rows that originated from a Comcast twitter handle comcasttweet <- grep("[Cc]omcast", tweetList$screenName) ##finds the rows related to email and customer service custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text)
After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns.
The returned columns are:
All I care about are:
This is handled through this tidbit of code:
##combine all of the "good" tweets row numbers that we greped out above and ##then sorts them and makes sure they are unique combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv) uvals <- unique(combined) sorted <- sort(uvals) ##pull the row numbers that we want, and with the columns that are important to ##us (tweet text, time of tweet, source, and username) paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)]
STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS
Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown):
##make the device source look nicer paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource) paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource) ##name the columns names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName") paredTweetList
|Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP||2014-12-23 15:44:27||Twitter Web Client||riocauto|
|@msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW||2014-12-23 14:52:50||Twitter Web Client||Monty_H_Mathis|
|Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast||2014-12-23 09:20:14||Twitter Web Client||mathercesul|
As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.