data engineering

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup <- read.csv("mylookup.csv", header = FALSE) ## ARBITRARY "CHUNK" SIZE TO KEEP SEARCHES SMALLER ## start <- 1 end <- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata <- data.frame() ## HOW MANY "CHUNKS" WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring <- paste(mylookup[start:end], sep = "", collapse = '" OR VAR_NAME="') ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring <- paste('index = "my_splunk_index" (VAR_NAME="', lookupstring, '") | stats count BY VAR_NAME', sep = "") ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response <- GET("", path = "servicesNS/admin/search/search/jobs/export", encode="form", config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate("USERNAME", "PASSWORD"), query=list(search=paste0("search ", searchstring, collapse="", sep=""), output_mode="csv")) ## CHANGES THE RESULTS TO A DATA TABLE ## result <- read.table(text = content(response, as = "text"), sep = ",", header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata <- rbind(alldata, result) ## UPDATES THE START POINT start <- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN'T GO TOO FAR ## if((end + 1000) > length(allusers)){ end <- length(allusers) } else { end <- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, "mydata.csv", row.names = FALSE, sep = ",") So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

Doing a Sentiment Analysis on Tweets (Part 2)

INTRO This is post is a continuation of my last post. There I pulled tweets from Twitter related to “Comcast email,” got rid of the junk, and removed the unnecessary/unwanted data. Now that I have the tweets, I will further clean the text and subject it to two different analyses: emotion and polarity. WHY DOES THIS MATTER Before I get started, I thought it might be a good idea to talk about WHY I am doing this (besides the fact that I learned a new skill and want to show it off and get feedback). This yet incomplete project was devised for two reasons: Understand the overall customer sentiment about the product I support Create an early warning system to help identify when things are going wrong on the platform Keeping the customer voice at the forefront of everything we do is tantamount to providing the best experience for the users of our platform. Identifying trends in sentiment and emotion can help inform the team in many ways, including seeing the reaction to new features/releases (i.e. – seeing a rise in comments about a specific addition from a release) and identifying needed changes to current functionality (i.e. – users who continually comment about a specific behavior of the application) and improvements to user experience (i.e. – trends in comments about being unable to find a certain feature on the site). Secondarily, this analysis can act as an early warning system when there are issues with the platform (i.e. – a sudden spike in comments about the usability of a mobile device). Now that I’ve explained why I am doing this (which I probably should have done in this sort of detail the first post), let’s get into how it is actually done… STEP ONE: STRIPPING THE TEXT FOR ANALYSIS There are a number of things included in tweets that dont matter for the analysis. Things like twitter handles, URLs, punctuation… they are not necessary to do the analysis (in fact, they may well confound it). This bit of code handles that cleanup. For those following the scripts on GitHub, this is part of my tweet_clean.R script. Also, to give credit where it is due: I’ve borrowed and tweaked the code from Andy Bromberg’s blog to do this task. library(stringr) ##Does some of the text editing ##Cleaning up the data some more (just the text now) First grabbing only the text text <- paredTweetList$Tweet # remove retweet entities text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text) # remove at people text <- gsub("@\\w+", "", text) # remove punctuation text <- gsub("[[:punct:]]", "", text) # remove numbers text <- gsub("[[:digit:]]", "", text) # remove html links text <- gsub("http\\w+", "", text) # define "tolower error handling" function try.error <- function(x) { # create missing value y <- NA # tryCatch error try_error <- tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(try_error, "error")) y <- tolower(x) # result return(y) } # lower case using try.error with sapply text <- sapply(text, try.error) # remove NAs in text text <- text[!] # remove column names names(text) <- NULL STEP TWO: CLASSIFYING THE EMOTION FOR EACH TWEET So now the text is just that: only text. The punctuation, links, handles, etc. have been removed. Now it is time to estimate the emotion of each tweet. Through some research, I found that there are many posts/sites on Sentiment Analysis/Emotion Classification that use the “Sentiment” package in R. I thought: “Oh great – a package tailor made to solve the problem for which I want an answer.” The problem is that this package has been deprecated and removed from the CRAN library. To get around this, I downloaded the archived package and pulled the code for doing the emotion classification. With some minor tweaks, I was able to get it going. This can be seen in its entirety in the classify_emotion.R script. You can also see the “made for the internet” version here: library(RTextTools) library(tm) algorithm <- "bayes" prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/emotions.csv.gz",header=FALSE) counts <- list(anger=length(which(lexicon[,2]=="anger")), disgust=length(which(lexicon[,2]=="disgust")), fear=length(which(lexicon[,2]=="fear")), joy=length(which(lexicon[,2]=="joy")), sadness=length(which(lexicon[,2]=="sadness")), surprise=length(which(lexicon[,2]=="surprise")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(anger=0,disgust=0,fear=0,joy=0,sadness=0,surprise=0) doc <- matrix[i,] words <- findFreqTerms(doc,lowfreq=1) for (word in words) { for (key in names(scores)) { emotions <- lexicon[which(lexicon[,2]==key),] index 0) { entry <- emotions[index,] category <- as.character(entry[[2]]]) count <- counts[[category]] score <- 1.0 if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:",word,"CAT:", category,"SCORE:",score)) } scores[[category]] <- scores[[category]]+score } } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] if (best_fit == "disgust" && as.numeric(unlist(scores[2]))-3.09234 < .01) best_fit <- NA documents <- rbind(documents, c(scores$anger, scores$disgust, scores$fear, scores$joy, scores$sadness, scores$surprise, best_fit)) } colnames(documents) <- c("ANGER", "DISGUST", "FEAR", "JOY", "SADNESS", "SURPRISE", "BEST_FIT") Here is a sample output from this code: ANGER DISGUST FEAR JOY SADNESS SURPRISE BEST_FIT “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412327” “sadness” “7.34083555412328” “3.09234031207392” “2.06783599555953” “1.02547755260094” “1.7277074477352” “2.78695866252273” “anger” “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412328” “sadness” Here you can see that the initial author is using naive Bayes (which honestly I don’t yet understand) to analyze the text. I wanted to show a quick snipet of how the analysis is being done “under the hood.” For my purposes though, I only care about the emotion outputted and the tweet it is analyzed from. emotion <- documents[, "BEST_FIT"]` This variable, emotion, is returned by the classify_emotion.R script. CHALLENGES OBSERVED In addition to not fully understanding the code, the emotion classification seems to only work OK (which is pretty much expected… this is a canned analysis that hasn’t been tailored to my analysis at all). I’d like to come back to this one day to see if I can do a better job analyzing the emotions of the tweets. STEP THREE: CLASSIFYING THE POLARITY OF EACH TWEET Similarly to what we saw in step 5, I will use the cleaned text to analyze the polarity of each tweet. This code is also from the old R Packaged titled “Sentiment.” As with above, I was able to get the code working with only some minor tweaks. This can be seen in its entirety in the classify_polarity.R script. Here it is, too: algorithm <- "bayes" pstrong <- 0.5 pweak <- 1.0 prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/subjectivity.csv.gz",header=FALSE) counts <- list(positive=length(which(lexicon[,3]=="positive")), negative=length(which(lexicon[,3]=="negative")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(positive=0,negative=0) doc <- matrix[i,] words <- findFreqTerms(doc, lowfreq=1) for (word in words) { index 0) { entry <- lexicon[index,] polarity <- as.character(entry[[2]]) category <- as.character(entry[[3]]) count <- counts[[category]] score <- pweak if (polarity == "strongsubj") score <- pstrong if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:", word, "CAT:", category, "POL:", polarity, "SCORE:", score)) } scores[[category]] <- scores[[category]]+score } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] ratio <- as.integer(abs(scores$positive/scores$negative)) if (ratio==1) best_fit <- "neutral" documents <- rbind(documents,c(scores$positive, scores$negative, abs(scores$positive/scores$negative), best_fit)) if (verbose) { print(paste("POS:", scores$positive,"NEG:", scores$negative, "RATIO:", abs(scores$positive/scores$negative))) cat("\n") } } colnames(documents) <- c("POS","NEG","POS/NEG","BEST_FIT") Here is a sample output from this code: POS NEG POS/NEG BEST_FIT “1.03127774142571” “0.445453222112551” “2.31512017476245” “positive” “1.03127774142571” “26.1492093145274” “0.0394381997949273” “negative” “17.9196623384892” “17.8123396772424” “1.00602518608961” “neutral” Again, I just wanted to show a quick snipet of how the analysis is being done “under the hood.” I only care about the polarity outputted and the tweet it is analyzed from. polarity <- documents[, "BEST_FIT"] This variable, polarity, is returned by the classify_polarity.R script. CHALLENGES OBSERVED As with above, this is a stock analysis and hasn’t been tweaked for my needs. The analysis does OK, but I want to come back to this again one day to see if I can do better. QUICK CONCLUSION So… Now I have the emotion and polarity for each tweet. This can be useful to see on its own, but I think is more worthwhile in aggregate. In my next post, I’ll show that. Also in the next post- I’ll also show an analysis of the word count with a wordcloud… This gets into the secondary point of this analysis. Hypothetically, I’d like to see common issues bubbled up through the wordcloud.

Doing a Sentiment Analysis on Tweets (Part 1)

INTRO So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis. For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support. STEP 1: GETTING AUTHENTICATED TO TWITTER First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that. Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter. library(twitteR) ## R package that does some of the Twitter API heavy lifting consumerKey <- "INSERT YOUR KEY HERE" consumerSecret <- "INSERT YOUR SECRET HERE" reqURL <- " " accessURL <- " " authURL <- " " twitCred <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake() save(cred, file="credentials.RData") STEP 2: GETTING THE TWEETS Once you have your authentication credentials set, you can use them to grab tweets from Twitter. The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub. ##Authentication load("credentials.RData") ##has my secret keys and shiz registerTwitterOAuth(twitCred) ##logs me in ##Get the tweets about "comcast email" to work with tweetList <- searchTwitter("comcast email", n = 1000) tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted the results to a Data Frame to make it easier to analyze the results. STEP 3: GETTING RID OF THE JUNK Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at” The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way the user feels about it, other than they use it for their business. So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases: “fix” AND “email” in them (in that order) “Comcast” AND “email” in them in that order “no email” in them Any tweet that comes from a source with “comcast” in the handle “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them This is done with this code: ##finds the rows that have the phrase "fix ... email" in them fixemail <- grep("(fix.*email)", tweetList$text) ##finds the rows that have the phrase "comcast ... email" in them comcastemail <- grep("[Cc]omcast.*email", tweetList$text) ##finds the rows that have the phrase "no email" in them noemail <- grep("no email", tweetList$text) ##finds the rows that originated from a Comcast twitter handle comcasttweet <- grep("[Cc]omcast", tweetList$screenName) ##finds the rows related to email and customer service custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text) After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns. The returned columns are: text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude All I care about are: text created statusSource screenName This is handled through this tidbit of code: ##combine all of the "good" tweets row numbers that we greped out above and ##then sorts them and makes sure they are unique combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv) uvals <- unique(combined) sorted <- sort(uvals) ##pull the row numbers that we want, and with the columns that are important to ##us (tweet text, time of tweet, source, and username) paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)] STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown): ##make the device source look nicer paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource) paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource) ##name the columns names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName") paredTweetList Tweet created statusSource screenName Dear Mark I am having problems login into my acct I get no email w codes to reset my password for eddygil HELP HELP 2014-12-23 15:44:27 Twitter Web Client riocauto @msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW 2014-12-23 14:52:50 Twitter Web Client Monty_H_Mathis Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast 2014-12-23 09:20:14 Twitter Web Client mathercesul CHALLENGES OBSERVED As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.