data visualization

Jake on Strava - R Shiny App

I created a Shiny app that grabs my running, riding, and other exercise stats from Strava and creates some simple visualizations.

Twitter Analysis - R Shiny App

I created a Shiny app that searches Twitter and does some simple analysis.

Jake Learns Data Science Visitor Dashboard

A quick view of visitors to my website. Data pulled from Google Analytics and pushed to Amazon Redshift using Stitch Data.

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself’s curve and the curve for one. When I saw this, my mind was blown. How interesting… and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.


A demonstration D3 project, shamelessly ripping off Gapminder.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!