# AI

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# api

### Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup &lt;- read.csv(&quot;mylookup.csv&quot;, header = FALSE) ## ARBITRARY &quot;CHUNK&quot; SIZE TO KEEP SEARCHES SMALLER ## start &lt;- 1 end &lt;- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata &lt;- data.frame() ## HOW MANY &quot;CHUNKS&quot; WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring &lt;- paste(mylookup[start:end], sep = &quot;&quot;, collapse = &#39;&quot; OR VAR_NAME=&quot;&#39;) ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring &lt;- paste(&#39;index = &quot;my_splunk_index&quot; (VAR_NAME=&quot;&#39;, lookupstring, &#39;&quot;) | stats count BY VAR_NAME&#39;, sep = &quot;&quot;) ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response &lt;- GET(&quot;https://our.splunk.link:8089/&quot;, path = &quot;servicesNS/admin/search/search/jobs/export&quot;, encode=&quot;form&quot;, config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate(&quot;USERNAME&quot;, &quot;PASSWORD&quot;), query=list(search=paste0(&quot;search &quot;, searchstring, collapse=&quot;&quot;, sep=&quot;&quot;), output_mode=&quot;csv&quot;)) ## CHANGES THE RESULTS TO A DATA TABLE ## result &lt;- read.table(text = content(response, as = &quot;text&quot;), sep = &quot;,&quot;, header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata &lt;- rbind(alldata, result) ## UPDATES THE START POINT start &lt;- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN&#39;T GO TOO FAR ## if((end + 1000) &gt; length(allusers)){ end &lt;- length(allusers) } else { end &lt;- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, &quot;mydata.csv&quot;, row.names = FALSE, sep = &quot;,&quot;) So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

### Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 &lt;- paste(&quot;waterparks in &quot;, list_of_states[j,1], sep = &quot;&quot;) response &lt;- GET(&quot;https://maps.googleapis.com/&quot;, path = &quot;maps/api/place/textsearch/xml&quot;, query = list(query = q1, key = &quot;YOUR_API_KEY&quot;)) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result &lt;- xmlParse(response) result1 &lt;- xmlRoot(result) result2 &lt;- getNodeSet(result1, &quot;//result&quot;) data[counter, 1] &lt;- xmlValue(result2[[i]][[&quot;name&quot;]]) data[counter, 2] &lt;- xmlValue(result2[[i]][[&quot;formatted_address&quot;]]) data[counter, 3] &lt;- xmlValue(result2[[i]][[&quot;geometry&quot;]][[&quot;location&quot;]][[&quot;lat&quot;]]) data[counter, 4] &lt;- xmlValue(result2[[i]][[&quot;geometry&quot;]][[&quot;location&quot;]][[&quot;lng&quot;]]) data[counter, 5] &lt;- xmlValue(result2[[i]][[&quot;rating&quot;]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

# articles

### How Data Science Is Keeping Your Cell Phone Info Safe

I am honored to have been interviewed by Philly Magazine about how my experience from Villanova's Masters in Applied Statistics has influenced my work in analytics at Comcast!

# AUC

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# cartoDB

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

# conferences

### First Annual Data Jawn

I went to a data event last night called the Data Jawn, presented by RJMetrics, and wanted to share a write up of the event – it was pretty cool and sparked some really good ideas. One idea I really liked – Predicting who will call in/have an issue and proactively reaching out to them… Sounds like a game changer to me. I could then trigger alerts when we see that activity or make a list for reaching out via phone, email, text… whatever. The possibilities of this seem pretty big. There was also a lot of tweeting going on under the hashtag #datajawn. Here’s some notes/takeaways from each speaker: Bob Moore – RJMetrics CEO Motto of RJMetrics: “Inspire and enpower data driven people” Jake Stein – RJMetrics cofounder “Be data driven” Steps to all problem solving: + Collect Data + Analyze + Present Results Madelyn Fitzgerald – RJMetrics “Need to be problem focused, not solution focused” + This means that you need to ask a question of your data before building out the answer + Having KPIs is awesome… but they need to be built to answer a question The most common problem people make with data is diving into the data before asking a question Kim Siejak – Independence Blue Cross IBX invested in hadoop last year Doing a number of predictive models and machine learning + Predicting what people who will go to the hospital before they go + Predicting different diseases based on health history + Predicting who will call in before they complain David Wallace – RJMetrics Every Important SaaS Metric in a Single Infographic Document how every KPI is derived and make sure everyone understands it “If you’re not experimenting, you’re not learning. If you’re not learning, you’re not growing.” Lauren Anacona/Christopher Tufts Did a sentiment analysis on tweets with emojis Pulled all the location based tweets from the Philadelphia area and visualized them on a map using CartoDB and torque.js (really cool visualizations!) Lots of people use emojis! https://github.com/laurenancona/twimoji Jim Multari - Comcast “Dashboards are no good for senior leaders” Only have 10 seconds to get your message across when talking to executives Alerting on KPI changes Four things needed to make a data driven org: + Right data &amp; insights + Right data &amp; systems + Right people + Right culture Ben Garvey – RJMetrics Pie charts are evil - you can estimate linear distance much easier than angular distance “Data visualization gives you confidence in state and trend without effort” You can tell the story much easier with the right visualization. Stacey Mosley – Data Services Manager for the City of Philadelphia Gave a talk about how they improved the use of court time for L&amp;I She didn’t share a lot about her processes or what data she used to do this… There were a few other speakers to end the talk with nice messages, but by that point I was fully tweeting and stuck checking out what everyone else thought of the event. I hope that there continue to be opportunities like this locally to learn more about Data Analytics!

# D3.js

### Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself&rsquo;s curve and the curve for one. When I saw this, my mind was blown. How interesting&hellip; and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

### GAP RE-MINDER

A demonstration D3 project, shamelessly ripping off Gapminder.

### Open Data Day - DC Hackathon

For those of you who aren&rsquo;t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC&rsquo;s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I&rsquo;ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn&rsquo;t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

# data analysis

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

### Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup &lt;- read.csv(&quot;mylookup.csv&quot;, header = FALSE) ## ARBITRARY &quot;CHUNK&quot; SIZE TO KEEP SEARCHES SMALLER ## start &lt;- 1 end &lt;- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata &lt;- data.frame() ## HOW MANY &quot;CHUNKS&quot; WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring &lt;- paste(mylookup[start:end], sep = &quot;&quot;, collapse = &#39;&quot; OR VAR_NAME=&quot;&#39;) ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring &lt;- paste(&#39;index = &quot;my_splunk_index&quot; (VAR_NAME=&quot;&#39;, lookupstring, &#39;&quot;) | stats count BY VAR_NAME&#39;, sep = &quot;&quot;) ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response &lt;- GET(&quot;https://our.splunk.link:8089/&quot;, path = &quot;servicesNS/admin/search/search/jobs/export&quot;, encode=&quot;form&quot;, config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate(&quot;USERNAME&quot;, &quot;PASSWORD&quot;), query=list(search=paste0(&quot;search &quot;, searchstring, collapse=&quot;&quot;, sep=&quot;&quot;), output_mode=&quot;csv&quot;)) ## CHANGES THE RESULTS TO A DATA TABLE ## result &lt;- read.table(text = content(response, as = &quot;text&quot;), sep = &quot;,&quot;, header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata &lt;- rbind(alldata, result) ## UPDATES THE START POINT start &lt;- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN&#39;T GO TOO FAR ## if((end + 1000) &gt; length(allusers)){ end &lt;- length(allusers) } else { end &lt;- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, &quot;mydata.csv&quot;, row.names = FALSE, sep = &quot;,&quot;) So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

### Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 &lt;- paste(&quot;waterparks in &quot;, list_of_states[j,1], sep = &quot;&quot;) response &lt;- GET(&quot;https://maps.googleapis.com/&quot;, path = &quot;maps/api/place/textsearch/xml&quot;, query = list(query = q1, key = &quot;YOUR_API_KEY&quot;)) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result &lt;- xmlParse(response) result1 &lt;- xmlRoot(result) result2 &lt;- getNodeSet(result1, &quot;//result&quot;) data[counter, 1] &lt;- xmlValue(result2[[i]][[&quot;name&quot;]]) data[counter, 2] &lt;- xmlValue(result2[[i]][[&quot;formatted_address&quot;]]) data[counter, 3] &lt;- xmlValue(result2[[i]][[&quot;geometry&quot;]][[&quot;location&quot;]][[&quot;lat&quot;]]) data[counter, 4] &lt;- xmlValue(result2[[i]][[&quot;geometry&quot;]][[&quot;location&quot;]][[&quot;lng&quot;]]) data[counter, 5] &lt;- xmlValue(result2[[i]][[&quot;rating&quot;]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

# data jawn

### First Annual Data Jawn

I went to a data event last night called the Data Jawn, presented by RJMetrics, and wanted to share a write up of the event – it was pretty cool and sparked some really good ideas. One idea I really liked – Predicting who will call in/have an issue and proactively reaching out to them… Sounds like a game changer to me. I could then trigger alerts when we see that activity or make a list for reaching out via phone, email, text… whatever. The possibilities of this seem pretty big. There was also a lot of tweeting going on under the hashtag #datajawn. Here’s some notes/takeaways from each speaker: Bob Moore – RJMetrics CEO Motto of RJMetrics: “Inspire and enpower data driven people” Jake Stein – RJMetrics cofounder “Be data driven” Steps to all problem solving: + Collect Data + Analyze + Present Results Madelyn Fitzgerald – RJMetrics “Need to be problem focused, not solution focused” + This means that you need to ask a question of your data before building out the answer + Having KPIs is awesome… but they need to be built to answer a question The most common problem people make with data is diving into the data before asking a question Kim Siejak – Independence Blue Cross IBX invested in hadoop last year Doing a number of predictive models and machine learning + Predicting what people who will go to the hospital before they go + Predicting different diseases based on health history + Predicting who will call in before they complain David Wallace – RJMetrics Every Important SaaS Metric in a Single Infographic Document how every KPI is derived and make sure everyone understands it “If you’re not experimenting, you’re not learning. If you’re not learning, you’re not growing.” Lauren Anacona/Christopher Tufts Did a sentiment analysis on tweets with emojis Pulled all the location based tweets from the Philadelphia area and visualized them on a map using CartoDB and torque.js (really cool visualizations!) Lots of people use emojis! https://github.com/laurenancona/twimoji Jim Multari - Comcast “Dashboards are no good for senior leaders” Only have 10 seconds to get your message across when talking to executives Alerting on KPI changes Four things needed to make a data driven org: + Right data &amp; insights + Right data &amp; systems + Right people + Right culture Ben Garvey – RJMetrics Pie charts are evil - you can estimate linear distance much easier than angular distance “Data visualization gives you confidence in state and trend without effort” You can tell the story much easier with the right visualization. Stacey Mosley – Data Services Manager for the City of Philadelphia Gave a talk about how they improved the use of court time for L&amp;I She didn’t share a lot about her processes or what data she used to do this… There were a few other speakers to end the talk with nice messages, but by that point I was fully tweeting and stuck checking out what everyone else thought of the event. I hope that there continue to be opportunities like this locally to learn more about Data Analytics!

# data science

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you&rsquo;ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I&rsquo;ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we&rsquo;ve learned that on sunny days it typically does not rain. We don&rsquo;t know for sure that today it won&rsquo;t rain but we&rsquo;re willing to make decisions based on our prediction that it won&rsquo;t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can&rsquo;t be 100% sure that it won&rsquo;t rain because we&rsquo;ve also woken up on a few sunny mornings in which we&rsquo;ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid&hellip; I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don&rsquo;t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead&hellip; or hair length&hellip; or eye color&hellip; or use all of these attributes at the same time! It&rsquo;s natural in this example to see how &ldquo;close&rdquo; people are to one another based on different attributes. What these type of algorithms do is evaluate the &ldquo;distances&rdquo; of one piece of information from another piece. In a machine learning setting you look for similarities and &ldquo;closeness&rdquo; in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,&hellip;, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is &ldquo;learned&rdquo; without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I&rsquo;ve used or found along my Data Science journey. UPDATE: I&rsquo;ve written a whole post on this. You can find it here O&rsquo;Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.

### Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself&rsquo;s curve and the curve for one. When I saw this, my mind was blown. How interesting&hellip; and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

### GAP RE-MINDER

A demonstration D3 project, shamelessly ripping off Gapminder.

### Open Data Day - DC Hackathon

For those of you who aren&rsquo;t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC&rsquo;s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I&rsquo;ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn&rsquo;t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

### Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 &lt;- paste(&quot;waterparks in &quot;, list_of_states[j,1], sep = &quot;&quot;) response &lt;- GET(&quot;https://maps.googleapis.com/&quot;, path = &quot;maps/api/place/textsearch/xml&quot;, query = list(query = q1, key = &quot;YOUR_API_KEY&quot;)) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result &lt;- xmlParse(response) result1 &lt;- xmlRoot(result) result2 &lt;- getNodeSet(result1, &quot;//result&quot;) data[counter, 1] &lt;- xmlValue(result2[[i]][[&quot;name&quot;]]) data[counter, 2] &lt;- xmlValue(result2[[i]][[&quot;formatted_address&quot;]]) data[counter, 3] &lt;- xmlValue(result2[[i]][[&quot;geometry&quot;]][[&quot;location&quot;]][[&quot;lat&quot;]]) data[counter, 4] &lt;- xmlValue(result2[[i]][[&quot;geometry&quot;]][[&quot;location&quot;]][[&quot;lng&quot;]]) data[counter, 5] &lt;- xmlValue(result2[[i]][[&quot;rating&quot;]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

# logistic regression

### Identifying Compromised User Accounts with Logistic Regression

INTRODUCTION As a Data Analyst on Comcast&rsquo;s Messaging Engineering team, it is my responsibility to report on the platform statuses, identify irregularities, measure impact of changes, and identify policies to ensure that our system is used as it was intended. Part of the last responsibility is the identification and remediation of compromised user accounts. The challenge the company faces is being able to detect account compromises faster and remediate them closer to the moment of detection. This post will focus on the methodology and process for modeling the criteria to best detect compromised user accounts in near real-time from outbound email activity. For obvious reasons, I am only going to speak to the methodologies used; I&rsquo;ll be vague when it comes to the actual criteria we used. DATA COLLECTION AND CLEANING Without getting into the finer details of email delivery, there are about 43 terminating actions an email can take when it was sent out of our platform. A message can be dropped for a number of reasons. These are things like the IP or user being on any number block lists, triggering our spam filters, and other abusive behaviors. The other side of that is that the message will be delivered to its intended recipient. That said, I was able to create a usage profile for all of our outbound senders in small chunks of time in Splunk (our machine log collection tool of choice). This profile gives a summary per user of how often the messages they sent hit each of the terminating actions described above. In order to train my data, I matched this usage data to our current compromised detection lists. I created a script in python that added an additional column in the data. If an account was flagged as compromised with our current criteria, it was given a one; if not, a zero. With the data collected, I am ready to determine the important inputs. DETERMINING INPUTS FOR THE MODEL In order to determine the important variables in the data, I created a Binary Regression Tree in R using the rpart library. The Binary Regression Tree iterates over the data and &ldquo;splits&rdquo; it in order to group the data to get compromised accounts together and non-compromised accounts together. It is also a nice way to visualize the data. You can see in the picture below what this looks like. Because the data is so large, I limited the data to one day chunks. I then ran this regression tree against each day separately. From that, I was able to determine that there are 6 important variables (4 of which showed up in every regression tree I created; the other 2 showed up in a majority of trees). You can determine the &ldquo;important&rdquo; variables by looking in the summary for the number of splits per variable. BUILDING THE MODEL Now that I have the important variables, I created a python script to build the Logistic Regression Model from them. Using the statsmodels package, I was able to build the model. All of my input variables were highly significant. I took the logistic regression equation with the coefficients given in the model back to Splunk and tested this on incoming data to see what would come out. I quickly found that it got many accounts that were really compromised. There were also some accounts being discovered that looked like brute force attacks that never got through - to adjust for that, I added a constraint to the model that the user must have done at least one terminating action that ensured they authenticated successfully (this rules out users coming from a ton of IPs, but failing authentication everytime). With these important variables, it&rsquo;s time to build the Logistic Regression Model. CONCLUSION First and foremost, this writeup was intended to be a very high level summary explaining the steps I took to get my final model. What isn&rsquo;t explained here is how many models I built that were less successful. Though this combination worked for me in the end, likely you&rsquo;ll need to iterate over the process a number of times to get something successful. The new detection method for compromised accounts is an opportunity for us to expand our compromise detection and do it in a more real-time manner. This is also a foundation for future detection techniques for malicious IPs and other actors. With this new method, we will be able to expand the activity types for compromise detection outside of outbound email activity to things like preference changes, password resets, changes to forwarding address, and even application activity outside of the email platform.

# machine learning

### Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you&rsquo;ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I&rsquo;ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we&rsquo;ve learned that on sunny days it typically does not rain. We don&rsquo;t know for sure that today it won&rsquo;t rain but we&rsquo;re willing to make decisions based on our prediction that it won&rsquo;t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can&rsquo;t be 100% sure that it won&rsquo;t rain because we&rsquo;ve also woken up on a few sunny mornings in which we&rsquo;ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid&hellip; I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don&rsquo;t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead&hellip; or hair length&hellip; or eye color&hellip; or use all of these attributes at the same time! It&rsquo;s natural in this example to see how &ldquo;close&rdquo; people are to one another based on different attributes. What these type of algorithms do is evaluate the &ldquo;distances&rdquo; of one piece of information from another piece. In a machine learning setting you look for similarities and &ldquo;closeness&rdquo; in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,&hellip;, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is &ldquo;learned&rdquo; without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I&rsquo;ve used or found along my Data Science journey. UPDATE: I&rsquo;ve written a whole post on this. You can find it here O&rsquo;Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.

# model coparison

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# open data

### Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself&rsquo;s curve and the curve for one. When I saw this, my mind was blown. How interesting&hellip; and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

# ROC curves

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# supervised learning

### Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you&rsquo;ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I&rsquo;ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we&rsquo;ve learned that on sunny days it typically does not rain. We don&rsquo;t know for sure that today it won&rsquo;t rain but we&rsquo;re willing to make decisions based on our prediction that it won&rsquo;t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can&rsquo;t be 100% sure that it won&rsquo;t rain because we&rsquo;ve also woken up on a few sunny mornings in which we&rsquo;ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid&hellip; I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don&rsquo;t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead&hellip; or hair length&hellip; or eye color&hellip; or use all of these attributes at the same time! It&rsquo;s natural in this example to see how &ldquo;close&rdquo; people are to one another based on different attributes. What these type of algorithms do is evaluate the &ldquo;distances&rdquo; of one piece of information from another piece. In a machine learning setting you look for similarities and &ldquo;closeness&rdquo; in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,&hellip;, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is &ldquo;learned&rdquo; without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I&rsquo;ve used or found along my Data Science journey. UPDATE: I&rsquo;ve written a whole post on this. You can find it here O&rsquo;Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.

# tableau

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

# tableau public

### Jake Learns Data Science Visitor Dashboard

A quick view of visitors to my website. Data pulled from Google Analytics and pushed to Amazon Redshift using Stitch Data.

### Twitter Analysis - R Shiny App

I created a Shiny app that searches Twitter and does some simple analysis.

# unsupervised learning

### Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you&rsquo;ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I&rsquo;ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we&rsquo;ve learned that on sunny days it typically does not rain. We don&rsquo;t know for sure that today it won&rsquo;t rain but we&rsquo;re willing to make decisions based on our prediction that it won&rsquo;t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can&rsquo;t be 100% sure that it won&rsquo;t rain because we&rsquo;ve also woken up on a few sunny mornings in which we&rsquo;ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid&hellip; I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don&rsquo;t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead&hellip; or hair length&hellip; or eye color&hellip; or use all of these attributes at the same time! It&rsquo;s natural in this example to see how &ldquo;close&rdquo; people are to one another based on different attributes. What these type of algorithms do is evaluate the &ldquo;distances&rdquo; of one piece of information from another piece. In a machine learning setting you look for similarities and &ldquo;closeness&rdquo; in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,&hellip;, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is &ldquo;learned&rdquo; without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I&rsquo;ve used or found along my Data Science journey. UPDATE: I&rsquo;ve written a whole post on this. You can find it here O&rsquo;Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.