AI

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

A Data Scientist's Take on the Roadmap to AI

INTRODUCTION Recently I was asked by a former colleague about getting into AI. He has truly big data and wants to use this data to power “AI” - if the headlines are to be believed, everyone else is already doing it. Though it was difficult for my ego, I told him I couldn’t help him in our 30 minute call and that he should think about hiring someone to get him there. The truth was I really didn’t have a solid answer for him in the moment. This was truly disappointing - in my current role and in my previous role, I put predictive models into production. After thinking about it for a bit, there is definitely a similar path I took in both roles. There’s 3 steps in my mind to getting to “AI.” Though this seems simple, it is a long process and potentially not linear - you may have to keep coming back to previous steps. Baseline (Reporting) Understand (Advanced Analytics) Artificial Intelligence (Data Science) BASELINE (REPORTING) Fun fact: You cannot effectively predict anything if you cannot measure the impact. What I mean by baseline is building out a reporting suite. Having a fundamental understanding of your business and environment is key. Without doing this step, you may try to predict the wrong thing entirely - or start with something that isn’t the most impactful. For me, this step started with finding the data in the first place. Perhaps, like my colleague, you have lots of data and you’re ready to jump in. That’s great and makes getting started that much more straightforward. In my role, I joined a finance team that really didn’t have a good bead on this - finding the data was difficult (and getting the owners of that data to give me access was a process as well). To be successfull, start small and iterate. Our first reports were built from manually downloading machine logs, processing them in R with JSON packages, and turning them into a black-and-white document. It was ugly, but it helped us know what we needed to know in that moment - oh yeah… it was MUCH better than nothing. “Don’t let perfection be the enemy of good.” - paraphrased from Voltaire. From this, I gained access to our organizations data warehouse, put automation in place, and purchased some Tableau licenses. This phase took a few months and is constantly being refined, but we are now able to see the impact of our decisions at a glance. This new understanding inevitably leads to more questions - queue step 2: Understanding. UNDERSTANDING (ADVANCED ANALYTICS) If you have never circulated reports and dashboards to others… let me fill you in on something: it will ALWAYS lead to additional, progressively harder questions. This step is an investment in time and expertise - you have to commit to having dedicated resource(s) (read: people… it is inhumane to call people resources and you may only need one person or some of a full time person’s time). Why did X go up unexpectedly (breaks the current trend)? Are we over indexing on this type of customer? Right before our customer leaves, this weird thing happens - what is this weird thing and why is it happening? Like the previous step - this will be ongoing. Investing in someone to do advanced analytics will help you to understand the fine details of your business AND … (drum roll) … will help you to understand which part of your business is most ripe for “AI”! ARTIFICIAL INTELLIGENCE (DATA SCIENCE) It is at this point that you will able to do real, bonafide, data science. A quick rant: Notice that I purposefully did not use the term “AI” (I know I used it throughout this article and even in the title of this section… what can I say - I am in-tune with marketing concepts, too). “AI” is a term that is overused and rarely implemented. Data science, however, comes in many forms and can really transform your business. Here’s a few ideas for what you can do with data science: Prediction/Machine Learning Testing Graph Analysis Perhaps you want to predict whether a sale is fraud or which existing customer is most apt to buy your new product? You can also test whether a new strategy works better than the old. This requires that you use statistical concepts to ensure valid testing and results. My new obsession is around graph analysis. With graphs you can see relationships that may have been hidden before - this will enable you to identify new targets and enrich your understanding of your business! Data science usually is very specific thing and takes many forms! SUMMARY Getting to data science is a process - it will take an investment. There are products out there that will help you shortcut some of these steps and I encourage you to consider these. There are products to help with reporting, analytics, and data science. These should, in my very humble opinion, be used by people who are dedicated to the organizations data, analytics, and science. Directions for data science - measure, analyze, predict, repeat!

api

Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup <- read.csv("mylookup.csv", header = FALSE) ## ARBITRARY "CHUNK" SIZE TO KEEP SEARCHES SMALLER ## start <- 1 end <- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata <- data.frame() ## HOW MANY "CHUNKS" WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring <- paste(mylookup[start:end], sep = "", collapse = '" OR VAR_NAME="') ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring <- paste('index = "my_splunk_index" (VAR_NAME="', lookupstring, '") | stats count BY VAR_NAME', sep = "") ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response <- GET("https://our.splunk.link:8089/", path = "servicesNS/admin/search/search/jobs/export", encode="form", config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate("USERNAME", "PASSWORD"), query=list(search=paste0("search ", searchstring, collapse="", sep=""), output_mode="csv")) ## CHANGES THE RESULTS TO A DATA TABLE ## result <- read.table(text = content(response, as = "text"), sep = ",", header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata <- rbind(alldata, result) ## UPDATES THE START POINT start <- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN'T GO TOO FAR ## if((end + 1000) > length(allusers)){ end <- length(allusers) } else { end <- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, "mydata.csv", row.names = FALSE, sep = ",") So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("https://maps.googleapis.com/", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

articles

How Data Science Is Keeping Your Cell Phone Info Safe

I am honored to have been interviewed by Philly Magazine about how my experience from Villanova's Masters in Applied Statistics has influenced my work in analytics at Comcast!

AUC

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

cartoDB

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

classification

How to Build a Binary Classification Model

What Is Binary Classification? Algorithms for Binary Classification Logistic Regression Decision Trees/Random Forests Decision Trees Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Great. Now what? Determining What the Problem is Locate and Obtain Data Data Mining & Preparing for Analysis Splitting the Data Building the Models Validating the Models Conclusion What Is Binary Classification? Binary classification is used to classify a given set into two categories. Usually, this answers a yes or no question: Did a particular passenger on the Titanic survive? Is a particular user account compromised? Is my product on the shelf in a certain pharmacy? This type of inference is frequently made using supervised machine learning techniques. Supervised machine learning means that you have historical, labeled data, from which your algorithm may learn. Algorithms for Binary Classification There are many methods for doing binary classification. To name a few: Logistic Regression Decision Trees/Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Logistic Regression Logistic regression is a parametric statistical model that predicts binary outcomes. Parametric means that this algorithm is based off of a distribution (in this case, the logistic distribution), and as such must follow a few assumptions: Obviously, the dependent variable must be binary Only meaningful independent variables are included in the model Error terms need to be independent and identically distributed Independent variables need to be independent from one another Large sample sizes are preferred Because of these assumptions, parametric tests tend to be more statistically powerful than nonparametric tests; in other words, they tend to better find a significant effect when it indeed exists. Logistic regression follows the equation: probability of the outcome being 1 given the independent variables Dependent variable Limited to values between 0 and 1 independent variables intercept and coefficients for the independent variables This equation is created based on a training set of data – or historical, labeled data – and then is used to predict the likelihoods of future, unlabeled data. Decision Trees/Random Forests Decision Trees A decision tree is a nonparametric classifier. It effectively partitions the data, starting first by splitting on the independent variable that gives the most information gain, and then recursively repeating this process at subsequent levels. Information gain is a formula that determines how “important” an independent variable is in predicting the dependent variable. It takes into account how many distinct values there are (in terms of categorical variables) and the number and size of branches in the decision tree. The goal is to pick the most informative variable that is still general enough to prevent overfitting. The bottom of the decision tree, at the leaf nodes, are groupings of events within the set that all follow the rules set forth throughout the tree to get to the node. Future, unlabeled events, are then fed into the tree to see which group the belong – the average of the labeled (training) data for the leaf is then assigned as the predicted value for the unlabeled event. As with logistic regression, overfitting is a concern. If you allow a decision tree to continue to grow without bound, eventually you will have all identical events in each leaf; while this may look beneficial, it may be too specific to the training data and mislabel future events. “Pruning” occurs to prevent overfitting. Random Forests Random forests is an ensemble method build upon the decision trees. Random forests are a “forest” of decision trees – in other words, you use bootstrap sampling techniques to build many over-fit decision trees, then average out the results to determine a final model. A bootstrap sample is sampling with replacement – in every selection for the sample, each event has an equal chance of being chosen. To clarify – building a random forest model means taking many bootstrap samples and building an over-fit decision tree (meaning you continue to split the tree without bound until every leaf node has identical groups in them) on each. These results, taken together, correct for the biases and potential overfitting of an individual tree. The more trees in your random forest, the better – the trade-off being that more trees mean more computing. Random forests often take a long time to train. Nearest Neighbor The k-nearest neighbor algorithm is a very simple algorithm. Using the training set as reference, the new, unlabeled data is predicted by taking the average of the k closest events. Being a lazy learner, where evaluation does not take place until you classify new events, it is quick to run. It can be difficult to determine what k should be. However, because it is easy computationally, you can run multiple iterations without much overhead. Support Vector Machines (SVM) SVM is also an ensemble machine learning method. SVM recursively attempts to “split” the two categories by maximizing the distance between a hyperplane (a plane in more than 2 dimensions; most applications of machine learning are in the higher dimensional space) and the closest points in each category. As you can see in the simple example below, the plane iteratively improves the split between the two groups. There are multiple kernels that can be used with SVM, depending on the shape of the data: Linear Polynomial Radial Sigmoid You may also choose to configure how big of steps can be taken by the plane in each iteration among other configurations. Neural Networks Neural networks (there are several varieties) are built to mimic how a brain solves problems. This is done by creating multiple layers from a single input – most easily demonstrated with image recognition – where it is able to turn groups of pixels into another, single value, over and over again, to provide more information to train the model. Great. Now what? Now that we know how we understand some of the tools in our arsenal, what are the steps to doing the analysis? Determining what the problem is Locate and obtain data Data mining for understanding & preparing for analysis Split data into training and testing sets Build model(s) on training data Test models on test data Validate and pick the best model Determining What the Problem is While it is easy to ask a question, it is difficult to understand all of the assumption being made by the question asker. For example, a simple question is asked: Will my product be on the shelf of this pharmacy next week? While that question may seem straightforward at first glance, what product are we talking about? What pharmacy are we talking about? What is the time frame which is being evaluated? Does it need to be in the pharmacy and available if you ask or does the customer need to be able to visually identify the product? Does it need to be available for the entire time period in question or did just have to be available at least part of the time period in question? Being as specific as possible is vital in order to deliver the correct answer. It is easy to misinterpret the assumptions of the question asker and then do a lot of work in to answer the wrong question. Specificity will help ensure time is not wasted and that question asker gets an answer that they were looking for. The final question may look more like: Will there be any Tylenol PM available over-the-counter at midnight, February 28, 2017 at Walgreens on the corner of 17th and John F. Kennedy Blvd in Philadelphia? Well – we don’t know. We can now use historical data to make our best guess. This question is specific enough to answer. Locate and Obtain Data Where is your data? Is it in a database? Some excel spreadsheet? Once you find it, how big is it? Can you download the data locally? Do you need to find a distributed database to handle it? If it is in a database, can you do some of the data mining (next step) before downloading the data? Be careful… “SELECT * FROM my_table;” can get scary, quick. This is also a good time to think about what tools and/or languages you want to use to mine and manipulate the data. Excel? SQL? R? Python? Some of the numerous other tools or languages out there that are good at a bunch of different things (Julia, Scala, Weka, Orange, etc.)? Get the data into one spot, preferably with some guidance on what and where it is in relation to what you need for your problem and open it up. Data Mining & Preparing for Analysis The most time consuming step in any data science article you read will always be the data cleaning step. This document is no different – you will spend an inordinate amount of time getting to know the data, cleaning it, getting to know it better, and cleaning it again. You may then proceed to analysis, discover you’ve missed something, and come back to this step. There is a lot to consider in this step and each data analysis is different. Is your data complete? If you are missing values in your data, how will you deal with them? There is no overarching rule on this. If you are dealing with continuous data, perhaps you’ll fill missing data points with the average of similar data. Perhaps you can infer what it should be based on context. Perhaps it constitutes such a small portion of your data, the logical thing to do is to just drop the events all together. The dependent variable – how does it break down? We are dealing with binomial data here; is there way more zeros then ones? How will you deal with that if there is? Are you doing your analysis on a subset? If so, is your sample representative of the population? How can you be sure? This is where histograms are your friend. Do you need to create variables? Perhaps one independent variable you have is a date, which might be tough to use as an input to your model. Should you find out which day of the week each date was? Month? Year? Season? These are easier to add in as a model input in some cases. Do you need to standardize your data? Perhaps men are listed as “M,” “Male,” “m,” “male,” “dude,” and “unsure.” It would behoove you, in this example, to standardize this data to all take on the same value. In most algorithms, correlated input variables are bad. This is the time to plot all of the independent variables against each other to see if there is correlation. If there are correlated variables, it may be a tough choice to drop one (or all!). Speaking of independent variables, which are important to predict your dependent variable? You can use information gain packages (depending on the language/tool you are using to do your analysis), step-wise regression, or random forests to help understand the important variables. In many of these steps, there are no hard-and-fast rules on how to proceed. You’ll need to make a decision in the context of your problem. In many cases, you may be wrong and need to come back to the decision after trying things out. Splitting the Data Now that you (think you) have a clean dataset, you’ll need to split it into training and testing datasets. You’ll want to have as much data as possible to train on, though still have enough data left over to test on. This is less and less of an issue in the age of big data. However, sometimes too much data and it will take too long for your algorithms to train. Again – this is another decision that will need to be made in the context of your problem. There are a few options for splitting your data. The most straightforward being take a portion of your overall dataset to train on (say 70%) and leave behind the rest to test on. This works well in most big data applications. If you do not have a lot of data (or if you do), consider cross-validation. This is an iterative approach where you train your algorithm recursively on the same data set, leaving some portion out each iteration to be used as the test set. The most popular versions of cross-validation are k-fold cross validation and leave-one-out cross validation. There is even nested cross-validation, which gets very Inception-like. Building the Models Finally, you are ready to do what we came to do – build the models. We have our datasets cleaned, enriched, and split. Time to build our models. I say it plural because you’ll always want to evaluate which method and/or inputs works best. You’ll want to pick a few of the algorithms from above and build the model. While that is vague, depending on your language or tool of choice, there are multiple packages available to perform each analysis. It is generally only a line or two of code to train each model; once we have our models trained, it is time to validate. Validating the Models So – which model did best? How can you tell? We start by predicting results for our test set with each model and building a confusion matrix for each: With this, we can calculate the specificity, sensitivity, and accuracy for each model. For each value, higher is better. The best model is one that performs the best in each of these counts. In the real world, frequently one model will have better specificity, while another will have better sensitivity, and yet another will be the most accurate. Again, there is no hard and fast rule one which model to choose; it all depends on the context. Perhaps false positives are really bad in your context, then the specificity rate should be given more merit. It all depends. From here, you have some measures in order to pick a model and implement it. Conclusion Much of model building, in general, is part computer science, part statistics, and part business understanding. Understanding which tools and languages are best to implement the best statistical modeling technique to solve a business problem can feel like more of a form of art than science at times. In this document, I’ve presented some algorithms and steps to do binary classification, which is just the tip of the iceberg. I am sure there are algorithms and steps missing – I hope that this helps in your understanding.

conferences

First Annual Data Jawn

I went to a data event last night called the Data Jawn, presented by RJMetrics, and wanted to share a write up of the event – it was pretty cool and sparked some really good ideas. One idea I really liked – Predicting who will call in/have an issue and proactively reaching out to them… Sounds like a game changer to me. I could then trigger alerts when we see that activity or make a list for reaching out via phone, email, text… whatever. The possibilities of this seem pretty big. There was also a lot of tweeting going on under the hashtag #datajawn. Here’s some notes/takeaways from each speaker: Bob Moore – RJMetrics CEO Motto of RJMetrics: “Inspire and enpower data driven people” Jake Stein – RJMetrics cofounder “Be data driven” Steps to all problem solving: + Collect Data + Analyze + Present Results Madelyn Fitzgerald – RJMetrics “Need to be problem focused, not solution focused” + This means that you need to ask a question of your data before building out the answer + Having KPIs is awesome… but they need to be built to answer a question The most common problem people make with data is diving into the data before asking a question Kim Siejak – Independence Blue Cross IBX invested in hadoop last year Doing a number of predictive models and machine learning + Predicting what people who will go to the hospital before they go + Predicting different diseases based on health history + Predicting who will call in before they complain David Wallace – RJMetrics Every Important SaaS Metric in a Single Infographic Document how every KPI is derived and make sure everyone understands it “If you’re not experimenting, you’re not learning. If you’re not learning, you’re not growing.” Lauren Anacona/Christopher Tufts Did a sentiment analysis on tweets with emojis Pulled all the location based tweets from the Philadelphia area and visualized them on a map using CartoDB and torque.js (really cool visualizations!) Lots of people use emojis! https://github.com/laurenancona/twimoji Jim Multari - Comcast “Dashboards are no good for senior leaders” Only have 10 seconds to get your message across when talking to executives Alerting on KPI changes Four things needed to make a data driven org: + Right data & insights + Right data & systems + Right people + Right culture Ben Garvey – RJMetrics Pie charts are evil - you can estimate linear distance much easier than angular distance “Data visualization gives you confidence in state and trend without effort” You can tell the story much easier with the right visualization. Stacey Mosley – Data Services Manager for the City of Philadelphia Gave a talk about how they improved the use of court time for L&I She didn’t share a lot about her processes or what data she used to do this… There were a few other speakers to end the talk with nice messages, but by that point I was fully tweeting and stuck checking out what everyone else thought of the event. I hope that there continue to be opportunities like this locally to learn more about Data Analytics!

D3.js

Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself’s curve and the curve for one. When I saw this, my mind was blown. How interesting… and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

GAP RE-MINDER

A demonstration D3 project, shamelessly ripping off Gapminder.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

data analysis

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix <- read.csv("Parking_Violations.csv") ## READ IN THE WEATHER DATA (FROM NCDC) weather_data <- read.csv("weather_data.csv") ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays <- as.Date(c("2012-01-02", "2012-01-16", "2012-02-20", "2012-05-28", "2012-07-04", "2012-09-03", "2012-10-08", "2012-11-12", "2012-11-22", "2012-12-25", "2013-01-01", "2013-01-21", "2013-02-18", "2013-05-27", "2013-07-04", "2013-09-02", "2013-10-14", "2013-11-11", "2013-11-28", "2013-12-25", "2014-01-01", "2014-01-20", "2014-02-17", "2014-05-26", "2014-07-04", "2014-09-01", "2014-10-13", "2014-11-11", "2014-11-27", "2014-12-25", "2015-01-01", "2015-01-09", "2015-02-16", "2015-05-25", "2015-07-03", "2015-09-07")) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days <- as.data.frame(as.Date( ptix$Issue.Date.and.Time, format = "%m/%d/%Y")) names(days) <- "DATE" count_by_day <- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE <- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = "%Y%m%d")), format = "%m/%d/%Y") Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day <- join(count_by_day, weather_data, by = "DATE") With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON'T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION <- NULL count_by_day$STATION_NAME <- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN'T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR < 0] <- 0 count_by_day$DAPR[count_by_day$DAPR < 0] <- 0 count_by_day$PRCP[count_by_day$PRCP < 0] <- 0 count_by_day$SNWD[count_by_day$SNWD < 0] <- 0 count_by_day$SNOW[count_by_day$SNOW < 0] <- 0 count_by_day$WT01[count_by_day$WT01 < 0] <- 0 count_by_day$WT03[count_by_day$WT03 < 0] <- 0 count_by_day$WT04[count_by_day$WT04 < 0] <- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day <- count_by_day[ count_by_day$TMAX > 0, ] count_by_day <- count_by_day[ count_by_day$TMIN > 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day <- count_by_day[!is.na( count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS <- NULL count_by_day$WT01 <- NULL count_by_day$WT04 <- NULL count_by_day$WT03 <- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX <- count_by_day$TMAX / 10 count_by_day$TMIN <- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW <- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL <- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] <- 1 count_by_day$HOL <- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON <- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train <- count_by_day[count_by_day$DATE < "2014-08-01", ] test <- count_by_day[count_by_day$DATE >= "2014-08-01", ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest <- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest <- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon <- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5271.4002 89.5216 58.884 < 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 < 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 < 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 < 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: < 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 < 2.2e-16 *** HOL 1 147553631 147553631 498.1432 < 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 < 2.2e-16 *** HOL 1 147553631 147553631 463.001 < 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat <- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value <- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF <- round(predict(predForest, test), 0) test$LM <- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF <- sum(abs(test$RF - test$count)) difOfLM <- sum(abs(test$LM - test$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

Doing a Sentiment Analysis on Tweets (Part 2)

INTRO This is post is a continuation of my last post. There I pulled tweets from Twitter related to “Comcast email,” got rid of the junk, and removed the unnecessary/unwanted data. Now that I have the tweets, I will further clean the text and subject it to two different analyses: emotion and polarity. WHY DOES THIS MATTER Before I get started, I thought it might be a good idea to talk about WHY I am doing this (besides the fact that I learned a new skill and want to show it off and get feedback). This yet incomplete project was devised for two reasons: Understand the overall customer sentiment about the product I support Create an early warning system to help identify when things are going wrong on the platform Keeping the customer voice at the forefront of everything we do is tantamount to providing the best experience for the users of our platform. Identifying trends in sentiment and emotion can help inform the team in many ways, including seeing the reaction to new features/releases (i.e. – seeing a rise in comments about a specific addition from a release) and identifying needed changes to current functionality (i.e. – users who continually comment about a specific behavior of the application) and improvements to user experience (i.e. – trends in comments about being unable to find a certain feature on the site). Secondarily, this analysis can act as an early warning system when there are issues with the platform (i.e. – a sudden spike in comments about the usability of a mobile device). Now that I’ve explained why I am doing this (which I probably should have done in this sort of detail the first post), let’s get into how it is actually done… STEP ONE: STRIPPING THE TEXT FOR ANALYSIS There are a number of things included in tweets that dont matter for the analysis. Things like twitter handles, URLs, punctuation… they are not necessary to do the analysis (in fact, they may well confound it). This bit of code handles that cleanup. For those following the scripts on GitHub, this is part of my tweet_clean.R script. Also, to give credit where it is due: I’ve borrowed and tweaked the code from Andy Bromberg’s blog to do this task. library(stringr) ##Does some of the text editing ##Cleaning up the data some more (just the text now) First grabbing only the text text <- paredTweetList$Tweet # remove retweet entities text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text) # remove at people text <- gsub("@\\w+", "", text) # remove punctuation text <- gsub("[[:punct:]]", "", text) # remove numbers text <- gsub("[[:digit:]]", "", text) # remove html links text <- gsub("http\\w+", "", text) # define "tolower error handling" function try.error <- function(x) { # create missing value y <- NA # tryCatch error try_error <- tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(try_error, "error")) y <- tolower(x) # result return(y) } # lower case using try.error with sapply text <- sapply(text, try.error) # remove NAs in text text <- text[!is.na(text)] # remove column names names(text) <- NULL STEP TWO: CLASSIFYING THE EMOTION FOR EACH TWEET So now the text is just that: only text. The punctuation, links, handles, etc. have been removed. Now it is time to estimate the emotion of each tweet. Through some research, I found that there are many posts/sites on Sentiment Analysis/Emotion Classification that use the “Sentiment” package in R. I thought: “Oh great – a package tailor made to solve the problem for which I want an answer.” The problem is that this package has been deprecated and removed from the CRAN library. To get around this, I downloaded the archived package and pulled the code for doing the emotion classification. With some minor tweaks, I was able to get it going. This can be seen in its entirety in the classify_emotion.R script. You can also see the “made for the internet” version here: library(RTextTools) library(tm) algorithm <- "bayes" prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/emotions.csv.gz",header=FALSE) counts <- list(anger=length(which(lexicon[,2]=="anger")), disgust=length(which(lexicon[,2]=="disgust")), fear=length(which(lexicon[,2]=="fear")), joy=length(which(lexicon[,2]=="joy")), sadness=length(which(lexicon[,2]=="sadness")), surprise=length(which(lexicon[,2]=="surprise")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(anger=0,disgust=0,fear=0,joy=0,sadness=0,surprise=0) doc <- matrix[i,] words <- findFreqTerms(doc,lowfreq=1) for (word in words) { for (key in names(scores)) { emotions <- lexicon[which(lexicon[,2]==key),] index 0) { entry <- emotions[index,] category <- as.character(entry[[2]]]) count <- counts[[category]] score <- 1.0 if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:",word,"CAT:", category,"SCORE:",score)) } scores[[category]] <- scores[[category]]+score } } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] if (best_fit == "disgust" && as.numeric(unlist(scores[2]))-3.09234 < .01) best_fit <- NA documents <- rbind(documents, c(scores$anger, scores$disgust, scores$fear, scores$joy, scores$sadness, scores$surprise, best_fit)) } colnames(documents) <- c("ANGER", "DISGUST", "FEAR", "JOY", "SADNESS", "SURPRISE", "BEST_FIT") Here is a sample output from this code: ANGER DISGUST FEAR JOY SADNESS SURPRISE BEST_FIT “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412327” “sadness” “7.34083555412328” “3.09234031207392” “2.06783599555953” “1.02547755260094” “1.7277074477352” “2.78695866252273” “anger” “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412328” “sadness” Here you can see that the initial author is using naive Bayes (which honestly I don’t yet understand) to analyze the text. I wanted to show a quick snipet of how the analysis is being done “under the hood.” For my purposes though, I only care about the emotion outputted and the tweet it is analyzed from. emotion <- documents[, "BEST_FIT"]` This variable, emotion, is returned by the classify_emotion.R script. CHALLENGES OBSERVED In addition to not fully understanding the code, the emotion classification seems to only work OK (which is pretty much expected… this is a canned analysis that hasn’t been tailored to my analysis at all). I’d like to come back to this one day to see if I can do a better job analyzing the emotions of the tweets. STEP THREE: CLASSIFYING THE POLARITY OF EACH TWEET Similarly to what we saw in step 5, I will use the cleaned text to analyze the polarity of each tweet. This code is also from the old R Packaged titled “Sentiment.” As with above, I was able to get the code working with only some minor tweaks. This can be seen in its entirety in the classify_polarity.R script. Here it is, too: algorithm <- "bayes" pstrong <- 0.5 pweak <- 1.0 prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/subjectivity.csv.gz",header=FALSE) counts <- list(positive=length(which(lexicon[,3]=="positive")), negative=length(which(lexicon[,3]=="negative")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(positive=0,negative=0) doc <- matrix[i,] words <- findFreqTerms(doc, lowfreq=1) for (word in words) { index 0) { entry <- lexicon[index,] polarity <- as.character(entry[[2]]) category <- as.character(entry[[3]]) count <- counts[[category]] score <- pweak if (polarity == "strongsubj") score <- pstrong if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:", word, "CAT:", category, "POL:", polarity, "SCORE:", score)) } scores[[category]] <- scores[[category]]+score } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] ratio <- as.integer(abs(scores$positive/scores$negative)) if (ratio==1) best_fit <- "neutral" documents <- rbind(documents,c(scores$positive, scores$negative, abs(scores$positive/scores$negative), best_fit)) if (verbose) { print(paste("POS:", scores$positive,"NEG:", scores$negative, "RATIO:", abs(scores$positive/scores$negative))) cat("\n") } } colnames(documents) <- c("POS","NEG","POS/NEG","BEST_FIT") Here is a sample output from this code: POS NEG POS/NEG BEST_FIT “1.03127774142571” “0.445453222112551” “2.31512017476245” “positive” “1.03127774142571” “26.1492093145274” “0.0394381997949273” “negative” “17.9196623384892” “17.8123396772424” “1.00602518608961” “neutral” Again, I just wanted to show a quick snipet of how the analysis is being done “under the hood.” I only care about the polarity outputted and the tweet it is analyzed from. polarity <- documents[, "BEST_FIT"] This variable, polarity, is returned by the classify_polarity.R script. CHALLENGES OBSERVED As with above, this is a stock analysis and hasn’t been tweaked for my needs. The analysis does OK, but I want to come back to this again one day to see if I can do better. QUICK CONCLUSION So… Now I have the emotion and polarity for each tweet. This can be useful to see on its own, but I think is more worthwhile in aggregate. In my next post, I’ll show that. Also in the next post- I’ll also show an analysis of the word count with a wordcloud… This gets into the secondary point of this analysis. Hypothetically, I’d like to see common issues bubbled up through the wordcloud.

Doing a Sentiment Analysis on Tweets (Part 1)

INTRO So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis. For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support. STEP 1: GETTING AUTHENTICATED TO TWITTER First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that. Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter. library(twitteR) ## R package that does some of the Twitter API heavy lifting consumerKey <- "INSERT YOUR KEY HERE" consumerSecret <- "INSERT YOUR SECRET HERE" reqURL <- "https://api.twitter.com/oauth/request_token " accessURL <- "https://api.twitter.com/oauth/access_token " authURL <- "https://api.twitter.com/oauth/authorize " twitCred <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake() save(cred, file="credentials.RData") STEP 2: GETTING THE TWEETS Once you have your authentication credentials set, you can use them to grab tweets from Twitter. The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub. ##Authentication load("credentials.RData") ##has my secret keys and shiz registerTwitterOAuth(twitCred) ##logs me in ##Get the tweets about "comcast email" to work with tweetList <- searchTwitter("comcast email", n = 1000) tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted the results to a Data Frame to make it easier to analyze the results. STEP 3: GETTING RID OF THE JUNK Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at myemailaddress@comcast.net” The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way the user feels about it, other than they use it for their business. So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases: “fix” AND “email” in them (in that order) “Comcast” AND “email” in them in that order “no email” in them Any tweet that comes from a source with “comcast” in the handle “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them This is done with this code: ##finds the rows that have the phrase "fix ... email" in them fixemail <- grep("(fix.*email)", tweetList$text) ##finds the rows that have the phrase "comcast ... email" in them comcastemail <- grep("[Cc]omcast.*email", tweetList$text) ##finds the rows that have the phrase "no email" in them noemail <- grep("no email", tweetList$text) ##finds the rows that originated from a Comcast twitter handle comcasttweet <- grep("[Cc]omcast", tweetList$screenName) ##finds the rows related to email and customer service custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text) After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns. The returned columns are: text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude All I care about are: text created statusSource screenName This is handled through this tidbit of code: ##combine all of the "good" tweets row numbers that we greped out above and ##then sorts them and makes sure they are unique combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv) uvals <- unique(combined) sorted <- sort(uvals) ##pull the row numbers that we want, and with the columns that are important to ##us (tweet text, time of tweet, source, and username) paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)] STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown): ##make the device source look nicer paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource) paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource) ##name the columns names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName") paredTweetList Tweet created statusSource screenName Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP 2014-12-23 15:44:27 Twitter Web Client riocauto @msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW 2014-12-23 14:52:50 Twitter Web Client Monty_H_Mathis Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast 2014-12-23 09:20:14 Twitter Web Client mathercesul CHALLENGES OBSERVED As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.

data analytics

A Data Scientist's Take on the Roadmap to AI

INTRODUCTION Recently I was asked by a former colleague about getting into AI. He has truly big data and wants to use this data to power “AI” - if the headlines are to be believed, everyone else is already doing it. Though it was difficult for my ego, I told him I couldn’t help him in our 30 minute call and that he should think about hiring someone to get him there. The truth was I really didn’t have a solid answer for him in the moment. This was truly disappointing - in my current role and in my previous role, I put predictive models into production. After thinking about it for a bit, there is definitely a similar path I took in both roles. There’s 3 steps in my mind to getting to “AI.” Though this seems simple, it is a long process and potentially not linear - you may have to keep coming back to previous steps. Baseline (Reporting) Understand (Advanced Analytics) Artificial Intelligence (Data Science) BASELINE (REPORTING) Fun fact: You cannot effectively predict anything if you cannot measure the impact. What I mean by baseline is building out a reporting suite. Having a fundamental understanding of your business and environment is key. Without doing this step, you may try to predict the wrong thing entirely - or start with something that isn’t the most impactful. For me, this step started with finding the data in the first place. Perhaps, like my colleague, you have lots of data and you’re ready to jump in. That’s great and makes getting started that much more straightforward. In my role, I joined a finance team that really didn’t have a good bead on this - finding the data was difficult (and getting the owners of that data to give me access was a process as well). To be successfull, start small and iterate. Our first reports were built from manually downloading machine logs, processing them in R with JSON packages, and turning them into a black-and-white document. It was ugly, but it helped us know what we needed to know in that moment - oh yeah… it was MUCH better than nothing. “Don’t let perfection be the enemy of good.” - paraphrased from Voltaire. From this, I gained access to our organizations data warehouse, put automation in place, and purchased some Tableau licenses. This phase took a few months and is constantly being refined, but we are now able to see the impact of our decisions at a glance. This new understanding inevitably leads to more questions - queue step 2: Understanding. UNDERSTANDING (ADVANCED ANALYTICS) If you have never circulated reports and dashboards to others… let me fill you in on something: it will ALWAYS lead to additional, progressively harder questions. This step is an investment in time and expertise - you have to commit to having dedicated resource(s) (read: people… it is inhumane to call people resources and you may only need one person or some of a full time person’s time). Why did X go up unexpectedly (breaks the current trend)? Are we over indexing on this type of customer? Right before our customer leaves, this weird thing happens - what is this weird thing and why is it happening? Like the previous step - this will be ongoing. Investing in someone to do advanced analytics will help you to understand the fine details of your business AND … (drum roll) … will help you to understand which part of your business is most ripe for “AI”! ARTIFICIAL INTELLIGENCE (DATA SCIENCE) It is at this point that you will able to do real, bonafide, data science. A quick rant: Notice that I purposefully did not use the term “AI” (I know I used it throughout this article and even in the title of this section… what can I say - I am in-tune with marketing concepts, too). “AI” is a term that is overused and rarely implemented. Data science, however, comes in many forms and can really transform your business. Here’s a few ideas for what you can do with data science: Prediction/Machine Learning Testing Graph Analysis Perhaps you want to predict whether a sale is fraud or which existing customer is most apt to buy your new product? You can also test whether a new strategy works better than the old. This requires that you use statistical concepts to ensure valid testing and results. My new obsession is around graph analysis. With graphs you can see relationships that may have been hidden before - this will enable you to identify new targets and enrich your understanding of your business! Data science usually is very specific thing and takes many forms! SUMMARY Getting to data science is a process - it will take an investment. There are products out there that will help you shortcut some of these steps and I encourage you to consider these. There are products to help with reporting, analytics, and data science. These should, in my very humble opinion, be used by people who are dedicated to the organizations data, analytics, and science. Directions for data science - measure, analyze, predict, repeat!

data engineering

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup <- read.csv("mylookup.csv", header = FALSE) ## ARBITRARY "CHUNK" SIZE TO KEEP SEARCHES SMALLER ## start <- 1 end <- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata <- data.frame() ## HOW MANY "CHUNKS" WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring <- paste(mylookup[start:end], sep = "", collapse = '" OR VAR_NAME="') ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring <- paste('index = "my_splunk_index" (VAR_NAME="', lookupstring, '") | stats count BY VAR_NAME', sep = "") ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response <- GET("https://our.splunk.link:8089/", path = "servicesNS/admin/search/search/jobs/export", encode="form", config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate("USERNAME", "PASSWORD"), query=list(search=paste0("search ", searchstring, collapse="", sep=""), output_mode="csv")) ## CHANGES THE RESULTS TO A DATA TABLE ## result <- read.table(text = content(response, as = "text"), sep = ",", header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata <- rbind(alldata, result) ## UPDATES THE START POINT start <- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN'T GO TOO FAR ## if((end + 1000) > length(allusers)){ end <- length(allusers) } else { end <- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, "mydata.csv", row.names = FALSE, sep = ",") So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("https://maps.googleapis.com/", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

Doing a Sentiment Analysis on Tweets (Part 2)

INTRO This is post is a continuation of my last post. There I pulled tweets from Twitter related to “Comcast email,” got rid of the junk, and removed the unnecessary/unwanted data. Now that I have the tweets, I will further clean the text and subject it to two different analyses: emotion and polarity. WHY DOES THIS MATTER Before I get started, I thought it might be a good idea to talk about WHY I am doing this (besides the fact that I learned a new skill and want to show it off and get feedback). This yet incomplete project was devised for two reasons: Understand the overall customer sentiment about the product I support Create an early warning system to help identify when things are going wrong on the platform Keeping the customer voice at the forefront of everything we do is tantamount to providing the best experience for the users of our platform. Identifying trends in sentiment and emotion can help inform the team in many ways, including seeing the reaction to new features/releases (i.e. – seeing a rise in comments about a specific addition from a release) and identifying needed changes to current functionality (i.e. – users who continually comment about a specific behavior of the application) and improvements to user experience (i.e. – trends in comments about being unable to find a certain feature on the site). Secondarily, this analysis can act as an early warning system when there are issues with the platform (i.e. – a sudden spike in comments about the usability of a mobile device). Now that I’ve explained why I am doing this (which I probably should have done in this sort of detail the first post), let’s get into how it is actually done… STEP ONE: STRIPPING THE TEXT FOR ANALYSIS There are a number of things included in tweets that dont matter for the analysis. Things like twitter handles, URLs, punctuation… they are not necessary to do the analysis (in fact, they may well confound it). This bit of code handles that cleanup. For those following the scripts on GitHub, this is part of my tweet_clean.R script. Also, to give credit where it is due: I’ve borrowed and tweaked the code from Andy Bromberg’s blog to do this task. library(stringr) ##Does some of the text editing ##Cleaning up the data some more (just the text now) First grabbing only the text text <- paredTweetList$Tweet # remove retweet entities text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text) # remove at people text <- gsub("@\\w+", "", text) # remove punctuation text <- gsub("[[:punct:]]", "", text) # remove numbers text <- gsub("[[:digit:]]", "", text) # remove html links text <- gsub("http\\w+", "", text) # define "tolower error handling" function try.error <- function(x) { # create missing value y <- NA # tryCatch error try_error <- tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(try_error, "error")) y <- tolower(x) # result return(y) } # lower case using try.error with sapply text <- sapply(text, try.error) # remove NAs in text text <- text[!is.na(text)] # remove column names names(text) <- NULL STEP TWO: CLASSIFYING THE EMOTION FOR EACH TWEET So now the text is just that: only text. The punctuation, links, handles, etc. have been removed. Now it is time to estimate the emotion of each tweet. Through some research, I found that there are many posts/sites on Sentiment Analysis/Emotion Classification that use the “Sentiment” package in R. I thought: “Oh great – a package tailor made to solve the problem for which I want an answer.” The problem is that this package has been deprecated and removed from the CRAN library. To get around this, I downloaded the archived package and pulled the code for doing the emotion classification. With some minor tweaks, I was able to get it going. This can be seen in its entirety in the classify_emotion.R script. You can also see the “made for the internet” version here: library(RTextTools) library(tm) algorithm <- "bayes" prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/emotions.csv.gz",header=FALSE) counts <- list(anger=length(which(lexicon[,2]=="anger")), disgust=length(which(lexicon[,2]=="disgust")), fear=length(which(lexicon[,2]=="fear")), joy=length(which(lexicon[,2]=="joy")), sadness=length(which(lexicon[,2]=="sadness")), surprise=length(which(lexicon[,2]=="surprise")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(anger=0,disgust=0,fear=0,joy=0,sadness=0,surprise=0) doc <- matrix[i,] words <- findFreqTerms(doc,lowfreq=1) for (word in words) { for (key in names(scores)) { emotions <- lexicon[which(lexicon[,2]==key),] index 0) { entry <- emotions[index,] category <- as.character(entry[[2]]]) count <- counts[[category]] score <- 1.0 if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:",word,"CAT:", category,"SCORE:",score)) } scores[[category]] <- scores[[category]]+score } } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] if (best_fit == "disgust" && as.numeric(unlist(scores[2]))-3.09234 < .01) best_fit <- NA documents <- rbind(documents, c(scores$anger, scores$disgust, scores$fear, scores$joy, scores$sadness, scores$surprise, best_fit)) } colnames(documents) <- c("ANGER", "DISGUST", "FEAR", "JOY", "SADNESS", "SURPRISE", "BEST_FIT") Here is a sample output from this code: ANGER DISGUST FEAR JOY SADNESS SURPRISE BEST_FIT “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412327” “sadness” “7.34083555412328” “3.09234031207392” “2.06783599555953” “1.02547755260094” “1.7277074477352” “2.78695866252273” “anger” “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412328” “sadness” Here you can see that the initial author is using naive Bayes (which honestly I don’t yet understand) to analyze the text. I wanted to show a quick snipet of how the analysis is being done “under the hood.” For my purposes though, I only care about the emotion outputted and the tweet it is analyzed from. emotion <- documents[, "BEST_FIT"]` This variable, emotion, is returned by the classify_emotion.R script. CHALLENGES OBSERVED In addition to not fully understanding the code, the emotion classification seems to only work OK (which is pretty much expected… this is a canned analysis that hasn’t been tailored to my analysis at all). I’d like to come back to this one day to see if I can do a better job analyzing the emotions of the tweets. STEP THREE: CLASSIFYING THE POLARITY OF EACH TWEET Similarly to what we saw in step 5, I will use the cleaned text to analyze the polarity of each tweet. This code is also from the old R Packaged titled “Sentiment.” As with above, I was able to get the code working with only some minor tweaks. This can be seen in its entirety in the classify_polarity.R script. Here it is, too: algorithm <- "bayes" pstrong <- 0.5 pweak <- 1.0 prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/subjectivity.csv.gz",header=FALSE) counts <- list(positive=length(which(lexicon[,3]=="positive")), negative=length(which(lexicon[,3]=="negative")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(positive=0,negative=0) doc <- matrix[i,] words <- findFreqTerms(doc, lowfreq=1) for (word in words) { index 0) { entry <- lexicon[index,] polarity <- as.character(entry[[2]]) category <- as.character(entry[[3]]) count <- counts[[category]] score <- pweak if (polarity == "strongsubj") score <- pstrong if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:", word, "CAT:", category, "POL:", polarity, "SCORE:", score)) } scores[[category]] <- scores[[category]]+score } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] ratio <- as.integer(abs(scores$positive/scores$negative)) if (ratio==1) best_fit <- "neutral" documents <- rbind(documents,c(scores$positive, scores$negative, abs(scores$positive/scores$negative), best_fit)) if (verbose) { print(paste("POS:", scores$positive,"NEG:", scores$negative, "RATIO:", abs(scores$positive/scores$negative))) cat("\n") } } colnames(documents) <- c("POS","NEG","POS/NEG","BEST_FIT") Here is a sample output from this code: POS NEG POS/NEG BEST_FIT “1.03127774142571” “0.445453222112551” “2.31512017476245” “positive” “1.03127774142571” “26.1492093145274” “0.0394381997949273” “negative” “17.9196623384892” “17.8123396772424” “1.00602518608961” “neutral” Again, I just wanted to show a quick snipet of how the analysis is being done “under the hood.” I only care about the polarity outputted and the tweet it is analyzed from. polarity <- documents[, "BEST_FIT"] This variable, polarity, is returned by the classify_polarity.R script. CHALLENGES OBSERVED As with above, this is a stock analysis and hasn’t been tweaked for my needs. The analysis does OK, but I want to come back to this again one day to see if I can do better. QUICK CONCLUSION So… Now I have the emotion and polarity for each tweet. This can be useful to see on its own, but I think is more worthwhile in aggregate. In my next post, I’ll show that. Also in the next post- I’ll also show an analysis of the word count with a wordcloud… This gets into the secondary point of this analysis. Hypothetically, I’d like to see common issues bubbled up through the wordcloud.

Doing a Sentiment Analysis on Tweets (Part 1)

INTRO So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis. For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support. STEP 1: GETTING AUTHENTICATED TO TWITTER First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that. Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter. library(twitteR) ## R package that does some of the Twitter API heavy lifting consumerKey <- "INSERT YOUR KEY HERE" consumerSecret <- "INSERT YOUR SECRET HERE" reqURL <- "https://api.twitter.com/oauth/request_token " accessURL <- "https://api.twitter.com/oauth/access_token " authURL <- "https://api.twitter.com/oauth/authorize " twitCred <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake() save(cred, file="credentials.RData") STEP 2: GETTING THE TWEETS Once you have your authentication credentials set, you can use them to grab tweets from Twitter. The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub. ##Authentication load("credentials.RData") ##has my secret keys and shiz registerTwitterOAuth(twitCred) ##logs me in ##Get the tweets about "comcast email" to work with tweetList <- searchTwitter("comcast email", n = 1000) tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted the results to a Data Frame to make it easier to analyze the results. STEP 3: GETTING RID OF THE JUNK Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at myemailaddress@comcast.net” The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way the user feels about it, other than they use it for their business. So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases: “fix” AND “email” in them (in that order) “Comcast” AND “email” in them in that order “no email” in them Any tweet that comes from a source with “comcast” in the handle “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them This is done with this code: ##finds the rows that have the phrase "fix ... email" in them fixemail <- grep("(fix.*email)", tweetList$text) ##finds the rows that have the phrase "comcast ... email" in them comcastemail <- grep("[Cc]omcast.*email", tweetList$text) ##finds the rows that have the phrase "no email" in them noemail <- grep("no email", tweetList$text) ##finds the rows that originated from a Comcast twitter handle comcasttweet <- grep("[Cc]omcast", tweetList$screenName) ##finds the rows related to email and customer service custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text) After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns. The returned columns are: text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude All I care about are: text created statusSource screenName This is handled through this tidbit of code: ##combine all of the "good" tweets row numbers that we greped out above and ##then sorts them and makes sure they are unique combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv) uvals <- unique(combined) sorted <- sort(uvals) ##pull the row numbers that we want, and with the columns that are important to ##us (tweet text, time of tweet, source, and username) paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)] STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown): ##make the device source look nicer paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource) paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource) ##name the columns names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName") paredTweetList Tweet created statusSource screenName Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP 2014-12-23 15:44:27 Twitter Web Client riocauto @msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW 2014-12-23 14:52:50 Twitter Web Client Monty_H_Mathis Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast 2014-12-23 09:20:14 Twitter Web Client mathercesul CHALLENGES OBSERVED As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.

data jawn

First Annual Data Jawn

I went to a data event last night called the Data Jawn, presented by RJMetrics, and wanted to share a write up of the event – it was pretty cool and sparked some really good ideas. One idea I really liked – Predicting who will call in/have an issue and proactively reaching out to them… Sounds like a game changer to me. I could then trigger alerts when we see that activity or make a list for reaching out via phone, email, text… whatever. The possibilities of this seem pretty big. There was also a lot of tweeting going on under the hashtag #datajawn. Here’s some notes/takeaways from each speaker: Bob Moore – RJMetrics CEO Motto of RJMetrics: “Inspire and enpower data driven people” Jake Stein – RJMetrics cofounder “Be data driven” Steps to all problem solving: + Collect Data + Analyze + Present Results Madelyn Fitzgerald – RJMetrics “Need to be problem focused, not solution focused” + This means that you need to ask a question of your data before building out the answer + Having KPIs is awesome… but they need to be built to answer a question The most common problem people make with data is diving into the data before asking a question Kim Siejak – Independence Blue Cross IBX invested in hadoop last year Doing a number of predictive models and machine learning + Predicting what people who will go to the hospital before they go + Predicting different diseases based on health history + Predicting who will call in before they complain David Wallace – RJMetrics Every Important SaaS Metric in a Single Infographic Document how every KPI is derived and make sure everyone understands it “If you’re not experimenting, you’re not learning. If you’re not learning, you’re not growing.” Lauren Anacona/Christopher Tufts Did a sentiment analysis on tweets with emojis Pulled all the location based tweets from the Philadelphia area and visualized them on a map using CartoDB and torque.js (really cool visualizations!) Lots of people use emojis! https://github.com/laurenancona/twimoji Jim Multari - Comcast “Dashboards are no good for senior leaders” Only have 10 seconds to get your message across when talking to executives Alerting on KPI changes Four things needed to make a data driven org: + Right data & insights + Right data & systems + Right people + Right culture Ben Garvey – RJMetrics Pie charts are evil - you can estimate linear distance much easier than angular distance “Data visualization gives you confidence in state and trend without effort” You can tell the story much easier with the right visualization. Stacey Mosley – Data Services Manager for the City of Philadelphia Gave a talk about how they improved the use of court time for L&I She didn’t share a lot about her processes or what data she used to do this… There were a few other speakers to end the talk with nice messages, but by that point I was fully tweeting and stuck checking out what everyone else thought of the event. I hope that there continue to be opportunities like this locally to learn more about Data Analytics!

data science

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

A Data Scientist's Take on the Roadmap to AI

INTRODUCTION Recently I was asked by a former colleague about getting into AI. He has truly big data and wants to use this data to power “AI” - if the headlines are to be believed, everyone else is already doing it. Though it was difficult for my ego, I told him I couldn’t help him in our 30 minute call and that he should think about hiring someone to get him there. The truth was I really didn’t have a solid answer for him in the moment. This was truly disappointing - in my current role and in my previous role, I put predictive models into production. After thinking about it for a bit, there is definitely a similar path I took in both roles. There’s 3 steps in my mind to getting to “AI.” Though this seems simple, it is a long process and potentially not linear - you may have to keep coming back to previous steps. Baseline (Reporting) Understand (Advanced Analytics) Artificial Intelligence (Data Science) BASELINE (REPORTING) Fun fact: You cannot effectively predict anything if you cannot measure the impact. What I mean by baseline is building out a reporting suite. Having a fundamental understanding of your business and environment is key. Without doing this step, you may try to predict the wrong thing entirely - or start with something that isn’t the most impactful. For me, this step started with finding the data in the first place. Perhaps, like my colleague, you have lots of data and you’re ready to jump in. That’s great and makes getting started that much more straightforward. In my role, I joined a finance team that really didn’t have a good bead on this - finding the data was difficult (and getting the owners of that data to give me access was a process as well). To be successfull, start small and iterate. Our first reports were built from manually downloading machine logs, processing them in R with JSON packages, and turning them into a black-and-white document. It was ugly, but it helped us know what we needed to know in that moment - oh yeah… it was MUCH better than nothing. “Don’t let perfection be the enemy of good.” - paraphrased from Voltaire. From this, I gained access to our organizations data warehouse, put automation in place, and purchased some Tableau licenses. This phase took a few months and is constantly being refined, but we are now able to see the impact of our decisions at a glance. This new understanding inevitably leads to more questions - queue step 2: Understanding. UNDERSTANDING (ADVANCED ANALYTICS) If you have never circulated reports and dashboards to others… let me fill you in on something: it will ALWAYS lead to additional, progressively harder questions. This step is an investment in time and expertise - you have to commit to having dedicated resource(s) (read: people… it is inhumane to call people resources and you may only need one person or some of a full time person’s time). Why did X go up unexpectedly (breaks the current trend)? Are we over indexing on this type of customer? Right before our customer leaves, this weird thing happens - what is this weird thing and why is it happening? Like the previous step - this will be ongoing. Investing in someone to do advanced analytics will help you to understand the fine details of your business AND … (drum roll) … will help you to understand which part of your business is most ripe for “AI”! ARTIFICIAL INTELLIGENCE (DATA SCIENCE) It is at this point that you will able to do real, bonafide, data science. A quick rant: Notice that I purposefully did not use the term “AI” (I know I used it throughout this article and even in the title of this section… what can I say - I am in-tune with marketing concepts, too). “AI” is a term that is overused and rarely implemented. Data science, however, comes in many forms and can really transform your business. Here’s a few ideas for what you can do with data science: Prediction/Machine Learning Testing Graph Analysis Perhaps you want to predict whether a sale is fraud or which existing customer is most apt to buy your new product? You can also test whether a new strategy works better than the old. This requires that you use statistical concepts to ensure valid testing and results. My new obsession is around graph analysis. With graphs you can see relationships that may have been hidden before - this will enable you to identify new targets and enrich your understanding of your business! Data science usually is very specific thing and takes many forms! SUMMARY Getting to data science is a process - it will take an investment. There are products out there that will help you shortcut some of these steps and I encourage you to consider these. There are products to help with reporting, analytics, and data science. These should, in my very humble opinion, be used by people who are dedicated to the organizations data, analytics, and science. Directions for data science - measure, analyze, predict, repeat!

Getting Started with Data Science

updated: November 2020 Everyone in the world has a “how to” guide to data science… well, maybe not everyone - but there are a lot of “guides” out there. I get this question infrequently, so I thought I would do my best to put together what have been my best resources for learning. MY STORY Personally, I learned statistics by getting my Masters in Applied Statistics at Villanova University - it took 2.5 years. I got my introduction to R by working through the Johns Hopkins University Data Science Specialization on Coursera. Similarly for python, I got an online introduction via DataCamp. This was all bolstered by working with these tools at work and in side projects. The repetition of working with these tools every day has made it more fluent. Here are some resources that I’ve used or know of - I’ve tried to outline them and group them to the best of my ability. There’s many more out there, and you may find some better or worse depending on your style. LEARNING DATA Johns Hopkins University Data Science Specialization on Coursera: As mentioned above this course gave me my start with R, RStudio, and git. Kaggle: If you are as competitive as I am, this site should get you going - the interactive kernals and social aspects of this site make it a great place to see other data science in action. Plagiarism is greatest form of flattery (and easiest way to learn - thanks, Stack Overflow). EdX - R Programming: I haven’t used EdX much, but there is a wealth of MOOCs here. LEARNING STATISTICS & OTHER IMPORTANT MATH Khahn Academy - Statistics: I have used Khahn Academy on multiple occasions for refreshers in Statistics and Linear Algebra. The classes are interactive, manageable, and self-paced. Khahn Academy - Linear Algebra Coursera - Statistics with R EdX - Data Analytics & Statistics courses Of course - higher education, as well. DATA BOOKS Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy - Cathy O’Neil: Cathy O’Neil does a great job of outlining how data algorithms can have unintended negative consequences. Anyone who builds an machine learning algorithm should read. The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures - Dona M. Wong: I have this book on my desk as a reference. Quick read filled with easy to understand rules and objectives for creating data visualizations. Analyzing data is hard - this book teaches tips to build clear and informative visualizations that don’t take away from the message. The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t - Nate Silver: Nate Silver is [in]famous for predicting elections. This book gets into the details of how he does that. Super interesting for a guy increasingly interested in politics. How Not to Be Wrong: The Power of Mathematical Thinking - Jordan Ellenberg: Critical thinking is crucial in data science and analytics. This book gives some great tips on how to approach “facts” with the right mindset. Thinking, Fast and Slow - Daniel Kahneman: Currently on my list to read. PODCASTS Hidden Brain: NPR podcast covering many topics. I find it super interesting. While not distinctly data related, it frequently covers topics that have tangential importance to being a good data scientist. Exponential View: Not primarily focused on data, but is very frequently covering artificial intelligence and machine learning topics. I recommend the newsletter that goes along with this podcast (link below). Not So Standard Deviations: Richard Peng and Hilary Parker host a podcast on all things data science. The Data Lab Podcast: Local [to Philly] data podcast interviewing local data scientists. I find it reassuring to hear that my habits are often in line with these peoples, plus I’ve picked up many really great tidbits (like the Exponential View newsletter). O’Reilly Data Show: I have attended the Strata data conference by O’Reilly. Much like the conference, this podcast covers many relevant data themes. Data Skeptic: Another data podcast that covers many good data topics. BLOGS & NEWSLETTERS Exponential View: Billed as a weekly “wondermissive”, the author Azeem Azhar covers many topics relevant to data and the greater technology economy. I truly look forward to getting this newsletter every Sunday morning. Farnam Street: A weekly newsletter (and blog) about decision making. I frequently find golden tips on how to think and frame thinking. Must read. Twitter: I follow many great data people on twitter and get a great deal of my data news there.

How to Build a Binary Classification Model

What Is Binary Classification? Algorithms for Binary Classification Logistic Regression Decision Trees/Random Forests Decision Trees Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Great. Now what? Determining What the Problem is Locate and Obtain Data Data Mining & Preparing for Analysis Splitting the Data Building the Models Validating the Models Conclusion What Is Binary Classification? Binary classification is used to classify a given set into two categories. Usually, this answers a yes or no question: Did a particular passenger on the Titanic survive? Is a particular user account compromised? Is my product on the shelf in a certain pharmacy? This type of inference is frequently made using supervised machine learning techniques. Supervised machine learning means that you have historical, labeled data, from which your algorithm may learn. Algorithms for Binary Classification There are many methods for doing binary classification. To name a few: Logistic Regression Decision Trees/Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Logistic Regression Logistic regression is a parametric statistical model that predicts binary outcomes. Parametric means that this algorithm is based off of a distribution (in this case, the logistic distribution), and as such must follow a few assumptions: Obviously, the dependent variable must be binary Only meaningful independent variables are included in the model Error terms need to be independent and identically distributed Independent variables need to be independent from one another Large sample sizes are preferred Because of these assumptions, parametric tests tend to be more statistically powerful than nonparametric tests; in other words, they tend to better find a significant effect when it indeed exists. Logistic regression follows the equation: probability of the outcome being 1 given the independent variables Dependent variable Limited to values between 0 and 1 independent variables intercept and coefficients for the independent variables This equation is created based on a training set of data – or historical, labeled data – and then is used to predict the likelihoods of future, unlabeled data. Decision Trees/Random Forests Decision Trees A decision tree is a nonparametric classifier. It effectively partitions the data, starting first by splitting on the independent variable that gives the most information gain, and then recursively repeating this process at subsequent levels. Information gain is a formula that determines how “important” an independent variable is in predicting the dependent variable. It takes into account how many distinct values there are (in terms of categorical variables) and the number and size of branches in the decision tree. The goal is to pick the most informative variable that is still general enough to prevent overfitting. The bottom of the decision tree, at the leaf nodes, are groupings of events within the set that all follow the rules set forth throughout the tree to get to the node. Future, unlabeled events, are then fed into the tree to see which group the belong – the average of the labeled (training) data for the leaf is then assigned as the predicted value for the unlabeled event. As with logistic regression, overfitting is a concern. If you allow a decision tree to continue to grow without bound, eventually you will have all identical events in each leaf; while this may look beneficial, it may be too specific to the training data and mislabel future events. “Pruning” occurs to prevent overfitting. Random Forests Random forests is an ensemble method build upon the decision trees. Random forests are a “forest” of decision trees – in other words, you use bootstrap sampling techniques to build many over-fit decision trees, then average out the results to determine a final model. A bootstrap sample is sampling with replacement – in every selection for the sample, each event has an equal chance of being chosen. To clarify – building a random forest model means taking many bootstrap samples and building an over-fit decision tree (meaning you continue to split the tree without bound until every leaf node has identical groups in them) on each. These results, taken together, correct for the biases and potential overfitting of an individual tree. The more trees in your random forest, the better – the trade-off being that more trees mean more computing. Random forests often take a long time to train. Nearest Neighbor The k-nearest neighbor algorithm is a very simple algorithm. Using the training set as reference, the new, unlabeled data is predicted by taking the average of the k closest events. Being a lazy learner, where evaluation does not take place until you classify new events, it is quick to run. It can be difficult to determine what k should be. However, because it is easy computationally, you can run multiple iterations without much overhead. Support Vector Machines (SVM) SVM is also an ensemble machine learning method. SVM recursively attempts to “split” the two categories by maximizing the distance between a hyperplane (a plane in more than 2 dimensions; most applications of machine learning are in the higher dimensional space) and the closest points in each category. As you can see in the simple example below, the plane iteratively improves the split between the two groups. There are multiple kernels that can be used with SVM, depending on the shape of the data: Linear Polynomial Radial Sigmoid You may also choose to configure how big of steps can be taken by the plane in each iteration among other configurations. Neural Networks Neural networks (there are several varieties) are built to mimic how a brain solves problems. This is done by creating multiple layers from a single input – most easily demonstrated with image recognition – where it is able to turn groups of pixels into another, single value, over and over again, to provide more information to train the model. Great. Now what? Now that we know how we understand some of the tools in our arsenal, what are the steps to doing the analysis? Determining what the problem is Locate and obtain data Data mining for understanding & preparing for analysis Split data into training and testing sets Build model(s) on training data Test models on test data Validate and pick the best model Determining What the Problem is While it is easy to ask a question, it is difficult to understand all of the assumption being made by the question asker. For example, a simple question is asked: Will my product be on the shelf of this pharmacy next week? While that question may seem straightforward at first glance, what product are we talking about? What pharmacy are we talking about? What is the time frame which is being evaluated? Does it need to be in the pharmacy and available if you ask or does the customer need to be able to visually identify the product? Does it need to be available for the entire time period in question or did just have to be available at least part of the time period in question? Being as specific as possible is vital in order to deliver the correct answer. It is easy to misinterpret the assumptions of the question asker and then do a lot of work in to answer the wrong question. Specificity will help ensure time is not wasted and that question asker gets an answer that they were looking for. The final question may look more like: Will there be any Tylenol PM available over-the-counter at midnight, February 28, 2017 at Walgreens on the corner of 17th and John F. Kennedy Blvd in Philadelphia? Well – we don’t know. We can now use historical data to make our best guess. This question is specific enough to answer. Locate and Obtain Data Where is your data? Is it in a database? Some excel spreadsheet? Once you find it, how big is it? Can you download the data locally? Do you need to find a distributed database to handle it? If it is in a database, can you do some of the data mining (next step) before downloading the data? Be careful… “SELECT * FROM my_table;” can get scary, quick. This is also a good time to think about what tools and/or languages you want to use to mine and manipulate the data. Excel? SQL? R? Python? Some of the numerous other tools or languages out there that are good at a bunch of different things (Julia, Scala, Weka, Orange, etc.)? Get the data into one spot, preferably with some guidance on what and where it is in relation to what you need for your problem and open it up. Data Mining & Preparing for Analysis The most time consuming step in any data science article you read will always be the data cleaning step. This document is no different – you will spend an inordinate amount of time getting to know the data, cleaning it, getting to know it better, and cleaning it again. You may then proceed to analysis, discover you’ve missed something, and come back to this step. There is a lot to consider in this step and each data analysis is different. Is your data complete? If you are missing values in your data, how will you deal with them? There is no overarching rule on this. If you are dealing with continuous data, perhaps you’ll fill missing data points with the average of similar data. Perhaps you can infer what it should be based on context. Perhaps it constitutes such a small portion of your data, the logical thing to do is to just drop the events all together. The dependent variable – how does it break down? We are dealing with binomial data here; is there way more zeros then ones? How will you deal with that if there is? Are you doing your analysis on a subset? If so, is your sample representative of the population? How can you be sure? This is where histograms are your friend. Do you need to create variables? Perhaps one independent variable you have is a date, which might be tough to use as an input to your model. Should you find out which day of the week each date was? Month? Year? Season? These are easier to add in as a model input in some cases. Do you need to standardize your data? Perhaps men are listed as “M,” “Male,” “m,” “male,” “dude,” and “unsure.” It would behoove you, in this example, to standardize this data to all take on the same value. In most algorithms, correlated input variables are bad. This is the time to plot all of the independent variables against each other to see if there is correlation. If there are correlated variables, it may be a tough choice to drop one (or all!). Speaking of independent variables, which are important to predict your dependent variable? You can use information gain packages (depending on the language/tool you are using to do your analysis), step-wise regression, or random forests to help understand the important variables. In many of these steps, there are no hard-and-fast rules on how to proceed. You’ll need to make a decision in the context of your problem. In many cases, you may be wrong and need to come back to the decision after trying things out. Splitting the Data Now that you (think you) have a clean dataset, you’ll need to split it into training and testing datasets. You’ll want to have as much data as possible to train on, though still have enough data left over to test on. This is less and less of an issue in the age of big data. However, sometimes too much data and it will take too long for your algorithms to train. Again – this is another decision that will need to be made in the context of your problem. There are a few options for splitting your data. The most straightforward being take a portion of your overall dataset to train on (say 70%) and leave behind the rest to test on. This works well in most big data applications. If you do not have a lot of data (or if you do), consider cross-validation. This is an iterative approach where you train your algorithm recursively on the same data set, leaving some portion out each iteration to be used as the test set. The most popular versions of cross-validation are k-fold cross validation and leave-one-out cross validation. There is even nested cross-validation, which gets very Inception-like. Building the Models Finally, you are ready to do what we came to do – build the models. We have our datasets cleaned, enriched, and split. Time to build our models. I say it plural because you’ll always want to evaluate which method and/or inputs works best. You’ll want to pick a few of the algorithms from above and build the model. While that is vague, depending on your language or tool of choice, there are multiple packages available to perform each analysis. It is generally only a line or two of code to train each model; once we have our models trained, it is time to validate. Validating the Models So – which model did best? How can you tell? We start by predicting results for our test set with each model and building a confusion matrix for each: With this, we can calculate the specificity, sensitivity, and accuracy for each model. For each value, higher is better. The best model is one that performs the best in each of these counts. In the real world, frequently one model will have better specificity, while another will have better sensitivity, and yet another will be the most accurate. Again, there is no hard and fast rule one which model to choose; it all depends on the context. Perhaps false positives are really bad in your context, then the specificity rate should be given more merit. It all depends. From here, you have some measures in order to pick a model and implement it. Conclusion Much of model building, in general, is part computer science, part statistics, and part business understanding. Understanding which tools and languages are best to implement the best statistical modeling technique to solve a business problem can feel like more of a form of art than science at times. In this document, I’ve presented some algorithms and steps to do binary classification, which is just the tip of the iceberg. I am sure there are algorithms and steps missing – I hope that this helps in your understanding.

Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you’ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I’ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we’ve learned that on sunny days it typically does not rain. We don’t know for sure that today it won’t rain but we’re willing to make decisions based on our prediction that it won’t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can’t be 100% sure that it won’t rain because we’ve also woken up on a few sunny mornings in which we’ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid… I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don’t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead… or hair length… or eye color… or use all of these attributes at the same time! It’s natural in this example to see how “close” people are to one another based on different attributes. What these type of algorithms do is evaluate the “distances” of one piece of information from another piece. In a machine learning setting you look for similarities and “closeness” in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,…, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is “learned” without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I’ve used or found along my Data Science journey. UPDATE: I’ve written a whole post on this. You can find it here O’Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix <- read.csv("Parking_Violations.csv") ## READ IN THE WEATHER DATA (FROM NCDC) weather_data <- read.csv("weather_data.csv") ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays <- as.Date(c("2012-01-02", "2012-01-16", "2012-02-20", "2012-05-28", "2012-07-04", "2012-09-03", "2012-10-08", "2012-11-12", "2012-11-22", "2012-12-25", "2013-01-01", "2013-01-21", "2013-02-18", "2013-05-27", "2013-07-04", "2013-09-02", "2013-10-14", "2013-11-11", "2013-11-28", "2013-12-25", "2014-01-01", "2014-01-20", "2014-02-17", "2014-05-26", "2014-07-04", "2014-09-01", "2014-10-13", "2014-11-11", "2014-11-27", "2014-12-25", "2015-01-01", "2015-01-09", "2015-02-16", "2015-05-25", "2015-07-03", "2015-09-07")) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days <- as.data.frame(as.Date( ptix$Issue.Date.and.Time, format = "%m/%d/%Y")) names(days) <- "DATE" count_by_day <- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE <- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = "%Y%m%d")), format = "%m/%d/%Y") Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day <- join(count_by_day, weather_data, by = "DATE") With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON'T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION <- NULL count_by_day$STATION_NAME <- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN'T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR < 0] <- 0 count_by_day$DAPR[count_by_day$DAPR < 0] <- 0 count_by_day$PRCP[count_by_day$PRCP < 0] <- 0 count_by_day$SNWD[count_by_day$SNWD < 0] <- 0 count_by_day$SNOW[count_by_day$SNOW < 0] <- 0 count_by_day$WT01[count_by_day$WT01 < 0] <- 0 count_by_day$WT03[count_by_day$WT03 < 0] <- 0 count_by_day$WT04[count_by_day$WT04 < 0] <- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day <- count_by_day[ count_by_day$TMAX > 0, ] count_by_day <- count_by_day[ count_by_day$TMIN > 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day <- count_by_day[!is.na( count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS <- NULL count_by_day$WT01 <- NULL count_by_day$WT04 <- NULL count_by_day$WT03 <- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX <- count_by_day$TMAX / 10 count_by_day$TMIN <- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW <- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL <- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] <- 1 count_by_day$HOL <- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON <- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train <- count_by_day[count_by_day$DATE < "2014-08-01", ] test <- count_by_day[count_by_day$DATE >= "2014-08-01", ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest <- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest <- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon <- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5271.4002 89.5216 58.884 < 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 < 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 < 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 < 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: < 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 < 2.2e-16 *** HOL 1 147553631 147553631 498.1432 < 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 < 2.2e-16 *** HOL 1 147553631 147553631 463.001 < 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat <- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value <- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF <- round(predict(predForest, test), 0) test$LM <- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF <- sum(abs(test$RF - test$count)) difOfLM <- sum(abs(test$LM - test$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

Identifying Compromised User Accounts with Logistic Regression

INTRODUCTION As a Data Analyst on Comcast’s Messaging Engineering team, it is my responsibility to report on the platform statuses, identify irregularities, measure impact of changes, and identify policies to ensure that our system is used as it was intended. Part of the last responsibility is the identification and remediation of compromised user accounts. The challenge the company faces is being able to detect account compromises faster and remediate them closer to the moment of detection. This post will focus on the methodology and process for modeling the criteria to best detect compromised user accounts in near real-time from outbound email activity. For obvious reasons, I am only going to speak to the methodologies used; I’ll be vague when it comes to the actual criteria we used. DATA COLLECTION AND CLEANING Without getting into the finer details of email delivery, there are about 43 terminating actions an email can take when it was sent out of our platform. A message can be dropped for a number of reasons. These are things like the IP or user being on any number block lists, triggering our spam filters, and other abusive behaviors. The other side of that is that the message will be delivered to its intended recipient. That said, I was able to create a usage profile for all of our outbound senders in small chunks of time in Splunk (our machine log collection tool of choice). This profile gives a summary per user of how often the messages they sent hit each of the terminating actions described above. In order to train my data, I matched this usage data to our current compromised detection lists. I created a script in python that added an additional column in the data. If an account was flagged as compromised with our current criteria, it was given a one; if not, a zero. With the data collected, I am ready to determine the important inputs. DETERMINING INPUTS FOR THE MODEL In order to determine the important variables in the data, I created a Binary Regression Tree in R using the rpart library. The Binary Regression Tree iterates over the data and “splits” it in order to group the data to get compromised accounts together and non-compromised accounts together. It is also a nice way to visualize the data. You can see in the picture below what this looks like. Because the data is so large, I limited the data to one day chunks. I then ran this regression tree against each day separately. From that, I was able to determine that there are 6 important variables (4 of which showed up in every regression tree I created; the other 2 showed up in a majority of trees). You can determine the “important” variables by looking in the summary for the number of splits per variable. BUILDING THE MODEL Now that I have the important variables, I created a python script to build the Logistic Regression Model from them. Using the statsmodels package, I was able to build the model. All of my input variables were highly significant. I took the logistic regression equation with the coefficients given in the model back to Splunk and tested this on incoming data to see what would come out. I quickly found that it got many accounts that were really compromised. There were also some accounts being discovered that looked like brute force attacks that never got through - to adjust for that, I added a constraint to the model that the user must have done at least one terminating action that ensured they authenticated successfully (this rules out users coming from a ton of IPs, but failing authentication everytime). With these important variables, it’s time to build the Logistic Regression Model. CONCLUSION First and foremost, this writeup was intended to be a very high level summary explaining the steps I took to get my final model. What isn’t explained here is how many models I built that were less successful. Though this combination worked for me in the end, likely you’ll need to iterate over the process a number of times to get something successful. The new detection method for compromised accounts is an opportunity for us to expand our compromise detection and do it in a more real-time manner. This is also a foundation for future detection techniques for malicious IPs and other actors. With this new method, we will be able to expand the activity types for compromise detection outside of outbound email activity to things like preference changes, password resets, changes to forwarding address, and even application activity outside of the email platform.

data visualization

Jake on Strava - R Shiny App

I created a Shiny app that grabs my running, riding, and other exercise stats from Strava and creates some simple visualizations.

Twitter Analysis - R Shiny App

I created a Shiny app that searches Twitter and does some simple analysis.

Jake Learns Data Science Visitor Dashboard

A quick view of visitors to my website. Data pulled from Google Analytics and pushed to Amazon Redshift using Stitch Data.

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself’s curve and the curve for one. When I saw this, my mind was blown. How interesting… and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

GAP RE-MINDER

A demonstration D3 project, shamelessly ripping off Gapminder.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("https://maps.googleapis.com/", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

getting started

Getting Started with Data Science

updated: November 2020 Everyone in the world has a “how to” guide to data science… well, maybe not everyone - but there are a lot of “guides” out there. I get this question infrequently, so I thought I would do my best to put together what have been my best resources for learning. MY STORY Personally, I learned statistics by getting my Masters in Applied Statistics at Villanova University - it took 2.5 years. I got my introduction to R by working through the Johns Hopkins University Data Science Specialization on Coursera. Similarly for python, I got an online introduction via DataCamp. This was all bolstered by working with these tools at work and in side projects. The repetition of working with these tools every day has made it more fluent. Here are some resources that I’ve used or know of - I’ve tried to outline them and group them to the best of my ability. There’s many more out there, and you may find some better or worse depending on your style. LEARNING DATA Johns Hopkins University Data Science Specialization on Coursera: As mentioned above this course gave me my start with R, RStudio, and git. Kaggle: If you are as competitive as I am, this site should get you going - the interactive kernals and social aspects of this site make it a great place to see other data science in action. Plagiarism is greatest form of flattery (and easiest way to learn - thanks, Stack Overflow). EdX - R Programming: I haven’t used EdX much, but there is a wealth of MOOCs here. LEARNING STATISTICS & OTHER IMPORTANT MATH Khahn Academy - Statistics: I have used Khahn Academy on multiple occasions for refreshers in Statistics and Linear Algebra. The classes are interactive, manageable, and self-paced. Khahn Academy - Linear Algebra Coursera - Statistics with R EdX - Data Analytics & Statistics courses Of course - higher education, as well. DATA BOOKS Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy - Cathy O’Neil: Cathy O’Neil does a great job of outlining how data algorithms can have unintended negative consequences. Anyone who builds an machine learning algorithm should read. The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures - Dona M. Wong: I have this book on my desk as a reference. Quick read filled with easy to understand rules and objectives for creating data visualizations. Analyzing data is hard - this book teaches tips to build clear and informative visualizations that don’t take away from the message. The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t - Nate Silver: Nate Silver is [in]famous for predicting elections. This book gets into the details of how he does that. Super interesting for a guy increasingly interested in politics. How Not to Be Wrong: The Power of Mathematical Thinking - Jordan Ellenberg: Critical thinking is crucial in data science and analytics. This book gives some great tips on how to approach “facts” with the right mindset. Thinking, Fast and Slow - Daniel Kahneman: Currently on my list to read. PODCASTS Hidden Brain: NPR podcast covering many topics. I find it super interesting. While not distinctly data related, it frequently covers topics that have tangential importance to being a good data scientist. Exponential View: Not primarily focused on data, but is very frequently covering artificial intelligence and machine learning topics. I recommend the newsletter that goes along with this podcast (link below). Not So Standard Deviations: Richard Peng and Hilary Parker host a podcast on all things data science. The Data Lab Podcast: Local [to Philly] data podcast interviewing local data scientists. I find it reassuring to hear that my habits are often in line with these peoples, plus I’ve picked up many really great tidbits (like the Exponential View newsletter). O’Reilly Data Show: I have attended the Strata data conference by O’Reilly. Much like the conference, this podcast covers many relevant data themes. Data Skeptic: Another data podcast that covers many good data topics. BLOGS & NEWSLETTERS Exponential View: Billed as a weekly “wondermissive”, the author Azeem Azhar covers many topics relevant to data and the greater technology economy. I truly look forward to getting this newsletter every Sunday morning. Farnam Street: A weekly newsletter (and blog) about decision making. I frequently find golden tips on how to think and frame thinking. Must read. Twitter: I follow many great data people on twitter and get a great deal of my data news there.

ggplot

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

google search

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("https://maps.googleapis.com/", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

hackathon

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

learning

Getting Started with Data Science

updated: November 2020 Everyone in the world has a “how to” guide to data science… well, maybe not everyone - but there are a lot of “guides” out there. I get this question infrequently, so I thought I would do my best to put together what have been my best resources for learning. MY STORY Personally, I learned statistics by getting my Masters in Applied Statistics at Villanova University - it took 2.5 years. I got my introduction to R by working through the Johns Hopkins University Data Science Specialization on Coursera. Similarly for python, I got an online introduction via DataCamp. This was all bolstered by working with these tools at work and in side projects. The repetition of working with these tools every day has made it more fluent. Here are some resources that I’ve used or know of - I’ve tried to outline them and group them to the best of my ability. There’s many more out there, and you may find some better or worse depending on your style. LEARNING DATA Johns Hopkins University Data Science Specialization on Coursera: As mentioned above this course gave me my start with R, RStudio, and git. Kaggle: If you are as competitive as I am, this site should get you going - the interactive kernals and social aspects of this site make it a great place to see other data science in action. Plagiarism is greatest form of flattery (and easiest way to learn - thanks, Stack Overflow). EdX - R Programming: I haven’t used EdX much, but there is a wealth of MOOCs here. LEARNING STATISTICS & OTHER IMPORTANT MATH Khahn Academy - Statistics: I have used Khahn Academy on multiple occasions for refreshers in Statistics and Linear Algebra. The classes are interactive, manageable, and self-paced. Khahn Academy - Linear Algebra Coursera - Statistics with R EdX - Data Analytics & Statistics courses Of course - higher education, as well. DATA BOOKS Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy - Cathy O’Neil: Cathy O’Neil does a great job of outlining how data algorithms can have unintended negative consequences. Anyone who builds an machine learning algorithm should read. The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures - Dona M. Wong: I have this book on my desk as a reference. Quick read filled with easy to understand rules and objectives for creating data visualizations. Analyzing data is hard - this book teaches tips to build clear and informative visualizations that don’t take away from the message. The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t - Nate Silver: Nate Silver is [in]famous for predicting elections. This book gets into the details of how he does that. Super interesting for a guy increasingly interested in politics. How Not to Be Wrong: The Power of Mathematical Thinking - Jordan Ellenberg: Critical thinking is crucial in data science and analytics. This book gives some great tips on how to approach “facts” with the right mindset. Thinking, Fast and Slow - Daniel Kahneman: Currently on my list to read. PODCASTS Hidden Brain: NPR podcast covering many topics. I find it super interesting. While not distinctly data related, it frequently covers topics that have tangential importance to being a good data scientist. Exponential View: Not primarily focused on data, but is very frequently covering artificial intelligence and machine learning topics. I recommend the newsletter that goes along with this podcast (link below). Not So Standard Deviations: Richard Peng and Hilary Parker host a podcast on all things data science. The Data Lab Podcast: Local [to Philly] data podcast interviewing local data scientists. I find it reassuring to hear that my habits are often in line with these peoples, plus I’ve picked up many really great tidbits (like the Exponential View newsletter). O’Reilly Data Show: I have attended the Strata data conference by O’Reilly. Much like the conference, this podcast covers many relevant data themes. Data Skeptic: Another data podcast that covers many good data topics. BLOGS & NEWSLETTERS Exponential View: Billed as a weekly “wondermissive”, the author Azeem Azhar covers many topics relevant to data and the greater technology economy. I truly look forward to getting this newsletter every Sunday morning. Farnam Street: A weekly newsletter (and blog) about decision making. I frequently find golden tips on how to think and frame thinking. Must read. Twitter: I follow many great data people on twitter and get a great deal of my data news there.

Sierpinski Triangles (and Carpets) in R

Recently in class, I was asked the following question: Start with an equilateral triangle and a point chosen at random from the interior of that triangle. Label one vertex 1, 2, a second vertex 3, 4, and the last vertex 5, 6. Roll a die to pick a vertex. Place a dot at the point halfway between the roll-selected vertex and the point you chose. Now consider this new dot as a starting point to do this experiment once again. Roll the die to pick a new vertex. Place a dot at the point halfway between the last point and the most recent roll-selected vertex. Continue this procedure. What does the shape of the collection of dots look like? I thought, well - it’s got to be something cool or else the professor wouldn’t ask, but I can’t imagine it will be more than a cloud of dots. Truth be told, I went to a conference for work the week of this assignment and never did it - but when I went to the next class, IT WAS SOMETHING COOL! It turns out that this creates a Sierpinski Triangle - a fractal of increasingly smaller triangles. I wanted to check this out for myself, so I built an R script that creates the triangle. I ran it a few times with differing amounts of points. Here is one with 50,000 points. Though this post is written in RStudio, I’ve hidden the code for readability. Actual code for this can be found here. I thought - if equilateral triangles create patterns this cool, a square must be amazing! Well… it is, however you can’t just run this logic - it will return a cloud of random dots… After talking with my professor, Dr. Levitan - it turns out you can get something equally awesome as the Sierpinski triangle with a square; you just need to make a few changes (say this with a voice of authority and calm knowingness): Instead of 3 points to move to, you need 8 points: the 4 corners of a specified square and the midpoints between each side. Also, instead of taking the midpoint of your move to the specified location, you need to take the tripoint (division by 3 instead of 2). This is called a Sierpinski Carpet - a fractal of squares (as opposed to a fractal of equilateral triangles in the graph above). You can see in both the triangle and square that the same pattern is repeated time and again in smaller and smaller increments. I updated my R script and voila - MORE BEAUTIFUL MATH! Check out the script and run the functions yourself! I only spent a little bit of time putting it together - I think it would be cool to add some other features, especially when it comes to the plotting of the points. Also - I’d like to run it for a million or more points… I just lacked the patience to wait out the script to run for that long (50,000 points took about 30 minutes to run - my script is probably not the most efficient). Anyways - really cool to see what happens in math sometimes - its hard to imagine at first that the triangle would look that way. Another reason math is cool!

Understanding User Agents

INTRODUCTION I have had a few discussions around web user agents at work recently. It turns out that they are not straightforward at all. In other words, trying to report browser usage to our Business Unit required a nontrivial translation. The more I dug in, the more I learned. I had some challenges finding the information, so I thought it be useful to document my findings and centralizing the sites I used to figure all this out. Just a quick background: Our web application, for a multitude of reasons, sends Internet Explorer users into a kind of compatibility mode in which it appears the browser is another version of IE (frequently 7, which no one uses anymore). In addition to this, in some of the application logs, there are user agents that appear with the prefix from the app followed by the browser as it understands it - also frequently IE7. For other browsers - it could be Google Crome (GC43; 43 is the browser version) or Mozilla Firefox (FF38; same deal here with the version number) - it does the same thing, though those browsers do not default to a compatibility mode in the same way. This is only the beginning of the confusion that is a web user agent string. While there isn’t much I can do about the application logs doing its own user agent translations (we’ll need to make some changes to the system logging), I can decipher the users strings from the places in the app that report the raw user agent strings. These are the strings that begin with Mozilla (more on that below). Let’s walk through them. THE USER AGENT STRING It can look like many different things. Here are some examples: Mozilla/5.0 (Windows NT 6.1;; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36 Mozilla/5.0 (Windows NT 6.3;; WOW64;; rv:31.0) Gecko/20100101 Firefox/31.0 Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17 Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; .NET4.0C;; .NET4.0E) Mozilla/4.0 (compatible;; MSIE 7.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; MAEM;; .NET4.0C;; InfoPath.1) Mozilla/4.0 (compatible;; MSIE 8.0;; Windows NT 5.1;; Trident/4.0;; .NET CLR 1.0.3705;; .NET CLR 1.1.4322;; Media Center PC 4.0;; .NET CLR 2.0.50727;; .NET CLR 3.0.4506.2152;; .NET CLR 3.5.30729;; InfoPath.3;; .NET4.0C;; yie8) Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; Trident/5.0) As you can see - they all have different components and parts to them. Some seem to be very straightforward at first glance (keyword: seem) and others are totally baffling. TRANSLATING THE USER AGENT STRING Much of my understanding of these user agent strings came from plugging the user agent strings into this page and a fair amount of Googling. Let’s pull apart the first user agent string from above: Mozilla/5.0 (Windows NT 6.1;; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36 Mozilla/5.0 “MozillaProductSlice. Claims to be a Mozilla based user agent, which is only true for Gecko browsers like Firefox and Netscape. For all other user agents it means ‘Mozilla-compatible’. In modern browsers, this is only used for historical reasons. It has no real meaning anymore” TRANSLATION We don’t care about this field in any of the user agent strings. It’s good to know that it starts the web user agent strings, but that’s about it. (Windows NT 6.1;; WOW64) Operating System = Windows 7 TRANSLATION This is at least in the right ball park, but still not exactly straightforward. Why can’t it just be Windows 7 for Windows 7? AppleWebKit/537.36 “The Web Kit provides a set of core classes to display web content in windows” TRANSLATION I don’t even know… don’t care. (KHTML, “Open Source HTML layout engine developed by the KDE project” TRANSLATION Still don’t know or care. like Gecko) “like Gecko…” TRANSLATION What? Yep. Don’t care - makes no sense. Chrome/43.0.2357.81 This is the browser and it’s version TRANSLATION Google Chrome v. 43. YES! ONE THAT MAKES SENSE AND HAS INFO WE WANT! Safari/537.36 “Based on Safari” TRANSLATION Um… ok? So this isn’t actually Apple Safari? NOPE! It’s Chrome, which makes pulling Safari quite the challenge. I’ll spell that out in more detail in outlining the if statement below. Out of that whole thing, we have several things that aren’t important and several things that look like they could be another thing, but aren’t. So… Long story short - all of that info boils down to the user coming to our site using Google Crome 43 from a Window’s 7 machine. THE INTERNET EXPLORER USER AGENT Confused yet? Hold on to your butts. The Internet Explorer User Agent String is the level 2 version of the previous string. Let’s look at: Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; .NET4.0C;; .NET4.0E) I found some light reading to explain some of what we are about to dive into. Most important from that page is this line: “When the F12 developer tools are used to change the browser mode of Internet Explorer, the version token of the user-agent string is modified to appear so that the browser appears to be an earlier version. This is done to allow browser specific content to be served to Internet Explorer and is usually necessary only when websites have not been updated to reflect current versions of the browser. When this happens, a Trident token is added to the user-agent string. This token includes a version number that enables you to identify the version of the browser, regardless of the current browser mode.” TRANSLATION Though the browser version above looks like MSIE 9.0 (that’s clearly what the string says), the Trident version identifies the browser as actually Internet Explorer 11. I am 90% sure that our site has many many many many many customizations done to deal specifically with Internet Explorer funny business. This is why the browser appears many times as MSIE 7.0 (Like this example which is actually IE 11, too: Mozilla/4.0 (compatible;; MSIE 7.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; MAEM;; .NET4.0C;; InfoPath.1)) If you’d like additional information on Trident, it can be found here. Just to summarize: For those user agent strings from Internet Explorer, the important detail is that Trident bit for determining what browser they came from. PUTTING THE PIECES TOGETHER Ok ok ok… now we at least can read the string - maybe there are a bunch of questions about a lot of this, but we can pull the browser version at this point. After pulling all of this information together and getting a general understanding of it, I read this brief history of user agent strings. Now I understand why they are the way they are - though I still think it’s stupid. DECIPHERING USER AGENTS If you, like me, need to translate these user strings into something that normal people can understand - use this table for reference. We use Splunk to do our web scraping and analysis. By using the “BIT THAT MATTERS,” I was able to build a case statement to translate the User Agent Strings into human readable analysis. BROWSER USER AGENT STRING EXAMPLE BIT THAT MATTERS Internet Explorer 11 Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; .NET4.0C;; .NET4.0E) Trident/7.0 Internet Explorer 10 Mozilla/4.0 (compatible;; MSIE 7.0;; Windows NT 6.2;; WOW64;; Trident/6.0;; .NET4.0E;; .NET4.0C;; .NET CLR 3.5.30729;; .NET CLR 2.0.50727;; .NET CLR 3.0.30729;; MDDCJS) Trident/6.0 Internet Explorer 9 Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.0;; Trident/5.0;; BOIE9;;ENUSMSCOM) Trident/5.0 Mozilla Firefox 4X.x Mozilla/5.0 (Windows NT 6.1;; Win64;; x64;; rv:40.0) Gecko/20100101 Firefox/40.0 Firefox/4 Mozilla Firefox 3X.x Mozilla/5.0 (Windows NT 6.1;; rv:38.0) Gecko/20100101 Firefox/38.0 Firefox/3 Google Chrome 4X.x Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36 Google Chrome 3X.x Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 Chrome/3 Apple Safari 8.x Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 Version/8 Apple Safari 7.x Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_9_5) AppleWebKit/600.3.18 (KHTML, like Gecko) Version/7.1.3 Safari/537.85.12 Version/7 Apple Safari 6.x Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_7_5) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/6.1.6 Safari/537.78.2 Version/6

linear regression

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix <- read.csv("Parking_Violations.csv") ## READ IN THE WEATHER DATA (FROM NCDC) weather_data <- read.csv("weather_data.csv") ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays <- as.Date(c("2012-01-02", "2012-01-16", "2012-02-20", "2012-05-28", "2012-07-04", "2012-09-03", "2012-10-08", "2012-11-12", "2012-11-22", "2012-12-25", "2013-01-01", "2013-01-21", "2013-02-18", "2013-05-27", "2013-07-04", "2013-09-02", "2013-10-14", "2013-11-11", "2013-11-28", "2013-12-25", "2014-01-01", "2014-01-20", "2014-02-17", "2014-05-26", "2014-07-04", "2014-09-01", "2014-10-13", "2014-11-11", "2014-11-27", "2014-12-25", "2015-01-01", "2015-01-09", "2015-02-16", "2015-05-25", "2015-07-03", "2015-09-07")) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days <- as.data.frame(as.Date( ptix$Issue.Date.and.Time, format = "%m/%d/%Y")) names(days) <- "DATE" count_by_day <- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE <- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = "%Y%m%d")), format = "%m/%d/%Y") Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day <- join(count_by_day, weather_data, by = "DATE") With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON'T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION <- NULL count_by_day$STATION_NAME <- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN'T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR < 0] <- 0 count_by_day$DAPR[count_by_day$DAPR < 0] <- 0 count_by_day$PRCP[count_by_day$PRCP < 0] <- 0 count_by_day$SNWD[count_by_day$SNWD < 0] <- 0 count_by_day$SNOW[count_by_day$SNOW < 0] <- 0 count_by_day$WT01[count_by_day$WT01 < 0] <- 0 count_by_day$WT03[count_by_day$WT03 < 0] <- 0 count_by_day$WT04[count_by_day$WT04 < 0] <- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day <- count_by_day[ count_by_day$TMAX > 0, ] count_by_day <- count_by_day[ count_by_day$TMIN > 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day <- count_by_day[!is.na( count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS <- NULL count_by_day$WT01 <- NULL count_by_day$WT04 <- NULL count_by_day$WT03 <- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX <- count_by_day$TMAX / 10 count_by_day$TMIN <- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW <- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL <- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] <- 1 count_by_day$HOL <- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON <- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train <- count_by_day[count_by_day$DATE < "2014-08-01", ] test <- count_by_day[count_by_day$DATE >= "2014-08-01", ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest <- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest <- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon <- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5271.4002 89.5216 58.884 < 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 < 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 < 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 < 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: < 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 < 2.2e-16 *** HOL 1 147553631 147553631 498.1432 < 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 < 2.2e-16 *** HOL 1 147553631 147553631 463.001 < 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat <- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value <- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF <- round(predict(predForest, test), 0) test$LM <- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF <- sum(abs(test$RF - test$count)) difOfLM <- sum(abs(test$LM - test$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.

logistic regression

Identifying Compromised User Accounts with Logistic Regression

INTRODUCTION As a Data Analyst on Comcast’s Messaging Engineering team, it is my responsibility to report on the platform statuses, identify irregularities, measure impact of changes, and identify policies to ensure that our system is used as it was intended. Part of the last responsibility is the identification and remediation of compromised user accounts. The challenge the company faces is being able to detect account compromises faster and remediate them closer to the moment of detection. This post will focus on the methodology and process for modeling the criteria to best detect compromised user accounts in near real-time from outbound email activity. For obvious reasons, I am only going to speak to the methodologies used; I’ll be vague when it comes to the actual criteria we used. DATA COLLECTION AND CLEANING Without getting into the finer details of email delivery, there are about 43 terminating actions an email can take when it was sent out of our platform. A message can be dropped for a number of reasons. These are things like the IP or user being on any number block lists, triggering our spam filters, and other abusive behaviors. The other side of that is that the message will be delivered to its intended recipient. That said, I was able to create a usage profile for all of our outbound senders in small chunks of time in Splunk (our machine log collection tool of choice). This profile gives a summary per user of how often the messages they sent hit each of the terminating actions described above. In order to train my data, I matched this usage data to our current compromised detection lists. I created a script in python that added an additional column in the data. If an account was flagged as compromised with our current criteria, it was given a one; if not, a zero. With the data collected, I am ready to determine the important inputs. DETERMINING INPUTS FOR THE MODEL In order to determine the important variables in the data, I created a Binary Regression Tree in R using the rpart library. The Binary Regression Tree iterates over the data and “splits” it in order to group the data to get compromised accounts together and non-compromised accounts together. It is also a nice way to visualize the data. You can see in the picture below what this looks like. Because the data is so large, I limited the data to one day chunks. I then ran this regression tree against each day separately. From that, I was able to determine that there are 6 important variables (4 of which showed up in every regression tree I created; the other 2 showed up in a majority of trees). You can determine the “important” variables by looking in the summary for the number of splits per variable. BUILDING THE MODEL Now that I have the important variables, I created a python script to build the Logistic Regression Model from them. Using the statsmodels package, I was able to build the model. All of my input variables were highly significant. I took the logistic regression equation with the coefficients given in the model back to Splunk and tested this on incoming data to see what would come out. I quickly found that it got many accounts that were really compromised. There were also some accounts being discovered that looked like brute force attacks that never got through - to adjust for that, I added a constraint to the model that the user must have done at least one terminating action that ensured they authenticated successfully (this rules out users coming from a ton of IPs, but failing authentication everytime). With these important variables, it’s time to build the Logistic Regression Model. CONCLUSION First and foremost, this writeup was intended to be a very high level summary explaining the steps I took to get my final model. What isn’t explained here is how many models I built that were less successful. Though this combination worked for me in the end, likely you’ll need to iterate over the process a number of times to get something successful. The new detection method for compromised accounts is an opportunity for us to expand our compromise detection and do it in a more real-time manner. This is also a foundation for future detection techniques for malicious IPs and other actors. With this new method, we will be able to expand the activity types for compromise detection outside of outbound email activity to things like preference changes, password resets, changes to forwarding address, and even application activity outside of the email platform.

machine learning

How to Build a Binary Classification Model

What Is Binary Classification? Algorithms for Binary Classification Logistic Regression Decision Trees/Random Forests Decision Trees Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Great. Now what? Determining What the Problem is Locate and Obtain Data Data Mining & Preparing for Analysis Splitting the Data Building the Models Validating the Models Conclusion What Is Binary Classification? Binary classification is used to classify a given set into two categories. Usually, this answers a yes or no question: Did a particular passenger on the Titanic survive? Is a particular user account compromised? Is my product on the shelf in a certain pharmacy? This type of inference is frequently made using supervised machine learning techniques. Supervised machine learning means that you have historical, labeled data, from which your algorithm may learn. Algorithms for Binary Classification There are many methods for doing binary classification. To name a few: Logistic Regression Decision Trees/Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Logistic Regression Logistic regression is a parametric statistical model that predicts binary outcomes. Parametric means that this algorithm is based off of a distribution (in this case, the logistic distribution), and as such must follow a few assumptions: Obviously, the dependent variable must be binary Only meaningful independent variables are included in the model Error terms need to be independent and identically distributed Independent variables need to be independent from one another Large sample sizes are preferred Because of these assumptions, parametric tests tend to be more statistically powerful than nonparametric tests; in other words, they tend to better find a significant effect when it indeed exists. Logistic regression follows the equation: probability of the outcome being 1 given the independent variables Dependent variable Limited to values between 0 and 1 independent variables intercept and coefficients for the independent variables This equation is created based on a training set of data – or historical, labeled data – and then is used to predict the likelihoods of future, unlabeled data. Decision Trees/Random Forests Decision Trees A decision tree is a nonparametric classifier. It effectively partitions the data, starting first by splitting on the independent variable that gives the most information gain, and then recursively repeating this process at subsequent levels. Information gain is a formula that determines how “important” an independent variable is in predicting the dependent variable. It takes into account how many distinct values there are (in terms of categorical variables) and the number and size of branches in the decision tree. The goal is to pick the most informative variable that is still general enough to prevent overfitting. The bottom of the decision tree, at the leaf nodes, are groupings of events within the set that all follow the rules set forth throughout the tree to get to the node. Future, unlabeled events, are then fed into the tree to see which group the belong – the average of the labeled (training) data for the leaf is then assigned as the predicted value for the unlabeled event. As with logistic regression, overfitting is a concern. If you allow a decision tree to continue to grow without bound, eventually you will have all identical events in each leaf; while this may look beneficial, it may be too specific to the training data and mislabel future events. “Pruning” occurs to prevent overfitting. Random Forests Random forests is an ensemble method build upon the decision trees. Random forests are a “forest” of decision trees – in other words, you use bootstrap sampling techniques to build many over-fit decision trees, then average out the results to determine a final model. A bootstrap sample is sampling with replacement – in every selection for the sample, each event has an equal chance of being chosen. To clarify – building a random forest model means taking many bootstrap samples and building an over-fit decision tree (meaning you continue to split the tree without bound until every leaf node has identical groups in them) on each. These results, taken together, correct for the biases and potential overfitting of an individual tree. The more trees in your random forest, the better – the trade-off being that more trees mean more computing. Random forests often take a long time to train. Nearest Neighbor The k-nearest neighbor algorithm is a very simple algorithm. Using the training set as reference, the new, unlabeled data is predicted by taking the average of the k closest events. Being a lazy learner, where evaluation does not take place until you classify new events, it is quick to run. It can be difficult to determine what k should be. However, because it is easy computationally, you can run multiple iterations without much overhead. Support Vector Machines (SVM) SVM is also an ensemble machine learning method. SVM recursively attempts to “split” the two categories by maximizing the distance between a hyperplane (a plane in more than 2 dimensions; most applications of machine learning are in the higher dimensional space) and the closest points in each category. As you can see in the simple example below, the plane iteratively improves the split between the two groups. There are multiple kernels that can be used with SVM, depending on the shape of the data: Linear Polynomial Radial Sigmoid You may also choose to configure how big of steps can be taken by the plane in each iteration among other configurations. Neural Networks Neural networks (there are several varieties) are built to mimic how a brain solves problems. This is done by creating multiple layers from a single input – most easily demonstrated with image recognition – where it is able to turn groups of pixels into another, single value, over and over again, to provide more information to train the model. Great. Now what? Now that we know how we understand some of the tools in our arsenal, what are the steps to doing the analysis? Determining what the problem is Locate and obtain data Data mining for understanding & preparing for analysis Split data into training and testing sets Build model(s) on training data Test models on test data Validate and pick the best model Determining What the Problem is While it is easy to ask a question, it is difficult to understand all of the assumption being made by the question asker. For example, a simple question is asked: Will my product be on the shelf of this pharmacy next week? While that question may seem straightforward at first glance, what product are we talking about? What pharmacy are we talking about? What is the time frame which is being evaluated? Does it need to be in the pharmacy and available if you ask or does the customer need to be able to visually identify the product? Does it need to be available for the entire time period in question or did just have to be available at least part of the time period in question? Being as specific as possible is vital in order to deliver the correct answer. It is easy to misinterpret the assumptions of the question asker and then do a lot of work in to answer the wrong question. Specificity will help ensure time is not wasted and that question asker gets an answer that they were looking for. The final question may look more like: Will there be any Tylenol PM available over-the-counter at midnight, February 28, 2017 at Walgreens on the corner of 17th and John F. Kennedy Blvd in Philadelphia? Well – we don’t know. We can now use historical data to make our best guess. This question is specific enough to answer. Locate and Obtain Data Where is your data? Is it in a database? Some excel spreadsheet? Once you find it, how big is it? Can you download the data locally? Do you need to find a distributed database to handle it? If it is in a database, can you do some of the data mining (next step) before downloading the data? Be careful… “SELECT * FROM my_table;” can get scary, quick. This is also a good time to think about what tools and/or languages you want to use to mine and manipulate the data. Excel? SQL? R? Python? Some of the numerous other tools or languages out there that are good at a bunch of different things (Julia, Scala, Weka, Orange, etc.)? Get the data into one spot, preferably with some guidance on what and where it is in relation to what you need for your problem and open it up. Data Mining & Preparing for Analysis The most time consuming step in any data science article you read will always be the data cleaning step. This document is no different – you will spend an inordinate amount of time getting to know the data, cleaning it, getting to know it better, and cleaning it again. You may then proceed to analysis, discover you’ve missed something, and come back to this step. There is a lot to consider in this step and each data analysis is different. Is your data complete? If you are missing values in your data, how will you deal with them? There is no overarching rule on this. If you are dealing with continuous data, perhaps you’ll fill missing data points with the average of similar data. Perhaps you can infer what it should be based on context. Perhaps it constitutes such a small portion of your data, the logical thing to do is to just drop the events all together. The dependent variable – how does it break down? We are dealing with binomial data here; is there way more zeros then ones? How will you deal with that if there is? Are you doing your analysis on a subset? If so, is your sample representative of the population? How can you be sure? This is where histograms are your friend. Do you need to create variables? Perhaps one independent variable you have is a date, which might be tough to use as an input to your model. Should you find out which day of the week each date was? Month? Year? Season? These are easier to add in as a model input in some cases. Do you need to standardize your data? Perhaps men are listed as “M,” “Male,” “m,” “male,” “dude,” and “unsure.” It would behoove you, in this example, to standardize this data to all take on the same value. In most algorithms, correlated input variables are bad. This is the time to plot all of the independent variables against each other to see if there is correlation. If there are correlated variables, it may be a tough choice to drop one (or all!). Speaking of independent variables, which are important to predict your dependent variable? You can use information gain packages (depending on the language/tool you are using to do your analysis), step-wise regression, or random forests to help understand the important variables. In many of these steps, there are no hard-and-fast rules on how to proceed. You’ll need to make a decision in the context of your problem. In many cases, you may be wrong and need to come back to the decision after trying things out. Splitting the Data Now that you (think you) have a clean dataset, you’ll need to split it into training and testing datasets. You’ll want to have as much data as possible to train on, though still have enough data left over to test on. This is less and less of an issue in the age of big data. However, sometimes too much data and it will take too long for your algorithms to train. Again – this is another decision that will need to be made in the context of your problem. There are a few options for splitting your data. The most straightforward being take a portion of your overall dataset to train on (say 70%) and leave behind the rest to test on. This works well in most big data applications. If you do not have a lot of data (or if you do), consider cross-validation. This is an iterative approach where you train your algorithm recursively on the same data set, leaving some portion out each iteration to be used as the test set. The most popular versions of cross-validation are k-fold cross validation and leave-one-out cross validation. There is even nested cross-validation, which gets very Inception-like. Building the Models Finally, you are ready to do what we came to do – build the models. We have our datasets cleaned, enriched, and split. Time to build our models. I say it plural because you’ll always want to evaluate which method and/or inputs works best. You’ll want to pick a few of the algorithms from above and build the model. While that is vague, depending on your language or tool of choice, there are multiple packages available to perform each analysis. It is generally only a line or two of code to train each model; once we have our models trained, it is time to validate. Validating the Models So – which model did best? How can you tell? We start by predicting results for our test set with each model and building a confusion matrix for each: With this, we can calculate the specificity, sensitivity, and accuracy for each model. For each value, higher is better. The best model is one that performs the best in each of these counts. In the real world, frequently one model will have better specificity, while another will have better sensitivity, and yet another will be the most accurate. Again, there is no hard and fast rule one which model to choose; it all depends on the context. Perhaps false positives are really bad in your context, then the specificity rate should be given more merit. It all depends. From here, you have some measures in order to pick a model and implement it. Conclusion Much of model building, in general, is part computer science, part statistics, and part business understanding. Understanding which tools and languages are best to implement the best statistical modeling technique to solve a business problem can feel like more of a form of art than science at times. In this document, I’ve presented some algorithms and steps to do binary classification, which is just the tip of the iceberg. I am sure there are algorithms and steps missing – I hope that this helps in your understanding.

Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you’ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I’ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we’ve learned that on sunny days it typically does not rain. We don’t know for sure that today it won’t rain but we’re willing to make decisions based on our prediction that it won’t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can’t be 100% sure that it won’t rain because we’ve also woken up on a few sunny mornings in which we’ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid… I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don’t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead… or hair length… or eye color… or use all of these attributes at the same time! It’s natural in this example to see how “close” people are to one another based on different attributes. What these type of algorithms do is evaluate the “distances” of one piece of information from another piece. In a machine learning setting you look for similarities and “closeness” in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,…, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is “learned” without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I’ve used or found along my Data Science journey. UPDATE: I’ve written a whole post on this. You can find it here O’Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.

maching learning

A Data Scientist's Take on the Roadmap to AI

INTRODUCTION Recently I was asked by a former colleague about getting into AI. He has truly big data and wants to use this data to power “AI” - if the headlines are to be believed, everyone else is already doing it. Though it was difficult for my ego, I told him I couldn’t help him in our 30 minute call and that he should think about hiring someone to get him there. The truth was I really didn’t have a solid answer for him in the moment. This was truly disappointing - in my current role and in my previous role, I put predictive models into production. After thinking about it for a bit, there is definitely a similar path I took in both roles. There’s 3 steps in my mind to getting to “AI.” Though this seems simple, it is a long process and potentially not linear - you may have to keep coming back to previous steps. Baseline (Reporting) Understand (Advanced Analytics) Artificial Intelligence (Data Science) BASELINE (REPORTING) Fun fact: You cannot effectively predict anything if you cannot measure the impact. What I mean by baseline is building out a reporting suite. Having a fundamental understanding of your business and environment is key. Without doing this step, you may try to predict the wrong thing entirely - or start with something that isn’t the most impactful. For me, this step started with finding the data in the first place. Perhaps, like my colleague, you have lots of data and you’re ready to jump in. That’s great and makes getting started that much more straightforward. In my role, I joined a finance team that really didn’t have a good bead on this - finding the data was difficult (and getting the owners of that data to give me access was a process as well). To be successfull, start small and iterate. Our first reports were built from manually downloading machine logs, processing them in R with JSON packages, and turning them into a black-and-white document. It was ugly, but it helped us know what we needed to know in that moment - oh yeah… it was MUCH better than nothing. “Don’t let perfection be the enemy of good.” - paraphrased from Voltaire. From this, I gained access to our organizations data warehouse, put automation in place, and purchased some Tableau licenses. This phase took a few months and is constantly being refined, but we are now able to see the impact of our decisions at a glance. This new understanding inevitably leads to more questions - queue step 2: Understanding. UNDERSTANDING (ADVANCED ANALYTICS) If you have never circulated reports and dashboards to others… let me fill you in on something: it will ALWAYS lead to additional, progressively harder questions. This step is an investment in time and expertise - you have to commit to having dedicated resource(s) (read: people… it is inhumane to call people resources and you may only need one person or some of a full time person’s time). Why did X go up unexpectedly (breaks the current trend)? Are we over indexing on this type of customer? Right before our customer leaves, this weird thing happens - what is this weird thing and why is it happening? Like the previous step - this will be ongoing. Investing in someone to do advanced analytics will help you to understand the fine details of your business AND … (drum roll) … will help you to understand which part of your business is most ripe for “AI”! ARTIFICIAL INTELLIGENCE (DATA SCIENCE) It is at this point that you will able to do real, bonafide, data science. A quick rant: Notice that I purposefully did not use the term “AI” (I know I used it throughout this article and even in the title of this section… what can I say - I am in-tune with marketing concepts, too). “AI” is a term that is overused and rarely implemented. Data science, however, comes in many forms and can really transform your business. Here’s a few ideas for what you can do with data science: Prediction/Machine Learning Testing Graph Analysis Perhaps you want to predict whether a sale is fraud or which existing customer is most apt to buy your new product? You can also test whether a new strategy works better than the old. This requires that you use statistical concepts to ensure valid testing and results. My new obsession is around graph analysis. With graphs you can see relationships that may have been hidden before - this will enable you to identify new targets and enrich your understanding of your business! Data science usually is very specific thing and takes many forms! SUMMARY Getting to data science is a process - it will take an investment. There are products out there that will help you shortcut some of these steps and I encourage you to consider these. There are products to help with reporting, analytics, and data science. These should, in my very humble opinion, be used by people who are dedicated to the organizations data, analytics, and science. Directions for data science - measure, analyze, predict, repeat!

model coparison

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

open data

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix <- read.csv("Parking_Violations.csv") ## READ IN THE WEATHER DATA (FROM NCDC) weather_data <- read.csv("weather_data.csv") ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays <- as.Date(c("2012-01-02", "2012-01-16", "2012-02-20", "2012-05-28", "2012-07-04", "2012-09-03", "2012-10-08", "2012-11-12", "2012-11-22", "2012-12-25", "2013-01-01", "2013-01-21", "2013-02-18", "2013-05-27", "2013-07-04", "2013-09-02", "2013-10-14", "2013-11-11", "2013-11-28", "2013-12-25", "2014-01-01", "2014-01-20", "2014-02-17", "2014-05-26", "2014-07-04", "2014-09-01", "2014-10-13", "2014-11-11", "2014-11-27", "2014-12-25", "2015-01-01", "2015-01-09", "2015-02-16", "2015-05-25", "2015-07-03", "2015-09-07")) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days <- as.data.frame(as.Date( ptix$Issue.Date.and.Time, format = "%m/%d/%Y")) names(days) <- "DATE" count_by_day <- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE <- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = "%Y%m%d")), format = "%m/%d/%Y") Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day <- join(count_by_day, weather_data, by = "DATE") With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON'T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION <- NULL count_by_day$STATION_NAME <- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN'T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR < 0] <- 0 count_by_day$DAPR[count_by_day$DAPR < 0] <- 0 count_by_day$PRCP[count_by_day$PRCP < 0] <- 0 count_by_day$SNWD[count_by_day$SNWD < 0] <- 0 count_by_day$SNOW[count_by_day$SNOW < 0] <- 0 count_by_day$WT01[count_by_day$WT01 < 0] <- 0 count_by_day$WT03[count_by_day$WT03 < 0] <- 0 count_by_day$WT04[count_by_day$WT04 < 0] <- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day <- count_by_day[ count_by_day$TMAX > 0, ] count_by_day <- count_by_day[ count_by_day$TMIN > 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day <- count_by_day[!is.na( count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS <- NULL count_by_day$WT01 <- NULL count_by_day$WT04 <- NULL count_by_day$WT03 <- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX <- count_by_day$TMAX / 10 count_by_day$TMIN <- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW <- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL <- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] <- 1 count_by_day$HOL <- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON <- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train <- count_by_day[count_by_day$DATE < "2014-08-01", ] test <- count_by_day[count_by_day$DATE >= "2014-08-01", ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest <- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest <- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon <- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5271.4002 89.5216 58.884 < 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 < 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 < 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 < 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: < 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 < 2.2e-16 *** HOL 1 147553631 147553631 498.1432 < 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 < 2.2e-16 *** HOL 1 147553631 147553631 463.001 < 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat <- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value <- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF <- round(predict(predForest, test), 0) test$LM <- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF <- sum(abs(test$RF - test$count)) difOfLM <- sum(abs(test$LM - test$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

philly magazine

How Data Science Is Keeping Your Cell Phone Info Safe

I am honored to have been interviewed by Philly Magazine about how my experience from Villanova's Masters in Applied Statistics has influenced my work in analytics at Comcast!

plotly

Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself’s curve and the curve for one. When I saw this, my mind was blown. How interesting… and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("https://maps.googleapis.com/", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

prime numbers

Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself’s curve and the curve for one. When I saw this, my mind was blown. How interesting… and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

projects

Jake on Strava - R Shiny App

I created a Shiny app that grabs my running, riding, and other exercise stats from Strava and creates some simple visualizations.

Twitter Analysis - R Shiny App

I created a Shiny app that searches Twitter and does some simple analysis.

Jake Learns Data Science Visitor Dashboard

A quick view of visitors to my website. Data pulled from Google Analytics and pushed to Amazon Redshift using Stitch Data.

GAP RE-MINDER

A demonstration D3 project, shamelessly ripping off Gapminder.

R

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself’s curve and the curve for one. When I saw this, my mind was blown. How interesting… and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix <- read.csv("Parking_Violations.csv") ## READ IN THE WEATHER DATA (FROM NCDC) weather_data <- read.csv("weather_data.csv") ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays <- as.Date(c("2012-01-02", "2012-01-16", "2012-02-20", "2012-05-28", "2012-07-04", "2012-09-03", "2012-10-08", "2012-11-12", "2012-11-22", "2012-12-25", "2013-01-01", "2013-01-21", "2013-02-18", "2013-05-27", "2013-07-04", "2013-09-02", "2013-10-14", "2013-11-11", "2013-11-28", "2013-12-25", "2014-01-01", "2014-01-20", "2014-02-17", "2014-05-26", "2014-07-04", "2014-09-01", "2014-10-13", "2014-11-11", "2014-11-27", "2014-12-25", "2015-01-01", "2015-01-09", "2015-02-16", "2015-05-25", "2015-07-03", "2015-09-07")) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days <- as.data.frame(as.Date( ptix$Issue.Date.and.Time, format = "%m/%d/%Y")) names(days) <- "DATE" count_by_day <- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE <- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = "%Y%m%d")), format = "%m/%d/%Y") Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day <- join(count_by_day, weather_data, by = "DATE") With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON'T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION <- NULL count_by_day$STATION_NAME <- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN'T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR < 0] <- 0 count_by_day$DAPR[count_by_day$DAPR < 0] <- 0 count_by_day$PRCP[count_by_day$PRCP < 0] <- 0 count_by_day$SNWD[count_by_day$SNWD < 0] <- 0 count_by_day$SNOW[count_by_day$SNOW < 0] <- 0 count_by_day$WT01[count_by_day$WT01 < 0] <- 0 count_by_day$WT03[count_by_day$WT03 < 0] <- 0 count_by_day$WT04[count_by_day$WT04 < 0] <- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day <- count_by_day[ count_by_day$TMAX > 0, ] count_by_day <- count_by_day[ count_by_day$TMIN > 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day <- count_by_day[!is.na( count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS <- NULL count_by_day$WT01 <- NULL count_by_day$WT04 <- NULL count_by_day$WT03 <- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX <- count_by_day$TMAX / 10 count_by_day$TMIN <- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW <- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL <- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] <- 1 count_by_day$HOL <- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON <- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train <- count_by_day[count_by_day$DATE < "2014-08-01", ] test <- count_by_day[count_by_day$DATE >= "2014-08-01", ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest <- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest <- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon <- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5271.4002 89.5216 58.884 < 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 < 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 < 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 < 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: < 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 < 2.2e-16 *** HOL 1 147553631 147553631 498.1432 < 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 < 2.2e-16 *** HOL 1 147553631 147553631 463.001 < 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat <- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value <- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF <- round(predict(predForest, test), 0) test$LM <- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF <- sum(abs(test$RF - test$count)) difOfLM <- sum(abs(test$LM - test$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.

Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup <- read.csv("mylookup.csv", header = FALSE) ## ARBITRARY "CHUNK" SIZE TO KEEP SEARCHES SMALLER ## start <- 1 end <- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata <- data.frame() ## HOW MANY "CHUNKS" WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring <- paste(mylookup[start:end], sep = "", collapse = '" OR VAR_NAME="') ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring <- paste('index = "my_splunk_index" (VAR_NAME="', lookupstring, '") | stats count BY VAR_NAME', sep = "") ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response <- GET("https://our.splunk.link:8089/", path = "servicesNS/admin/search/search/jobs/export", encode="form", config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate("USERNAME", "PASSWORD"), query=list(search=paste0("search ", searchstring, collapse="", sep=""), output_mode="csv")) ## CHANGES THE RESULTS TO A DATA TABLE ## result <- read.table(text = content(response, as = "text"), sep = ",", header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata <- rbind(alldata, result) ## UPDATES THE START POINT start <- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN'T GO TOO FAR ## if((end + 1000) > length(allusers)){ end <- length(allusers) } else { end <- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, "mydata.csv", row.names = FALSE, sep = ",") So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("https://maps.googleapis.com/", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!

Sierpinski Triangles (and Carpets) in R

Recently in class, I was asked the following question: Start with an equilateral triangle and a point chosen at random from the interior of that triangle. Label one vertex 1, 2, a second vertex 3, 4, and the last vertex 5, 6. Roll a die to pick a vertex. Place a dot at the point halfway between the roll-selected vertex and the point you chose. Now consider this new dot as a starting point to do this experiment once again. Roll the die to pick a new vertex. Place a dot at the point halfway between the last point and the most recent roll-selected vertex. Continue this procedure. What does the shape of the collection of dots look like? I thought, well - it’s got to be something cool or else the professor wouldn’t ask, but I can’t imagine it will be more than a cloud of dots. Truth be told, I went to a conference for work the week of this assignment and never did it - but when I went to the next class, IT WAS SOMETHING COOL! It turns out that this creates a Sierpinski Triangle - a fractal of increasingly smaller triangles. I wanted to check this out for myself, so I built an R script that creates the triangle. I ran it a few times with differing amounts of points. Here is one with 50,000 points. Though this post is written in RStudio, I’ve hidden the code for readability. Actual code for this can be found here. I thought - if equilateral triangles create patterns this cool, a square must be amazing! Well… it is, however you can’t just run this logic - it will return a cloud of random dots… After talking with my professor, Dr. Levitan - it turns out you can get something equally awesome as the Sierpinski triangle with a square; you just need to make a few changes (say this with a voice of authority and calm knowingness): Instead of 3 points to move to, you need 8 points: the 4 corners of a specified square and the midpoints between each side. Also, instead of taking the midpoint of your move to the specified location, you need to take the tripoint (division by 3 instead of 2). This is called a Sierpinski Carpet - a fractal of squares (as opposed to a fractal of equilateral triangles in the graph above). You can see in both the triangle and square that the same pattern is repeated time and again in smaller and smaller increments. I updated my R script and voila - MORE BEAUTIFUL MATH! Check out the script and run the functions yourself! I only spent a little bit of time putting it together - I think it would be cool to add some other features, especially when it comes to the plotting of the points. Also - I’d like to run it for a million or more points… I just lacked the patience to wait out the script to run for that long (50,000 points took about 30 minutes to run - my script is probably not the most efficient). Anyways - really cool to see what happens in math sometimes - its hard to imagine at first that the triangle would look that way. Another reason math is cool!

Identifying Compromised User Accounts with Logistic Regression

INTRODUCTION As a Data Analyst on Comcast’s Messaging Engineering team, it is my responsibility to report on the platform statuses, identify irregularities, measure impact of changes, and identify policies to ensure that our system is used as it was intended. Part of the last responsibility is the identification and remediation of compromised user accounts. The challenge the company faces is being able to detect account compromises faster and remediate them closer to the moment of detection. This post will focus on the methodology and process for modeling the criteria to best detect compromised user accounts in near real-time from outbound email activity. For obvious reasons, I am only going to speak to the methodologies used; I’ll be vague when it comes to the actual criteria we used. DATA COLLECTION AND CLEANING Without getting into the finer details of email delivery, there are about 43 terminating actions an email can take when it was sent out of our platform. A message can be dropped for a number of reasons. These are things like the IP or user being on any number block lists, triggering our spam filters, and other abusive behaviors. The other side of that is that the message will be delivered to its intended recipient. That said, I was able to create a usage profile for all of our outbound senders in small chunks of time in Splunk (our machine log collection tool of choice). This profile gives a summary per user of how often the messages they sent hit each of the terminating actions described above. In order to train my data, I matched this usage data to our current compromised detection lists. I created a script in python that added an additional column in the data. If an account was flagged as compromised with our current criteria, it was given a one; if not, a zero. With the data collected, I am ready to determine the important inputs. DETERMINING INPUTS FOR THE MODEL In order to determine the important variables in the data, I created a Binary Regression Tree in R using the rpart library. The Binary Regression Tree iterates over the data and “splits” it in order to group the data to get compromised accounts together and non-compromised accounts together. It is also a nice way to visualize the data. You can see in the picture below what this looks like. Because the data is so large, I limited the data to one day chunks. I then ran this regression tree against each day separately. From that, I was able to determine that there are 6 important variables (4 of which showed up in every regression tree I created; the other 2 showed up in a majority of trees). You can determine the “important” variables by looking in the summary for the number of splits per variable. BUILDING THE MODEL Now that I have the important variables, I created a python script to build the Logistic Regression Model from them. Using the statsmodels package, I was able to build the model. All of my input variables were highly significant. I took the logistic regression equation with the coefficients given in the model back to Splunk and tested this on incoming data to see what would come out. I quickly found that it got many accounts that were really compromised. There were also some accounts being discovered that looked like brute force attacks that never got through - to adjust for that, I added a constraint to the model that the user must have done at least one terminating action that ensured they authenticated successfully (this rules out users coming from a ton of IPs, but failing authentication everytime). With these important variables, it’s time to build the Logistic Regression Model. CONCLUSION First and foremost, this writeup was intended to be a very high level summary explaining the steps I took to get my final model. What isn’t explained here is how many models I built that were less successful. Though this combination worked for me in the end, likely you’ll need to iterate over the process a number of times to get something successful. The new detection method for compromised accounts is an opportunity for us to expand our compromise detection and do it in a more real-time manner. This is also a foundation for future detection techniques for malicious IPs and other actors. With this new method, we will be able to expand the activity types for compromise detection outside of outbound email activity to things like preference changes, password resets, changes to forwarding address, and even application activity outside of the email platform.

Doing a Sentiment Analysis on Tweets (Part 2)

INTRO This is post is a continuation of my last post. There I pulled tweets from Twitter related to “Comcast email,” got rid of the junk, and removed the unnecessary/unwanted data. Now that I have the tweets, I will further clean the text and subject it to two different analyses: emotion and polarity. WHY DOES THIS MATTER Before I get started, I thought it might be a good idea to talk about WHY I am doing this (besides the fact that I learned a new skill and want to show it off and get feedback). This yet incomplete project was devised for two reasons: Understand the overall customer sentiment about the product I support Create an early warning system to help identify when things are going wrong on the platform Keeping the customer voice at the forefront of everything we do is tantamount to providing the best experience for the users of our platform. Identifying trends in sentiment and emotion can help inform the team in many ways, including seeing the reaction to new features/releases (i.e. – seeing a rise in comments about a specific addition from a release) and identifying needed changes to current functionality (i.e. – users who continually comment about a specific behavior of the application) and improvements to user experience (i.e. – trends in comments about being unable to find a certain feature on the site). Secondarily, this analysis can act as an early warning system when there are issues with the platform (i.e. – a sudden spike in comments about the usability of a mobile device). Now that I’ve explained why I am doing this (which I probably should have done in this sort of detail the first post), let’s get into how it is actually done… STEP ONE: STRIPPING THE TEXT FOR ANALYSIS There are a number of things included in tweets that dont matter for the analysis. Things like twitter handles, URLs, punctuation… they are not necessary to do the analysis (in fact, they may well confound it). This bit of code handles that cleanup. For those following the scripts on GitHub, this is part of my tweet_clean.R script. Also, to give credit where it is due: I’ve borrowed and tweaked the code from Andy Bromberg’s blog to do this task. library(stringr) ##Does some of the text editing ##Cleaning up the data some more (just the text now) First grabbing only the text text <- paredTweetList$Tweet # remove retweet entities text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text) # remove at people text <- gsub("@\\w+", "", text) # remove punctuation text <- gsub("[[:punct:]]", "", text) # remove numbers text <- gsub("[[:digit:]]", "", text) # remove html links text <- gsub("http\\w+", "", text) # define "tolower error handling" function try.error <- function(x) { # create missing value y <- NA # tryCatch error try_error <- tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(try_error, "error")) y <- tolower(x) # result return(y) } # lower case using try.error with sapply text <- sapply(text, try.error) # remove NAs in text text <- text[!is.na(text)] # remove column names names(text) <- NULL STEP TWO: CLASSIFYING THE EMOTION FOR EACH TWEET So now the text is just that: only text. The punctuation, links, handles, etc. have been removed. Now it is time to estimate the emotion of each tweet. Through some research, I found that there are many posts/sites on Sentiment Analysis/Emotion Classification that use the “Sentiment” package in R. I thought: “Oh great – a package tailor made to solve the problem for which I want an answer.” The problem is that this package has been deprecated and removed from the CRAN library. To get around this, I downloaded the archived package and pulled the code for doing the emotion classification. With some minor tweaks, I was able to get it going. This can be seen in its entirety in the classify_emotion.R script. You can also see the “made for the internet” version here: library(RTextTools) library(tm) algorithm <- "bayes" prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/emotions.csv.gz",header=FALSE) counts <- list(anger=length(which(lexicon[,2]=="anger")), disgust=length(which(lexicon[,2]=="disgust")), fear=length(which(lexicon[,2]=="fear")), joy=length(which(lexicon[,2]=="joy")), sadness=length(which(lexicon[,2]=="sadness")), surprise=length(which(lexicon[,2]=="surprise")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(anger=0,disgust=0,fear=0,joy=0,sadness=0,surprise=0) doc <- matrix[i,] words <- findFreqTerms(doc,lowfreq=1) for (word in words) { for (key in names(scores)) { emotions <- lexicon[which(lexicon[,2]==key),] index 0) { entry <- emotions[index,] category <- as.character(entry[[2]]]) count <- counts[[category]] score <- 1.0 if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:",word,"CAT:", category,"SCORE:",score)) } scores[[category]] <- scores[[category]]+score } } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] if (best_fit == "disgust" && as.numeric(unlist(scores[2]))-3.09234 < .01) best_fit <- NA documents <- rbind(documents, c(scores$anger, scores$disgust, scores$fear, scores$joy, scores$sadness, scores$surprise, best_fit)) } colnames(documents) <- c("ANGER", "DISGUST", "FEAR", "JOY", "SADNESS", "SURPRISE", "BEST_FIT") Here is a sample output from this code: ANGER DISGUST FEAR JOY SADNESS SURPRISE BEST_FIT “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412327” “sadness” “7.34083555412328” “3.09234031207392” “2.06783599555953” “1.02547755260094” “1.7277074477352” “2.78695866252273” “anger” “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412328” “sadness” Here you can see that the initial author is using naive Bayes (which honestly I don’t yet understand) to analyze the text. I wanted to show a quick snipet of how the analysis is being done “under the hood.” For my purposes though, I only care about the emotion outputted and the tweet it is analyzed from. emotion <- documents[, "BEST_FIT"]` This variable, emotion, is returned by the classify_emotion.R script. CHALLENGES OBSERVED In addition to not fully understanding the code, the emotion classification seems to only work OK (which is pretty much expected… this is a canned analysis that hasn’t been tailored to my analysis at all). I’d like to come back to this one day to see if I can do a better job analyzing the emotions of the tweets. STEP THREE: CLASSIFYING THE POLARITY OF EACH TWEET Similarly to what we saw in step 5, I will use the cleaned text to analyze the polarity of each tweet. This code is also from the old R Packaged titled “Sentiment.” As with above, I was able to get the code working with only some minor tweaks. This can be seen in its entirety in the classify_polarity.R script. Here it is, too: algorithm <- "bayes" pstrong <- 0.5 pweak <- 1.0 prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/subjectivity.csv.gz",header=FALSE) counts <- list(positive=length(which(lexicon[,3]=="positive")), negative=length(which(lexicon[,3]=="negative")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(positive=0,negative=0) doc <- matrix[i,] words <- findFreqTerms(doc, lowfreq=1) for (word in words) { index 0) { entry <- lexicon[index,] polarity <- as.character(entry[[2]]) category <- as.character(entry[[3]]) count <- counts[[category]] score <- pweak if (polarity == "strongsubj") score <- pstrong if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:", word, "CAT:", category, "POL:", polarity, "SCORE:", score)) } scores[[category]] <- scores[[category]]+score } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] ratio <- as.integer(abs(scores$positive/scores$negative)) if (ratio==1) best_fit <- "neutral" documents <- rbind(documents,c(scores$positive, scores$negative, abs(scores$positive/scores$negative), best_fit)) if (verbose) { print(paste("POS:", scores$positive,"NEG:", scores$negative, "RATIO:", abs(scores$positive/scores$negative))) cat("\n") } } colnames(documents) <- c("POS","NEG","POS/NEG","BEST_FIT") Here is a sample output from this code: POS NEG POS/NEG BEST_FIT “1.03127774142571” “0.445453222112551” “2.31512017476245” “positive” “1.03127774142571” “26.1492093145274” “0.0394381997949273” “negative” “17.9196623384892” “17.8123396772424” “1.00602518608961” “neutral” Again, I just wanted to show a quick snipet of how the analysis is being done “under the hood.” I only care about the polarity outputted and the tweet it is analyzed from. polarity <- documents[, "BEST_FIT"] This variable, polarity, is returned by the classify_polarity.R script. CHALLENGES OBSERVED As with above, this is a stock analysis and hasn’t been tweaked for my needs. The analysis does OK, but I want to come back to this again one day to see if I can do better. QUICK CONCLUSION So… Now I have the emotion and polarity for each tweet. This can be useful to see on its own, but I think is more worthwhile in aggregate. In my next post, I’ll show that. Also in the next post- I’ll also show an analysis of the word count with a wordcloud… This gets into the secondary point of this analysis. Hypothetically, I’d like to see common issues bubbled up through the wordcloud.

Doing a Sentiment Analysis on Tweets (Part 1)

INTRO So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis. For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support. STEP 1: GETTING AUTHENTICATED TO TWITTER First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that. Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter. library(twitteR) ## R package that does some of the Twitter API heavy lifting consumerKey <- "INSERT YOUR KEY HERE" consumerSecret <- "INSERT YOUR SECRET HERE" reqURL <- "https://api.twitter.com/oauth/request_token " accessURL <- "https://api.twitter.com/oauth/access_token " authURL <- "https://api.twitter.com/oauth/authorize " twitCred <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake() save(cred, file="credentials.RData") STEP 2: GETTING THE TWEETS Once you have your authentication credentials set, you can use them to grab tweets from Twitter. The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub. ##Authentication load("credentials.RData") ##has my secret keys and shiz registerTwitterOAuth(twitCred) ##logs me in ##Get the tweets about "comcast email" to work with tweetList <- searchTwitter("comcast email", n = 1000) tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted the results to a Data Frame to make it easier to analyze the results. STEP 3: GETTING RID OF THE JUNK Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at myemailaddress@comcast.net” The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way the user feels about it, other than they use it for their business. So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases: “fix” AND “email” in them (in that order) “Comcast” AND “email” in them in that order “no email” in them Any tweet that comes from a source with “comcast” in the handle “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them This is done with this code: ##finds the rows that have the phrase "fix ... email" in them fixemail <- grep("(fix.*email)", tweetList$text) ##finds the rows that have the phrase "comcast ... email" in them comcastemail <- grep("[Cc]omcast.*email", tweetList$text) ##finds the rows that have the phrase "no email" in them noemail <- grep("no email", tweetList$text) ##finds the rows that originated from a Comcast twitter handle comcasttweet <- grep("[Cc]omcast", tweetList$screenName) ##finds the rows related to email and customer service custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text) After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns. The returned columns are: text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude All I care about are: text created statusSource screenName This is handled through this tidbit of code: ##combine all of the "good" tweets row numbers that we greped out above and ##then sorts them and makes sure they are unique combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv) uvals <- unique(combined) sorted <- sort(uvals) ##pull the row numbers that we want, and with the columns that are important to ##us (tweet text, time of tweet, source, and username) paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)] STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown): ##make the device source look nicer paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource) paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource) ##name the columns names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName") paredTweetList Tweet created statusSource screenName Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP 2014-12-23 15:44:27 Twitter Web Client riocauto @msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW 2014-12-23 14:52:50 Twitter Web Client Monty_H_Mathis Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast 2014-12-23 09:20:14 Twitter Web Client mathercesul CHALLENGES OBSERVED As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.

random forest

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix <- read.csv("Parking_Violations.csv") ## READ IN THE WEATHER DATA (FROM NCDC) weather_data <- read.csv("weather_data.csv") ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays <- as.Date(c("2012-01-02", "2012-01-16", "2012-02-20", "2012-05-28", "2012-07-04", "2012-09-03", "2012-10-08", "2012-11-12", "2012-11-22", "2012-12-25", "2013-01-01", "2013-01-21", "2013-02-18", "2013-05-27", "2013-07-04", "2013-09-02", "2013-10-14", "2013-11-11", "2013-11-28", "2013-12-25", "2014-01-01", "2014-01-20", "2014-02-17", "2014-05-26", "2014-07-04", "2014-09-01", "2014-10-13", "2014-11-11", "2014-11-27", "2014-12-25", "2015-01-01", "2015-01-09", "2015-02-16", "2015-05-25", "2015-07-03", "2015-09-07")) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days <- as.data.frame(as.Date( ptix$Issue.Date.and.Time, format = "%m/%d/%Y")) names(days) <- "DATE" count_by_day <- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE <- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = "%Y%m%d")), format = "%m/%d/%Y") Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day <- join(count_by_day, weather_data, by = "DATE") With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON'T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION <- NULL count_by_day$STATION_NAME <- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN'T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR < 0] <- 0 count_by_day$DAPR[count_by_day$DAPR < 0] <- 0 count_by_day$PRCP[count_by_day$PRCP < 0] <- 0 count_by_day$SNWD[count_by_day$SNWD < 0] <- 0 count_by_day$SNOW[count_by_day$SNOW < 0] <- 0 count_by_day$WT01[count_by_day$WT01 < 0] <- 0 count_by_day$WT03[count_by_day$WT03 < 0] <- 0 count_by_day$WT04[count_by_day$WT04 < 0] <- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day <- count_by_day[ count_by_day$TMAX > 0, ] count_by_day <- count_by_day[ count_by_day$TMIN > 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day <- count_by_day[!is.na( count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS <- NULL count_by_day$WT01 <- NULL count_by_day$WT04 <- NULL count_by_day$WT03 <- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX <- count_by_day$TMAX / 10 count_by_day$TMIN <- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW <- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL <- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] <- 1 count_by_day$HOL <- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON <- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train <- count_by_day[count_by_day$DATE < "2014-08-01", ] test <- count_by_day[count_by_day$DATE >= "2014-08-01", ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest <- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest <- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon <- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5271.4002 89.5216 58.884 < 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 < 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 < 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 < 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: < 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 < 2.2e-16 *** HOL 1 147553631 147553631 498.1432 < 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 < 2.2e-16 *** HOL 1 147553631 147553631 463.001 < 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat <- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value <- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF <- round(predict(predForest, test), 0) test$LM <- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF <- sum(abs(test$RF - test$count)) difOfLM <- sum(abs(test$LM - test$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.

roadmap

A Data Scientist's Take on the Roadmap to AI

INTRODUCTION Recently I was asked by a former colleague about getting into AI. He has truly big data and wants to use this data to power “AI” - if the headlines are to be believed, everyone else is already doing it. Though it was difficult for my ego, I told him I couldn’t help him in our 30 minute call and that he should think about hiring someone to get him there. The truth was I really didn’t have a solid answer for him in the moment. This was truly disappointing - in my current role and in my previous role, I put predictive models into production. After thinking about it for a bit, there is definitely a similar path I took in both roles. There’s 3 steps in my mind to getting to “AI.” Though this seems simple, it is a long process and potentially not linear - you may have to keep coming back to previous steps. Baseline (Reporting) Understand (Advanced Analytics) Artificial Intelligence (Data Science) BASELINE (REPORTING) Fun fact: You cannot effectively predict anything if you cannot measure the impact. What I mean by baseline is building out a reporting suite. Having a fundamental understanding of your business and environment is key. Without doing this step, you may try to predict the wrong thing entirely - or start with something that isn’t the most impactful. For me, this step started with finding the data in the first place. Perhaps, like my colleague, you have lots of data and you’re ready to jump in. That’s great and makes getting started that much more straightforward. In my role, I joined a finance team that really didn’t have a good bead on this - finding the data was difficult (and getting the owners of that data to give me access was a process as well). To be successfull, start small and iterate. Our first reports were built from manually downloading machine logs, processing them in R with JSON packages, and turning them into a black-and-white document. It was ugly, but it helped us know what we needed to know in that moment - oh yeah… it was MUCH better than nothing. “Don’t let perfection be the enemy of good.” - paraphrased from Voltaire. From this, I gained access to our organizations data warehouse, put automation in place, and purchased some Tableau licenses. This phase took a few months and is constantly being refined, but we are now able to see the impact of our decisions at a glance. This new understanding inevitably leads to more questions - queue step 2: Understanding. UNDERSTANDING (ADVANCED ANALYTICS) If you have never circulated reports and dashboards to others… let me fill you in on something: it will ALWAYS lead to additional, progressively harder questions. This step is an investment in time and expertise - you have to commit to having dedicated resource(s) (read: people… it is inhumane to call people resources and you may only need one person or some of a full time person’s time). Why did X go up unexpectedly (breaks the current trend)? Are we over indexing on this type of customer? Right before our customer leaves, this weird thing happens - what is this weird thing and why is it happening? Like the previous step - this will be ongoing. Investing in someone to do advanced analytics will help you to understand the fine details of your business AND … (drum roll) … will help you to understand which part of your business is most ripe for “AI”! ARTIFICIAL INTELLIGENCE (DATA SCIENCE) It is at this point that you will able to do real, bonafide, data science. A quick rant: Notice that I purposefully did not use the term “AI” (I know I used it throughout this article and even in the title of this section… what can I say - I am in-tune with marketing concepts, too). “AI” is a term that is overused and rarely implemented. Data science, however, comes in many forms and can really transform your business. Here’s a few ideas for what you can do with data science: Prediction/Machine Learning Testing Graph Analysis Perhaps you want to predict whether a sale is fraud or which existing customer is most apt to buy your new product? You can also test whether a new strategy works better than the old. This requires that you use statistical concepts to ensure valid testing and results. My new obsession is around graph analysis. With graphs you can see relationships that may have been hidden before - this will enable you to identify new targets and enrich your understanding of your business! Data science usually is very specific thing and takes many forms! SUMMARY Getting to data science is a process - it will take an investment. There are products out there that will help you shortcut some of these steps and I encourage you to consider these. There are products to help with reporting, analytics, and data science. These should, in my very humble opinion, be used by people who are dedicated to the organizations data, analytics, and science. Directions for data science - measure, analyze, predict, repeat!

ROC curves

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

rStrava

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

RStudio

Sierpinski Triangles (and Carpets) in R

Recently in class, I was asked the following question: Start with an equilateral triangle and a point chosen at random from the interior of that triangle. Label one vertex 1, 2, a second vertex 3, 4, and the last vertex 5, 6. Roll a die to pick a vertex. Place a dot at the point halfway between the roll-selected vertex and the point you chose. Now consider this new dot as a starting point to do this experiment once again. Roll the die to pick a new vertex. Place a dot at the point halfway between the last point and the most recent roll-selected vertex. Continue this procedure. What does the shape of the collection of dots look like? I thought, well - it’s got to be something cool or else the professor wouldn’t ask, but I can’t imagine it will be more than a cloud of dots. Truth be told, I went to a conference for work the week of this assignment and never did it - but when I went to the next class, IT WAS SOMETHING COOL! It turns out that this creates a Sierpinski Triangle - a fractal of increasingly smaller triangles. I wanted to check this out for myself, so I built an R script that creates the triangle. I ran it a few times with differing amounts of points. Here is one with 50,000 points. Though this post is written in RStudio, I’ve hidden the code for readability. Actual code for this can be found here. I thought - if equilateral triangles create patterns this cool, a square must be amazing! Well… it is, however you can’t just run this logic - it will return a cloud of random dots… After talking with my professor, Dr. Levitan - it turns out you can get something equally awesome as the Sierpinski triangle with a square; you just need to make a few changes (say this with a voice of authority and calm knowingness): Instead of 3 points to move to, you need 8 points: the 4 corners of a specified square and the midpoints between each side. Also, instead of taking the midpoint of your move to the specified location, you need to take the tripoint (division by 3 instead of 2). This is called a Sierpinski Carpet - a fractal of squares (as opposed to a fractal of equilateral triangles in the graph above). You can see in both the triangle and square that the same pattern is repeated time and again in smaller and smaller increments. I updated my R script and voila - MORE BEAUTIFUL MATH! Check out the script and run the functions yourself! I only spent a little bit of time putting it together - I think it would be cool to add some other features, especially when it comes to the plotting of the points. Also - I’d like to run it for a million or more points… I just lacked the patience to wait out the script to run for that long (50,000 points took about 30 minutes to run - my script is probably not the most efficient). Anyways - really cool to see what happens in math sometimes - its hard to imagine at first that the triangle would look that way. Another reason math is cool!

sentiment analysis

Doing a Sentiment Analysis on Tweets (Part 2)

INTRO This is post is a continuation of my last post. There I pulled tweets from Twitter related to “Comcast email,” got rid of the junk, and removed the unnecessary/unwanted data. Now that I have the tweets, I will further clean the text and subject it to two different analyses: emotion and polarity. WHY DOES THIS MATTER Before I get started, I thought it might be a good idea to talk about WHY I am doing this (besides the fact that I learned a new skill and want to show it off and get feedback). This yet incomplete project was devised for two reasons: Understand the overall customer sentiment about the product I support Create an early warning system to help identify when things are going wrong on the platform Keeping the customer voice at the forefront of everything we do is tantamount to providing the best experience for the users of our platform. Identifying trends in sentiment and emotion can help inform the team in many ways, including seeing the reaction to new features/releases (i.e. – seeing a rise in comments about a specific addition from a release) and identifying needed changes to current functionality (i.e. – users who continually comment about a specific behavior of the application) and improvements to user experience (i.e. – trends in comments about being unable to find a certain feature on the site). Secondarily, this analysis can act as an early warning system when there are issues with the platform (i.e. – a sudden spike in comments about the usability of a mobile device). Now that I’ve explained why I am doing this (which I probably should have done in this sort of detail the first post), let’s get into how it is actually done… STEP ONE: STRIPPING THE TEXT FOR ANALYSIS There are a number of things included in tweets that dont matter for the analysis. Things like twitter handles, URLs, punctuation… they are not necessary to do the analysis (in fact, they may well confound it). This bit of code handles that cleanup. For those following the scripts on GitHub, this is part of my tweet_clean.R script. Also, to give credit where it is due: I’ve borrowed and tweaked the code from Andy Bromberg’s blog to do this task. library(stringr) ##Does some of the text editing ##Cleaning up the data some more (just the text now) First grabbing only the text text <- paredTweetList$Tweet # remove retweet entities text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text) # remove at people text <- gsub("@\\w+", "", text) # remove punctuation text <- gsub("[[:punct:]]", "", text) # remove numbers text <- gsub("[[:digit:]]", "", text) # remove html links text <- gsub("http\\w+", "", text) # define "tolower error handling" function try.error <- function(x) { # create missing value y <- NA # tryCatch error try_error <- tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(try_error, "error")) y <- tolower(x) # result return(y) } # lower case using try.error with sapply text <- sapply(text, try.error) # remove NAs in text text <- text[!is.na(text)] # remove column names names(text) <- NULL STEP TWO: CLASSIFYING THE EMOTION FOR EACH TWEET So now the text is just that: only text. The punctuation, links, handles, etc. have been removed. Now it is time to estimate the emotion of each tweet. Through some research, I found that there are many posts/sites on Sentiment Analysis/Emotion Classification that use the “Sentiment” package in R. I thought: “Oh great – a package tailor made to solve the problem for which I want an answer.” The problem is that this package has been deprecated and removed from the CRAN library. To get around this, I downloaded the archived package and pulled the code for doing the emotion classification. With some minor tweaks, I was able to get it going. This can be seen in its entirety in the classify_emotion.R script. You can also see the “made for the internet” version here: library(RTextTools) library(tm) algorithm <- "bayes" prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/emotions.csv.gz",header=FALSE) counts <- list(anger=length(which(lexicon[,2]=="anger")), disgust=length(which(lexicon[,2]=="disgust")), fear=length(which(lexicon[,2]=="fear")), joy=length(which(lexicon[,2]=="joy")), sadness=length(which(lexicon[,2]=="sadness")), surprise=length(which(lexicon[,2]=="surprise")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(anger=0,disgust=0,fear=0,joy=0,sadness=0,surprise=0) doc <- matrix[i,] words <- findFreqTerms(doc,lowfreq=1) for (word in words) { for (key in names(scores)) { emotions <- lexicon[which(lexicon[,2]==key),] index 0) { entry <- emotions[index,] category <- as.character(entry[[2]]]) count <- counts[[category]] score <- 1.0 if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:",word,"CAT:", category,"SCORE:",score)) } scores[[category]] <- scores[[category]]+score } } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] if (best_fit == "disgust" && as.numeric(unlist(scores[2]))-3.09234 < .01) best_fit <- NA documents <- rbind(documents, c(scores$anger, scores$disgust, scores$fear, scores$joy, scores$sadness, scores$surprise, best_fit)) } colnames(documents) <- c("ANGER", "DISGUST", "FEAR", "JOY", "SADNESS", "SURPRISE", "BEST_FIT") Here is a sample output from this code: ANGER DISGUST FEAR JOY SADNESS SURPRISE BEST_FIT “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412327” “sadness” “7.34083555412328” “3.09234031207392” “2.06783599555953” “1.02547755260094” “1.7277074477352” “2.78695866252273” “anger” “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412328” “sadness” Here you can see that the initial author is using naive Bayes (which honestly I don’t yet understand) to analyze the text. I wanted to show a quick snipet of how the analysis is being done “under the hood.” For my purposes though, I only care about the emotion outputted and the tweet it is analyzed from. emotion <- documents[, "BEST_FIT"]` This variable, emotion, is returned by the classify_emotion.R script. CHALLENGES OBSERVED In addition to not fully understanding the code, the emotion classification seems to only work OK (which is pretty much expected… this is a canned analysis that hasn’t been tailored to my analysis at all). I’d like to come back to this one day to see if I can do a better job analyzing the emotions of the tweets. STEP THREE: CLASSIFYING THE POLARITY OF EACH TWEET Similarly to what we saw in step 5, I will use the cleaned text to analyze the polarity of each tweet. This code is also from the old R Packaged titled “Sentiment.” As with above, I was able to get the code working with only some minor tweaks. This can be seen in its entirety in the classify_polarity.R script. Here it is, too: algorithm <- "bayes" pstrong <- 0.5 pweak <- 1.0 prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/subjectivity.csv.gz",header=FALSE) counts <- list(positive=length(which(lexicon[,3]=="positive")), negative=length(which(lexicon[,3]=="negative")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(positive=0,negative=0) doc <- matrix[i,] words <- findFreqTerms(doc, lowfreq=1) for (word in words) { index 0) { entry <- lexicon[index,] polarity <- as.character(entry[[2]]) category <- as.character(entry[[3]]) count <- counts[[category]] score <- pweak if (polarity == "strongsubj") score <- pstrong if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:", word, "CAT:", category, "POL:", polarity, "SCORE:", score)) } scores[[category]] <- scores[[category]]+score } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] ratio <- as.integer(abs(scores$positive/scores$negative)) if (ratio==1) best_fit <- "neutral" documents <- rbind(documents,c(scores$positive, scores$negative, abs(scores$positive/scores$negative), best_fit)) if (verbose) { print(paste("POS:", scores$positive,"NEG:", scores$negative, "RATIO:", abs(scores$positive/scores$negative))) cat("\n") } } colnames(documents) <- c("POS","NEG","POS/NEG","BEST_FIT") Here is a sample output from this code: POS NEG POS/NEG BEST_FIT “1.03127774142571” “0.445453222112551” “2.31512017476245” “positive” “1.03127774142571” “26.1492093145274” “0.0394381997949273” “negative” “17.9196623384892” “17.8123396772424” “1.00602518608961” “neutral” Again, I just wanted to show a quick snipet of how the analysis is being done “under the hood.” I only care about the polarity outputted and the tweet it is analyzed from. polarity <- documents[, "BEST_FIT"] This variable, polarity, is returned by the classify_polarity.R script. CHALLENGES OBSERVED As with above, this is a stock analysis and hasn’t been tweaked for my needs. The analysis does OK, but I want to come back to this again one day to see if I can do better. QUICK CONCLUSION So… Now I have the emotion and polarity for each tweet. This can be useful to see on its own, but I think is more worthwhile in aggregate. In my next post, I’ll show that. Also in the next post- I’ll also show an analysis of the word count with a wordcloud… This gets into the secondary point of this analysis. Hypothetically, I’d like to see common issues bubbled up through the wordcloud.

Doing a Sentiment Analysis on Tweets (Part 1)

INTRO So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis. For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support. STEP 1: GETTING AUTHENTICATED TO TWITTER First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that. Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter. library(twitteR) ## R package that does some of the Twitter API heavy lifting consumerKey <- "INSERT YOUR KEY HERE" consumerSecret <- "INSERT YOUR SECRET HERE" reqURL <- "https://api.twitter.com/oauth/request_token " accessURL <- "https://api.twitter.com/oauth/access_token " authURL <- "https://api.twitter.com/oauth/authorize " twitCred <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake() save(cred, file="credentials.RData") STEP 2: GETTING THE TWEETS Once you have your authentication credentials set, you can use them to grab tweets from Twitter. The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub. ##Authentication load("credentials.RData") ##has my secret keys and shiz registerTwitterOAuth(twitCred) ##logs me in ##Get the tweets about "comcast email" to work with tweetList <- searchTwitter("comcast email", n = 1000) tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted the results to a Data Frame to make it easier to analyze the results. STEP 3: GETTING RID OF THE JUNK Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at myemailaddress@comcast.net” The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way the user feels about it, other than they use it for their business. So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases: “fix” AND “email” in them (in that order) “Comcast” AND “email” in them in that order “no email” in them Any tweet that comes from a source with “comcast” in the handle “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them This is done with this code: ##finds the rows that have the phrase "fix ... email" in them fixemail <- grep("(fix.*email)", tweetList$text) ##finds the rows that have the phrase "comcast ... email" in them comcastemail <- grep("[Cc]omcast.*email", tweetList$text) ##finds the rows that have the phrase "no email" in them noemail <- grep("no email", tweetList$text) ##finds the rows that originated from a Comcast twitter handle comcasttweet <- grep("[Cc]omcast", tweetList$screenName) ##finds the rows related to email and customer service custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text) After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns. The returned columns are: text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude All I care about are: text created statusSource screenName This is handled through this tidbit of code: ##combine all of the "good" tweets row numbers that we greped out above and ##then sorts them and makes sure they are unique combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv) uvals <- unique(combined) sorted <- sort(uvals) ##pull the row numbers that we want, and with the columns that are important to ##us (tweet text, time of tweet, source, and username) paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)] STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown): ##make the device source look nicer paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource) paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource) ##name the columns names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName") paredTweetList Tweet created statusSource screenName Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP 2014-12-23 15:44:27 Twitter Web Client riocauto @msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW 2014-12-23 14:52:50 Twitter Web Client Monty_H_Mathis Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast 2014-12-23 09:20:14 Twitter Web Client mathercesul CHALLENGES OBSERVED As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.

shiny

Jake on Strava - R Shiny App

I created a Shiny app that grabs my running, riding, and other exercise stats from Strava and creates some simple visualizations.

Twitter Analysis - R Shiny App

I created a Shiny app that searches Twitter and does some simple analysis.

sierpinski triangles

Sierpinski Triangles (and Carpets) in R

Recently in class, I was asked the following question: Start with an equilateral triangle and a point chosen at random from the interior of that triangle. Label one vertex 1, 2, a second vertex 3, 4, and the last vertex 5, 6. Roll a die to pick a vertex. Place a dot at the point halfway between the roll-selected vertex and the point you chose. Now consider this new dot as a starting point to do this experiment once again. Roll the die to pick a new vertex. Place a dot at the point halfway between the last point and the most recent roll-selected vertex. Continue this procedure. What does the shape of the collection of dots look like? I thought, well - it’s got to be something cool or else the professor wouldn’t ask, but I can’t imagine it will be more than a cloud of dots. Truth be told, I went to a conference for work the week of this assignment and never did it - but when I went to the next class, IT WAS SOMETHING COOL! It turns out that this creates a Sierpinski Triangle - a fractal of increasingly smaller triangles. I wanted to check this out for myself, so I built an R script that creates the triangle. I ran it a few times with differing amounts of points. Here is one with 50,000 points. Though this post is written in RStudio, I’ve hidden the code for readability. Actual code for this can be found here. I thought - if equilateral triangles create patterns this cool, a square must be amazing! Well… it is, however you can’t just run this logic - it will return a cloud of random dots… After talking with my professor, Dr. Levitan - it turns out you can get something equally awesome as the Sierpinski triangle with a square; you just need to make a few changes (say this with a voice of authority and calm knowingness): Instead of 3 points to move to, you need 8 points: the 4 corners of a specified square and the midpoints between each side. Also, instead of taking the midpoint of your move to the specified location, you need to take the tripoint (division by 3 instead of 2). This is called a Sierpinski Carpet - a fractal of squares (as opposed to a fractal of equilateral triangles in the graph above). You can see in both the triangle and square that the same pattern is repeated time and again in smaller and smaller increments. I updated my R script and voila - MORE BEAUTIFUL MATH! Check out the script and run the functions yourself! I only spent a little bit of time putting it together - I think it would be cool to add some other features, especially when it comes to the plotting of the points. Also - I’d like to run it for a million or more points… I just lacked the patience to wait out the script to run for that long (50,000 points took about 30 minutes to run - my script is probably not the most efficient). Anyways - really cool to see what happens in math sometimes - its hard to imagine at first that the triangle would look that way. Another reason math is cool!

Splunk

Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup <- read.csv("mylookup.csv", header = FALSE) ## ARBITRARY "CHUNK" SIZE TO KEEP SEARCHES SMALLER ## start <- 1 end <- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata <- data.frame() ## HOW MANY "CHUNKS" WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring <- paste(mylookup[start:end], sep = "", collapse = '" OR VAR_NAME="') ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring <- paste('index = "my_splunk_index" (VAR_NAME="', lookupstring, '") | stats count BY VAR_NAME', sep = "") ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response <- GET("https://our.splunk.link:8089/", path = "servicesNS/admin/search/search/jobs/export", encode="form", config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate("USERNAME", "PASSWORD"), query=list(search=paste0("search ", searchstring, collapse="", sep=""), output_mode="csv")) ## CHANGES THE RESULTS TO A DATA TABLE ## result <- read.table(text = content(response, as = "text"), sep = ",", header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata <- rbind(alldata, result) ## UPDATES THE START POINT start <- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN'T GO TOO FAR ## if((end + 1000) > length(allusers)){ end <- length(allusers) } else { end <- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, "mydata.csv", row.names = FALSE, sep = ",") So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

stat theory

Sierpinski Triangles (and Carpets) in R

Recently in class, I was asked the following question: Start with an equilateral triangle and a point chosen at random from the interior of that triangle. Label one vertex 1, 2, a second vertex 3, 4, and the last vertex 5, 6. Roll a die to pick a vertex. Place a dot at the point halfway between the roll-selected vertex and the point you chose. Now consider this new dot as a starting point to do this experiment once again. Roll the die to pick a new vertex. Place a dot at the point halfway between the last point and the most recent roll-selected vertex. Continue this procedure. What does the shape of the collection of dots look like? I thought, well - it’s got to be something cool or else the professor wouldn’t ask, but I can’t imagine it will be more than a cloud of dots. Truth be told, I went to a conference for work the week of this assignment and never did it - but when I went to the next class, IT WAS SOMETHING COOL! It turns out that this creates a Sierpinski Triangle - a fractal of increasingly smaller triangles. I wanted to check this out for myself, so I built an R script that creates the triangle. I ran it a few times with differing amounts of points. Here is one with 50,000 points. Though this post is written in RStudio, I’ve hidden the code for readability. Actual code for this can be found here. I thought - if equilateral triangles create patterns this cool, a square must be amazing! Well… it is, however you can’t just run this logic - it will return a cloud of random dots… After talking with my professor, Dr. Levitan - it turns out you can get something equally awesome as the Sierpinski triangle with a square; you just need to make a few changes (say this with a voice of authority and calm knowingness): Instead of 3 points to move to, you need 8 points: the 4 corners of a specified square and the midpoints between each side. Also, instead of taking the midpoint of your move to the specified location, you need to take the tripoint (division by 3 instead of 2). This is called a Sierpinski Carpet - a fractal of squares (as opposed to a fractal of equilateral triangles in the graph above). You can see in both the triangle and square that the same pattern is repeated time and again in smaller and smaller increments. I updated my R script and voila - MORE BEAUTIFUL MATH! Check out the script and run the functions yourself! I only spent a little bit of time putting it together - I think it would be cool to add some other features, especially when it comes to the plotting of the points. Also - I’d like to run it for a million or more points… I just lacked the patience to wait out the script to run for that long (50,000 points took about 30 minutes to run - my script is probably not the most efficient). Anyways - really cool to see what happens in math sometimes - its hard to imagine at first that the triangle would look that way. Another reason math is cool!

statistics

Identifying Compromised User Accounts with Logistic Regression

INTRODUCTION As a Data Analyst on Comcast’s Messaging Engineering team, it is my responsibility to report on the platform statuses, identify irregularities, measure impact of changes, and identify policies to ensure that our system is used as it was intended. Part of the last responsibility is the identification and remediation of compromised user accounts. The challenge the company faces is being able to detect account compromises faster and remediate them closer to the moment of detection. This post will focus on the methodology and process for modeling the criteria to best detect compromised user accounts in near real-time from outbound email activity. For obvious reasons, I am only going to speak to the methodologies used; I’ll be vague when it comes to the actual criteria we used. DATA COLLECTION AND CLEANING Without getting into the finer details of email delivery, there are about 43 terminating actions an email can take when it was sent out of our platform. A message can be dropped for a number of reasons. These are things like the IP or user being on any number block lists, triggering our spam filters, and other abusive behaviors. The other side of that is that the message will be delivered to its intended recipient. That said, I was able to create a usage profile for all of our outbound senders in small chunks of time in Splunk (our machine log collection tool of choice). This profile gives a summary per user of how often the messages they sent hit each of the terminating actions described above. In order to train my data, I matched this usage data to our current compromised detection lists. I created a script in python that added an additional column in the data. If an account was flagged as compromised with our current criteria, it was given a one; if not, a zero. With the data collected, I am ready to determine the important inputs. DETERMINING INPUTS FOR THE MODEL In order to determine the important variables in the data, I created a Binary Regression Tree in R using the rpart library. The Binary Regression Tree iterates over the data and “splits” it in order to group the data to get compromised accounts together and non-compromised accounts together. It is also a nice way to visualize the data. You can see in the picture below what this looks like. Because the data is so large, I limited the data to one day chunks. I then ran this regression tree against each day separately. From that, I was able to determine that there are 6 important variables (4 of which showed up in every regression tree I created; the other 2 showed up in a majority of trees). You can determine the “important” variables by looking in the summary for the number of splits per variable. BUILDING THE MODEL Now that I have the important variables, I created a python script to build the Logistic Regression Model from them. Using the statsmodels package, I was able to build the model. All of my input variables were highly significant. I took the logistic regression equation with the coefficients given in the model back to Splunk and tested this on incoming data to see what would come out. I quickly found that it got many accounts that were really compromised. There were also some accounts being discovered that looked like brute force attacks that never got through - to adjust for that, I added a constraint to the model that the user must have done at least one terminating action that ensured they authenticated successfully (this rules out users coming from a ton of IPs, but failing authentication everytime). With these important variables, it’s time to build the Logistic Regression Model. CONCLUSION First and foremost, this writeup was intended to be a very high level summary explaining the steps I took to get my final model. What isn’t explained here is how many models I built that were less successful. Though this combination worked for me in the end, likely you’ll need to iterate over the process a number of times to get something successful. The new detection method for compromised accounts is an opportunity for us to expand our compromise detection and do it in a more real-time manner. This is also a foundation for future detection techniques for malicious IPs and other actors. With this new method, we will be able to expand the activity types for compromise detection outside of outbound email activity to things like preference changes, password resets, changes to forwarding address, and even application activity outside of the email platform.

Strava

Jake on Strava - R Shiny App

I created a Shiny app that grabs my running, riding, and other exercise stats from Strava and creates some simple visualizations.

Visualizing Exercise Data from Strava

INTRODUCTION My wife introduced me to cycling in 2014 - I fell in love with it and went all in. That first summer after buying my bike, I rode over 500 miles (more on that below). My neighbors at the time, also cyclists, introduced me to the app Strava. Ever since then, I’ve tracked all of my rides, runs, hikes, walks (perhaps not really exercise that needs to be tracked… but I hurt myself early in 2018 and that’s all I could do for a while), etc. everything I could, I tracked. I got curious and found a package, rStrava, where I can download all of my activity. Once I had it, I put it into a few visualizations. ESTABLISH STRAVA AUTHENTICATION First thing I had to do was set up a Strava account and application. I found some really nice instructions on another blog that helped walk me through this. After that, I installed rStrava and set up authentication (you only have to do this the first time). ## INSTALLING THE NECESSARY PACKAGES install.packages("devtools") devtools::install_github('fawda123/rStrava') ## LOAD THE LIBRARY library(rStrava) ## ESTABLISH THE APP CREDENTIALS name <- 'jakelearnsdatascience' client_id <- '31528' secret <- 'MY_SECRET_KEY' ## CREATE YOUR STRAVA TOKEN token <- httr::config(token = strava_oauth(name, client_id, secret, app_scope = "read_all", cache = TRUE)) ## cache = TRUE is optional - but it saves your token to the working directory GET MY EXERCISE DATA Now that authentication is setup, using the rStrava package to pull activity data is relatively straightforward. library(rStrava) ## LOAD THE TOKEN (AFTER THE FIRST TIME) stoken <- httr::config(token = readRDS(oauth_location)[[1]]) ## GET STRAVA DATA USING rStrava FUNCTION FOR MY ATHLETE ID my_act <- get_activity_list(stoken) This function returns a list of activities. class(my_act): list. In my case, there are 379 activies. FORMATTING THE DATA To make the data easier to work with, I convert it to a data frame. There are many more fields than I’ve selected below - these are all I want for this post. info_df <- data.frame() for(act in 1:length(my_act)){ tmp <- my_act[[act]] tmp_df <- data.frame(name = tmp$name, type = tmp$type, distance = tmp$distance, moving_time = tmp$moving_time, elapsed_time = tmp$elapsed_time, start_date = tmp$start_date_local, total_elevation_gain = tmp$total_elevation_gain, trainer = tmp$trainer, manual = tmp$manual, average_speed = tmp$average_speed, max_speed = tmp$max_speed) info_df <- rbind(info_df, tmp_df) } I want to convert a few fields to units that make more sense for me (miles, feet, hours instead of meters and seconds). I’ve also created a number of features, though I’ve suppressed the code here. You can see all of the code on github. HOW FAR HAVE I GONE? Since August 08, 2014, I have - under my own power - traveled 1300.85 miles. There were a few periods without much action (a whole year from mid-2016 through later-2017), which is a bit sad. The last few months have been good, though. Here’s a similar view, but split by activity. I’ve been running recently. I haven’t really ridden my bike since the first 2 summers I had it. I rode the peloton when we first got it, but not since. I was a walker when I first tore the labrum in my hip in early 2018. Finally, here’s the same data again, but split up in a ridgeplot. SUMMARY There’s a TON of data that is returned by the Strava API. This blog just scratches the surface of analysis that is possible - mostly I am just introducing how to get the data and get up and running. As a new year’s resolution, I’ve committed to run 312 miles this year. That is 6 miles per week for 52 weeks (for those trying to wrap their head around the weird number). Now that I’ve been able to pull this data, I’ll have to set up a tracker/dashboard for that data. More to come!

supervised learning

Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you’ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I’ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we’ve learned that on sunny days it typically does not rain. We don’t know for sure that today it won’t rain but we’re willing to make decisions based on our prediction that it won’t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can’t be 100% sure that it won’t rain because we’ve also woken up on a few sunny mornings in which we’ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid… I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don’t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead… or hair length… or eye color… or use all of these attributes at the same time! It’s natural in this example to see how “close” people are to one another based on different attributes. What these type of algorithms do is evaluate the “distances” of one piece of information from another piece. In a machine learning setting you look for similarities and “closeness” in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,…, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is “learned” without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I’ve used or found along my Data Science journey. UPDATE: I’ve written a whole post on this. You can find it here O’Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.

tableau

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

tableau public

Jake Learns Data Science Visitor Dashboard

A quick view of visitors to my website. Data pulled from Google Analytics and pushed to Amazon Redshift using Stitch Data.

twitteR

Twitter Analysis - R Shiny App

I created a Shiny app that searches Twitter and does some simple analysis.

Doing a Sentiment Analysis on Tweets (Part 2)

INTRO This is post is a continuation of my last post. There I pulled tweets from Twitter related to “Comcast email,” got rid of the junk, and removed the unnecessary/unwanted data. Now that I have the tweets, I will further clean the text and subject it to two different analyses: emotion and polarity. WHY DOES THIS MATTER Before I get started, I thought it might be a good idea to talk about WHY I am doing this (besides the fact that I learned a new skill and want to show it off and get feedback). This yet incomplete project was devised for two reasons: Understand the overall customer sentiment about the product I support Create an early warning system to help identify when things are going wrong on the platform Keeping the customer voice at the forefront of everything we do is tantamount to providing the best experience for the users of our platform. Identifying trends in sentiment and emotion can help inform the team in many ways, including seeing the reaction to new features/releases (i.e. – seeing a rise in comments about a specific addition from a release) and identifying needed changes to current functionality (i.e. – users who continually comment about a specific behavior of the application) and improvements to user experience (i.e. – trends in comments about being unable to find a certain feature on the site). Secondarily, this analysis can act as an early warning system when there are issues with the platform (i.e. – a sudden spike in comments about the usability of a mobile device). Now that I’ve explained why I am doing this (which I probably should have done in this sort of detail the first post), let’s get into how it is actually done… STEP ONE: STRIPPING THE TEXT FOR ANALYSIS There are a number of things included in tweets that dont matter for the analysis. Things like twitter handles, URLs, punctuation… they are not necessary to do the analysis (in fact, they may well confound it). This bit of code handles that cleanup. For those following the scripts on GitHub, this is part of my tweet_clean.R script. Also, to give credit where it is due: I’ve borrowed and tweaked the code from Andy Bromberg’s blog to do this task. library(stringr) ##Does some of the text editing ##Cleaning up the data some more (just the text now) First grabbing only the text text <- paredTweetList$Tweet # remove retweet entities text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text) # remove at people text <- gsub("@\\w+", "", text) # remove punctuation text <- gsub("[[:punct:]]", "", text) # remove numbers text <- gsub("[[:digit:]]", "", text) # remove html links text <- gsub("http\\w+", "", text) # define "tolower error handling" function try.error <- function(x) { # create missing value y <- NA # tryCatch error try_error <- tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(try_error, "error")) y <- tolower(x) # result return(y) } # lower case using try.error with sapply text <- sapply(text, try.error) # remove NAs in text text <- text[!is.na(text)] # remove column names names(text) <- NULL STEP TWO: CLASSIFYING THE EMOTION FOR EACH TWEET So now the text is just that: only text. The punctuation, links, handles, etc. have been removed. Now it is time to estimate the emotion of each tweet. Through some research, I found that there are many posts/sites on Sentiment Analysis/Emotion Classification that use the “Sentiment” package in R. I thought: “Oh great – a package tailor made to solve the problem for which I want an answer.” The problem is that this package has been deprecated and removed from the CRAN library. To get around this, I downloaded the archived package and pulled the code for doing the emotion classification. With some minor tweaks, I was able to get it going. This can be seen in its entirety in the classify_emotion.R script. You can also see the “made for the internet” version here: library(RTextTools) library(tm) algorithm <- "bayes" prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/emotions.csv.gz",header=FALSE) counts <- list(anger=length(which(lexicon[,2]=="anger")), disgust=length(which(lexicon[,2]=="disgust")), fear=length(which(lexicon[,2]=="fear")), joy=length(which(lexicon[,2]=="joy")), sadness=length(which(lexicon[,2]=="sadness")), surprise=length(which(lexicon[,2]=="surprise")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(anger=0,disgust=0,fear=0,joy=0,sadness=0,surprise=0) doc <- matrix[i,] words <- findFreqTerms(doc,lowfreq=1) for (word in words) { for (key in names(scores)) { emotions <- lexicon[which(lexicon[,2]==key),] index 0) { entry <- emotions[index,] category <- as.character(entry[[2]]]) count <- counts[[category]] score <- 1.0 if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:",word,"CAT:", category,"SCORE:",score)) } scores[[category]] <- scores[[category]]+score } } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] if (best_fit == "disgust" && as.numeric(unlist(scores[2]))-3.09234 < .01) best_fit <- NA documents <- rbind(documents, c(scores$anger, scores$disgust, scores$fear, scores$joy, scores$sadness, scores$surprise, best_fit)) } colnames(documents) <- c("ANGER", "DISGUST", "FEAR", "JOY", "SADNESS", "SURPRISE", "BEST_FIT") Here is a sample output from this code: ANGER DISGUST FEAR JOY SADNESS SURPRISE BEST_FIT “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412327” “sadness” “7.34083555412328” “3.09234031207392” “2.06783599555953” “1.02547755260094” “1.7277074477352” “2.78695866252273” “anger” “1.46871776464786” “3.09234031207392” “2.06783599555953” “1.02547755260094” “7.34083555412328” “7.34083555412328” “sadness” Here you can see that the initial author is using naive Bayes (which honestly I don’t yet understand) to analyze the text. I wanted to show a quick snipet of how the analysis is being done “under the hood.” For my purposes though, I only care about the emotion outputted and the tweet it is analyzed from. emotion <- documents[, "BEST_FIT"]` This variable, emotion, is returned by the classify_emotion.R script. CHALLENGES OBSERVED In addition to not fully understanding the code, the emotion classification seems to only work OK (which is pretty much expected… this is a canned analysis that hasn’t been tailored to my analysis at all). I’d like to come back to this one day to see if I can do a better job analyzing the emotions of the tweets. STEP THREE: CLASSIFYING THE POLARITY OF EACH TWEET Similarly to what we saw in step 5, I will use the cleaned text to analyze the polarity of each tweet. This code is also from the old R Packaged titled “Sentiment.” As with above, I was able to get the code working with only some minor tweaks. This can be seen in its entirety in the classify_polarity.R script. Here it is, too: algorithm <- "bayes" pstrong <- 0.5 pweak <- 1.0 prior <- 1.0 verbose <- FALSE matrix <- create_matrix(text) lexicon <- read.csv("./data/subjectivity.csv.gz",header=FALSE) counts <- list(positive=length(which(lexicon[,3]=="positive")), negative=length(which(lexicon[,3]=="negative")), total=nrow(lexicon)) documents <- c() for (i in 1:nrow(matrix)) { if (verbose) print(paste("DOCUMENT",i)) scores <- list(positive=0,negative=0) doc <- matrix[i,] words <- findFreqTerms(doc, lowfreq=1) for (word in words) { index 0) { entry <- lexicon[index,] polarity <- as.character(entry[[2]]) category <- as.character(entry[[3]]) count <- counts[[category]] score <- pweak if (polarity == "strongsubj") score <- pstrong if (algorithm=="bayes") score <- abs(log(score*prior/count)) if (verbose) { print(paste("WORD:", word, "CAT:", category, "POL:", polarity, "SCORE:", score)) } scores[[category]] <- scores[[category]]+score } } if (algorithm=="bayes") { for (key in names(scores)) { count <- counts[[key]] total <- counts[["total"]] score <- abs(log(count/total)) scores[[key]] <- scores[[key]]+score } } else { for (key in names(scores)) { scores[[key]] <- scores[[key]]+0.000001 } } best_fit <- names(scores)[which.max(unlist(scores))] ratio <- as.integer(abs(scores$positive/scores$negative)) if (ratio==1) best_fit <- "neutral" documents <- rbind(documents,c(scores$positive, scores$negative, abs(scores$positive/scores$negative), best_fit)) if (verbose) { print(paste("POS:", scores$positive,"NEG:", scores$negative, "RATIO:", abs(scores$positive/scores$negative))) cat("\n") } } colnames(documents) <- c("POS","NEG","POS/NEG","BEST_FIT") Here is a sample output from this code: POS NEG POS/NEG BEST_FIT “1.03127774142571” “0.445453222112551” “2.31512017476245” “positive” “1.03127774142571” “26.1492093145274” “0.0394381997949273” “negative” “17.9196623384892” “17.8123396772424” “1.00602518608961” “neutral” Again, I just wanted to show a quick snipet of how the analysis is being done “under the hood.” I only care about the polarity outputted and the tweet it is analyzed from. polarity <- documents[, "BEST_FIT"] This variable, polarity, is returned by the classify_polarity.R script. CHALLENGES OBSERVED As with above, this is a stock analysis and hasn’t been tweaked for my needs. The analysis does OK, but I want to come back to this again one day to see if I can do better. QUICK CONCLUSION So… Now I have the emotion and polarity for each tweet. This can be useful to see on its own, but I think is more worthwhile in aggregate. In my next post, I’ll show that. Also in the next post- I’ll also show an analysis of the word count with a wordcloud… This gets into the secondary point of this analysis. Hypothetically, I’d like to see common issues bubbled up through the wordcloud.

Doing a Sentiment Analysis on Tweets (Part 1)

INTRO So… This post is my first foray into the R twitteR package. This post assumes that you have that package installed already in R. I show here how to get tweets from Twitter in preparation for doing some sentiment analysis. My next post will be the actual sentiment analysis. For this example, I am grabbing tweets related to “Comcast email.” My goal of this exercise is to see how people are feeling about the product I support. STEP 1: GETTING AUTHENTICATED TO TWITTER First, you’ll need to create an application at Twitter. I used this blog post to get rolling with that. This post does a good job walking you through the steps to do that. Once you have your app created, this is the code I used to create and save my authentication credentials. Once you’ve done this once, you need only load your credentials in the future to authenticate with Twitter. library(twitteR) ## R package that does some of the Twitter API heavy lifting consumerKey <- "INSERT YOUR KEY HERE" consumerSecret <- "INSERT YOUR SECRET HERE" reqURL <- "https://api.twitter.com/oauth/request_token " accessURL <- "https://api.twitter.com/oauth/access_token " authURL <- "https://api.twitter.com/oauth/authorize " twitCred <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake() save(cred, file="credentials.RData") STEP 2: GETTING THE TWEETS Once you have your authentication credentials set, you can use them to grab tweets from Twitter. The next snippets of code come from my scraping_twitter.R script, which you are welcome to see in it’s entirety on GitHub. ##Authentication load("credentials.RData") ##has my secret keys and shiz registerTwitterOAuth(twitCred) ##logs me in ##Get the tweets about "comcast email" to work with tweetList <- searchTwitter("comcast email", n = 1000) tweetList <- twListToDF(tweetList) ##converts that data we got into a data frame As you can see, I used the twitteR R Package to authenticate and search Twitter. After getting the tweets, I converted the results to a Data Frame to make it easier to analyze the results. STEP 3: GETTING RID OF THE JUNK Many of the tweets returned by my initial search are totally unrelated to Comcast Email. An example of this would be: “I am selling something random… please email me at myemailaddress@comcast.net” The tweet above includes the words email and comcast, but has nothing to actually do with Comcast Email and the way the user feels about it, other than they use it for their business. So… based on some initial, manual, analysis of the tweets, I’ve decided to pull those tweets with the phrases: “fix” AND “email” in them (in that order) “Comcast” AND “email” in them in that order “no email” in them Any tweet that comes from a source with “comcast” in the handle “Customer Service” AND “email” OR the reverse (“email” AND “Customer Service”) in them This is done with this code: ##finds the rows that have the phrase "fix ... email" in them fixemail <- grep("(fix.*email)", tweetList$text) ##finds the rows that have the phrase "comcast ... email" in them comcastemail <- grep("[Cc]omcast.*email", tweetList$text) ##finds the rows that have the phrase "no email" in them noemail <- grep("no email", tweetList$text) ##finds the rows that originated from a Comcast twitter handle comcasttweet <- grep("[Cc]omcast", tweetList$screenName) ##finds the rows related to email and customer service custserv <- grep("[Cc]ustomer [Ss]ervice.*email|email.*[Cc]ustomer [Ss]ervice", tweetList$text) After pulling out the duplicates (some tweets may fall into multiple scenarios from above) and ensuring they are in order (as returned initially), I assign the relevant tweets to a new variable with only some of the returned columns. The returned columns are: text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude All I care about are: text created statusSource screenName This is handled through this tidbit of code: ##combine all of the "good" tweets row numbers that we greped out above and ##then sorts them and makes sure they are unique combined <- c(fixemail, comcastemail, noemail, comcasttweet, custserv) uvals <- unique(combined) sorted <- sort(uvals) ##pull the row numbers that we want, and with the columns that are important to ##us (tweet text, time of tweet, source, and username) paredTweetList <- tweetList[sorted, c(1, 5, 10, 11)] STEP 4: CLEAN UP THE DATA AND RETURN THE RESULTS Lastly, for this first script, I make the sources look nice, add titles, and return the final list (only a sample set of tweets shown): ##make the device source look nicer paredTweetList$statusSource <- sub("<.*\">", "", paredTweetList$statusSource) paredTweetList$statusSource <- sub("</a>", "", paredTweetList$statusSource) ##name the columns names(paredTweetList) <- c("Tweet", "Created", "Source", "ScreenName") paredTweetList Tweet created statusSource screenName Dear Mark I am having problems login into my acct REDACTED@comcast.net I get no email w codes to reset my password for eddygil HELP HELP 2014-12-23 15:44:27 Twitter Web Client riocauto @msnbc @nbc @comcast pay @thereval who incites the murder of police officers. Time to send them a message of BOYCOTT! Tweet/email them NOW 2014-12-23 14:52:50 Twitter Web Client Monty_H_Mathis Comcast, I have no email. This is bad for my small business. Their response “Oh, I’m sorry for that”. Problem not resolved. #comcast 2014-12-23 09:20:14 Twitter Web Client mathercesul CHALLENGES OBSERVED As you can see from the output, sometimes some “junk” still gets in. Something I’d like to continue working on is a more reliable algorithm for identifying appropriate tweets. I also am worried that my choice of subjects is biasing the sentiment.

unsupervised learning

Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you’ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I’ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we’ve learned that on sunny days it typically does not rain. We don’t know for sure that today it won’t rain but we’re willing to make decisions based on our prediction that it won’t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can’t be 100% sure that it won’t rain because we’ve also woken up on a few sunny mornings in which we’ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid… I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don’t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead… or hair length… or eye color… or use all of these attributes at the same time! It’s natural in this example to see how “close” people are to one another based on different attributes. What these type of algorithms do is evaluate the “distances” of one piece of information from another piece. In a machine learning setting you look for similarities and “closeness” in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,…, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is “learned” without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I’ve used or found along my Data Science journey. UPDATE: I’ve written a whole post on this. You can find it here O’Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.

user agents

Understanding User Agents

INTRODUCTION I have had a few discussions around web user agents at work recently. It turns out that they are not straightforward at all. In other words, trying to report browser usage to our Business Unit required a nontrivial translation. The more I dug in, the more I learned. I had some challenges finding the information, so I thought it be useful to document my findings and centralizing the sites I used to figure all this out. Just a quick background: Our web application, for a multitude of reasons, sends Internet Explorer users into a kind of compatibility mode in which it appears the browser is another version of IE (frequently 7, which no one uses anymore). In addition to this, in some of the application logs, there are user agents that appear with the prefix from the app followed by the browser as it understands it - also frequently IE7. For other browsers - it could be Google Crome (GC43; 43 is the browser version) or Mozilla Firefox (FF38; same deal here with the version number) - it does the same thing, though those browsers do not default to a compatibility mode in the same way. This is only the beginning of the confusion that is a web user agent string. While there isn’t much I can do about the application logs doing its own user agent translations (we’ll need to make some changes to the system logging), I can decipher the users strings from the places in the app that report the raw user agent strings. These are the strings that begin with Mozilla (more on that below). Let’s walk through them. THE USER AGENT STRING It can look like many different things. Here are some examples: Mozilla/5.0 (Windows NT 6.1;; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36 Mozilla/5.0 (Windows NT 6.3;; WOW64;; rv:31.0) Gecko/20100101 Firefox/31.0 Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17 Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; .NET4.0C;; .NET4.0E) Mozilla/4.0 (compatible;; MSIE 7.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; MAEM;; .NET4.0C;; InfoPath.1) Mozilla/4.0 (compatible;; MSIE 8.0;; Windows NT 5.1;; Trident/4.0;; .NET CLR 1.0.3705;; .NET CLR 1.1.4322;; Media Center PC 4.0;; .NET CLR 2.0.50727;; .NET CLR 3.0.4506.2152;; .NET CLR 3.5.30729;; InfoPath.3;; .NET4.0C;; yie8) Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; Trident/5.0) As you can see - they all have different components and parts to them. Some seem to be very straightforward at first glance (keyword: seem) and others are totally baffling. TRANSLATING THE USER AGENT STRING Much of my understanding of these user agent strings came from plugging the user agent strings into this page and a fair amount of Googling. Let’s pull apart the first user agent string from above: Mozilla/5.0 (Windows NT 6.1;; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36 Mozilla/5.0 “MozillaProductSlice. Claims to be a Mozilla based user agent, which is only true for Gecko browsers like Firefox and Netscape. For all other user agents it means ‘Mozilla-compatible’. In modern browsers, this is only used for historical reasons. It has no real meaning anymore” TRANSLATION We don’t care about this field in any of the user agent strings. It’s good to know that it starts the web user agent strings, but that’s about it. (Windows NT 6.1;; WOW64) Operating System = Windows 7 TRANSLATION This is at least in the right ball park, but still not exactly straightforward. Why can’t it just be Windows 7 for Windows 7? AppleWebKit/537.36 “The Web Kit provides a set of core classes to display web content in windows” TRANSLATION I don’t even know… don’t care. (KHTML, “Open Source HTML layout engine developed by the KDE project” TRANSLATION Still don’t know or care. like Gecko) “like Gecko…” TRANSLATION What? Yep. Don’t care - makes no sense. Chrome/43.0.2357.81 This is the browser and it’s version TRANSLATION Google Chrome v. 43. YES! ONE THAT MAKES SENSE AND HAS INFO WE WANT! Safari/537.36 “Based on Safari” TRANSLATION Um… ok? So this isn’t actually Apple Safari? NOPE! It’s Chrome, which makes pulling Safari quite the challenge. I’ll spell that out in more detail in outlining the if statement below. Out of that whole thing, we have several things that aren’t important and several things that look like they could be another thing, but aren’t. So… Long story short - all of that info boils down to the user coming to our site using Google Crome 43 from a Window’s 7 machine. THE INTERNET EXPLORER USER AGENT Confused yet? Hold on to your butts. The Internet Explorer User Agent String is the level 2 version of the previous string. Let’s look at: Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; .NET4.0C;; .NET4.0E) I found some light reading to explain some of what we are about to dive into. Most important from that page is this line: “When the F12 developer tools are used to change the browser mode of Internet Explorer, the version token of the user-agent string is modified to appear so that the browser appears to be an earlier version. This is done to allow browser specific content to be served to Internet Explorer and is usually necessary only when websites have not been updated to reflect current versions of the browser. When this happens, a Trident token is added to the user-agent string. This token includes a version number that enables you to identify the version of the browser, regardless of the current browser mode.” TRANSLATION Though the browser version above looks like MSIE 9.0 (that’s clearly what the string says), the Trident version identifies the browser as actually Internet Explorer 11. I am 90% sure that our site has many many many many many customizations done to deal specifically with Internet Explorer funny business. This is why the browser appears many times as MSIE 7.0 (Like this example which is actually IE 11, too: Mozilla/4.0 (compatible;; MSIE 7.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; MAEM;; .NET4.0C;; InfoPath.1)) If you’d like additional information on Trident, it can be found here. Just to summarize: For those user agent strings from Internet Explorer, the important detail is that Trident bit for determining what browser they came from. PUTTING THE PIECES TOGETHER Ok ok ok… now we at least can read the string - maybe there are a bunch of questions about a lot of this, but we can pull the browser version at this point. After pulling all of this information together and getting a general understanding of it, I read this brief history of user agent strings. Now I understand why they are the way they are - though I still think it’s stupid. DECIPHERING USER AGENTS If you, like me, need to translate these user strings into something that normal people can understand - use this table for reference. We use Splunk to do our web scraping and analysis. By using the “BIT THAT MATTERS,” I was able to build a case statement to translate the User Agent Strings into human readable analysis. BROWSER USER AGENT STRING EXAMPLE BIT THAT MATTERS Internet Explorer 11 Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.1;; WOW64;; Trident/7.0;; SLCC2;; .NET CLR 2.0.50727;; .NET CLR 3.5.30729;; .NET CLR 3.0.30729;; Media Center PC 6.0;; .NET4.0C;; .NET4.0E) Trident/7.0 Internet Explorer 10 Mozilla/4.0 (compatible;; MSIE 7.0;; Windows NT 6.2;; WOW64;; Trident/6.0;; .NET4.0E;; .NET4.0C;; .NET CLR 3.5.30729;; .NET CLR 2.0.50727;; .NET CLR 3.0.30729;; MDDCJS) Trident/6.0 Internet Explorer 9 Mozilla/5.0 (compatible;; MSIE 9.0;; Windows NT 6.0;; Trident/5.0;; BOIE9;;ENUSMSCOM) Trident/5.0 Mozilla Firefox 4X.x Mozilla/5.0 (Windows NT 6.1;; Win64;; x64;; rv:40.0) Gecko/20100101 Firefox/40.0 Firefox/4 Mozilla Firefox 3X.x Mozilla/5.0 (Windows NT 6.1;; rv:38.0) Gecko/20100101 Firefox/38.0 Firefox/3 Google Chrome 4X.x Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36 Google Chrome 3X.x Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 Chrome/3 Apple Safari 8.x Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 Version/8 Apple Safari 7.x Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_9_5) AppleWebKit/600.3.18 (KHTML, like Gecko) Version/7.1.3 Safari/537.85.12 Version/7 Apple Safari 6.x Mozilla/5.0 (Macintosh;; Intel Mac OS X 10_7_5) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/6.1.6 Safari/537.78.2 Version/6