# AI

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# articles

### How Data Science Is Keeping Your Cell Phone Info Safe

I am honored to have been interviewed by Philly Magazine about how my experience from Villanova's Masters in Applied Statistics has influenced my work in analytics at Comcast!

# AUC

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# cartoDB

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

# D3.js

### Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself&rsquo;s curve and the curve for one. When I saw this, my mind was blown. How interesting&hellip; and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

### GAP RE-MINDER

A demonstration D3 project, shamelessly ripping off Gapminder.

### Open Data Day - DC Hackathon

For those of you who aren&rsquo;t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC&rsquo;s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I&rsquo;ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn&rsquo;t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

# data analysis

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

# data science

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself&rsquo;s curve and the curve for one. When I saw this, my mind was blown. How interesting&hellip; and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

### GAP RE-MINDER

A demonstration D3 project, shamelessly ripping off Gapminder.

### Open Data Day - DC Hackathon

For those of you who aren&rsquo;t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC&rsquo;s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I&rsquo;ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn&rsquo;t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).

# logistic regression

### Identifying Compromised User Accounts with Logistic Regression

INTRODUCTION As a Data Analyst on Comcast&rsquo;s Messaging Engineering team, it is my responsibility to report on the platform statuses, identify irregularities, measure impact of changes, and identify policies to ensure that our system is used as it was intended. Part of the last responsibility is the identification and remediation of compromised user accounts. The challenge the company faces is being able to detect account compromises faster and remediate them closer to the moment of detection. This post will focus on the methodology and process for modeling the criteria to best detect compromised user accounts in near real-time from outbound email activity. For obvious reasons, I am only going to speak to the methodologies used; I&rsquo;ll be vague when it comes to the actual criteria we used. DATA COLLECTION AND CLEANING Without getting into the finer details of email delivery, there are about 43 terminating actions an email can take when it was sent out of our platform. A message can be dropped for a number of reasons. These are things like the IP or user being on any number block lists, triggering our spam filters, and other abusive behaviors. The other side of that is that the message will be delivered to its intended recipient. That said, I was able to create a usage profile for all of our outbound senders in small chunks of time in Splunk (our machine log collection tool of choice). This profile gives a summary per user of how often the messages they sent hit each of the terminating actions described above. In order to train my data, I matched this usage data to our current compromised detection lists. I created a script in python that added an additional column in the data. If an account was flagged as compromised with our current criteria, it was given a one; if not, a zero. With the data collected, I am ready to determine the important inputs. DETERMINING INPUTS FOR THE MODEL In order to determine the important variables in the data, I created a Binary Regression Tree in R using the rpart library. The Binary Regression Tree iterates over the data and &ldquo;splits&rdquo; it in order to group the data to get compromised accounts together and non-compromised accounts together. It is also a nice way to visualize the data. You can see in the picture below what this looks like. Because the data is so large, I limited the data to one day chunks. I then ran this regression tree against each day separately. From that, I was able to determine that there are 6 important variables (4 of which showed up in every regression tree I created; the other 2 showed up in a majority of trees). You can determine the &ldquo;important&rdquo; variables by looking in the summary for the number of splits per variable. BUILDING THE MODEL Now that I have the important variables, I created a python script to build the Logistic Regression Model from them. Using the statsmodels package, I was able to build the model. All of my input variables were highly significant. I took the logistic regression equation with the coefficients given in the model back to Splunk and tested this on incoming data to see what would come out. I quickly found that it got many accounts that were really compromised. There were also some accounts being discovered that looked like brute force attacks that never got through - to adjust for that, I added a constraint to the model that the user must have done at least one terminating action that ensured they authenticated successfully (this rules out users coming from a ton of IPs, but failing authentication everytime). With these important variables, it&rsquo;s time to build the Logistic Regression Model. CONCLUSION First and foremost, this writeup was intended to be a very high level summary explaining the steps I took to get my final model. What isn&rsquo;t explained here is how many models I built that were less successful. Though this combination worked for me in the end, likely you&rsquo;ll need to iterate over the process a number of times to get something successful. The new detection method for compromised accounts is an opportunity for us to expand our compromise detection and do it in a more real-time manner. This is also a foundation for future detection techniques for malicious IPs and other actors. With this new method, we will be able to expand the activity types for compromise detection outside of outbound email activity to things like preference changes, password resets, changes to forwarding address, and even application activity outside of the email platform.

# model coparison

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# open data

### Prime Number Patterns

I found a very thought provoking and beautiful visualization on the D3 Website regarding prime numbers. What the visualization shows is that if you draw periodic curves beginning at the origin for each positive integer, the prime numbers will be intersected by only two curves: the prime itself&rsquo;s curve and the curve for one. When I saw this, my mind was blown. How interesting&hellip; and also how obvious. The definition of a prime is that it can only be divided by itself and one (duh). This is a visualization of that fact. The patterns that emerge are stunning. I wanted to build the data and visualization for myself in R. While not as spectacular as the original I found, it was still a nice adventure. I used Plotly to visualize the data. The code can be found on github. Here is the visualization:

# ROC curves

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

# tableau

### Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell&rsquo;s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine,$2000 (OUCH!), was issued (many times) by the police for &ldquo;ATV on Public Property.&rdquo; The smallest fine, $15, was issued also by the police &ldquo;parking over the time limit.&rdquo; The average fine for a violation in Philadelphia over the time range was$46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping$36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for &ldquo;Stop Prohibited&rdquo; and &ldquo;Bus Only Zone&rdquo; in the Chinatown area). I&rsquo;d love to hear more about why this person has so many tickets and what you do about that&hellip; 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That&rsquo;s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And&hellip; who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are &ldquo;hotter&rdquo; than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

# tableau public

### Jake Learns Data Science Visitor Dashboard

A quick view of visitors to my website. Data pulled from Google Analytics and pushed to Amazon Redshift using Stitch Data.

### Twitter Analysis - R Shiny App

I created a Shiny app that searches Twitter and does some simple analysis.