# data science

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Model Comparision - ROC Curves & AUC

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? $$TPR = sensitivity = \frac{TP}{TP + FN}$$ Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? $$FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})$$ Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR &amp; FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR &amp; FPR for each rank and model! Step 5: Plot the Results &amp; Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 ($$area = hieght * width$$ for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

### Machine Learning Demystified

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix &lt;- read.csv(&quot;Parking_Violations.csv&quot;) ## READ IN THE WEATHER DATA (FROM NCDC) weather_data &lt;- read.csv(&quot;weather_data.csv&quot;) ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays &lt;- as.Date(c(&quot;2012-01-02&quot;, &quot;2012-01-16&quot;, &quot;2012-02-20&quot;, &quot;2012-05-28&quot;, &quot;2012-07-04&quot;, &quot;2012-09-03&quot;, &quot;2012-10-08&quot;, &quot;2012-11-12&quot;, &quot;2012-11-22&quot;, &quot;2012-12-25&quot;, &quot;2013-01-01&quot;, &quot;2013-01-21&quot;, &quot;2013-02-18&quot;, &quot;2013-05-27&quot;, &quot;2013-07-04&quot;, &quot;2013-09-02&quot;, &quot;2013-10-14&quot;, &quot;2013-11-11&quot;, &quot;2013-11-28&quot;, &quot;2013-12-25&quot;, &quot;2014-01-01&quot;, &quot;2014-01-20&quot;, &quot;2014-02-17&quot;, &quot;2014-05-26&quot;, &quot;2014-07-04&quot;, &quot;2014-09-01&quot;, &quot;2014-10-13&quot;, &quot;2014-11-11&quot;, &quot;2014-11-27&quot;, &quot;2014-12-25&quot;, &quot;2015-01-01&quot;, &quot;2015-01-09&quot;, &quot;2015-02-16&quot;, &quot;2015-05-25&quot;, &quot;2015-07-03&quot;, &quot;2015-09-07&quot;)) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days &lt;- as.data.frame(as.Date( ptix$Issue.Date.and.Time, format = &quot;%m/%d/%Y&quot;)) names(days) &lt;- &quot;DATE&quot; count_by_day &lt;- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE &lt;- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = &quot;%Y%m%d&quot;)), format = &quot;%m/%d/%Y&quot;) Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day &lt;- join(count_by_day, weather_data, by = &quot;DATE&quot;) With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON&#39;T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION &lt;- NULL count_by_day$STATION_NAME &lt;- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN&#39;T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR &lt; 0] &lt;- 0 count_by_day$DAPR[count_by_day$DAPR &lt; 0] &lt;- 0 count_by_day$PRCP[count_by_day$PRCP &lt; 0] &lt;- 0 count_by_day$SNWD[count_by_day$SNWD &lt; 0] &lt;- 0 count_by_day$SNOW[count_by_day$SNOW &lt; 0] &lt;- 0 count_by_day$WT01[count_by_day$WT01 &lt; 0] &lt;- 0 count_by_day$WT03[count_by_day$WT03 &lt; 0] &lt;- 0 count_by_day$WT04[count_by_day$WT04 &lt; 0] &lt;- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day &lt;- count_by_day[ count_by_day$TMAX &gt; 0, ] count_by_day &lt;- count_by_day[ count_by_day$TMIN &gt; 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day &lt;- count_by_day[!is.na( count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS &lt;- NULL count_by_day$WT01 &lt;- NULL count_by_day$WT04 &lt;- NULL count_by_day$WT03 &lt;- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX &lt;- count_by_day$TMAX / 10 count_by_day$TMIN &lt;- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW &lt;- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL &lt;- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] &lt;- 1 count_by_day$HOL &lt;- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON &lt;- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train &lt;- count_by_day[count_by_day$DATE &lt; &quot;2014-08-01&quot;, ] test &lt;- count_by_day[count_by_day$DATE &gt;= &quot;2014-08-01&quot;, ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest &lt;- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest &lt;- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon &lt;- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(&gt;|t|) (Intercept) 5271.4002 89.5216 58.884 &lt; 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 &lt; 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 &lt; 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 &lt; 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: &lt; 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(&gt;F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 &lt; 2.2e-16 *** HOL 1 147553631 147553631 498.1432 &lt; 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(&gt;F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 &lt; 2.2e-16 *** HOL 1 147553631 147553631 463.001 &lt; 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat &lt;- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value &lt;- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF &lt;- round(predict(predForest, test), 0) test$LM &lt;- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF &lt;- sum(abs(test$RF - test$count)) difOfLM &lt;- sum(abs(test$LM - test\$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.