INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

INTRODUCTION Whether you are a data professional or in a job that requires data driven decisions, predictive analytics and related products (aka machine learning aka ML aka artificial intelligence aka AI) are here and understanding them is paramount. They are being used to drive industry. Because of this, understanding how to compare predictive models is very important. This post gets into a very popular method of decribing how well a model performs: the Area Under the Curve (AUC) metric. As the term implies, AUC is a measure of area under the curve. The curve referenced is the Reciever Operating Characteristic (ROC) curve. The ROC curve is a way to visually represent how the True Positive Rate (TPR) increases as the False Positive Rate (FPR) increases. In plain english, the ROC curve is a visualization of how well a predictive model is ordering the outcome - can it separate the two classes (TRUE/FALSE)? If not (most of the time it is not perfect), how close does it get? This last question can be answered with the AUC metric. THE BACKGROUND Before I explain, let’s take a step back and understand the foundations of TPR and FPR. For this post we are talking about a binary prediction (TRUE/FALSE). This could be answering a question like: Is this fraud? (TRUE/FALSE). In a predictive model, you get some right and some wrong for both the TRUE and FALSE. Thus, you have four categories of outcomes: True positive (TP): I predicted TRUE and it was actually TRUE False positive (FP): I predicted TRUE and it was actually FALSE True negative (TN): I predicted FALSE and it was actually FALSE False negative (FN): I predicted FALSE and it was actually TRUE From these, you can create a number of additional metrics that measure various things. In ROC Curves, there are two that are important: True Positive Rate aka Sensitivity (TPR): out of all the actual TRUE outcomes, how many did I predict TRUE? \(TPR = sensitivity = \frac{TP}{TP + FN}\) Higher is better! False Positive Rate aka 1 - Specificity (FPR): out of all the actual FALSE outcomes, how many did I predict TRUE? \(FPR = 1 - sensitivity = 1 - (\frac{TN}{TN + FP})\) Lower is better! BUILDING THE ROC CURVE For the sake of the example, I built 3 models to compare: Random Forest, Logistic Regression, and random prediction using a uniform distribution. Step 1: Rank Order Predictions To build the ROC curve for each model, you first rank order your predictions: Actual Predicted FALSE 0.9291 FALSE 0.9200 TRUE 0.8518 TRUE 0.8489 TRUE 0.8462 TRUE 0.7391 Step 2: Calculate TPR & FPR for First Iteration Now, we step through the table. Using a “cutoff” as the first row (effectively the most likely to be TRUE), we say that the first row is predicted TRUE and the remaining are predicted FALSE. From the table below, we can see that the first row is FALSE, though we are predicting it TRUE. This leads to the following metrics for our first iteration: Iteration TPR FPR Sensitivity Specificity True.Positive False.Positive True.Negative False.Negative 1 0 0.037 0 0.963 0 1 26 11 This is what we’d expect. We have a 0% TPR on the first iteration because we got that single prediction wrong. Since we’ve only got 1 false positve, our FPR is still low: 3.7%. Step 3: Iterate Through the Remaining Predictions Now, let’s go through all of the possible cut points and calculate the TPR and FPR. Actual Outcome Predicted Outcome Model Rank True Positive Rate False Positive Rate Sensitivity Specificity True Negative True Positive False Negative False Positive FALSE 0.9291 Logistic Regression 1 0.0000 0.0370 0.0000 0.9630 26 0 11 1 FALSE 0.9200 Logistic Regression 2 0.0000 0.0741 0.0000 0.9259 25 0 11 2 TRUE 0.8518 Logistic Regression 3 0.0909 0.0741 0.0909 0.9259 25 1 10 2 TRUE 0.8489 Logistic Regression 4 0.1818 0.0741 0.1818 0.9259 25 2 9 2 TRUE 0.8462 Logistic Regression 5 0.2727 0.0741 0.2727 0.9259 25 3 8 2 TRUE 0.7391 Logistic Regression 6 0.3636 0.0741 0.3636 0.9259 25 4 7 2 Step 4: Repeat Steps 1-3 for Each Model Calculate the TPR & FPR for each rank and model! Step 5: Plot the Results & Calculate AUC As you can see below, the Random Forest does remarkably well. It perfectly separated the outcomes in this example (to be fair, this is really small data and test data). What I mean is, when the data is rank ordered by the predicted likelihood of being TRUE, the actual outcome of TRUE are grouped together. There are no false positives. The Area Under the Curve (AUC) is 1 (\(area = hieght * width\) for a rectangle/square). Logistic Regression does well - ~80% AUC is nothing to sneeze at. The random prediction does just better than a coin flip (50% AUC), but this is just random chance and a small sample. SUMMARY The AUC is a very important metric for comparing models. To properly understand it, you need to understand the ROC curve and the underlying calculations. In the end, AUC is showing how well a model is at classifying. The better it can separate the TRUEs from the FALSEs, the closer to 1 the AUC will be. This means the True Positive Rate is increasing faster than the False Positive Rate. More True Positives is better than more False Positives in prediction.

INTRODUCTION Recently I was asked by a former colleague about getting into AI. He has truly big data and wants to use this data to power “AI” - if the headlines are to be believed, everyone else is already doing it. Though it was difficult for my ego, I told him I couldn’t help him in our 30 minute call and that he should think about hiring someone to get him there. The truth was I really didn’t have a solid answer for him in the moment. This was truly disappointing - in my current role and in my previous role, I put predictive models into production. After thinking about it for a bit, there is definitely a similar path I took in both roles. There’s 3 steps in my mind to getting to “AI.” Though this seems simple, it is a long process and potentially not linear - you may have to keep coming back to previous steps. Baseline (Reporting) Understand (Advanced Analytics) Artificial Intelligence (Data Science) BASELINE (REPORTING) Fun fact: You cannot effectively predict anything if you cannot measure the impact. What I mean by baseline is building out a reporting suite. Having a fundamental understanding of your business and environment is key. Without doing this step, you may try to predict the wrong thing entirely - or start with something that isn’t the most impactful. For me, this step started with finding the data in the first place. Perhaps, like my colleague, you have lots of data and you’re ready to jump in. That’s great and makes getting started that much more straightforward. In my role, I joined a finance team that really didn’t have a good bead on this - finding the data was difficult (and getting the owners of that data to give me access was a process as well). To be successfull, start small and iterate. Our first reports were built from manually downloading machine logs, processing them in R with JSON packages, and turning them into a black-and-white document. It was ugly, but it helped us know what we needed to know in that moment - oh yeah… it was MUCH better than nothing. “Don’t let perfection be the enemy of good.” - paraphrased from Voltaire. From this, I gained access to our organizations data warehouse, put automation in place, and purchased some Tableau licenses. This phase took a few months and is constantly being refined, but we are now able to see the impact of our decisions at a glance. This new understanding inevitably leads to more questions - queue step 2: Understanding. UNDERSTANDING (ADVANCED ANALYTICS) If you have never circulated reports and dashboards to others… let me fill you in on something: it will ALWAYS lead to additional, progressively harder questions. This step is an investment in time and expertise - you have to commit to having dedicated resource(s) (read: people… it is inhumane to call people resources and you may only need one person or some of a full time person’s time). Why did X go up unexpectedly (breaks the current trend)? Are we over indexing on this type of customer? Right before our customer leaves, this weird thing happens - what is this weird thing and why is it happening? Like the previous step - this will be ongoing. Investing in someone to do advanced analytics will help you to understand the fine details of your business AND … (drum roll) … will help you to understand which part of your business is most ripe for “AI”! ARTIFICIAL INTELLIGENCE (DATA SCIENCE) It is at this point that you will able to do real, bonafide, data science. A quick rant: Notice that I purposefully did not use the term “AI” (I know I used it throughout this article and even in the title of this section… what can I say - I am in-tune with marketing concepts, too). “AI” is a term that is overused and rarely implemented. Data science, however, comes in many forms and can really transform your business. Here’s a few ideas for what you can do with data science: Prediction/Machine Learning Testing Graph Analysis Perhaps you want to predict whether a sale is fraud or which existing customer is most apt to buy your new product? You can also test whether a new strategy works better than the old. This requires that you use statistical concepts to ensure valid testing and results. My new obsession is around graph analysis. With graphs you can see relationships that may have been hidden before - this will enable you to identify new targets and enrich your understanding of your business! Data science usually is very specific thing and takes many forms! SUMMARY Getting to data science is a process - it will take an investment. There are products out there that will help you shortcut some of these steps and I encourage you to consider these. There are products to help with reporting, analytics, and data science. These should, in my very humble opinion, be used by people who are dedicated to the organizations data, analytics, and science. Directions for data science - measure, analyze, predict, repeat!