What Is Binary Classification? Algorithms for Binary Classification Logistic Regression Decision Trees/Random Forests Decision Trees Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Great. Now what? Determining What the Problem is Locate and Obtain Data Data Mining & Preparing for Analysis Splitting the Data Building the Models Validating the Models Conclusion What Is Binary Classification? Binary classification is used to classify a given set into two categories. Usually, this answers a yes or no question: Did a particular passenger on the Titanic survive? Is a particular user account compromised? Is my product on the shelf in a certain pharmacy? This type of inference is frequently made using supervised machine learning techniques. Supervised machine learning means that you have historical, labeled data, from which your algorithm may learn. Algorithms for Binary Classification There are many methods for doing binary classification. To name a few: Logistic Regression Decision Trees/Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Logistic Regression Logistic regression is a parametric statistical model that predicts binary outcomes. Parametric means that this algorithm is based off of a distribution (in this case, the logistic distribution), and as such must follow a few assumptions: Obviously, the dependent variable must be binary Only meaningful independent variables are included in the model Error terms need to be independent and identically distributed Independent variables need to be independent from one another Large sample sizes are preferred Because of these assumptions, parametric tests tend to be more statistically powerful than nonparametric tests; in other words, they tend to better find a significant effect when it indeed exists. Logistic regression follows the equation: probability of the outcome being 1 given the independent variables Dependent variable Limited to values between 0 and 1 independent variables intercept and coefficients for the independent variables This equation is created based on a training set of data – or historical, labeled data – and then is used to predict the likelihoods of future, unlabeled data. Decision Trees/Random Forests Decision Trees A decision tree is a nonparametric classifier. It effectively partitions the data, starting first by splitting on the independent variable that gives the most information gain, and then recursively repeating this process at subsequent levels. Information gain is a formula that determines how “important” an independent variable is in predicting the dependent variable. It takes into account how many distinct values there are (in terms of categorical variables) and the number and size of branches in the decision tree. The goal is to pick the most informative variable that is still general enough to prevent overfitting. The bottom of the decision tree, at the leaf nodes, are groupings of events within the set that all follow the rules set forth throughout the tree to get to the node. Future, unlabeled events, are then fed into the tree to see which group the belong – the average of the labeled (training) data for the leaf is then assigned as the predicted value for the unlabeled event. As with logistic regression, overfitting is a concern. If you allow a decision tree to continue to grow without bound, eventually you will have all identical events in each leaf; while this may look beneficial, it may be too specific to the training data and mislabel future events. “Pruning” occurs to prevent overfitting. Random Forests Random forests is an ensemble method build upon the decision trees. Random forests are a “forest” of decision trees – in other words, you use bootstrap sampling techniques to build many over-fit decision trees, then average out the results to determine a final model. A bootstrap sample is sampling with replacement – in every selection for the sample, each event has an equal chance of being chosen. To clarify – building a random forest model means taking many bootstrap samples and building an over-fit decision tree (meaning you continue to split the tree without bound until every leaf node has identical groups in them) on each. These results, taken together, correct for the biases and potential overfitting of an individual tree. The more trees in your random forest, the better – the trade-off being that more trees mean more computing. Random forests often take a long time to train. Nearest Neighbor The k-nearest neighbor algorithm is a very simple algorithm. Using the training set as reference, the new, unlabeled data is predicted by taking the average of the k closest events. Being a lazy learner, where evaluation does not take place until you classify new events, it is quick to run. It can be difficult to determine what k should be. However, because it is easy computationally, you can run multiple iterations without much overhead. Support Vector Machines (SVM) SVM is also an ensemble machine learning method. SVM recursively attempts to “split” the two categories by maximizing the distance between a hyperplane (a plane in more than 2 dimensions; most applications of machine learning are in the higher dimensional space) and the closest points in each category. As you can see in the simple example below, the plane iteratively improves the split between the two groups. There are multiple kernels that can be used with SVM, depending on the shape of the data: Linear Polynomial Radial Sigmoid You may also choose to configure how big of steps can be taken by the plane in each iteration among other configurations. Neural Networks Neural networks (there are several varieties) are built to mimic how a brain solves problems. This is done by creating multiple layers from a single input – most easily demonstrated with image recognition – where it is able to turn groups of pixels into another, single value, over and over again, to provide more information to train the model. Great. Now what? Now that we know how we understand some of the tools in our arsenal, what are the steps to doing the analysis? Determining what the problem is Locate and obtain data Data mining for understanding & preparing for analysis Split data into training and testing sets Build model(s) on training data Test models on test data Validate and pick the best model Determining What the Problem is While it is easy to ask a question, it is difficult to understand all of the assumption being made by the question asker. For example, a simple question is asked: Will my product be on the shelf of this pharmacy next week? While that question may seem straightforward at first glance, what product are we talking about? What pharmacy are we talking about? What is the time frame which is being evaluated? Does it need to be in the pharmacy and available if you ask or does the customer need to be able to visually identify the product? Does it need to be available for the entire time period in question or did just have to be available at least part of the time period in question? Being as specific as possible is vital in order to deliver the correct answer. It is easy to misinterpret the assumptions of the question asker and then do a lot of work in to answer the wrong question. Specificity will help ensure time is not wasted and that question asker gets an answer that they were looking for. The final question may look more like: Will there be any Tylenol PM available over-the-counter at midnight, February 28, 2017 at Walgreens on the corner of 17th and John F. Kennedy Blvd in Philadelphia? Well – we don’t know. We can now use historical data to make our best guess. This question is specific enough to answer. Locate and Obtain Data Where is your data? Is it in a database? Some excel spreadsheet? Once you find it, how big is it? Can you download the data locally? Do you need to find a distributed database to handle it? If it is in a database, can you do some of the data mining (next step) before downloading the data? Be careful… “SELECT * FROM my_table;” can get scary, quick. This is also a good time to think about what tools and/or languages you want to use to mine and manipulate the data. Excel? SQL? R? Python? Some of the numerous other tools or languages out there that are good at a bunch of different things (Julia, Scala, Weka, Orange, etc.)? Get the data into one spot, preferably with some guidance on what and where it is in relation to what you need for your problem and open it up. Data Mining & Preparing for Analysis The most time consuming step in any data science article you read will always be the data cleaning step. This document is no different – you will spend an inordinate amount of time getting to know the data, cleaning it, getting to know it better, and cleaning it again. You may then proceed to analysis, discover you’ve missed something, and come back to this step. There is a lot to consider in this step and each data analysis is different. Is your data complete? If you are missing values in your data, how will you deal with them? There is no overarching rule on this. If you are dealing with continuous data, perhaps you’ll fill missing data points with the average of similar data. Perhaps you can infer what it should be based on context. Perhaps it constitutes such a small portion of your data, the logical thing to do is to just drop the events all together. The dependent variable – how does it break down? We are dealing with binomial data here; is there way more zeros then ones? How will you deal with that if there is? Are you doing your analysis on a subset? If so, is your sample representative of the population? How can you be sure? This is where histograms are your friend. Do you need to create variables? Perhaps one independent variable you have is a date, which might be tough to use as an input to your model. Should you find out which day of the week each date was? Month? Year? Season? These are easier to add in as a model input in some cases. Do you need to standardize your data? Perhaps men are listed as “M,” “Male,” “m,” “male,” “dude,” and “unsure.” It would behoove you, in this example, to standardize this data to all take on the same value. In most algorithms, correlated input variables are bad. This is the time to plot all of the independent variables against each other to see if there is correlation. If there are correlated variables, it may be a tough choice to drop one (or all!). Speaking of independent variables, which are important to predict your dependent variable? You can use information gain packages (depending on the language/tool you are using to do your analysis), step-wise regression, or random forests to help understand the important variables. In many of these steps, there are no hard-and-fast rules on how to proceed. You’ll need to make a decision in the context of your problem. In many cases, you may be wrong and need to come back to the decision after trying things out. Splitting the Data Now that you (think you) have a clean dataset, you’ll need to split it into training and testing datasets. You’ll want to have as much data as possible to train on, though still have enough data left over to test on. This is less and less of an issue in the age of big data. However, sometimes too much data and it will take too long for your algorithms to train. Again – this is another decision that will need to be made in the context of your problem. There are a few options for splitting your data. The most straightforward being take a portion of your overall dataset to train on (say 70%) and leave behind the rest to test on. This works well in most big data applications. If you do not have a lot of data (or if you do), consider cross-validation. This is an iterative approach where you train your algorithm recursively on the same data set, leaving some portion out each iteration to be used as the test set. The most popular versions of cross-validation are k-fold cross validation and leave-one-out cross validation. There is even nested cross-validation, which gets very Inception-like. Building the Models Finally, you are ready to do what we came to do – build the models. We have our datasets cleaned, enriched, and split. Time to build our models. I say it plural because you’ll always want to evaluate which method and/or inputs works best. You’ll want to pick a few of the algorithms from above and build the model. While that is vague, depending on your language or tool of choice, there are multiple packages available to perform each analysis. It is generally only a line or two of code to train each model; once we have our models trained, it is time to validate. Validating the Models So – which model did best? How can you tell? We start by predicting results for our test set with each model and building a confusion matrix for each: With this, we can calculate the specificity, sensitivity, and accuracy for each model. For each value, higher is better. The best model is one that performs the best in each of these counts. In the real world, frequently one model will have better specificity, while another will have better sensitivity, and yet another will be the most accurate. Again, there is no hard and fast rule one which model to choose; it all depends on the context. Perhaps false positives are really bad in your context, then the specificity rate should be given more merit. It all depends. From here, you have some measures in order to pick a model and implement it. Conclusion Much of model building, in general, is part computer science, part statistics, and part business understanding. Understanding which tools and languages are best to implement the best statistical modeling technique to solve a business problem can feel like more of a form of art than science at times. In this document, I’ve presented some algorithms and steps to do binary classification, which is just the tip of the iceberg. I am sure there are algorithms and steps missing – I hope that this helps in your understanding.
The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you’ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper. Supervised vs. Unsupervised Machine Learning At the highest level, there are two different types of machine learning - supervised and unsupervised. Supervised means that we have historical information in order to learn from and make future decisions; unsupervised means that we have no previous information, but might be attempting to group things together or do some other type of pattern or outlier recognition. In each of these subsets there are many methodologies and motivations; I’ll explain how they work and give a simple example or two. Supervised Machine Learning Supervised machine learning is nothing more than using historical information (read: data) in order to predict a future event or explain a behavior using algorithms. I know - this is vague - but humans use these algorithms based on previous learning everyday in their lives to predict things. A very simple example: if it is sunny outside when we wake up, it is perfectly reasonable to assume that it will not rain that day. Why do we make this prediction? Because over time, we’ve learned that on sunny days it typically does not rain. We don’t know for sure that today it won’t rain but we’re willing to make decisions based on our prediction that it won’t rain. Computers do this exact same thing in order to make predictions. The real gains come from Supervised Machine Learning when you have lots of accurate historical data. In the example above, we can’t be 100% sure that it won’t rain because we’ve also woken up on a few sunny mornings in which we’ve driven home after work in a monsoon - adding more and more data for your supervised machine learning algorithm to learn from also allows it to make concessions for these other possible outcomes. Supervised Machine Learning can be used to classify (usually binary or yes/no outcomes but can be broader - is a person going to default on their loan? will they get divorced?) or predict a value (how much money will you make next year? what will the stock price be tomorrow?). Some popular supervised machine learning methods are regression (linear, which can predict a continuous value, or logistic, which can predict a binary value), decision trees, k-nearest neighbors, and naive Bayes. My favorite of these methods is decision trees. A decision tree is used to classify your data. Once the data is classified, the average is taken of each terminal node; this value is then applied to any future data that fits this classification. The decision tree above shows that if you were a female and in first or second class, there was a high likelihood you survived. If you were a male in second class who was younger than 12 years old, you also had a high likelihood of surviving. This tree could be used to predict the potential outcomes of future sinking ships (morbid… I know). Unsupervised Machine Learning Unsupervised machine learning is the other side of this coin. In this case, we do not necessarily want to make a prediction. Instead, this type of machine learning is used to find similarities and patterns in the information to cluster or group. An example of this: Consider a situation where you are looking at a group of people and you want to group similar people together. You don’t know anything about these people other than what you can see in their physical appearance. You might end up grouping the tallest people together and the shortest people together. You could do this same thing by weight instead… or hair length… or eye color… or use all of these attributes at the same time! It’s natural in this example to see how “close” people are to one another based on different attributes. What these type of algorithms do is evaluate the “distances” of one piece of information from another piece. In a machine learning setting you look for similarities and “closeness” in the data and group accordingly. This could allow the administrators of a mobile application to see the different types of users of their app in order to treat each group with different rules and policies. They could cluster samples of users together and analyze each cluster to see if there are opportunities for targeted improvements. The most popular of these unsupervised machine learning methods is called k-means clustering. In k-means clustering, the goal is to partition your data into k clusters (where k is how many clusters you want - 1, 2,…, 10, etc.). To begin this algorithm, k means (or cluster centers) are randomly chosen. Each data point in the sample is clustered to the closest mean; the center (or centroid, to use the technical term) of each cluster is calculated and that becomes the new mean. This process is repeated until the mean of each cluster is optimized. The important part to note is that the output of k-means is clustered data that is “learned” without any input from a human. Similar methods are used in Natural Language Processing (NLP) in order to do Topic Modeling. Resources to Learn More There are an uncountable amount resources out there to dive deeper into this topic. Here are a few that I’ve used or found along my Data Science journey. UPDATE: I’ve written a whole post on this. You can find it here O’Reilly has a ton of great books that focus on various areas of machine learning. edX and coursera have a TON of self-paced and instructor-led learning courses in machine learning. There is a specific series of courses offered by Columbia University that look particularly applicable. If you are interested in learning machine learning and already have a familiarity with R and Statistics, DataCamp has a nice, free program. If you are new to R, they have a free program for that, too. There are also many, many blogs out there to read about how people are using data science and machine learning.