data science

A Data Scientist's Take on the Roadmap to AI

INTRODUCTION Recently I was asked by a former colleague about getting into AI. He has truly big data and wants to use this data to power “AI” - if the headlines are to be believed, everyone else is already doing it. Though it was difficult for my ego, I told him I couldn’t help him in our 30 minute call and that he should think about hiring someone to get him there.

Getting Started with Data Science

Everyone in the world has a “how to” guide to data science… well, maybe not everyone - but there are a lot of “guides” out there. I get this question infrequently, so I thought I would do my best to put together what have been my best resources for learning. MY STORY Personally, I learned statistics by getting my Masters in Applied Statistics at Villanova University - it took 2.

How to Build a Binary Classification Model

What Is Binary Classification? Algorithms for Binary Classification Logistic Regression Decision Trees/Random Forests Decision Trees Random Forests Nearest Neighbor Support Vector Machines (SVM) Neural Networks Great. Now what? Determining What the Problem is Locate and Obtain Data Data Mining & Preparing for Analysis Splitting the Data Building the Models Validating the Models Conclusion What Is Binary Classification? Binary classification is used to classify a given set into two categories.

Machine Learning Demystified

The differences and applications of Supervised and Unsupervised Machine Learning. Introduction Machine learning is one of the buzziest terms thrown around in technology these days. Combine machine learning with big data in a Google search and you’ve got yourself an unmanageable amount of information to digest. In an (possibly ironic) effort to help navigate this sea of information, this post is meant to be an introduction and simplification of some common machine learning terminology and types with some resources to dive deeper.

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago.

Identifying Compromised User Accounts with Logistic Regression

INTRODUCTION As a Data Analyst on Comcast’s Messaging Engineering team, it is my responsibility to report on the platform statuses, identify irregularities, measure impact of changes, and identify policies to ensure that our system is used as it was intended. Part of the last responsibility is the identification and remediation of compromised user accounts. The challenge the company faces is being able to detect account compromises faster and remediate them closer to the moment of detection.