open data

Exploring Open Data - Predicting the Amoung of Violations

Introduction In my last post, I went over some of the highlights of the open data set of all Philadelphia Parking Violations. In this post, I’ll go through the steps to build a model to predict the amount of violations the city issues on a daily basis. I’ll walk you through cleaning and building the data set, selecting and creating the important features, and building predictive models using Random Forests and Linear Regression. Step 1: Load Packages and Data Just an initial step to get the right libraries and data loaded in R. library(plyr) library(randomForest) ## DATA FILE FROM OPENDATAPHILLY ptix <- read.csv("Parking_Violations.csv") ## READ IN THE WEATHER DATA (FROM NCDC) weather_data <- read.csv("weather_data.csv") ## LIST OF ALL FEDERAL HOLIDAYS DURING THE ## RANGE OF THE DATA SET holidays <- as.Date(c("2012-01-02", "2012-01-16", "2012-02-20", "2012-05-28", "2012-07-04", "2012-09-03", "2012-10-08", "2012-11-12", "2012-11-22", "2012-12-25", "2013-01-01", "2013-01-21", "2013-02-18", "2013-05-27", "2013-07-04", "2013-09-02", "2013-10-14", "2013-11-11", "2013-11-28", "2013-12-25", "2014-01-01", "2014-01-20", "2014-02-17", "2014-05-26", "2014-07-04", "2014-09-01", "2014-10-13", "2014-11-11", "2014-11-27", "2014-12-25", "2015-01-01", "2015-01-09", "2015-02-16", "2015-05-25", "2015-07-03", "2015-09-07")) Step 2: Formatting the Data First things first, we have to total the amount of tickets per day from the raw data. For this, I use the plyr command ddply. Before I can use the ddply command, I need to format the Issue.Date.and.Time column to be a Date variable in the R context. days <- ptix$Issue.Date.and.Time, format = "%m/%d/%Y")) names(days) <- "DATE" count_by_day <- ddply(days, .(DATE), summarize, count = length(DATE)) Next, I do the same exact date formatting with the weather data. weather_data$DATE <- as.Date(as.POSIXct(strptime(as.character(weather_data$DATE), format = "%Y%m%d")), format = "%m/%d/%Y") Now that both the ticket and weather data have the same date format (and name), we can use the join function from the plyr package. count_by_day <- join(count_by_day, weather_data, by = "DATE") With the data joined by date, it is time to clean. There are a number of columns with unneeded data (weather station name, for example) and others with little or no data in them, which I just flatly remove. The data has also been coded with negative values representing that data had not been collected for any number of reasons (I’m not surprised that snow was not measured in the summer); for that data, I’ve made any values coded -9999 into 0. There are some days where the maximum or minimum temperature was not gathered (I’m not sure why). As this is the main variable I plan to use to predict daily violations, I drop the entire row if the temperature data is missing. ## I DON'T CARE ABOUT THE STATION OR ITS NAME - ## GETTING RID OF IT count_by_day$STATION <- NULL count_by_day$STATION_NAME <- NULL ## A BUNCH OF VARIABLE ARE CODED WITH NEGATIVE VALUES ## IF THEY WEREN'T COLLECTED - CHANGING THEM TO 0s count_by_day$MDPR[count_by_day$MDPR < 0] <- 0 count_by_day$DAPR[count_by_day$DAPR < 0] <- 0 count_by_day$PRCP[count_by_day$PRCP < 0] <- 0 count_by_day$SNWD[count_by_day$SNWD < 0] <- 0 count_by_day$SNOW[count_by_day$SNOW < 0] <- 0 count_by_day$WT01[count_by_day$WT01 < 0] <- 0 count_by_day$WT03[count_by_day$WT03 < 0] <- 0 count_by_day$WT04[count_by_day$WT04 < 0] <- 0 ## REMOVING ANY ROWS WITH MISSING TEMP DATA count_by_day <- count_by_day[ count_by_day$TMAX > 0, ] count_by_day <- count_by_day[ count_by_day$TMIN > 0, ] ## GETTING RID OF SOME NA VALUES THAT POPPED UP count_by_day <- count_by_day[! count_by_day$TMAX), ] ## REMOVING COLUMNS THAT HAVE LITTLE OR NO DATA ## IN THEM (ALL 0s) count_by_day$TOBS <- NULL count_by_day$WT01 <- NULL count_by_day$WT04 <- NULL count_by_day$WT03 <- NULL ## CHANGING THE DATA, UNNECESSARILY, FROM 10ths OF ## DEGREES CELCIUS TO JUST DEGREES CELCIUS count_by_day$TMAX <- count_by_day$TMAX / 10 count_by_day$TMIN <- count_by_day$TMIN / 10 Step 3: Visualizing the Data At this point, we have joined our data sets and gotten rid of the unhelpful “stuff.” What does the data look like? Daily Violation Counts There are clearly two populations here. With the benefit of hindsight, the small population on the left of the histogram is mainly Sundays. The larger population with the majority of the data is all other days of the week. Let’s make some new features to explore this idea. Step 4: New Feature Creation As we see in the histogram above, there are obviously a few populations in the data - I know that day of the week, holidays, and month of the year likely have some strong influence on how many violations are issued. If you think about it, most parking signs include the clause: “Except Sundays and Holidays.” Plus, spending more than a few summers in Philadelphia at this point, I know that from Memorial Day until Labor Day the city relocates to the South Jersey Shore (emphasis on the South part of the Jersey Shore). That said - I add in those features as predictors. ## FEATURE CREATION - ADDING IN THE DAY OF WEEK count_by_day$DOW <- as.factor(weekdays(count_by_day$DATE)) ## FEATURE CREATION - ADDING IN IF THE DAY WAS A HOLIDAY count_by_day$HOL <- 0 count_by_day$HOL[as.character(count_by_day$DATE) %in% as.character(holidays)] <- 1 count_by_day$HOL <- as.factor(count_by_day$HOL) ## FEATURE CREATION - ADDING IN THE MONTH count_by_day$MON <- as.factor(months(count_by_day$DATE)) Now - let’s see if the Sunday thing is real. Here is a scatterplot of the data. The circles represent Sundays; triangles are all other days of the week. Temperature vs. Ticket Counts You can clearly see that Sunday’s tend to do their own thing in a very consistent manner that is similar to the rest of the week. In other words, the slope for Sundays is very close to that of the slope for all other days of the week. There are some points that don’t follow those trends, which are likely due to snow, holidays, and/or other man-made or weather events. Let’s split the data into a training and test set (that way we can see how well we do with the model). I’m arbitrarily making the test set the last year of data; everything before that is the training set. train <- count_by_day[count_by_day$DATE < "2014-08-01", ] test <- count_by_day[count_by_day$DATE >= "2014-08-01", ] Step 5: Feature Identification We now have a data set that is ready for some model building! The problem to solve next is figuring out which features best explain the count of violations issued each day. My preference is to use Random Forests to tell me which features are the most important. We’ll also take a look to see which, if any, variables are highly correlated. High correlation amongst input variables will lead to high variability due to multicollinearity issues. featForest <- randomForest(count ~ MDPR + DAPR + PRCP + SNWD + SNOW + TMAX + TMIN + DOW + HOL + MON, data = train, importance = TRUE, ntree = 10000) ## PLOT THE VARIABLE TO SEE THE IMPORTANCE varImpPlot(featForest) In the Variable Importance Plot below, you can see very clearly that the day of the week (DOW) is by far the most important variable in describing the amount of violations written per day. This is followed by whether or not the day was a holiday (HOL), the minimum temperature (TMIN), and the month (MON). The maximum temperature is in there, too, but I think that it is likely highly correlated with the minimum temperature (we’ll see that next). The rest of the variables have very little impact. Variable Importance Plot cor(count_by_day[,c(3:9)]) I’ll skip the entire output of the correlation table, but TMIN and TMAX have a correlation coefficient of 0.940379171. Because TMIN has a higher variable importance and there is a high correlation between the TMIN and TMAX, I’ll leave TMAX out of the model. Step 6: Building the Models The goal here was to build a multiple linear regression model - since I’ve already started down the path of Random Forests, I’ll do one of those, too, and compare the two. To build the models, we do the following: ## BUILD ANOTHER FOREST USING THE IMPORTANT VARIABLES predForest <- randomForest(count ~ DOW + HOL + TMIN + MON, data = train, importance = TRUE, ntree = 10000) ## BUILD A LINEAR MODEL USING THE IMPORTANT VARIABLES linmod_with_mon <- lm(count ~ TMIN + DOW + HOL + MON, data = train) In looking at the summary, I have questions on whether or not the month variable (MON) is significant to the model or not. Many of the variables have rather high p-values. summary(linmod_with_mon) Call: lm(formula = count ~ TMIN + DOW + HOL + MON, data = train) Residuals: Min 1Q Median 3Q Max -4471.5 -132.1 49.6 258.2 2539.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5271.4002 89.5216 58.884 < 2e-16 *** TMIN -15.2174 5.6532 -2.692 0.007265 ** DOWMonday -619.5908 75.2208 -8.237 7.87e-16 *** DOWSaturday -788.8261 74.3178 -10.614 < 2e-16 *** DOWSunday -3583.6718 74.0854 -48.372 < 2e-16 *** DOWThursday 179.0975 74.5286 2.403 0.016501 * DOWTuesday -494.3059 73.7919 -6.699 4.14e-11 *** DOWWednesday -587.7153 74.0264 -7.939 7.45e-15 *** HOL1 -3275.6523 146.8750 -22.302 < 2e-16 *** MONAugust -99.8049 114.4150 -0.872 0.383321 MONDecember -390.2925 109.4594 -3.566 0.000386 *** MONFebruary -127.8091 112.0767 -1.140 0.254496 MONJanuary -73.0693 109.0627 -0.670 0.503081 MONJuly -346.7266 113.6137 -3.052 0.002355 ** MONJune -30.8752 101.6812 -0.304 0.761481 MONMarch -1.4980 94.8631 -0.016 0.987405 MONMay 0.1194 88.3915 0.001 0.998923 MONNovember 170.8023 97.6989 1.748 0.080831 . MONOctober 125.1124 92.3071 1.355 0.175702 MONSeptember 199.6884 101.9056 1.960 0.050420 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 544.2 on 748 degrees of freedom Multiple R-squared: 0.8445, Adjusted R-squared: 0.8405 F-statistic: 213.8 on 19 and 748 DF, p-value: < 2.2e-16 To verify this, I build the model without the MON term and then do an F-Test to compare using the results of the ANOVA tables below. ## FIRST ANOVA TABLE (WITH THE MON TERM) anova(linmod_with_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 54.3844 4.383e-13 *** DOW 6 1019164305 169860717 573.4523 < 2.2e-16 *** HOL 1 147553631 147553631 498.1432 < 2.2e-16 *** MON 11 20322464 1847497 6.2372 6.883e-10 *** Residuals 748 221563026 296207 ## SECOND ANOVA TABLE (WITHOUT THE MON TERM) anova(linmod_wo_mon) Analysis of Variance Table Response: count Df Sum Sq Mean Sq F value Pr(>F) TMIN 1 16109057 16109057 50.548 2.688e-12 *** DOW 6 1019164305 169860717 532.997 < 2.2e-16 *** HOL 1 147553631 147553631 463.001 < 2.2e-16 *** Residuals 759 241885490 318690 ## Ho: B9 = B10 = B11 = B12 = B13 = B14 = B15 = B16 = ## B17 = B18 = B19 = 0 ## Ha: At least one is not equal to 0 ## F-Stat = MSdrop / MSE = ## ((SSR1 - SSR2) / (DF(R)1 - DF(R)2)) / MSE f_stat <- ((241885490 - 221563026) / (759 - 748)) / 296207 ## P_VALUE OF THE F_STAT CALCULATED ABOVE p_value <- 1 - pf(f_stat, 11, 748) Since the P-Value 6.8829e-10 is MUCH MUCH less than 0.05, I can reject the null hypothesis and conclude that at least one of the parameters associated with the MON term is not zero. Because of this, I’ll keep the term in the model. Step 7: Apply the Models to the Test Data Below I call the predict function to see how the Random Forest and Linear Model predict the test data. I am rounding the prediction to the nearest integer. To determine which model performs better, I am calculating the difference in absolute value of the predicted value from the actual count. ## PREDICT THE VALUES BASED ON THE MODELS test$RF <- round(predict(predForest, test), 0) test$LM <- round(predict.lm(linmod_with_mon, test), 0) ## SEE THE ABSOLUTE DIFFERENCE FROM THE ACTUAL difOfRF <- sum(abs(test$RF - test$count)) difOfLM <- sum(abs(test$LM - test$count)) Conclusion As it turns out, the Linear Model performs better than the Random Forest model. I am relatively pleased with the Linear Model - an R-Squared value of 0.8445 ain’t nothin’ to shake a stick at. You can see that Random Forests are very useful in identifying the important features. To me, it tends to be a bit more of a “black box” in comparison the linear regression - I hesitate to use it at work for more than a feature identification tool. Overall - a nice little experiment and a great dive into some open data. I now know that PPA rarely takes a day off, regardless of the weather. I’d love to know how much of the fines they write are actually collected. I may also dive into predicting what type of ticket you received based on your location, time of ticket, etc. All in another day’s work! Thanks for reading.

Exploring Open Data - Philadelphia Parking Violations

Introduction A few weeks ago, I stumbled across Dylan Purcell’s article on Philadelphia Parking Violations. This is a nice glimpse of the data, but I wanted to get a taste of it myself. I went and downloaded the entire data set of Parking Violations in Philadelphia from the OpenDataPhilly website and came up with a few questions after checking out the data: How many tickets in the data set? What is the range of dates in the data? Are there missing days/data? What was the biggest/smallest individual fine? What were those fines for? Who issued those fines? What was the average individual fine amount? What day had the most/least count of fines? What is the average amount per day How much $ in fines did they write each day? What hour of the day are the most fines issued? What day of the week are the most fines issued? What state has been issued the most fines? Who (what individual) has been issued the most fines? How much does the individual with the most fines owe the city? How many people have been issued fines? What fines are issued the most/least? And finally to the cool stuff: Where were the most fines? Can I see them on a heat map? Can I predict the amount of parking tickets by weather data and other factors using linear regression? How about using Random Forests? Data Insights This data set has 5,624,084 tickets in it that spans from January 1, 2012 through September 30, 2015 - an exact range of 1368.881 days. I was glad to find that there are no missing days in the data set. The biggest fine, $2000 (OUCH!), was issued (many times) by the police for “ATV on Public Property.” The smallest fine, $15, was issued also by the police “parking over the time limit.” The average fine for a violation in Philadelphia over the time range was $46.33. The most violations occurred on November 30, 2012 when 6,040 were issued. The least issued, unsurprisingly, was on Christmas day, 2014, when only 90 were issued. On average, PPA and the other 9 agencies that issued tickets (more on that below), issued 4,105.17 tickets per day. All of those tickets add up to $190,193.50 in fines issued to the residents and visitors of Philadelphia every day!!! Digging a little deeper, I find that the most popular hour of the day for getting a ticket is 12 noon; 5AM nets the least tickets. Thursdays see the most tickets written (Thursdays and Fridays are higher than the rest of the week; Sundays see the least (pretty obvious). Other obvious insight is that PA licensed drivers were issued the most tickets. Looking at individuals, there was one person who was issued 1,463 tickets (thats more than 1 violation per day on average) for a whopping $36,471. In just looking at a few of their tickets, it seems like it is probably a delivery vehicle that delivers to Chinatown (Tickets for “Stop Prohibited” and “Bus Only Zone” in the Chinatown area). I’d love to hear more about why this person has so many tickets and what you do about that… 1,976,559 people - let me reiterate - nearly 2 million unique vehicles have been issued fines over the three and three quarter years this data set encompasses. That’s so many!!! That is 2.85 tickets per vehicle, on average (of course that excludes all of the cars that were here and never ticketed). That makes me feel much better about how many tickets I got while I lived in the city. And… who are the agencies behind all of this? It is no surprise that PPA issues the most. There are 11 agencies in all. Seems like all of the policing agencies like to get in on the fun from time to time. Issuing Agency count PPA 4,979,292 PHILADELPHIA POLICE 611,348 CENTER CITY DISTRICT 9,628 SEPTA 9342 UPENN POLICE 6,366 TEMPLE POLICE 4,055 HOUSING AUTHORITY 2,137 PRISON CORRECTIONS OFFICER 295 POST OFFICE 121 FAIRMOUNT DISTRICT 120 Mapping the Violations Where are you most likely to get a violation? Is there anywhere that is completely safe? Looking at the city as a whole, you can see that there are some places that are “hotter” than others. I played around in cartoDB to try to visualize this as well, but tableau seemed to do a decent enough job (though these are just screenshots). Zooming in, you can see that there are some distinct areas where tickets are given out in more quantity. Looking one level deeper, you can see that there are some areas like Center City, east Washington Avenue, Passyunk Ave, and Broad Street that seem to be very highly patrolled. Summary I created the above maps in Tableau. I used R to summarize the data. The R scripts, raw and processed data, and Tableau workbook can be found in my github repo. In the next post, I use weather data and other parameters to predict how many tickets will be written on a daily basis.

Open Data Day - DC Hackathon

For those of you who aren’t stirred from bed in the small hours to learn data science, you might have missed that March 5th was international open data day. There are hundreds of local events around the world; I was lucky enough to attend DC’s Open Data Day Hackathon. I met a bunch of great people doing noble things with data who taught me a crap-ton (scientific term) and also validated my love for data science and how much I’ve learned since beginning my journey almost two years ago. Here is a quick rundown of what I learned and some helpful links so that you can find out more, too. Being that it is an Open Data event, everything was well documented on the hackathon hackpad. Introduction to Open Data Eric Mill gave an really nice overview of what JSON is how to use APIs to access the JSON and thus, the data the website is conveying. Though many APIs are open and documented, many are not. Eric gave some tips on how to access that data, too. This session really opened my eyes to how to access that previously unusable data that was hidden in plain sight in the text of websites. Data Science Primer This was one of the highlights for me - A couple of NIST Data Scientists, Pri Oberoi and Star Ying, gave a presentation and walkthrough on how to use k-means clustering to identify groupings in your data. The data and jupyter notebook is available on github. I will definitely be using this in my journey to better detect and remediate compromised user accounts at Comcast. Hackathon I joined a group that was working to use data science to identify Opioid overuse. Though I didn’t add much (the group was filled with some really really smart people), I was able to visualize the data using R and share some of those techniques with the team. Intro to D3 Visualizations The last session and probably my favorite was a tutorial on building out a D3 Visualization. Chris Given walked a packed house through building a D3 viz step-by-step, giving some background on why things work they work and showing some great resources. I am particularly proud of the results (though I only followed his instruction to build this). Closing I also attended 2 sessions about using the command line that totally demystified the shell prompt. All in all, it was a great two days! I will definitely be back next year (unless I can convince someone to do one in Philly).