Using R and Splunk: Lookups of More Than 10,000 Results

Splunk, for some probably very good reasons, has limits on how many results are returned by sub-searches (which in turn limits us on lookups, too). Because of this, I’ve used R to search Splunk through it’s API endpoints (using the httr package) and utilize loops, the plyr package, and other data manipulation flexibilities given through the use of R. This has allowed me to answer some questions for our business team that at the surface seem simple enough, but the data gathering and manipulation get either too complex or large for Splunk to handle efficiently. Here are some examples: Of the 1.5 million customers we’ve emailed in a marketing campaign, how many of them have made the conversion? How are our 250,000 beta users accessing the platform? Who are the users logging into our system from our internal IPs? The high level steps to using R and Splunk are: Import the lookup values of concern as a csv Create the lookup as a string Create the search string including the lookup just created Execute the GET to get the data Read the response into a data table I’ve taken this one step further; because my lookups are usually LARGE, I end up breaking up the search into smaller chunks and combining the results at the end. Here is some example code that you can edit to show what I’ve done and how I’ve done it. This bit of code will iteratively run the “searchstring” 250 times and combine the results. ## LIBRARY THAT ENABLES THE HTTPS CALL ## library(httr) ## READ IN THE LOOKUP VALUES OF CONCERN ## mylookup <- read.csv("mylookup.csv", header = FALSE) ## ARBITRARY "CHUNK" SIZE TO KEEP SEARCHES SMALLER ## start <- 1 end <- 1000 ## CREATE AN EMPTY DATA FRAME THAT WILL HOLD END RESULTS ## alldata <- data.frame() ## HOW MANY "CHUNKS" WILL NEED TO BE RUN TO GET COMPLETE RESULTS ## for(i in 1:250){ ## CREATES THE LOOKUP STRING FROM THE mylookup VARIABLE ## lookupstring <- paste(mylookup[start:end], sep = "", collapse = '" OR VAR_NAME="') ## CREATES THE SEARCH STRING; THIS IS A SIMPLE SEARCH EXAMPLE ## searchstring <- paste('index = "my_splunk_index" (VAR_NAME="', lookupstring, '") | stats count BY VAR_NAME', sep = "") ## RUNS THE SEARCH; SUB IN YOUR SPLUNK LINK, USERNAME, AND PASSWORD ## response <- GET("https://our.splunk.link:8089/", path = "servicesNS/admin/search/search/jobs/export", encode="form", config(ssl_verifyhost=FALSE, ssl_verifypeer=0), authenticate("USERNAME", "PASSWORD"), query=list(search=paste0("search ", searchstring, collapse="", sep=""), output_mode="csv")) ## CHANGES THE RESULTS TO A DATA TABLE ## result <- read.table(text = content(response, as = "text"), sep = ",", header = TRUE, stringsAsFactors = FALSE) ## BINDS THE CURRENT RESULTS WITH THE OVERALL RESULTS ## alldata <- rbind(alldata, result) ## UPDATES THE START POINT start <- end + 1 ## UPDATES THE END POINT, BUT MAKES SURE IT DOESN'T GO TOO FAR ## if((end + 1000) > length(allusers)){ end <- length(allusers) } else { end <- end + 1000 } ## FOR TROUBLESHOOTING, I PRINT THE ITERATION ## #print(i) } ## WRITES THE RESULTS TO A CSV ## write.table(alldata, "mydata.csv", row.names = FALSE, sep = ",") So - that is how you do a giant lookup against Splunk data with R! I am sure that there are more efficient ways of doing this, even in the Splunk app itself, but this has done the trick for me!

Using the Google Search API and Plotly to Locate Waterparks

I’ve got a buddy who manages and builds waterparks. I thought to myself… I am probably the only person in the world who has a friend that works at a waterpark - cool. Then I started thinking some more… there has to be more than just his waterpark in this country; I’ve been to at least a few… and the thinking continued… I wonder how many there are… and continued… and I wonder where they are… and, well, here we are at the culmination of that curiosity with this blog post. So - the first problem - how would I figure that out? As with most things I need answers to in this world, I turned to Google and asked: Where are the waterparks in the US? The answer appears to be: there are a lot. The data is there if I can get my hands on it. Knowing that Google has an API, I signed up for an API key and away I went! Until I was stopped abruptly with limits on how many results will be returned: a measly 20 per search. I know R and wanted to use that to hit the API. Using the httr package and a for loop, I conceded to doing the search once per state and living with a maximum of 20 results per state. Easy fix. Here’s the code to generate the search string and query Google: q1 <- paste("waterparks in ", list_of_states[j,1], sep = "") response <- GET("https://maps.googleapis.com/", path = "maps/api/place/textsearch/xml", query = list(query = q1, key = "YOUR_API_KEY")) The results come back in XML (or JSON, if you so choose… I went with XML for this, though) - something that I have not had much experience in. I used the XML package and a healthy amount of more time in Google search-land and was able to parse the data into data frame! Success! Here’s a snippet of the code to get this all done: result <- xmlParse(response) result1 <- xmlRoot(result) result2 <- getNodeSet(result1, "//result") data[counter, 1] <- xmlValue(result2[[i]][["name"]]) data[counter, 2] <- xmlValue(result2[[i]][["formatted_address"]]) data[counter, 3] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lat"]]) data[counter, 4] <- xmlValue(result2[[i]][["geometry"]][["location"]][["lng"]]) data[counter, 5] <- xmlValue(result2[[i]][["rating"]]) Now that the data is gathered and in the right shape - what is the best way to present it? I’ve recently read about a package in R named plotly. They have many interesting and interactive visualizations, plus the API plugs right into R. I found a nice example of a map using the package. With just a few lines of code and a couple iterations, I was able to generate this (click on the picture to get the full interactivity): Waterpark’s in the USA This plot can be seen here, too. Not too shabby! There are a few things to mention here… For one, not every water park has a rating; I dealt with this by making the NAs into 0s. That’s probably not the nicest way of handling that. Also - this is only the top 20 waterparks as Google decided per state. There are likely some waterparks out there that are not represented here. There are also probably non-waterparks represented here that popped up in the results. For those of you who are interested in the data or script I used to generate this map, feel free to grab them at those links. Maybe one day I’ll come back to this to find out where there are the most waterparks per capita - or some other correlation to see what the best water park really is… this is just the tip of the iceberg. It feels good to scratch a few curiosity driven scratches in one project!