Wednesday, October 7, 2015

Treasure Trove of R Scripts for Auto Classification, Chart Generation, Solr, Mongo, MySQL and Ton More

In this repository hosted at github, the datadolph.in team is sharing all of the R codebase that it developed to analyze large quantities of data.
datadolph.in team has benefited tremendously from fellow R bloggers and other open source communities and is proud to contribute all of its codebase into the community.
The codebase includes ETL and integration scripts on -
  • R-Solr Integration
  • R-Mongo Interaction
  • R-MySQL Interaction
  • Fetching, cleansing and transforming data
  • Classification (identify column types)
  • Default chart generation (based on simple heuristics and matching a dimension with a measure)

Github Source: https://github.com/datadolphyn/R

Tuesday, November 19, 2013

R and Solr Integration Using Solr's REST APIs


Solr is the most popular, fast and reliable open source enterprise search platform from the Apache Luene project.  Among many other features, we love its powerful full-text search, hit highlighting, faceted search, and near real-time indexing.  Solr powers the search and navigation features of many of the world's largest internet sites.  Solr, written in Java, uses the Lucene Java search library for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language including R.  

We invested significant amount of time integrating our R-based data-management platform with Solr using HTTP/JSON based REST interface.  This integration allowed us to index millions of data-sets in solr in real-time as these data-sets get processed by R.  It took us few days to stabilize and optimize this approach and we are very proud to share this approach and source code with you.  The full source code can be found and downloaded from datadolph.in's git repository

The script has R functions for:
  • querying Solr and returning matching docs
  • posting a document to solr  (taking a list and converting it to JSON before posting it)
  • deleting all indexes, deleting indexes for a certain document type and for a certain category within document type
     # query a field for the text and return docs
      querySolr <- function(queryText, queryfield="all") {
        response <- fromJSON(getURL(paste(getQueryURL(), queryfield, ":", queryText, sep="")))
        if(!response$responseHeader$status) #if 0
          return(response$response$docs)
      }

      # delete all indexes from solr server
      deleteAllIndexes <-function() {
        response <- postForm(getUpdateURL(),
                             .opts = list(postfields = '{"delete": {"query":"*:*"}}',
                                          httpheader = c('Content-Type' = 'application/json', 
                                                         Accept = 'application/json')
                                          ssl.verifypeer=FALSE
                             )
        ) #end of PostForm
        return(fromJSON(response)$responseHeader[1])
      }

      # delete all indexes for a document type from solr server 
      # in this example : type = sports
      deleteSportsIndexes <-function() {
        response <- postForm(getUpdateURL(),
                             .opts = list(postfields = '{"delete": {"query":"type:sports"}}',
                                          httpheader = c('Content-Type' = 'application/json', 
                                                         Accept = 'application/json'),
                                          ssl.verifypeer=FALSE
                             )
        ) #end of PostForm
        return(fromJSON(response)$responseHeader[1])
      }

      # delete indexes for all baskeball category in sports type from solr server 
      # in this example : type = sports and category: basketball
      deleteSportsIndexesForCat <-function(category) {
        response <- postForm(getUpdateURL(),
                             .opts = list(postfields = 
                               paste('{"delete": {"query":"type:sports AND category:', category, '"}}', sep=""),
                                          httpheader = c('Content-Type' = 'application/json', 
                                                         Accept = 'application/json'),
                                          ssl.verifypeer=FALSE
                             )
        ) #end of PostForm
        return(fromJSON(response)$responseHeader[1])
      }
      #deletePadIndexesForCat("baskeball")

      #Post a new document to Solr
      postDoc <- function(doc) { 
        solr_update_url <- getUpdateURL()
        jsonst <- toJSON(list(doc))
        response <- postForm(solr_update_url,
                             .opts = list(postfields = jsonst,
                                          httpheader = c('Content-Type' = 'application/json', 
                                                         Accept = 'application/json'),
                                          ssl.verifypeer=FALSE
                             )) #end of PostForm
        return(fromJSON(response)$responseHeader[1])
        ########## Commit - only if it doesn't work the other way ###############
        #return(fromJSON(getURL(getCommitURL())))
      }

Happy Coding!

Friday, June 14, 2013

Simulating Map-Reduce in R for Big Data Analysis Using Flights Data

We are constantly crunching through large amounts of data and designing unique and innovative ways to process large datasets on a single node and use distributed computing only when single node computing becomes time consuming and less efficient.  

We are happy to share with the R community one such unique map-reduce like approach we designed in R for a single node to process flights data (available here) which has  ~122 million records and occupies 12GB of space when uncompressed.  We used Mathew Dowle's data.table package heavily to load and analyze large datasets.

It took us few days to stabilize and optimize this approach and we are very proud to share this approach and source code with you.  The full source code can be found and downloaded from datadolph.in's git repository.

Here is how we approached this problem:  First, before loading the datasets in R, we compressed each of the 22 CSV files using gunzip for faster reading in R.  The method read.csv can read gzip files faster than it can read uncompressed files:

# load list of all files
 flights.files <- list.files(path=flights.folder.path, pattern="*.csv.gz")

# read files in data.table
 flights <- data.table(read.csv(flights.files[i], stringsAsFactors=F))

Next, we mapped the analysis we wanted to run to extract insights from each of the datasets.  This approach included extracting flight level, airlines level and airport level aggregated analysis and generating intermediate results.  Here is example code to get stats for each airline by year:

getFlightsStatusByAirlines <- function(flights, yr){   
  # by Year
  if(verbose) cat("Getting stats for airlines:", '\n')
  airlines.stats <- flights[, list(
                                   dep_airports=length(unique(origin)),
                                   flights=length(origin),
                                   flights_cancelled=sum(cancelled, na.rm=T),
                                   flights_diverted=sum(diverted, na.rm=T),
                                   flights_departed_late=length(which(depdelay > 0)),
                                   flights_arrived_late=length(which(arrdelay > 0)),
                                   total_dep_delay_in_mins=sum(depdelay[which(depdelay > 0)]),
                                   avg_dep_delay_in_mins=round(mean(depdelay[which(depdelay > 0)])),
                                   median_dep_delay_in_mins=round(median(depdelay[which(depdelay > 0)])),                 
                                   miles_traveled=sum(distance, na.rm=T)
                                 ), by=uniquecarrier][, year:=yr]
  #change col order
  setcolorder(airlines.stats, c("year", colnames(airlines.stats)[-ncol(airlines.stats)]))
  #save this data
  saveData(airlines.stats, paste(flights.folder.path, "stats/5/airlines_stats_", yr, ".csv", sep=""))
  #clear up space
  rm(airlines.stats)  
 # continue.. see git full code
}

Here is a copy of the map function:

#map all calculations 
mapFlightStats <- function(){
  for(j in 1:period) {
      yr <- as.integer(gsub("[^0-9]", "", gsub("(.*)(\\.csv)", "\\1", flights.files[j])))
      flights.data.file <- paste(flights.folder.path, flights.files[j], sep="")
      if(verbose) cat(yr, ": Reading : ", flights.data.file, "\n")
      flights <- data.table(read.csv(flights.data.file, stringsAsFactors=F))
      setkeyv(flights, c("year", "uniquecarrier", "dest", "origin", "month")) 
     # call functions
      getFlightStatsForYear(flights, yr)
      getFlightsStatusByAirlines(flights, yr)
      getFlightsStatsByAirport(flights, yr)
    }


As one can see, we are generating intermediate results by airlines (and by airports /  flights) for each year and storing it on the disk.  The map function takes less than 2 hours to run on a MacBook Pro which had 2.3 GHZ dual core processor and 8 GB of memory and generated 132 intermediate datasets containing aggregated analysis. 

And finally, we call the reduce function to aggregate intermediate datasets into final output (for flights, airlines and airports):

#reduce all results
reduceFlightStats <- function(){
  n <- 1:6
  folder.path <- paste("./raw-data/flights/stats/", n, "/", sep="")
  print(folder.path)
  for(i in n){
    filenames <- paste(folder.path[i], list.files(path=folder.path[i], pattern="*.csv"), sep="") 
    dt <- do.call("rbind", lapply(filenames, read.csv, stringsAsFactors=F))
    print(nrow(dt))
    saveData(dt, paste("./raw-data/flights/stats/", i, ".csv", sep=""))
  }
}

Monday, October 29, 2012

Pull Yahoo Finance Key-Statistics Instantaneously Using XML and XPath in R


This two-part blog post I published a day ago required key-stats from Yahoo Finance for all the companies in the control group I created for my research.  I wanted all the key-stats pulled, arranged in a data-frame and then present them side-by-side to form my opinions.

Quantmod package has "getQuote" method which should return the desired metrics.  The number and names of the metrics can be controlled via "yahooQF" (see the script below.)  Unfortunately, this method seems to be broken as the resulting data-frame had large amount of null values for some metrics.  Here is the script nonetheless if one wishes to experiment:

#######################################################################
# Script to download key metrics for a set of stock tickers using the quantmod package
#######################################################################
require(quantmod)
require("plyr")
what_metrics <- yahooQF(c("Price/Sales", 
                          "P/E Ratio",
                          "Price/EPS Estimate Next Year",
                          "PEG Ratio",
                          "Dividend Yield", 
                          "Market Capitalization"))

tickers <- c("AAPL", "FB", "GOOG", "HPQ", "IBM", "MSFT", "ORCL", "SAP")
# Not all the metrics are returned by Yahoo.
metrics <- getQuote(paste(tickers, sep="", collapse=";"), what=what_metrics)

#Add tickers as the first column and remove the first column which had date stamps
metrics <- data.frame(Symbol=tickers, metrics[,2:length(metrics)]) 

#Change colnames
colnames(metrics) <- c("Symbol", "Revenue Multiple", "Earnings Multiple", 
                       "Earnings Multiple (Forward)", "Price-to-Earnings-Growth", "Div Yield", "Market Cap")

#Persist this to the csv file
write.csv(metrics, "FinancialMetrics.csv", row.names=FALSE)

#######################################################################

After some digging around and staring at the raw HTML for Yahoo's KeyStats page for sometime, I decided to use the XML package and the XPath operators to get all the nodes which host key stats (name and values).  This turned out to be lot simpler.  Let's walk through this using three easy steps:

1) The CSS class name for the HTML nodes which host the name of the metric such as Market Cap or Enterprise value is "yfnc_tablehead1".  This made it quite easy to grab all the elements from the HTML tree with this class name:
 nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")

2) Now all I needed to do was get the value of this node using xmlValue function to get the name of the metric (Enterprise Value as an example):
 measures <- sapply(nodes, xmlValue)

3) Next, to get the value of any metric, I used the getSibling function to get the adjacent node (i.e. sibling) and used xmlValue function to get the value.  Here is how it was done:
values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))

This is it, I then used some other common functions to clean up column names and constructed a data-frame to arrange the key-stats in a columnar fashion.  Here is the final script and the result is shown in the graphics below.  Please feel free to use this and share it with other R enthusiasts:

#######################################################################
##Alternate method to download all key stats using XML and x_path - PREFERRED WAY
#######################################################################

setwd("C:/Users/i827456/Pictures/Blog/Oct-25")
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
  yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
  html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")

  #search for <td> nodes anywhere that have class 'yfnc_tablehead1'
  nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")
  
  if(length(nodes) > 0 ) {
   measures <- sapply(nodes, xmlValue)
   
   #Clean up the column name
   measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))   
   
   #Remove dups
   dups <- which(duplicated(measures))
   #print(dups) 
   for(i in 1:length(dups)) 
     measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
   
   #use siblings function to get value
   values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))
   
   df <- data.frame(t(values))
   colnames(df) <- measures
   return(df)
  } else {
    break
  }
}

tickers <- c("AAPL")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)  

#######################################################################




Happy Analyzing!
All Things R &
All Things Analytics (http://goo.gl/CBYQI)


Wednesday, May 23, 2012

If You are a R Developer, Then You Must Try SAP HANA for Free.



This is a guest blog from 
Alvaro Tejada Galindo, my colleague and fellow R and SAP HANA enthusiast.  I am thankful to Alvaro for coming and posting on "AllThingsR".

Are you an R developers? Have ever heard of SAP HANA? Would you like to test SAP HANA for free?

SAP HANA is an In-Memory Database Technology allowing developers to analyze big data in real-time.

Processes that took hours now take seconds due to SAP HANA's power to keep everything on RAM memory.

As announced in SAP Sapphire Now event in Orlando, Florida, SAP HANA is free for developers. You just need to download and install both the SAP HANA Client and the SAP HANA Studio, and create an SAP HANA Server on the Amazon Web Services as described in the following document:
Get your own SAP HANA DB server on Amazon Web Services - http://scn.sap.com/docs/DOC-28294

Why should this interest you? Easy...SAP HANA is an agent of change bringing speed to its limits and it can also be integrated with R as described in the following blog:

Want to know more about SAP HANA? Read everything you need here: http://developers.sap.com

You're convinced but don't want to pay for the Amazon Web Services? No problem. Just leave a comment including your name, company and email. We will reach you and send you an Amazon Gift Card so you can get started. Of course, your feedback would be greatly appreciated. Of course, we only a limited set of gift cards, so be quick or be out.

Author Alvaro Tejada Galindo, mostly known as "Blag" is a Development Expert working for the Technology Innovation and Developer Experience team in SAP Labs.  He can be contacted at a.tejada.galindo@sap.com.

Alvaro's background in his own words: I used to be an ABAP Consultant for 11 years. I worked in implementations on Peru and Canada. I’m also a die hard developer using R, Python, Ruby, PHP, Flex and many more languages. Now, I work for SAP Labs and my main roles are evangelize SAP technologies by writing blogs, articles, helping people on the forums, attending SAP events, besides many other “Developer engagement” activities.
I maintain a blog called “Blag’s bag of rants” at blagrants.blogspot.com

Wednesday, May 2, 2012

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part III


Mash-up Airlines Performance Data with Historical Weather Data to Pinpoint Weather Related Delays

For this exercise, I combined following four separate blogs that I did on BigData, R and SAP HANA.  Historical airlines and weather data were used for the underlying analysis. The aggregated output of this analysis was outputted in JSON which was visualized in HTML5, D3 and Google Maps.  The previous blogs on this series are:
  1. Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II
  2. Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps
  3. Getting Historical Weather Data in R and SAP HANA 
  4. Tracking SFO Airport's Performance Using R, HANA and D3
In this blog, I wanted to mash-up disparate data sources in R and HANA by combining airlines data with weather data to understand the reasons behind the airport/airlines delay.  Why weather - because weather is one of the commonly cited reasons in the airlines industry for flight delays.  Fortunately, the airlines data breaks up the delay by weather, security, late aircraft etc., so weather related delays can be isolated and then the actual weather data can be mashed-up to validate the airlines' claims.  However, I will not be doing this here, I will just be displaying the mashed-up data.

I have intentionally focused on the three bay-area airports and have used last 4 years of historical data to visualize the airport's performance using a HTML5 calendar built from scratch using D3.js.  One can use all 20 years of data and for all the airports to extend this example.  I had downloaded historical weather data for the same 2005-2008 period for SFO and SJC airports as shown in my previous blog (For some strange reasons, there is no weather data for OAK, huh?).  Here is how the final result will look like in HTML5:



Click here to interact with the live example.  Hover over any cell in the live example and a tool tip with comprehensive analytics will show the break down of the performance delay for the selected cell including weather data and correct icons* - result of a mash-up.  Choose a different airport from the drop-down to change the performance calendar. 
* Weather icons are properties of Weather Underground.

As anticipated, SFO airport had more red on the calendar than SJC and OAK.  SJC definitely is the best performing airport in the bay-area.  Contrary to my expectation, weather didn't cause as much havoc on SFO as one would expect, strange?

Creating a mash-up in R for these two data-sets was super easy and a CSV output was produced to work with HTML5/D3.  Here is the R code and if it not clear from all my previous blogs: I just love data.table package.


###########################################################################################  

# Percent delayed flights from three bay area airports, a break up of the flights delay by various reasons, mash-up with weather data

###########################################################################################  

baa.hp.daily.flights <- baa.hp[,list( TotalFlights=length(DepDelay), CancelledFlights=sum(Cancelled, na.rm=TRUE)), 

                             by=list(Year, Month, DayofMonth, Origin)]
setkey(baa.hp.daily.flights,Year, Month, DayofMonth, Origin)

baa.hp.daily.flights.delayed <- baa.hp[DepDelay>15,
                                     list(DelayedFlights=length(DepDelay), 
                                      WeatherDelayed=length(WeatherDelay[WeatherDelay>0]),
                                      AvgDelayMins=round(sum(DepDelay, na.rm=TRUE)/length(DepDelay), digits=2),
                                      CarrierCaused=round(sum(CarrierDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      WeatherCaused=round(sum(WeatherDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      NASCaused=round(sum(NASDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      SecurityCaused=round(sum(SecurityDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      LateAircraftCaused=round(sum(LateAircraftDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2)), by=list(Year, Month, DayofMonth, Origin)]
setkey(baa.hp.daily.flights.delayed, Year, Month, DayofMonth, Origin)

# Merge two data-tables
baa.hp.daily.flights.summary <- baa.hp.daily.flights.delayed[baa.hp.daily.flights,list(Airport=Origin,
                           TotalFlights, CancelledFlights, DelayedFlights, WeatherDelayed, 
                           PercentDelayedFlights=round(DelayedFlights/(TotalFlights-CancelledFlights), digits=2),
                           AvgDelayMins, CarrierCaused, WeatherCaused, NASCaused, SecurityCaused, LateAircraftCaused)]
setkey(baa.hp.daily.flights.summary, Year, Month, DayofMonth, Airport)

# Merge with weather data
baa.hp.daily.flights.summary.weather <-baa.weather[baa.hp.daily.flights.summary]
baa.hp.daily.flights.summary.weather$Date <- as.Date(paste(baa.hp.daily.flights.summary.weather$Year, 
                                                           baa.hp.daily.flights.summary.weather$Month, 
                                                           baa.hp.daily.flights.summary.weather$DayofMonth, 
                                                           sep="-"),"%Y-%m-%d")
# remove few columns
baa.hp.daily.flights.summary.weather <- baa.hp.daily.flights.summary.weather[, 
            which(!(colnames(baa.hp.daily.flights.summary.weather) %in% c("Year", "Month", "DayofMonth", "Origin"))), with=FALSE]

#Write the output in both JSON and CSV file formats
objs <- baa.hp.daily.flights.summary.weather[, getRowWiseJson(.SD), by=list(Airport)]
# You have now (Airportcode, JSONString), Once again, you need to attach them together.
row.json <- apply(objs, 1, function(x) paste('{\"AirportCode\":"', x[1], '","Data\":', x[2], '}', sep=""))
json.st <- paste('[', paste(row.json, collapse=', '), ']')
writeLines(json.st, "baa-2005-2008.summary.json")                 
write.csv(baa.hp.daily.flights.summary.weather, "baa-2005-2008.summary.csv", row.names=FALSE)


Happy Coding!

Wednesday, April 25, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II


In my last blog, Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps, I analyzed historical airlines performance data set using R and SAP HANA and put the aggregated analysis on Google Maps.  Undoubtedly, Map is a pretty exciting canvas to view and analyze big data sets. One could draw shapes (circles, polygons) on the map under a marker pin, providing pin-point information and display aggregated information in the info-window when a marker is clicked.  So I enjoyed doing all of that, but I was craving for some old fashion bubble charts and other types of charts to provide comparative information on big data sets.  Ultimately, all big data sets get aggregated into smaller analytical sets for viewing, sharing and reporting.  An old fashioned chart is the best way to tell a visual story!

On bubble charts, one could display four dimensional data for comparative analysis. In this blog analysis, I used the same data-set which had 200M data points and went deeper looking at finer slices of information.  I leveraged D3, R and SAP HANA for this blog post.  Here I am publishing some of this work:  

In this first graphics, the performance of top airlines is compared for 2008.  As expected, Southwest, the largest airlines (when using total number of flights as a proxy), performed well for its size (1.2M flights, 64 destinations but average delay was ~10 mins.)  Some of the other airlines like American and Continental were the worst performers along with Skywest.  Note, I didn't remove outliers from this analysis.  Click here to interact with this example (view source to get D3 code).


In the second analysis, I replaced airlines dimension with airports dimension but kept all the other dimensions the same.  To my disbelief, Newark airport is the worst performing airport when it comes to departure delays.  Chicago O'Hare, SFO and JFK follow.  Atlanta airport is the largest airport but it has the best performance. What are they doing differently at ATL?  Click here to interact with this example (view source to get D3 code).


It was hell of a fun playing with D3, R and HANA, good intellectual stimulation if nothing else!  Happy Analyzing and remember possibilities are endless!

As always, my R modules are fairly simple and straightforward:
###########################################################################################  
#ETL - Read the AIRPORT Information, get major aiport informatoin extracted and upload this 
#transfromed dataset into HANA
###########################################################################################
major.airports <- data.table(read.csv("MajorAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(major.airports, iata)

all.airports <- data.table(read.csv("AllAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(all.airports, iata)

airports.2008.hp <- data.table(read.csv("2008.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(airports.2008.hp, Origin, UniqueCarrier)

#Merge two datasets
airports.2008.hp <- major.airports[airports.2008.hp,]


###########################################################################################  
# Get airport statisitics for all airports
###########################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)