Recently I was loading a large (~420mb) CSV file for Kaggle’s Adzuna job salary competition into R and ran into some speed problems. Specifically R was crunching endlessly and my Macbook pro turned into a paper weight. It was probably naive of me to attempt loading a CSV of such size into R, but I assumed some well written functions in C would handle this quickly and not block the rest of my processes (actually it didn’t block them but it may as well have). Anyway for the most part I load data into R via ODBC so I figured I’d throw the data into a local PostgreSQL database and get on with my data munging.
However, OS X didn’t want to play ball. Apparently, some big cats ago, Apple decided ODBC wasn’t super important (?!) and decided to remove it from the stock install (I recently upgraded to Mountain Lion so my mac would run slower and sometimes fall into an infinite loop during startup). I searched around a bit and found the usual ODBC suspects that I could install on OS X but quickly recalled the small pains I dealt with doing the same thing with my Ubuntu machine at work. Anyway I played around with ODBC for a bit and wasn’t really getting the optimal configuration I wanted so I looked a little further and found RPostgreSQL.
Now, having used RODBC over the years I don’t have many complaints except that if you don’t already have a stock ODBC install setting everything up can take time from your cause (fun data analysis in this case) that you generally don’t really have to spare. What would be better is a package that’s wicked fast and a configuration that’s handled during package install. Check out the CRAN entry.
After installation I loaded the entire CSV into R in about 3 seconds:
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, db="postgres", user="postgres");
rs <- dbSendQuery(con, statement = "select * from rev1");
df <- fetch(rs, n = -1);