Mastering Data Analysis with R
上QQ阅读APP看书,第一时间看更新

Loading datasets from the Internet

The most obvious task is to download datasets from the Web and load those into our R session in two manual steps:

  1. Save the datasets to disk.
  2. Read those with standard functions, such as read.table or for example foreign::read.spss, to import sav files.

But we can often save some time by skipping the first step and loading the flat text data files directly from the URL. The following example fetches a comma-separated file from the Americas Open Geocode (AOG) database at http://opengeocode.org, which contains the government, national statistics, geological information, and post office websites for the countries of the world:

> str(read.csv('http://opengeocode.org/download/CCurls.txt'))
'data.frame': 249 obs. of 5 variables:
 $ ISO.3166.1.A2 : Factor w/ 248 levels "AD" ...
 $ Government.URL : Factor w/ 232 levels "" ...
 $ National.Statistics.Census..URL: Factor w/ 213 levels "" ...
 $ Geological.Information.URL : Factor w/ 116 levels "" ...
 $ Post.Office.URL : Factor w/ 156 levels "" ...

In this example, we passed a hyperlink to the file argument of read.table, which actually downloaded the text file before processing. The url function, used by read.table in the background, supports HTTP and FTP protocols, and can also handle proxies, but it has its own limitations. For example url does not support Hypertext Transfer Protocol Secure (HTTPS) except for a few exceptions on Windows, which is often a must to access Web services that handle sensitive data.

Note

HTTPS is not a separate protocol alongside HTTP, but instead HTTP over an encrypted SSL/TLS connection. While HTTP is considered to be insecure due to the unencrypted packets travelling between the client and server, HTTPS does not let third-parties discover sensitive information with the help of signed and trusted certificates.

In such situations, it's wise, and used to be the only reasonable option, to install and use the RCurl package, which is an R client interface to curl: http://curl.haxx.se. Curl supports a wide variety of protocols and URI schemes and handles cookies, authentication, redirects, timeouts, and even more.

For example, let's check the U.S. Government's open data catalog at http://catalog.data.gov/dataset. Although the general site can be accessed without SSL, most of the generated download URLs follow the HTTPS URI scheme. In the following example, we will fetch the Comma Separated Values (CSV) file of the Consumer Complaint Database from the Consumer Financial Protection Bureau, which can be accessed at http://catalog.data.gov/dataset/consumer-complaint-database.

Note

This CSV file contains metadata on around a quarter of a million of complaints about financial products and services since 2011. Please note that the file is around 35-40 megabytes, so downloading it might take some time, and you would probably not want to reproduce the following example on mobile or limited Internet. If the getURL function fails with a certificate error (this might happen on Windows), please provide the path of the certificate manually by options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) or try the more recently published curl package by Jeroen Ooms or httr (RCurl front-end) by Hadley Wickham—see later.

Let's see the distribution of these complaints by product type after fetching and loading the CSV file directly from R:

> library(RCurl)
Loading required package: bitops
> url <- 'https://data.consumerfinance.gov/api/views/x94z-ydhh/rows.csv?accessType=DOWNLOAD'
> df <- read.csv(text = getURL(url))
> str(df)
'data.frame': 236251 obs. of 14 variables:
 $ Complaint.ID : int 851391 851793 ...
 $ Product : Factor w/ 8 levels ...
 $ Sub.product : Factor w/ 28 levels ...
 $ Issue : Factor w/ 71 levels "Account opening ...
 $ Sub.issue : Factor w/ 48 levels "Account status" ...
 $ State : Factor w/ 63 levels "","AA","AE",,..
 $ ZIP.code : int 14220 64119 ...
 $ Submitted.via : Factor w/ 6 levels "Email","Fax" ...
 $ Date.received : Factor w/ 897 levels ...
 $ Date.sent.to.company: Factor w/ 847 levels "","01/01/2013" ...
 $ Company : Factor w/ 1914 levels ...
 $ Company.response : Factor w/ 8 levels "Closed" ...
 $ Timely.response. : Factor w/ 2 levels "No","Yes" ...
 $ Consumer.disputed. : Factor w/ 3 levels "","No","Yes" ...
> sort(table(df$Product))

 Money transfers Consumer loan Student loan 
 965 6564 7400 
 Debt collection Credit reporting Bank account or service 
 24907 26119 30744 
 Credit card Mortgage 
 34848 104704

Although it's nice to know that most complaints were received about mortgages, the point here was to use curl to download the CSV file with a HTTPS URI and then pass the content to the read.csv function (or any other parser we discussed in the previous chapter) as text.

Note

Besides GET requests, you can easily interact with RESTful API endpoints via POST, DELETE, or PUT requests as well by using the postForm function from the RCurl package or the httpDELETE, httpPUT, or httpHEAD functions— see details about the httr package later.

Curl can also help to download data from a secured site that requires authorization. The easiest way to do so is to login to the homepage in a browser, save the cookie to a text file, and then pass the path of that to cookiefile in getCurlHandle. You can also specify useragent among other options. Please see http://www.omegahat.org/RCurl/RCurlJSS.pdf for more details and an overall (and very useful) overview on the most important RCurl features.

Although curl is extremely powerful, the syntax and the numerous options with the technical details might be way too complex for those without a decent IT background. The httr package is a simplified wrapper around RCurl with some sane defaults and much simpler configuration options for common operations and everyday actions.

For example, cookies are handled automatically by sharing the same connection across all requests to the same website; error handling is much improved, which means easier debugging if something goes wrong; the package comes with various helper functions to, for instance, set headers, use proxies, and easily issue GET, POST, PUT, DELETE, and other methods. Even more, it also handles authentication in a much more user-friendly way—along with OAuth support.

Note

OAuth is the open standard for authorization with the help of intermediary service providers. This simply means that the user does not have to share actual credentials, but can rather delegate rights to access some of the stored information at the service providers. For example, one can authorize Google to share the real name, e-mail address, and so on with a third-party without disclosing any other sensitive information or any need for passwords. Most generally, OAuth is used for password-less login to various Web services and APIs. For more information, please see the Chapter 14, Analyzing the R Community, where we will use OAuth with Twitter to authorize the R session for fetching data.

But what if the data is not available to be downloaded as CSV files?