
Reading data from HTML tables
According to the traditional document formats on the World Wide Web, most texts and data are served in HTML pages. We can often find interesting pieces of information in for example HTML tables, from which it's pretty easy to copy and paste data into an Excel spreadsheet, save that to disk, and load it to R afterwards. But it takes time, it's boring, and can be automated anyway.
Such HTML tables can be easily generated with the help of the aforementioned API of the Customer Compliant Database. If we do not set the required output format for which we used XML or JSON earlier, then the browser returns a HTML table instead, as you should be able to see in the following screenshot:

Well, in the R console it's a bit more complicated as the browser sends some non-default HTTP headers while using curl, so the preceding URL would simply return a JSON list. To get HTML, let the server know that we expect HTML output. To do so, simply set the appropriate HTTP header of the query:
> doc <- getURL(paste0(u, '/25ei-6bcr/rows?max_rows=5'), + httpheader = c(Accept = "text/html"))
The XML
package provides an extremely easy way to parse all the HTML tables from a document or specific nodes with the help of the readHTMLTable
function, which returns a list
of data.frames
by default:
> res <- readHTMLTable(doc)
To get only the first table on the page, we can filter res
afterwards or pass the which
argument to readHTMLTable
. The following two R expressions have the very same results:
> df <- res[[1]] > df <- readHTMLTable(doc, which = 1)
Reading tabular data from static Web pages
Okay, so far we have seen a bunch of variations on the same theme, but what if we do not find a downloadable dataset in any popular data format? For example, one might be interested in the available R packages hosted at CRAN, whose list is available at http://cran.r-project.org/web/packages/available_packages_by_name.html. How do we scrape that? No need to call RCurl
or to specify custom headers, still less do we have to download the file first; it's enough to pass the URL to readHTMLTable
:
> res <- readHTMLTable('http://cran.r-project.org/Web/packages/available_packages_by_name.html')
So readHTMLTable
can directly fetch HTML pages, then it extracts all the HTML tables to data.frame
R objects, and returns a list
of those. In the preceding example, we got a list
of only one data.frame
with all the package names and descriptions as columns.
Well, this amount of textual information is not really informative with the str
function. For a quick example of processing and visualizing this type of raw data, and to present the plethora of available features by means of R packages at CRAN, now we can create a word cloud of the package descriptions with some nifty functions from the wordcloud
and the tm
packages:
> library(wordcloud) Loading required package: Rcpp Loading required package: RColorBrewer > wordcloud(res[[1]][, 2]) Loading required package: tm
This short command results in the following screenshot, which shows the most frequent words found in the R package descriptions. The position of the words has no special meaning, but the larger the font size, the higher the frequency. Please see the technical description of the plot following the screenshot:

So we simply passed all the strings from the second column of the first list
element to the wordcloud
function, which automatically runs a few text-mining scripts from the tm
package on the text. You can find more details on this topic in Chapter 7, Unstructured Data. Then, it renders the words with a relative size weighted by the number of occurrences in the package descriptions. It seems that R packages are indeed primarily targeted at building models and applying multivariate tests on data.