By Bob Rudis (@hrbrmstr)
Wed 17 September 2014
|
tags:
r,
rstats,
rvest,
-- (permalink)
I was offline much of the day Tuesday and completely missed Hadley Wickham’s tweet about the new rvest package:
Are you an #rstats user who misses python's beautiful soup? Please try out rvest (https://t.co/PeiIHr3jDW) and let me know what you think.
— Hadley Wickham (@hadleywickham) September 12, 2014
My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.
The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest
calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td
blocks and elements from tags within the td
blocks, a simple readHTMLTable
would not suffice.
The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr
/dplyr
and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).
The code (sans output) is in this gist, and IMO the rvest
package is going to make working with web site data so much easier.
library(XML)
library(httr)
library(rvest)
library(magrittr)
# setup connection & grab HTML the "old" way w/httr
freak_get <- GET("https://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")
freak_html <- htmlParse(content(freak_get, as="text"))
# do the same the rvest way, using "html_session" since we may need connection info in some scripts
freak <- html_session("https://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")
# extracting the "old" way with xpathSApply
xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]
## [1] "Silver Linings Playbook " "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"
## [4] "Argo (DVDscr)" "Identity Thief " "Red Dawn "
## [7] "Rise Of The Guardians (DVDscr)" "Django Unchained (DVDscr)" "Lincoln (DVDscr)"
## [10] "Zero Dark Thirty "
xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
xpathSApply(freak_html, "//*/td[4]", xmlValue)
## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"
## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"
xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")
## href href href
## "https://www.imdb.com/title/tt1045658/" "https://www.imdb.com/title/tt0903624/" "https://www.imdb.com/title/tt0454876/"
## href href href
## "https://www.imdb.com/title/tt1024648/" "https://www.imdb.com/title/tt2024432/" "https://www.imdb.com/title/tt1234719/"
## href href href
## "https://www.imdb.com/title/tt1446192/" "https://www.imdb.com/title/tt1853728/" "https://www.imdb.com/title/tt0443272/"
## href
## "https://www.imdb.com/title/tt1790885/?"
# extracting with rvest + XPath
freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]
## [1] "Silver Linings Playbook " "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"
## [4] "Argo (DVDscr)" "Identity Thief " "Red Dawn "
## [7] "Rise Of The Guardians (DVDscr)" "Django Unchained (DVDscr)" "Lincoln (DVDscr)"
## [10] "Zero Dark Thirty "
freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]
## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"
## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"
freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]
## [1] "https://www.imdb.com/title/tt1045658/" "https://www.imdb.com/title/tt0903624/"
## [3] "https://www.imdb.com/title/tt0454876/" "https://www.imdb.com/title/tt1024648/"
## [5] "https://www.imdb.com/title/tt2024432/" "https://www.imdb.com/title/tt1234719/"
## [7] "https://www.imdb.com/title/tt1446192/" "https://www.imdb.com/title/tt1853728/"
## [9] "https://www.imdb.com/title/tt0443272/" "https://www.imdb.com/title/tt1790885/?"
# extracting with rvest + CSS selectors
freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]
## [1] "Silver Linings Playbook " "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"
## [4] "Argo (DVDscr)" "Identity Thief " "Red Dawn "
## [7] "Rise Of The Guardians (DVDscr)" "Django Unchained (DVDscr)" "Lincoln (DVDscr)"
## [10] "Zero Dark Thirty "
freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]
## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"
## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"
freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]
## [1] "https://www.imdb.com/title/tt1045658/" "https://www.imdb.com/title/tt0903624/"
## [3] "https://www.imdb.com/title/tt0454876/" "https://www.imdb.com/title/tt1024648/"
## [5] "https://www.imdb.com/title/tt2024432/" "https://www.imdb.com/title/tt1234719/"
## [7] "https://www.imdb.com/title/tt1446192/" "https://www.imdb.com/title/tt1853728/"
## [9] "https://www.imdb.com/title/tt0443272/" "https://www.imdb.com/title/tt1790885/?"
# building a data frame (which is kinda obvious, but hey)
data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],
rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],
rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],
imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],
stringsAsFactors=FALSE)
## movie rank rating imdb.url
## 1 Silver Linings Playbook 1 7.4 / trailer https://www.imdb.com/title/tt1045658/
## 2 The Hobbit: An Unexpected Journey 2 8.2 / trailer https://www.imdb.com/title/tt0903624/
## 3 Life of Pi (DVDscr/DVDrip) 3 8.3 / trailer https://www.imdb.com/title/tt0454876/
## 4 Argo (DVDscr) 4 8.2 / trailer https://www.imdb.com/title/tt1024648/
## 5 Identity Thief 5 8.2 / trailer https://www.imdb.com/title/tt2024432/
## 6 Red Dawn 6 5.3 / trailer https://www.imdb.com/title/tt1234719/
## 7 Rise Of The Guardians (DVDscr) 7 7.5 / trailer https://www.imdb.com/title/tt1446192/
## 8 Django Unchained (DVDscr) 8 8.8 / trailer https://www.imdb.com/title/tt1853728/
## 9 Lincoln (DVDscr) 9 8.2 / trailer https://www.imdb.com/title/tt0443272/
## 10 Zero Dark Thirty 10 7.6 / trailer https://www.imdb.com/title/tt1790885/?