Sometimes you just need the salient text from a web site, often as a first step towards natural language processing (NLP) or classification. There are many ways to achieve this, but XSLT (eXtensible Stylesheet Language) was purpose-built for slicing, dicing and transforming XML (and, hence, HTML) so, it can make more sense and even be speedier use XSLT transformations than to a write a hefty bit of R (or other language) code.
R has had XSLT processing capabilities in the past. Sxslt and SXalan both provided extensive XSLT/XML processing capabilities, and Carl Boettiger (@cboettig) has resurrected
Sxslt on github. However, it has some legacy memory bugs (just like the
XML package does and said bugs were there long before Carl did his reanimation) and is a bit more heavyweight than at least I needed.
The github page for the package has installation instructions (you’ll need to be somewhat adventureous until the package matures a bit), but I wanted to demonstrate the utility before refining it.
Using XSLT in Data Analyis Workflows
At work, we maintain an ever-increasing list of public breaches known as the Veris Community Database - VCDB. Each breach is a github issue and we store links to news stories (et al) that document or report the breach in each issue. Coding breaches is pretty labor-intensive work and we have not really received a ton of volunteers (the “C” in “VCDB” stands for “Community”), so we’ve been looking at ways to at least auto-classify the breaches and get some details from them programmatically. This means that getting just the salient text from these news stories/reports is critical.
xslt package, we can use an XSLT tranformation (that XSLT file is a bit big, mostly due to my XSLT being rusty) in an
xml2 pipeline to extract just the text.
Here’s a sample of it in action with apologies for the somewhat large text chunks:
(those are links from three recent breaches posted to VCDB).
Those operations are also pretty fast:
system.time(just_the_text_maam("https://krebsonsecurity.com/2015/07/banks-card-breach-at-trump-hotel-properties/", sheet)) ## user system elapsed ## 0.089 0.102 0.199 system.time(just_the_text_maam("https://www.csoonline.com/article/2943968/data-breach/hacking-team-hacked-attackers-claim-400gb-in-dumped-data.html", sheet)) ## user system elapsed ## 0.127 0.179 0.311 system.time(just_the_text_maam("https://datadrivensecurity.info/blog/posts/2015/Jul/hiring-data-scientist/", sheet)) ## user system elapsed ## 0.034 0.043 0.078
(more benchmarks that exclude the randomness of download speeds will be forthcoming).
Rather than focus on handling tags, attributes and doing some fancy footwork with regular expressions (like all the various readability ports do), you get to focus on the data analysis pipeline, with text that’s pretty clean (you can see it misses some things) and also pretty much ready for LDA or other text analysis.
xmlwrapp C++ library doesn’t have much functionality beyond the transformation function, so there may not be much more added to this package. There is one extra option—to pass parameters to XSLT transformation scripts—that will be coded up in short order.
If you find a use for
xslt (or a bug) drop us a note here or on github.