Tweetscription From “Weathering The Data Storm”

By Bob Rudis (@hrbrmstr)
Sat 25 January 2014 | tags: data science, podcast, -- (permalink)

I had the opportunity to attend Weathering the Data Storm: The Promise and Challenges of Data Science at Harvard yesterday (2013-01-24). Overall, it was an excellent symposium and I’ll be talking with Jay about it on the next podcast.

The @DDSecBlog account was live-tweeting as much as possible and I’ve hoovered them all up and put them below. So, while you’re waiting for the podcast you can peruse the tweets to see i there are 140 character nuggets worth bookmarking.

Weathering the Data Storm: The Promise and Challenges of Data Science Tweetscription

Rachel Schutt starting her talk on “What is Data Science” @; highlighting “promise” & “challenge” in the symposium’s title “Data Science” subsumes other fields and disciplines; is a moving target; comes with much hype but also much promise
Data Science as a job title: great summary of part of “Building Data Science Teams” by @OReillyMedia (buy the book :-)
Facebook & LinkedIn started building “data products” (e.g. friend recommendations) from scads of raw user action logs
Skill sets are interdisciplinary and team-oriented; “data scientists should think like journalists”
Raw material of data products are data and code and is part of a user-centric feedback loop.
Underlying analytics can be familiar (e.g. naive bayes) but scale can be challenging esp if real-time.
Google didn’t have a job title “data scientist” until 2013. Title rly was born from tech companies.
Great re-quotes of Josh Willis and Will Cukierski #googlethem
Referencing Cleveland’s spiffy “action plan” from 2001 showing use of the co-joined terms “Data Science”
“Data science” process”
Models developed for data products need to consider the impact on user behavior (cld be causal vs pure predictive)
Data science is not just a bunch of academic disciplines glued together, instead it shld come from fields intermingling
IPython talk by Fernando Pérez coming up next. it’d rock if it it’s all in an IPython Notebook :-)
Despite all the computing resources at our disposal we only have 2 eyeballs & one brain. Need gd flow from raw data to insight
lifecycle of a scientific idea: 1. individual, 2. collar, 3. parallel, 4. publication !reproducible!, 5. education, Goto 1
Nice shout out to Enthought (Canopy rocks btw). IPython is a “terminal for a scientist”
IPython Notebooks: Everyday exploratory computing with annotations
ZOMGosh it’s a live IPython Notebook for the presentation! #awesome
IPython Notebooks have full support for rich output (anything a browser can display) and LaTeX & symbolic math support
W00t! Showing R integration into IPython “%load_ext rmagic” “%%R” (has Julia, Ruby and other support too)
nbconvert can translate IPython Notebooks to HTML, PDF, LaTeX, etc
Can also share IPython Notebooks over at w/just a link to your ipynb file
IPython Notebooks are a huge help to the reproducible research movement
Python for Signal Processing is also available as a github repository and is just a collection of IPython Notebooks #spiffy
All CS109 homework assignments at Harvard are IPython Notebooks
IBM’s Watson case study switched from an clunky in-house custom software setup to IPython.
Enthought Canopy, Microsoft Python Tools in Visual Studio & Azure; Anaconda all “talk” IPython
MIT has a “StarCluster” capable of running IPython at scale
ZOMGOSH! Now answering an audience question about using Perl, SQL & SAS from IPython vs standalone, so he just does it live
Next up: Bonnie Ray, Director of Cognitive Algorithms at IBM Watson Research Center: “From Big Data” to Better Decision Making”
IBM Watson is involved in many, diverse analytics projects. Featuring their SMS poll analyses now
IBM analyzes unsolicited texts TO the project and auto-classifies them. More info/details here:
Now covering a developed hierarchical TCM technique to drill into customer satisfaction data HTCM paper:
Super-stoked for @jeffrey_heer’s Interactive Data Analysis talk (up next)
Shout out to Tukey. Many quotes. Great foundation for the talk.
“Nothing can substitute for the flexibility of the informed human mind”
Showing his Facebook “ego network” in graph viewer as an example of using vis to help direct further modeling/analysis tasks”
Switching from node-link to adjacency matrix with seriation, which reveals some additional structure
Switches from seriation to ordering by Facebook id, revealing a pattern that was actually caused by a bad FB search query
Covering challenges in data acquisition and cleaning and the birth of Data Wrangler
How might we support expressive & effective vis design? Showing a D3 version of Playfair’s Wheat Chart
Great D3 demo. Can find most of the examples at
Now covering the Stanford Dissertation Browser
entered late…#sigh…Electricity grid topic on slides. Looking at the extremely messy data in trouble tickets
Trouble tickets == “data tombs” (awesome term!). Wrking on Reactive Point Processes to do short horizon or real time prediction”
Great paper to read explaining the MHE talk in more detail:
New topic: next gen search engine finds what you need before you need it.
Using a “Boston events” and “Jewish foods” search examples. Comparing Google Sets vs Boo!Wa! results”
Paper explaining the “search” experiment:”
When developing models do you need to sacrifice interpretability for accuracy? #datascience14
Assoc rules & Assoc classification. Step 1: Find (alot of) Rules -> Step 2: Order rules to build a list.
Goal: classification; Probabilistic model of permutations; Use Bayesian prior to gain interpretability;
Goal: Measure stroke risk among patients with atrial fib (and beat CHADS2). Started with 11.1m Medicaid enrollees…
Used criteria to filter down to 12.5. Developed rly straightforward set of rules to identify stroke risk %.
Google’s up now. “Statistics at Google scale”.
Big ideas from the masters: Collect data carefully; Experiment; Measure uncertainty & bias; Monitor all the time
Stats are the core of google search. Talking about historical Page rank.
google has a lot of data. From users (across products). Logs (queries, ads, map motions…). Experiments (A/B)
Language translation; BLEU (Bilingual Evaluation Understudy) score :
generally, more data solves harder problems; i.e. predictive typing (which measures while it’s predicting)
Data is not enough. Experimentation is critical. Which X is faster? Does Y increase latency? Numerous methods for experiments.
Every time you enter a query into google, you’re part of an experiment.
Experiments need statisticians but at google they must be able to be “productionized” and done at scale.”

comments powered by Disqus