Tweetscription From “Weathering The Data Storm”

By Bob Rudis (@hrbrmstr)
Sat 25 January 2014 | tags: data science, podcast, -- (permalink)

I had the opportunity to attend Weathering the Data Storm: The Promise and Challenges of Data Science at Harvard yesterday (2013-01-24). Overall, it was an excellent symposium and I’ll be talking with Jay about it on the next podcast.

The @DDSecBlog account was live-tweeting as much as possible and I’ve hoovered them all up and put them below. So, while you’re waiting for the podcast you can peruse the tweets to see i there are 140 character nuggets worth bookmarking.

Weathering the Data Storm: The Promise and Challenges of Data Science Tweetscription

Rachel Schutt starting her talk on “What is Data Science” @; highlighting “promise” & “challenge” in the symposium’s title “Data Science” subsumes other fields and disciplines; is a moving target; comes with much hype but also much promise

Data Science as a job title: great summary of part of “Building Data Science Teams” by @OReillyMedia (buy the book :-)

Facebook & LinkedIn started building “data products” (e.g. friend recommendations) from scads of raw user action logs

Skill sets are interdisciplinary and team-oriented; “data scientists should think like journalists”

Raw material of data products are data and code and is part of a user-centric feedback loop.

Underlying analytics can be familiar (e.g. naive bayes) but scale can be challenging esp if real-time.

Google didn’t have a job title “data scientist” until 2013. Title rly was born from tech companies.

Great re-quotes of Josh Willis and Will Cukierski #googlethem

Referencing Cleveland’s spiffy “action plan” from 2001 showing use of the co-joined terms “Data Science”

“Data science” process https://t.co/Qv7yu8lvPj”

Models developed for data products need to consider the impact on user behavior (cld be causal vs pure predictive)

Data science is not just a bunch of academic disciplines glued together, instead it shld come from fields intermingling

IPython talk by Fernando Pérez coming up next. it’d rock if it it’s all in an IPython Notebook :-)

Despite all the computing resources at our disposal we only have 2 eyeballs & one brain. Need gd flow from raw data to insight

lifecycle of a scientific idea: 1. individual, 2. collar, 3. parallel, 4. publication !reproducible!, 5. education, Goto 1

Nice shout out to Enthought (Canopy rocks btw). IPython is a “terminal for a scientist”

IPython Notebooks: Everyday exploratory computing with annotations

ZOMGosh it’s a live IPython Notebook for the presentation! #awesome

IPython Notebooks have full support for rich output (anything a browser can display) and LaTeX & symbolic math support

W00t! Showing R integration into IPython “%load_ext rmagic” “%%R” (has Julia, Ruby and other support too)

nbconvert can translate IPython Notebooks to HTML, PDF, LaTeX, etc

Can also share IPython Notebooks over at https://t.co/j4oCEKMSa1 w/just a link to your ipynb file

IPython Notebooks are a huge help to the reproducible research movement

Python for Signal Processing is also available as a github repository and is just a collection of IPython Notebooks #spiffy

All CS109 homework assignments at Harvard are IPython Notebooks

IBM’s Watson case study switched from an clunky in-house custom software setup to IPython.

Enthought Canopy, Microsoft Python Tools in Visual Studio & Azure; Anaconda all “talk” IPython

MIT has a “StarCluster” capable of running IPython at scale

ZOMGOSH! Now answering an audience question about using Perl, SQL & SAS from IPython vs standalone, so he just does it live

Next up: Bonnie Ray, Director of Cognitive Algorithms at IBM Watson Research Center: “From Big Data” to Better Decision Making”

IBM Watson is involved in many, diverse analytics projects. Featuring their https://t.co/Gd3Sy3Dr9J SMS poll analyses now

IBM analyzes unsolicited texts TO the project and auto-classifies them. More info/details here: https://t.co/uiyLikbS8l

Now covering a developed hierarchical TCM technique to drill into customer satisfaction data HTCM paper: https://t.co/7mrYdAqbfX

Super-stoked for @jeffrey_heer’s Interactive Data Analysis talk (up next)

Shout out to Tukey. Many quotes. Great foundation for the talk.

“Nothing can substitute for the flexibility of the informed human mind”

Showing his Facebook “ego network” in graph viewer as an example of using vis to help direct further modeling/analysis tasks”

Switching from node-link to adjacency matrix with seriation, which reveals some additional structure

Switches from seriation to ordering by Facebook id, revealing a pattern that was actually caused by a bad FB search query

Covering challenges in data acquisition and cleaning and the birth of Data Wrangler https://t.co/J5MF4qY0My

How might we support expressive & effective vis design? Showing a D3 version of Playfair’s Wheat Chart

Great D3 demo. Can find most of the examples at https://t.co/hE96W8adt5

Now covering the Stanford Dissertation Browser https://t.co/Y2lxSCHyc4

entered late…#sigh…Electricity grid topic on slides. Looking at the extremely messy data in trouble tickets

Trouble tickets == “data tombs” (awesome term!). Wrking on Reactive Point Processes to do short horizon or real time prediction”

Great paper to read explaining the MHE talk in more detail: https://t.co/hnqU9WmX7l

New topic: next gen search engine finds what you need before you need it.

Using a “Boston events” and “Jewish foods” search examples. Comparing Google Sets vs Boo!Wa! results”

Paper explaining the “search” experiment: https://t.co/A681W7toNh”

When developing models do you need to sacrifice interpretability for accuracy? #datascience14

Assoc rules & Assoc classification. Step 1: Find (alot of) Rules -> Step 2: Order rules to build a list.

Goal: classification; Probabilistic model of permutations; Use Bayesian prior to gain interpretability;

Goal: Measure stroke risk among patients with atrial fib (and beat CHADS2). Started with 11.1m Medicaid enrollees…

Used criteria to filter down to 12.5. Developed rly straightforward set of rules to identify stroke risk %.

Google’s up now. “Statistics at Google scale”.

Big ideas from the masters: Collect data carefully; Experiment; Measure uncertainty & bias; Monitor all the time

Stats are the core of google search. Talking about historical Page rank.

google has a lot of data. From users (across products). Logs (queries, ads, map motions…). Experiments (A/B)

Language translation; BLEU (Bilingual Evaluation Understudy) score : https://t.co/nzUtC9U5wa

generally, more data solves harder problems; i.e. predictive typing (which measures while it’s predicting)

Data is not enough. Experimentation is critical. Which X is faster? Does Y increase latency? Numerous methods for experiments.

Every time you enter a query into google, you’re part of an experiment.

Experiments need statisticians but at google they must be able to be “productionized” and done at scale.”