- Multivariate data analysis using R, Darren J Wilkinson
- An Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- The Element of Statistical Learning, Trevor Hastie, Robert Tibshirani, Jerome Friedman
- Foundations of Data Science, John Hopcroft and Ravindran Kannan (pdf)
- Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeff Ullman
- Advanced R, Hadley Wickham
- Introduction to Statistical Thought, Michael Lavine
- Introduction to Probability and Statistics Using R, G. Jay Kerns
- Introduction to Graphs, Wikipedia Book
- Elementary Linear Algebra, K.R. Matthews
- Electronic Statistics Textbook, StatSoft, Inc. (online only)
- OpenRefine : an open source, locally-installed, cross-platform toolkit that makes it extremely easy to import, explore, clean, transform, and enrich messy data into something usable for analysis.
- WebPlotDigitizer : This online tool makes it possible to quickly “reverse engineer” charts and graphs that have no associated open data files.
- Google CRUSH Tools : A command-line processing engine and data transformation tool that makes it possible to work efficiently with large data sets from a shell prompt.
- csvkit : A suite of open source Python utilities that are similar to the CRUSH tools, but usable from both the command line and within Python scripts.
- Data Cleaner : This product is similar to OpenRefine but with both commercial and open source offerings.
- Mr. Data Converter : In-browser and locally installable open source tool created by Shan Carter to improve data cleansing workflows at the New York Times.
- [Your favorite scripting language]: Never underestimate the power of a Python, R, Perl or awk script when it comes to cleaning data. You’ll have to do more up-front work, but you may be able to build a far more reusable and customized cleanup and transformation workflow with your own tools.
Data Analytics & Visualization : Core Tools
- R + RStudio : The language of data science. Commercial offering available via Revolution Analytics.
- Python + pandas : The other language of data science. Additional open source and commercial offerings available via Enthought Canopy and Continuum Analytics Anaconda.
- Tableau : Commercial tool with an emphasis on producing interactive dashboards and visualizations.
- D3.js : Enables the creation of “data driven documents” and provides templates and examples for creating almost every type of modern static and interactive visualization. Test out your vis ideas in the D3 Playground
- Highcharts JS : Provides robust charting and graphing funtions, especially well-suited for dashboards.
Data Analytics & Visualization : Mapping Tools
- OpenHeatMap : Produce high quality heat maps from CSV data right in your browser. No coding required.
Data Analytics & Visualization : Specialized Tools
- TimeFlow : An open source tool specifically design for analysis and visualization of temporal/time series data.
- Gephi : Open source network graph analysis and visualizatin tool.
- Quadirgram : This tool provides a visual programming interface for working with data and designing highly customized, interactive visualization.
Aggregation Sites, Q&A Sites, And Blogs To Follow
- R-Bloggers : Rather than follow a plethora of individual blogs you can follow the R-Bloggers RSS feed to see only R-related posts that deal with all aspects of data analysis and visualization.
- Stats Blogs : An aggregation sites, similar to R-Bloggers, but with a focus on statistics.
- StackExchange : The perfect place to go if you have R, Python or pandas questions, can’t remember a ggplot option or need some help with a gnarly statistics problem.
- JunkCharts : Learn from the visualization mistakes of others.
- FlowingData : Resources, news and tutorials that will improve the way you think and design visualizations.
- DataVisualization.ch : Aggregation and index of the most popular and useful visualization tools currently available.
- Data Analysis & Visualization Bit.ly Bundle : An aggregation of links maintained by the authors along with David Severski.
- ColorBrewer : Designed by Cynthia brewer, this is the color resource that should be the first tool you head for when designing visualizations. It provides a wide range of palettes with options for creating print-safe and colorblind-friendly images.
- HCL Picker : An open source, D3-based color picker, that lets you select colors based on hue, chroma and lightness.
- Adobe Kuler : An online tool, provided by Adobe, which allows you to design compelling color palettes or choose from a wide assortment of pre-made palettes.
- OS X Color Picker Palettes : Use ColorBrewer palettes in Excel, PhotoShop and any other application on your Mac.
Online Schools and Learning Resources
- edX : Browse the collection of courses in statistics and data analysis at edX.
- Coursera : Browser the collection of free online courses in statistics and data analysis at Coursera.
- Syracuse University : Data Science Open Online course through Syracuse.
- UC Berkeley : Master of Information and Data Science (MIDS) online program.
- Univ of Washington : University of Washington’s certificate in data science.
- Penn State : Penn State’s Applied Statistics online curriculum.
- Coursera Specialization - Data Science : A collection of nine courses covering topics in data science for $49 each. After completion of a capstone project attendees get a “Specialization Certificate”.
- Carnegie Melon University : CMU’s Open Learning Initiative with courses in statistics and research methods.
- University of Wisconsin : UW Master of Science Degree in Data Science