The Data Science Process

The Data Science Process is a framework for approaching data science tasks, and is crafted by Joe Blitzstein and Hanspeter Pfister of Harvard’s CS 109. The goal of CS 109, as per Blitzstein himself, is to introduce students to the overall process of data science investigation, a goal which should provide some insight into the framework itself.

analytics anywhere

The following is a sample application of Blitzstein & Pfister’s framework, regarding skills and tools at each stage, as given by Ryan Fox Squire in his answer:

Stage 1: Ask A Question
Skills: science, domain expertise, curiosity
Tools: your brain, talking to experts, experience

Stage 2: Get the Data
Skills: web scraping, data cleaning, querying databases, CS stuff
Tools: python, pandas

Stage 3: Explore the Data
Skills: Get to know data, develop hypotheses, patterns? anomalies?
Tools: matplotlib, numpy, scipy, pandas, mrjob

Stage 4: Model the Data
Skills: regression, machine learning, validation, big data
Tools: scikits learn, pandas, mrjob, mapreduce

Stage 5: Communicate the Data
Skills: presentation, speaking, visuals, writing
Tools: matplotlib, adobe illustrator, powerpoint/keynote

Squire then (rightfully) concludes that the data science work flow is a non-linear, iterative process, and that there are many skills and tools required to cover the full data science process. Squire also professes that he is fond of the Data Science Process as it stresses both the importance of asking questions to guide your workflow, and the importance of iterating on your questions and research, as one gains familiarity with one’s data.

The Data Science Framework is an innovative framework for approaching data science problems. Isn’t it?