Data Science Reading List

Suggested Reading for Data Science

Best Approaches for Data Science Reading

I remember seeing a history professor filling a cart of books to use for his research. I paused and asked him if he would read all of those books. He said that he would, but gauging my facial expression he elaborated. He then explained that he had a method for reading books to get what he needed out of them and did not cozy up on a couch to read them cover to cover. He is an expert in his field and chose specific books because he is familiar with the author’s argument, so he knew what he was looking for. This approach reduces how long it takes him to read, but he still dedicates considerable time to reading. This was a big break through for me and I started asking others how they read. It turned out that researchers in the sciences also have their own methods of reading material. The University of Pittsburgh Computer Science department made a PDF on this topic and I have found it very useful and thought I would share.

LINK: How to Read a Computer Science Research Paper by Amanda Stent

Takeaway

  • Research papers are plublished many places and have a flow. (Technical report -> conference papers -> journal paper)
  • Three basic types: theoretical, engineering, and empirical.
  • Finding “Good” papers to read - know where to look and focus on what you are trying to learn about or figure out.
  • Reading is a step by step process that can help you find what you are looking for quicker and more reliably than a full read through.
  • Storing and keeping track of the papers you read. Many apps that track but also need to record in a way that helps you remember.

This is a very brief summary and I highly suggest you take the time to read the three page paper by Amanda Stent.


Fundamental Data Science Paper

50 years of Data Science by David Dono

LINK: MIT Course Paper

Abstract

More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name “Data Science” for his envisioned field.

A recent and growing phenomenon is the emergence of “Data Science” programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M “Data Science Initiative” that will hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments.

This paper reviews some ingredients of the current “Data Science moment”, including recent commentary about data science in the popular media, and about how/whether Data Science is really different from Statistics.

The now-contemplated field of Data Science amounts to a superset of the fields of statistics and machine learning which adds some technology for ‘scaling up’ to ‘big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fifty years.

Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere ‘scaling up’, but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field.

Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are ‘learning from data’, and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.