A Data Science Course for Undergraduates: Thinking With Data
2015, The American Statistician
https://doi.org/10.1080/00031305.2015.1081105Abstract
Data science is an emerging interdisciplinary field that combines elements of mathematics, statistics, computer science, and knowledge in a particular application domain for the purpose of extracting meaningful information from the increasingly sophisticated array of data available in many settings. These data tend to be non-traditional, in the sense that they are often live, large, complex, and/or messy. A first course in statistics at the undergraduate level typically introduces students with a variety of techniques to analyze small, neat, and clean data sets. However, whether they pursue more formal training in statistics or not, many of these students will end up working with data that is considerably more complex, and will need facility with statistical computing techniques. More importantly, these students require a framework for thinking structurally about data. We describe an undergraduate course in a liberal arts environment that provides students with the tools necessary to apply data science. The course emphasizes modern, practical, and useful skills that cover the full data analysis spectrum, from asking an interesting question to acquiring, managing, manipulating, processing, querying, analyzing, and visualizing data, as well communicating findings in written, graphical, and oral forms.
References (50)
- Allaire, J., Horner, J., Marti, V., and Porte, N. (2013), markdown: Markdown rendering for R, R package version 0.6.3, http://CRAN.R-project.org/package=markdown.
- American Statistical Association Undergraduate Guidelines Workgroup (2014), 2014 Curricu- lum Guidelines for Undergraduate Programs in Statistical Science, http://www.amstat.org/ education/curriculumguidelines.cfm.
- Anderson, C. (2008), "The End of Theory," Wired, http://www.wired.com/science/ discoveries/magazine/16-07/pb_theory.
- Bartlett, R. (2013), "We Are Data Science," AMSTAT News, October, http://magazine.amstat. org/blog/2013/10/01/we-are-data-science/.
- Box, G. E. (1979), "Some problems of statistics and everyday life," Journal of the American Sta- tistical Association, 74, 1-4, http://www.tandfonline.com/doi/pdf/10.1080/01621459.1979. 10481600.
- Breiman, L. et al. (2001), "Statistical modeling: The two cultures (with comments and a rejoinder by the author)," Statistical Science, 16, 199-231, http://www.jstor.org/stable/2676686.
- Chance, B. L. (2002), "Components of Statistical Thinking and Implications for Instruction and Assessment," Journal of Statistics Education, 10, http://www.amstat.org/publications/jse/ v10n3/chance.html.
- Cleveland, W. S. (2001), "Data science: an plan for expanding the technical areas of the field of statistics," International statistical review, 69, 21-26, http://www.jstor.org/stable/1403527.
- Cobb, G. W. (2007), "The Introductory Statistics Course: A Ptolemaic Curriculum?" Technology Innovations in Statistics Education (TISE), 1, http://escholarship.org/uc/item/6hb3k0nz.
- -(2011), "Teaching statistics: Some important tensions," Chilean Journal of Statistics, 2, 31-62, http://chjs.deuv.cl/Vol2N1/ChJS-02-01-03.pdf.
- Cohen, D. and Henle, J. (1995), "The Pyramid Exam," Undergraduate Mathematics Education Trends, 7, 2.
- Committee on Professional Ethics (1999), Ethical Guidelines for Statistical Practice, http://www. amstat.org/about/ethicalguidelines.cfm.
- Davenport, T. H. and Patil, D. (2012), "Data Scientist: The Sexiest Job of the 21st Century," http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1.
- Davidian, M. (2013a), "Aren't We Data Science?" AMSTAT News, July, http://magazine. amstat.org/blog/2013/07/01/datascience/.
- -(2013b), "The ASA and Big Data," AMSTAT News, June, http://magazine.amstat.org/blog/ 2013/06/01/the-asa-and-big-data/.
- Dhar, V. (2013), "Data Science and Prediction," Communications of the ACM, 56, 64-73, http: //cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext.
- DiGrazia, J., McKelvey, K., Bollen, J., and Rojas, F. (2013), "More Tweets, More Votes: Social Media as a Quantitative Indicator of Political Behavior," Social Science Research Network, http: //ssrn.com/abstract=2235423.
- Finzer, W. (2013), "The Data Science Education Dilemma," Technology Innovations in Statistics Education, 7, http://escholarship.org/uc/item/7gv0q9dc.pdf.
- Franck, C. (2013), "Is Nate Silver a Statistician?" AMSTAT News, October, http://magazine. amstat.org/blog/2013/10/01/is-nate-silver/.
- Gelman, A. (2013), "The Tweets-Votes Curve," http://andrewgelman.com/2013/04/24/ the-tweets-votes-curve/.
- Gould, R., Baumer, B., C ¸etinkaya Rundel, M., and Bray, A. (2014), "Big Data Goes to College," AMSTAT News, June, http://magazine.amstat.org/blog/2014/06/01/datafest/.
- Halvorsen, K. T. and Moore, T. L. (2001), "Motivating, monitoring, and evaluating student projects," MAA Notes, 27-32.
- Harris, J. G., Shetterley, N., Alter, A. E., and Schnell, K. (2014), "It Takes Teams to Solve the Data Scientist Shortage," The Wall Street Journal, http://blogs.wsj.com/cio/2014/02/14/ it-takes-teams-to-solve-the-data-scientist-shortage/.
- Hart, M. and Newby, G. (2013), "Project Gutenberg," http://www.gutenberg.org/wiki/Main_ Page.
- Horton, N. J. (2015), "Challenges and opportunities for statistics and statistical education: looking back, looking forward," arXiv preprint arXiv:1503.02188.
- Horton, N. J., Baumer, B. S., and Wickham, H. (2015), "Setting the stage for data science: integra- tion of data management skills in introductory and second courses in statistics," arXiv preprint arXiv:1502.00318.
- IMDB.com (2013), "Internet Movie Database," http://www.imdb.com/help/show_article? conditions.
- James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An introduction to statistical learning, Springer, http://www-bcf.usc.edu/ ~gareth/ISL/.
- Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N. H., Weaver, C., Lee, B., Brodbeck, D., and Buono, P. (2011), "Research directions in data wrangling: Visualizations and transformations for usable and credible data," Information Visualization, 10, 271-288, http: //research.microsoft.com/EN-US/UM/REDMOND/GROUPS/cue/infovis/.
- Linkins, J. (2013), "Let's Calm Down About Twitter Being Able To Predict Elections, Guys," http: //www.huffingtonpost.com/2013/08/14/twitter-predict-elections_n_3755326.html.
- Lohr, S. (2009), "For Today's Graduate, Just One Word: Statistics," http://www.nytimes.com/ 2009/08/06/technology/06stats.html.
- Moore, D. S. (1998), "Statistics among the liberal arts," Journal of the American Statistical Asso- ciation, 93, 1253-1259, http://www.jstor.org/stable/2670040.
- Mosteller, F. and Wallace, D. L. (1963), "Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers," Journal of the American Statistical Association, 58, 275-309.
- Murrell, P. (2010), Introduction to Data Technologies, Chapman and Hall/CRC, https://www. stat.auckland.ac.nz/ ~paul/ItDT/.
- Nolan, D. and Temple Lang, D. (2010), "Computing in the statistics curricula," The Amer- ican Statistician, 64, 97-107, http://www.stat.berkeley.edu/users/statcur/Preprints/ ComputingCurric3.pdf.
- Rajaraman, A. and Ullman, J. D. (2011), Mining of massive datasets, Cambridge University Press, http://www.mmds.org/.
- Rojas, F. (2013), "How Twitter can help predict an election," http://www. washingtonpost.com/opinions/how-twitter-can-predict-an-election/2013/08/11/ 35ef885a-0108-11e3-96a8-d3b921c0924a_story.html.
- RStudio and Inc. (2013), shiny: Web Application Framework for R, r package version 0.8.0, http: //CRAN.R-project.org/package=shiny.
- Stanton, J. (2012), An Introduction to Data Science, https://ischool.syr.edu/media/ documents/2012/3/DataScienceBook1_1.pdf.
- Swires-Hennessy, E. (2014), Presenting Data: How to Communicate Your Message Effectively, Wiley, 1st ed., http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118489594.html.
- Tan, P.-N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining, Pearson Addison- Wesley, 1st ed., http://www-users.cs.umn.edu/ ~kumar/dmbook/index.php.
- Tufte, E. R. (1983), The Visual Display of Quantitative Information, Graphics Press, 2nd ed.
- Wickham, H. (2012), "my cynical definition: a data scientist is a statistician who is useful ;)," https://twitter.com/hadleywickham/status/263750846246969344.
- -(2014), "Tidy data," The Journal of Statistical Software, 59, http://vita.had.co.nz/papers/ tidy-data.html.
- Wickham, H. and Francois, R. (2014), dplyr: a grammar of data manipulation, R package version 0.1, http://CRAN.R-project.org/package=dplyr.
- Wilkinson, L., Wills, D., Rope, D., Norton, A., and Dubbs, R. (2006), The grammar of graphics, Springer.
- Yau, N. (2011), Visualize this: the Flowing Data guide to design, visualization, and statistics, Wiley Publishing.
- -(2013), Data points: visualization that means something, John Wiley & Sons.
- Zhu, Y., Hernandez, L. M., Mueller, P., Dong, Y., and Forman, M. R. (2013), "Data Acquisition and Preprocessing in Studies on Humans: What is Not Taught in Statistics Classes?" The American Statistician, 67, 235-241, http://dx.doi.org/10.1080/00031305.2013.842498.
- 10 pts) Briefly discuss the relative strengths of SQL vs. R. What does SQL do better than R? What does R do better than SQL? [Hint: It may be helpful to give an example of a data science task for which one or the other would be better suited.]