Data Mining

I. Syllabus:

The course provides a statistical introduction to methods designed for analyzing large and complex data sets and relations. The focus is on regression and classification methods. We start with linearity, but move on relatively fast to discuss several nonlinear/nonparametric approaches such as local polynomial regressions. Then we discuss ways of evaluating estimation errors, e.g.~via the bootstrap, and take a look at model selection techniques like cross-validation. The course is completed by taking a glimpse at specific nonparametric techniques such as decision trees. Selected case studies are discussed in a computer class using R. After completing the course, you will be able to conduct complex data analyses on your own.

  1. Statistical learning
  2. Linear Regression (review)
  3. Classification (mostly linear)
  4. Resampling methods
  5. Model selection
  6. Getting more nonlinear
  7. Tree-based methods

 

II. Prerequisites:

 

III. Exam:

  • written exam 
  • you may use a formulary (will be available for download)
  • you can earn some bonus points in the computer tutorial

IV. Downloads:

 

V. Literature:

  • Main textbook: James, G. et al., "Introduction to Statistical Learning," Springer (a pdf copy is available from the authors).
  • Bishop, C. (2006) Pattern Recognition and Machine Learning, Springer
  • Hastie, T. , R. Tibshirani and J. Friedman (2011, 2nd ed.) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer
  • Han, J., M. Kamber and J. Pei (2012, 3rd ed.) Data Mining: Concepts and Techniques, Elsevier

 

VI. Lecture:

VII. PC-Tutorial:

Access to the computer lab requires one-time registration with a Stu-Account