SLI | Classes / CS178: Machine Learning and Data Mining

CLOSED : 2012 OFFERING

Assignments and Exams:

HW1	Code	01/30/12	soln
HW2	Code	02/10/12	soln
HW3	Code	02/28/12
HW4	Code	03/15/12

Midterm	2/16/12	2:00-3:30	soln
Final	3/22/12	1:30-3:30	soln

Lecture: ICS 259, TR 2pm-3:30pm

Discussion: Bren Hall 1500, W 4-5pm

Instructor: Prof. Alex Ihler (ihler@ics.uci.edu), Office Bren Hall 4066

Office Hours: 2:00-3:00pm Mondays, Bren Hall 4066, or by appointment

Course Notes in development

Introduction to machine learning and data mining

How can a machine learn from experience, to become better at a given task? How can we automatically extract knowledge or make sense of massive quantities of data? These are the fundamental questions of machine learning. Machine learning and data mining algorithms use techniques from statistics, optimization, and computer science to create automated systems which can sift through large volumes of data at high speed to make predictions or decisions without human intervention.

Machine learning as a field is now incredibly pervasive, with applications from the web (search, advertisements, and suggestions) to national security, from analyzing biochemical interactions to traffic and emissions to astrophysics. Perhaps most famously, the $1M Netflix prize stirred up interest in learning algorithms in professionals, students, and hobbyists alike.

This class will familiarize you with a broad cross-section of models and algorithms for machine learning, and prepare you for research or industry application of machine learning techniques.

Background

We will assume basic familiarity with the concepts of probability and linear algebra. Some programming will be required; we will primarily use Matlab, but no prior experience with Matlab will be assumed.

Textbook and Reading

There is no required textbook for the class. However, useful books on the subject for supplementary reading include Bishop's "Pattern Recognition and Machine Learning", Duda, Hart & Stork, "Pattern Classification", and Hastie, Tibshirani, and Friedman, "The Elements of Statistical Learning".

Matlab

Often we will write code for the course using the Matlab environment. Matlab is accessible through NACS computers at several campus locations (e.g., MSTB-A, MSTB-B, and the ICS lab), and if you want a copy for yourself student licenses are fairly inexpensive ($100). Personally, I do not recommend the open-source Octave program as a replacement, as the syntax is not 100% compatible and may cause problems (for me or you).

If you are not familiar with Matlab, there are a number of tutorials on the web:

University of Utah, very short
CMU / UMichigan tutorial, also short
University of Florida's tutorial, more complete
Union College / Cyclismo.Org tutorial, also good
UMaryland guide, lots of pointers to other tutorials and reference manuals

You may want to start with one of the very short tutorials, then use the longer ones as a reference during the rest of the term.

Interesting stuff for students

tba...

Syllabus and Schedule (may be updated)

L01 (PDF): Introduction; classification & regression; nearest neighbor methods
R01 (PDF): Matlab basics
L02 (PDF): Linear regression
L03 (PDF): Linear regression, overfitting, regularization
L04: no class
L05 (PDF): Classification, probability, decisions
R03 (PDF): Matlab classes, Probability
L06 (PDF): Bayes classifiers, Naive Bayes
L07 (PDF): Perceptrons, Logistic regression
R04: Matlab, homework discussion
L08 (PDF): Multi-layer perceptrons (neural networks); decision trees
L09 (PDF 1, PDF 2): VC dimension, Decision trees, Ensemble methods
Guest lecture: David Newman, latent space models
Decision trees; ensemble methods (bagging, boosting) (see L09-2)
Midterm exam: Past years exams: 2011(soln), 2010(soln)
L12, (PDF) Clustering
L13, (PDF) Clustering, latent space representations, collaborative filtering
L14, (PDF) Probability models for unsupervised learning
L15, (PDF) Probability models, data mining
L16, (PDF) Support vector machines
L17, (PDF) Time series, Markov chains, AR models
L18, (PDF) Graphical models
Final exam: Past years exams: 2011(soln), 2010 (soln)

Previous year's lectures (2011, 2010) are also available.

Projects

Your course project is in the nature of an "undirected" homework assignment. Choose a machine learning problem, on your own or from the list below, and explore the implied prediction task to the best of your ability. You can try different learners, choosing from methods we have used in the homework or implementing new ones; different feature representations (feature selection or augmentation); meta-learning algorithms such as bagging and boosting; and hold-out or cross-validation assessment techniques. You should explore the problem in some detail, describing the different ideas you tried and how (and whether or not) they worked, and how you assessed their performance.

Examples:

Face detection: this zip file contains a dataset of 24x24 pixel image patches containing faces and non-faces. Learn to predict the presence of a face. Also included is a function for computing the Haar wavelet features and a simple demo of adaBoost, the building blocks of the Viola-Jones technique.
Collaborative filtering: learn to predict how you will rate something, given how others have rated it. This zip file contains a subset of the "Jokes" database for collaborative filtering, in which viewers have rated a subset of jokes on their amusement value, as well as some simple demo code of and SVD-based estimate. I suggest combining it with nearest-neighbor approaches, and trying different levels of complexity (latent space dimension, neighbors, weighting functions, etc.) in your predictors.
Web ranking: learn to predict the relative relevance of a set of returned webpages; this problem is of great importance in both search and advertising, the mainstays of many internet companies. Unfortunately I cannot redistribute the data myself, but you can download Yahoo's ranking challenge data at http://webscope.sandbox.yahoo.com/catalog.php?datatype=c. Here is a zip file with some code for reading their data, which come in "queries" (a web search) with a variable length list of possible responses and their human-rated quality. (See "main.m" for some example code.)