SLI | Classes / CS273a: Introduction to Machine Learning

CLOSED : 2013 OFFERING

Assignments and Exams:

HW1	Code	10/03/13	soln
HW2	Code	10/15/13	soln
HW3	Code	10/31/13	soln
HW4		11/20/13	soln
HW5	Code	12/06/13	soln

Midterm	in-class	Thurs, 11/7/13	soln
Project	12/13/12
Final	Thurs 1:30-3:30pm	12/12/12

Lecture: Tues/Thurs 2-3:30pm, BH1600

Instructor: Prof. Alex Ihler (ihler@ics.uci.edu), Office Bren Hall 4066

Office Hours: Tues 4-5pm, Bren Hall 4066, or by appointment

Some Course Notes in development

Also, a possibly helpful LaTeX template I use for homeworks and solutions.

Introduction to machine learning and data mining

How can a machine learn from experience, to become better at a given task? How can we automatically extract knowledge or make sense of massive quantities of data? These are the fundamental questions of machine learning. Machine learning and data mining algorithms use techniques from statistics, optimization, and computer science to create automated systems which can sift through large volumes of data at high speed to make predictions or decisions without human intervention.

Machine learning as a field is now incredibly pervasive, with applications from the web (search, advertisements, and suggestions) to national security, from analyzing biochemical interactions to traffic and emissions to astrophysics. Perhaps most famously, the $1M Netflix prize stirred up interest in learning algorithms in professionals, students, and hobbyists alike.

This class will familiarize you with a broad cross-section of models and algorithms for machine learning, and prepare you for research or industry application of machine learning techniques.

Background

This is an introductory graduate class, intended for first year graduate students. We will assume familiarity with some concepts from probability, calculus, and linear algebra. Programming will be required; we will primarily use Matlab, but no prior experience with Matlab will be assumed.

Textbook and Reading

There is no required textbook for the class. However, useful books on the subject for supplementary reading include Murphy's "Probabilistic Machine Learning", Duda, Hart & Stork, "Pattern Classification", and Hastie, Tibshirani, and Friedman, "The Elements of Statistical Learning".

Grading

The course consists of homeworks, some small in-class quizzes, a project, midterm and final exam. Grading is approximately (possibly subject to modification):

25% homework (drop lowest of approx 6)
10% project (structured, e.g. Kaggle)
5% quizzes on reading
25% midterm, 35% final exam

Homeworks are due at 5pm on the listed day (or on EEE at the dropbox closing time). Late homeworks may not be accepted, and will not be after solutions are posted. Please turn in what you have at the deadline.

Collaboration

Please do form study groups for discussion of the material, including lectures, homework, past exams, etc. Your fellow students are one of your best resources in this course. Piazza is often useful for this as well. However, you are responsible for the material, and should do the homework yourself. In other words, discussing the concepts in the homework, and solution strategies, etc. is fine -- but please do not look at others' solutions, exchange code, etc.

Matlab

Often we will write code for the course using the Matlab environment. Matlab is accessible through NACS computers at several campus locations (e.g., MSTB-A, MSTB-B, and the ICS lab), and if you want a copy for yourself student licenses are fairly inexpensive ($100). You may also use the free alternative Octave (heavily tested but poor GUI), or another alternative FreeMat (newer, less tested), both of which attempt to provide a free, syntax-compatible alternative to Matlab. However, please try to stick to Matlab syntax so that we can run your code in Matlab, and be aware that the code provided to you is likely tested in Matlab and not Octave or FreeMat, and the responsibility for discovering and fixing/working around any bugs will be yours. If you're not comfortable with that, use Matlab.

If you are not familiar with Matlab, there are a number of tutorials on the web:

University of Utah, very short
CMU / UMichigan tutorial, also short
University of Florida's tutorial, more complete
Union College / Cyclismo.Org tutorial, also good
UMaryland guide, lots of pointers to other tutorials and reference manuals

You may want to start with one of the very short tutorials, then use the longer ones as a reference during the rest of the term.

For getting started, you can actually run simple Octave scripts and functions online at

http://www.compileonline.com/execute_matlab_online.php

Interesting stuff for students

tba...

Syllabus

Slides	Videos	Topics
PDF	1 , 2 , 3 , 4	Introduction
PDF	1 , 2	Nearest neighbor methods
PDF	1 , 2	Bayes classifiers, naive Bayes
PDF	1 , 2	Decision trees for classification & regression
PDF	1 , 2 , 3 , 4 , 5 , 6	Linear regression
PDF	1 , 2	Linear classifiers; perceptrons & logistic regression
PDF	1	VC dimension, shattering, and complexity
PDF		Neural networks (multi-layer perceptrons) and deep belief nets
PDF		Support vector machines; kernel methods
PDF	1, 2, 3, 4	Ensembles; bagging, gradient boosting, adaboost
PDF		Unsupervised learning: clustering methods
PDF	1, 2	Dimensionality reduction: (Multivariate Gaussians); PCA/SVD, latent space representations

Course Project

See here (pdf) for the full description.

For your course project, you will explore data mining and prediction in the wild, in a real life data set and compared against the performance of teams from around the world. We will use a data set from a past Knowledge Discovery in Data (KDD) Cup, a yearly competition in machine learning and data mining associated with the KDD conference. In particular, we will use the 2004 Competition's Particle Physics data set. The challenge is described in full on the webpage: http://osmot.cs.cornell.edu/kddcup/

Teams: Form teams of 2-3 students with whom you will directly collaborate.
Download the data: See http://osmot.cs.cornell.edu/kddcup/datasets.html
Build your learners: I suggest that you try several different models, such as nearest neighb or methods, decision

trees, linear classifiers (logistic regression, support vector machines, etc.), naive Bayes classifiers, and/or boosted classifiers (decision stumps, etc.). Each member of your team may try one or two models, and can explore setting them to the data and assessing their performance using validation or cross-validation.

Write up a report (~6 pages) on your methods. Again, see the full description here for details.