|
KNearestNeighborsNearest Neighbor classifiersNearest-neighbor rules for classification are some of the most intuitive types of classifiers to consider. A nearest neighbor rule simply memorizes (stores) all of the training data, and when a new test point is given, compares the new point to all of the training examples, finds the nearest one, and predicts that the new point has the same class as that one. A k-nearest neighbor rule is a simple extension -- instead of finding the single nearest training example, find the nearest k of them, and predict the majority class value, i.e., the class to which the most of those k examples belong. The following figures show several classifiers as a function of k, the number of neighbors used.
Decision boundaryAn alternative representation to any classifier is its decision boundary, the places at which it changes from one decision to another. Since the decision of a kNN classifier is defined by the nearest training examples, the decision boundary of a kNN classifier consists of those locations at which the set of the nearest training data change by (at least) one example. A location on the boundary is one balanced between those two sets -- in other words, it must be equidistant between two training data, and since it must change the balance of the set, they must be two points in different classes. For any two points, the set of equidistant locations is a line, perpendicular to the line joining the two points. This means the decision boundary is locally linear (piecewise linear); this appearance can be easily seen in a small enough set of training data. By a similar argument, the transitions between linear segments must be equidistant from at least three points. Complexity and KThe complexity of a k-nearest neighbor rule is a bit difficult to describe precisely. Nearest neighbor methods store all of the training examples, and don't have a simple notion of complexity like we saw in linear regression. However, consider how the training error of a k-nearest neighbor changes with k. At k=1, evaluating the prediction at any training data point finds a data point of distance zero (itself), and is guaranteed to predict the correct class. As k increases, this may no longer be true -- points of class 0 that are completely surrounded by points of class 1 are unlikely to predict their class correctly. As the following images suggest, as k is increased further the classifier is less able to memorize the data, reducing the complexity of the learner.
Example codeHere is some example k nearest neighbor code in Matlab. classdef knnClassify < supLearn % Class implementing K-nearest-neighbors classifier |