Why kNN is slow What you see Making kNN fast What algorithm sees Y 10 8 6 Training: O(1), but testing: O (nd) Try to reduce d: reduce dimensionality Training set: − {(1,9), (2,3), (4,1), (3,7), (5,4), (6,8), (7,2), (8,8), (7,9), (9,6)} simple feature selection, since PCA/LDA etc. are O(d3) Try to reduce n: don't compare to all training examples − Testing instance: idea: quickly identify m << n potential near neighbours 4 2 (7,4) 2 4 6 8 10 X K-D trees: low dimensionality, numeric data: compare one-by-one to each training instance 0 0 − Nearest neighbors? − n comparisons each takes d operations − O(d' n'), d'<<d, n'<<n, works for sparse (discrete) data, exact locality-sensitive hashing: high-d, sparse or dense 1 O(d log n), only works when d << n, inexact: may miss neighbours inverted lists: high dimensionality, sparse data compare only to those, pick k nearest neighbours → O(md) time O(d H), H<<n... number of hashes, inexact: may miss neighbours Copyright © Victor Lavrenko, 2010 K-D tree example Build a K-D tree: Data structure used by search engines (Google, etc.) − {(1,9), (2,3), (4,1), (3,7), (5,4), (6,8), (7,2), (8,8), (7,9), (9,6)} − list of training instances that contain a particular attribute − pick random dimension, find median, split data points, repeat − main assumption: most attribute values are zero (sparseness) Find nearest neighbours for a new point: (7,4) find region containing new point − compare to all points in region − up the tree x=6 Y − fetch and merge inverted lists for attributes present in example − O(n'd'): d' … attributes present, n' … avg. length of inverted list D1: “send your password” D2: “send us review” D3: “send us password” D4: “send us details” D5: “send your password” D6: “review your account” (1,9) (3,7) (5,4) (7,2) (9,6) 4 (6,8) (7,9) (8,8) Copyright © Victor Lavrenko, 2010 0 (2,3) (4,1) y=8 2 y=4 Given a new testing example: 10 − 6 Inverted list example 8 2 Copyright © Victor Lavrenko, 2010 0 2 4 6 8 10 X spam ham spam ham spam ham new email: “account review” Copyright © Victor Lavrenko, 2010 send 1 2 3 4 5 your 1 5 6 review account password 2 6 6 1 3 5 4 k Nearest Neighbours Summary Key idea: distance between training and testing point − important to select good distance function Can be used for classification and regression Simple, non-linear, asymptotically optimal − assumes only smoothness − “let data speak for itself” Value k selected by optimizing generalization error Naive implementation slow for large datasets − use K-D Trees (low d) or inverted indices (high d) 5 Copyright © Victor Lavrenko, 2010
© Copyright 2024