Similarity Neighbors Clusters KAIST 지식서비스공학과 이의진 [email protected] 2014년2월27일 Different sorts of business tasks involve reasoning from similar examples: • Retrieving similar things directly • IBM wants to find companies that are similar to their best business customers, in order to have the sales staff look at them as prospects. • Hewlett-Packard maintains many high performance servers for clients; this maintenance is aided by a tool that, given a server configuration, retrieves information on other similarly configured servers. • Advertisers often want to serve online ads to consumers who are similar to their current good customers. • Doing classification and regression Different sorts of business tasks involve reasoning from similar examples: • Grouping similar items together into clusters • Our customer base contains groups of similar customers and what these groups have in common (unsupervised segmentation) • Providing recommendations of similar products or from similar people • Whenever you see statements like “People who like X also like Y” or “Customers with your browsing history have also looked at …” similarity is being applied. • Other fields: medicine and law • A doctor may reason about a new difficult case by recalling a similar case (either treated personally or documented in a journal) and its diagnosis. • A lawyer often argues cases by citing legal precedents, which are similar historical cases whose dispositions were previously judged and entered into the legal casebook. 목차 (6장) 1. Similarity & distance 2. Nearest-neighbor reasoning (근접-이웃 추론) 3. Clustering (군집분석) • Hierarchical clustering (계층적 군집분석) • K-means clustering (최적분리 군집분석) • Density-based clustering (밀집도 기반 군집분석) Similarity and distance (유사도 & 거리) • 신용카드회사: 사용자 정보 • 나이, 거주연도, 거주지현황 등 • 두 사람의 유사도(similarity)를 어떻게 측정 할 것인가? Euclidean (유클리드) 거리 Euclidean (유클리드) 거리 Euclidean (유클리드) 거리 • 두 사람: A, B • 애트리뷰트가 n개 일때의 유클리드 거리 목차 (6장) 1. Similarity & distance 2. Nearest-neighbor reasoning (근접-이웃 추론) 3. Clustering (군집분석) • Hierarchical clustering (계층적 군집분석) • K-means clustering (최적분리 군집분석) • Density-based clustering (밀집도 기반 군집분석) Whiskey Analytics Jackson’s five features 색깔 향기 보디 맛 여운 유사한 위스키 찾기? Nearest neighbors for predictive modeling Will David respond or not? Nearest neighbors for predictive modeling David respond yes no Nearest neighbors for predictive modeling • Judging based on nearest neighbors… respond yes no 다수결 => Yes or P(Respond=Yes) = 2/3 Nearest neighbors: How many neighbors? respond yes no 1개의 이웃만을 가지고 분류를 할 경우 분류영역 표시? Nearest neighbors: How many neighbors? respond yes no Nearest neighbors: How many neighbors? 노이즈 같아 보임 => overfitting! respond yes no Nearest neighbors: How many neighbors? • k-NN 방법에서 좋은 k 찾기: Cross validation • k를 1부터 적당한 수까지 늘여가면서 recall/precision/Fmeasure 등 값이 잘 나오는 k를 선택함 k-근접이웃(k-NN) • 핵심아이디어: 과거의 사례들(cases)을 기반으로 새로운 결과를 예측 • k-NN: 과거 사례 중 k개의 유사 사례를 가지고 결과를 예측 • k의 선택? • 다수결 선택을 하는 경우 k를 홀수로 잡아 동점을 방지 • 너무 작은 k값은 over-fitting 문제가 발생(예: k=1) • Cross-validation을 통해서 주어진 데이터로부터 적절한 k를 찾음 목차 (6장) 1. Similarity & distance 2. Nearest-neighbor reasoning (근접-이웃 추론) 3. Clustering (군집분석) • Hierarchical clustering (계층적 군집분석) • K-means clustering (최적분리 군집분석) • Density-based clustering (밀집도 기반 군집분석) Whiskey Analytics Revisited Evaluation of 109 Scotch Whiskies Jackson’s five features 색깔 향기 보디 맛 여운 Similarity Similarity – Jaccad’s distance • X := Bunnahabhain’s Body = {firm, medium, light} • Y := Jack Daniel’s Body = {firm, medium} • X∩Y = {firm, medium}  | X∩Y | = 2 • X∪Y = {firm, medium, light}  | X∪Y | = 3 dJaccard(X, Y) = 1 – 2/3 = 1/3 Similarity – Jaccad’s distance 0 0 1/3 2/3 0 평균거리(Bunnahabhain, Jack Daniel) = (0+0+1/3+2/3+0)/5 = 1/5 (가정: 각 feature에 equal weight를 주었음) Hierarchical Clustering • 방법: • 시작: 각 object는 개개의 클러스터 임(atomic cluster) • 반복: 거리가 가까운 클러스터 둘을 반복적으로 merge 시킴 • 끝: 1개의 단일 클러스터가 생성 덴드로그램(Dendrogram) 표시 Distance 6 5 4 3 2 Object Dendrogram tree representation Cluster Distance Measures • Single link: smallest distance between an element in one cluster single link (min) and an element in the other, i.e., d(Ci, Cj) = min{d(xip, xjq)} • Complete link: largest distance between an element in one cluster complete link (max) and an element in the other, i.e., d(Ci, Cj) = max{d(xip, xjq)} • Average: avg distance between elements in one cluster and elements in the other, i.e., d(Ci, Cj) = avg{d(xip, xjq)} average Dendrogram: hierarchical classification of single-malt Scotch whiskies 109 Scotch Whiskies 목차 (6장) 1. Similarity & distance 2. Nearest-neighbor reasoning (근접-이웃 추론) 3. Clustering (군집분석) • Hierarchical clustering (계층적 군집분석) • K-means clustering (최적분리 군집분석) • Density-based clustering (밀집도 기반 군집분석) K-means clustering(최적분리 군집분석) • Initial group centroids: • Place K points into the space represented by the objects that are being clustered. 1. 2. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. • Repeat Steps 1 and 2 until the centroids no longer move. Here we have a dataset!  We randomly choose 2 group centroids!  We assign each point to the group that has the closest centroid. We recalculate the positions of the centroids. We assign each point to the group that has the closest centroid. We recalculate the positions of the centroids. No matter how many times the algorithm will be executed, from now on the centroids won’t move!!  So the clustering it’s over! K-means clustering(최적분리 군집분석) • Initial group centroids: • Place K points into the space represented by the objects that are being clustered. 1. 2. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. • Repeat Steps 1 and 2 until the centroids no longer move. 목차 (6장) 1. Similarity & distance 2. Nearest-neighbor reasoning (근접-이웃 추론) 3. Clustering (군집분석) • Hierarchical clustering (계층적 군집분석) • K-means clustering (최적분리 군집분석) • Density-based clustering (밀집도 기반 군집분석) 참고: Density-based clustering • 밀집도가 높은 지역을 군집하는 방법: • 밀집도 확장 := 반경(eps)안에 최소 포인트 수(minPts) 만족 Original Points Clustered Points Point types: core, border and outliers Quiz 1. 2. 3. 4. 5. 6. 7. 8. kNN방법은 k개의 유사 사례를 사용한다. kNN방법은 k값은 아무 정수나 다 괜찮다. kNN은 특정 값을 다수결로만 결정 가능하다. kNN은 k값이 작으면 overfitting문제가 없다. kNN에서 주어진 샘플을 갖고 좋은 k값을 찾을 수 있다. Hierarchical clustering은 top-down 방식이다. Hierarchical clustering은 클러스터 개수를 조절할 수 없다. Hierarchical clustering에서 두 클러스터를 merge할 때 최소 거리만을 사용 가능하다. 9. Hierarchical clustering 과정을 덴드로그램으로 표시 가능하 다. 10. Hierarchical clustering에서 나온 덴드로그램의 y축은 클러스 터 개수 이다. Quiz 11. Hierarchical clustering과 k-means clustering은 unsupervised learning 방법이다. 12. Hierarchical clustering 방법은 centroid를 사용한다. 13. k-means clustering에서는 클러스터링 시작전에 k 값을 정해주지 않아도 자동으로 찾아준다. 14. k-means clustering에서 초기 centroid의 위치를 임으로 배정하면 안 된다. 15. k-means clustering에서 k개의 센트로이드 중 각 점을 가까운 센트 로이드에 맵핑한다. 16. k-means clustering에서 k개의 센트로이드 중 각 점을 가까운 센트 로이드에 맵핑하는데 거리가 같다면 아무점이나 할당한다. 17. k-means clustering에서 1) 각 점을 센트로이드에 할당하고 2) 센트 로이드를 구하는 과정을 반복한다. 18. k-means clustering에서 이러한 반복이 끝나지 않고 센트로이드 위 치를 못구하는 경우도 있다. 요약 1. Similarity & Distance • 유클리드 거리; 자카드 거리(집합에 대한 유사도 계산) 2. Nearest-Neighbor Reasoning (근접-이웃 추론) • k-NN: 과거 사례 중 k개의 유사 사례로 결과 예측(예: k개중 다수결) 3. Clustering (군집분석) • Hierarchical clustering (계층적 군집분석) • 시작: 모든 object가 클러스터 • 반복: 클러스터를 반복적으로 merge 시킴 • 끝: 하나의 클러스터 생성 • K-means clustering (최적분리 군집분석) • 시작: k개의 centroid => 각 object는 k개의 centroid중 가장 가까운 곳으로 분류 • 반복: 분류된 object를 바탕으로 k개의 centroid를 다시 계산 => 분류 • 끝: k개의 centroid가 더 이상 움직이지 않음 • Density-based clustering • 밀집도 확장(minPts/eps지역)을 통한 군집화 방법