Lecture Applied data science: Clustering

Chia sẻ: _ _ | Ngày: | Loại File: PDF | Số trang:21

Thêm vào BST

Báo xấu

9
lượt xem 3
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Lecture "Applied data science: Clustering" includes content: Exemplary technique - K-means clustering; Exemplary technique - Hierarchical clustering; Practical issues in clustering; Case study;... We invite you to consult!

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Lecture Applied data science: Clustering

Clustering
Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression (review) 13. Ethics 7. Classification
Lecture outline - Exemplary technique - K-means clustering - Exemplary technique - Hierarchical clustering - Practical issues in clustering - Case study
Unsupervised learning and clustering - Tend to be more subjective - Often a part of the exploratory data analysis - No universally accepted mechanism to validate the results - Clustering - partition a data set into distinct, non-overlapping groups
Exemplary technique - K-means clustering - Assign each observation to exactly one of K clusters (K must be predefined) - A good clustering is one for which the within-cluster variation is smallest - There are K^n ways to partition n observations in K clusters, thus the approximating algorithm…
Exemplary technique - K-means clustering
Exemplary technique - K-means clustering - The above algorithm is repeated until the elements in the K clusters are stable - The algorithm only gives a local optimum - Run the algorithm multiple times and selected the best solution, i.e. one that has the smallest within-cluster variation of all clusters.
Exemplary technique - Agglomerative hierarchical clustering
Exemplary technique - Agglomerative hierarchical clustering
The dendrogram ‘Hierarchical’ means that clusters obtained by cutting the dendrogram at a given height are nested within clusters at any greater height => not a suitable approach to all data sets.
Choice of dissimilarities Euclidean distance Manhattan distance Jaccard distance Cosine distance Correlation based distance
Choice of dissimilarity - The Euclidean distance - similar items have shorter distance between them - The correlation based distance - similar items are stronger correlated
Practical issues in clustering - Standardising features before clustering - Hierarchical clustering - dissimilarity measures, types of linkage, number of clusters - K- means clustering - the number of k - Are clusters representing true (natural) sub groups in data? - Clustering methods not robust to perturbations to data - Clustering results are only a starting point for forming hypotheses about data - Understanding clustering results - Use the name (or characteristic attributes) of elements in each cluster - Use an exemplar member in each cluster - Clusters may be used as label for subsequent predictive analytics
Case study - Clustering financial centres 15 cities - Ho Chi Minh City, Manila, Jakarta, Kuala Lumpur, Bangkok, Mumbai, Hong Kong, Singapore, Beijing, Shanghai, Shenzhen, Seoul, Busan, Taipei, Tokyo 55 instrument factors - in Business Environment (20), Financial Sector Development (9), Human Capital (7), Infrastructure (8), Reputation (11)