⬡ Hub
Skip to content

Core ML: Clustering & Dimensionality Reduction

Unsupervised learning operates on datasets that have no labeled answers attached. The algorithm's objective is to discover hidden structures, patterns, or groupings within the raw data.

1. Clustering Algorithms

Grouping similar data points together based on their mathematical "distance" from each other in N-dimensional space. Used heavily in customer segmentation, anomaly detection, and genetics.

K-Means Clustering

  • Mechanism: The user defines the number of clusters $K$. The algorithm places $K$ random "centroids" into the data. It assigns each data point to its closest centroid, then recalculates the centroid's position by finding the average of all points assigned to it. It repeats this until the centroids stop moving.
  • Pros: extremely fast, scales to massive datasets.
  • Cons: You must guess $K$ in advance (usually discovered using the "Elbow Method"). Struggles with non-spherical clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

  • Mechanism: Groups together points that are closely packed together (points with many nearby neighbors).
  • Pros: You do not need to specify the number of clusters. Extremely good at finding oddly-shaped clusters (like a ring of data points around another circle). Automatically identifies outliers and labels them as "Noise".
  • Cons: Struggles if clusters have varying densities.

2. Dimensionality Reduction

Imagine a dataset predicting house prices. You have 100 features. Two of those are SquareFeet and SquareMeters. These are perfectly correlated, redundant, and confusing to an ML model. If we have 100 features, we are operating in a 100-dimensional mathematical space.

Dimensionality reduction is the process of compressing that 100D space down to a smaller space (e.g., 5D) while preserving maximum statistical variance.

Principal Component Analysis (PCA)

  • Uses advanced Linear Algebra (Eigenvalues/Eigenvectors) to project data orthogonally onto a lower-dimensional axis.
  • Primary Uses:
  • Compressing data to speed up other ML models (e.g., compress an image from 1024 pixels down to 50 Principal Components, then run it through a Neural Network).
  • 2D or 3D Data Visualization: Humans cannot visualize a 15-dimensional dataset. By compressing it to 2 Principal Components via PCA, we can plot it on an X-Y scatter plot and physically see clusters with our eyes.

How to execute the examples:

Go to the Examples/ folder and run the scripts: python Cluster_KMeans.py python Cluster_DBSCAN.py python Dimensionality_PCA.py