Recently, while working on a school project, I came across this library for topological data analysis in Python called giotto-tda
. The project was mainly focused around time series, but reading the library’s documentation I noticed they also had feature extraction methods for classification problems, notably the MNIST handwritten digits dataset. Here, I want to dive a little deeper into how these topological features are generated and how they can be used for classification problems.
Topological Feature Extraction
If we think of the pixels in a rasterized image as a series of points, we should be able to treat these as 0-simplices, and be able to calculate the persistent homology of the entire image as if it were a simplicial complex. However, due to the grid-like nature of the pixels, we can do better, by treating the image as a cubical complex, where our basic shapes are squares instead of triangles.
In our dataset, we have pixelized images of handwritten digits, 70000, to be exact. Let’s see what they look like.
Here we have the digit 8. In order to create our cubical complexes, our image has to be binary, i.e., only containing black or white pixels. This can be done easily by setting a specific threshold and rounding the value of each pixel up or down. Let’s choose a threshold of 0.5, where 0.0 is black and 1.0 is white.
With our binary image, we can now calculate different filtrations on the complex. These filtrations are what allows us to extract different features from the image, since each one of them provides us with different types of information, depending on the filtration used. giotto-tda
has six different filtrations, each of which can be seen here:
From each filtration, we calculate its corresponding persistence diagram:
Most of these plots don’t really tell us much, however, notice how when using either the Height or Radial filtration, two persistent \(H_1\) generators appear, each one corresponding to one of the loops in the number 8. These filtrations are ones that, visually speaking, don’t seem to affect the original image as much. Also note how the diagrams have been rescaled, which seems to be standard procedure.
With our persistence diagrams complete, we can now generate data to train our machine learning algorithm of choice. Persistent homology lets us say that, if two images are similar, then their resulting persistence diagrams will be similar as well, which is good news for what we’re trying to do.
To transform our persistence diagrams into useful, numerical data, we calculate its amplitude, defined as the distance between it and an empty diagram. There are a bunch of different metrics we can use to calculate this difference, and each one of these will provide different information for our model. As an example, we can derive a metric from the Heat Kernel of the Radial filtration.
Calculating the amplitude of this filtration we’re left with the vector \[ v=[0.004, 2.497] \] Each value in the vector corresponds to a homology dimension in the persistence diagram. These results show us that the homology generated by \(H_1\) is a lot more noticeable than that of \(H_0\).
With all this done, we can now define a general feature extraction pipeline for our images.
Pipeline(steps=[('binarizer', Binarizer()), ('heightfiltration', HeightFiltration()), ('cubicalpersistence', CubicalPersistence()), ('scaler', Scaler()), ('featureunion', FeatureUnion(transformer_list=[('persistenceentropy', PersistenceEntropy(nan_fill_value=-1)), ('amplitude-1', Amplitude(metric='bottleneck', metric_params={}, n_jobs=-1)), ('amplitude-2', Amplitu... metric_params={'n_bins': 100, 'p': 1, 'sigma': 1.6}, n_jobs=-1)), ('amplitude-11', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 1, 'sigma': 3.2}, n_jobs=-1)), ('amplitude-12', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 1.6}, n_jobs=-1)), ('amplitude-13', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 3.2}, n_jobs=-1))]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('binarizer', Binarizer()), ('heightfiltration', HeightFiltration()), ('cubicalpersistence', CubicalPersistence()), ('scaler', Scaler()), ('featureunion', FeatureUnion(transformer_list=[('persistenceentropy', PersistenceEntropy(nan_fill_value=-1)), ('amplitude-1', Amplitude(metric='bottleneck', metric_params={}, n_jobs=-1)), ('amplitude-2', Amplitu... metric_params={'n_bins': 100, 'p': 1, 'sigma': 1.6}, n_jobs=-1)), ('amplitude-11', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 1, 'sigma': 3.2}, n_jobs=-1)), ('amplitude-12', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 1.6}, n_jobs=-1)), ('amplitude-13', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 3.2}, n_jobs=-1))]))])
Binarizer()
HeightFiltration()
CubicalPersistence()
Scaler()
FeatureUnion(transformer_list=[('persistenceentropy', PersistenceEntropy(nan_fill_value=-1)), ('amplitude-1', Amplitude(metric='bottleneck', metric_params={}, n_jobs=-1)), ('amplitude-2', Amplitude(metric='wasserstein', metric_params={'p': 1}, n_jobs=-1)), ('amplitude-3', Amplitude(metric='wasserstein', metric_params={'p': 2}, n_jobs=-1)), ('amplitude-4', Amplitude... metric_params={'n_bins': 100, 'p': 1, 'sigma': 1.6}, n_jobs=-1)), ('amplitude-11', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 1, 'sigma': 3.2}, n_jobs=-1)), ('amplitude-12', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 1.6}, n_jobs=-1)), ('amplitude-13', Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 3.2}, n_jobs=-1))])
PersistenceEntropy(nan_fill_value=-1)
Amplitude(metric='bottleneck', metric_params={}, n_jobs=-1)
Amplitude(metric='wasserstein', metric_params={'p': 1}, n_jobs=-1)
Amplitude(metric='wasserstein', metric_params={'p': 2}, n_jobs=-1)
Amplitude(metric_params={'n_bins': 100, 'n_layers': 1, 'p': 1}, n_jobs=-1)
Amplitude(metric_params={'n_bins': 100, 'n_layers': 2, 'p': 1}, n_jobs=-1)
Amplitude(metric_params={'n_bins': 100, 'n_layers': 1, 'p': 2}, n_jobs=-1)
Amplitude(metric_params={'n_bins': 100, 'n_layers': 2, 'p': 2}, n_jobs=-1)
Amplitude(metric='betti', metric_params={'n_bins': 100, 'p': 1}, n_jobs=-1)
Amplitude(metric='betti', metric_params={'n_bins': 100, 'p': 2}, n_jobs=-1)
Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 1, 'sigma': 1.6}, n_jobs=-1)
Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 1, 'sigma': 3.2}, n_jobs=-1)
Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 1.6}, n_jobs=-1)
Amplitude(metric='heat', metric_params={'n_bins': 100, 'p': 2, 'sigma': 3.2}, n_jobs=-1)
The pipeline takes each of our filtrations, calculates 13 different amplitudes based on different metrics, and combines them all into a single feature vector containing 588 features. It’s very likely that many of these are highly correlated, and we can visualize this with a correlation matrix.
Removing features with a Pearson correlation of over 0.9, we’re left with 52 features, a much more workable amount.
Results
With all this done, we can finally train a classifier on our data. Let’s use a Random Forest Classifier trained on 8000 samples.
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier()
rf
rf.fit(X_train_tda, y_train)
= genfromtxt("X_test_tda.csv", delimiter=',')
X_test_tda print(f"Accuracy: {rf.score(X_test_tda, y_test)}")
Accuracy: 0.875
This accuracy isn’t too bad! Let’s see where our model messes up the most with a confusion matrix.
Main takeaways:
- The numbers 0 and 3 were always classified correctly. This could be due to them being topologically very distinct as 0 is just a loop with no other features, and 3 has two noticeable “curves”, that don’t join to form loops like the number 8.
- 1 and 7 get confused the most. Looking through some of the observations in the dataset it’s easy to see why.
Closing Thoughts
I find it really interesting how such a conceptually abstract area of math can be used for practical applications, through a relatively simple series of steps. As a follow-up project, I’d like to compare this method to other more conventional image classification techniques, like CNNs, and see whether or not there’s any merit in using TDA for future endeavours.