Friday, December 11, 2015

Multidimensional Scaling

I have a working hypothesis that Mike Bostock and others care more about producing many aesthetically pleasing visualizations than meaningful and useful ones.  When changing just a few lines of code is all it takes, this is reasonable, but I wish a bit more analysis would be devoted to their formation.  

Is multidimensional scaling useful?  The raw data form is an adjacency (discretized, usu. binary) or distance (continuous, inverted from adjacency) matrix.  If we use a cluster detection algorithm on our distance matrix, we can sort it into groups and arrange the elements to relate adjacency and closeness on the axes.  Here we have one such sorted adjacency matrix by Mike Bostock on the co-occurrence of characters in Les Miserables.  
http://bost.ocks.org/mike/miserables/

With fewer than 100 elements, this approach can work well.  We can recognize the existence of clusters in the data, and begin to look between clusters if there are concentrations of adjacency or low-distance off the major diagonal.  The cluster detection algorithm sorts the character Valjean into the green group, but we can recognize a concentration through him with the purple group.  Several characters in the yellow group also have relation to the purple group.  These concentrations will likely be sparse in most data sets, so this method of finding relations between clusters should hold up well.  Between 100 to ~500 elements this matrix display could still work, but should make use of shading and blurring rather than discrete boxes to preserve space.

Could we map from the values of an adjacency matrix to Euclidean distances of element on a graph, a preserve the information we have?  Properly, no.  We lose information because the system is overconstrained by the distances between all of the elements.  We only preserve clusters, and a binary form of close v. far.  With many classes (clusters that should arise), and many features, the probability of high 'tension' or poor solubility of the mapping to distances on the embedding grows.  This tension happens when an element or cluster both pushes and pulls from different elements in another cluster.  Compression of a distances in N-dimensional space to two dimensional space necessarily loses information.  In practice, we can not learn useful information between different values of 'far', only if a pair is close or far.  Does this hold up?

Here, we show the same data as above, in another visualization by Mike Bostock, called a Force Graph, which is a form of multidimensional scaling.  Unfortunately, it appears that the clusters have different color labels.


http://bl.ocks.org/mbostock/4062045

What useful information can be gathered here?  With a mouse-over labeling of the characters, we might learn a bit more, but this is too cluttered.  Before, I had the impression that the data was fairly sparse, but from the nest of lines, whose shades should convey force information, I might think differently.

There is another multidimensional scaling algorithm in recent interest especially for model introspection in neural networks: tSNE (t-Distributed Stochastic Neighbor Embedding).  First, we'll look at a map of the closeness of different forums on reddit (subreddits) based on common subscriptions.
http://nicolas.kruchten.com/content/2014/12/subreddit-map/
We could quibble with some seemingly missing adjacencies - such as hocket and canada or sports (though they have the same category color label).  The larger issue is that there is obviously much richer information in the relations found between sets than 'near' or 'far', but that is all that be communicated.  From the creator's explanation, a dense embedding in 100 dimensions using SVD was created.  The distances within that space could have been used in a distance matrix, and may have been more useful overall, especially if it were displayed using the clusters found by KMeans (as described).

Next, a visualization of the MNIST handwritten digits data set.
http://colah.github.io/posts/2015-01-Visualizing-Representations/
Visualizations for image recognition using neural networks seem to break my curmudgeonly hypotheses.  This use of tSNE for MNIST similarity representation works well because visual inspection of the input examples flows smoothly between points in the input space and is meaningful.  I can tell what characteristics of the '3' digit and the '8' or '5' digit correlate with similarity or presence on the class boundary, and the "main dimensions of variation within each class" are visually discernible.  The entire 60000 example data set was used to create the model, but only a subset were shown, which makes the display reasonable.  Inspection of each element and visual comparison between pairs makes this useful in ways that non-image data cannot be.

This is a topic I'd like to revisit with other data sets, but wanted to make a first pass here.


No comments:

Post a Comment