Wednesday, July 11, 2012

Should you apply PCA to your data?

If you've ever dipped your toe into the cold & murky pool of data processing, you've probably heard of principal component analysis (PCA).  PCA is a classy way to reduce the dimensionality of your data, while (purportedly) keeping most of the information.  It's ubiquitously well-regarded.
But is it actually a good idea in practice?  Should you apply PCA to your data before, for example, learning a classifier?  This post will take a small step in the direction of answering this question.

I was inspired to investigate PCA by David MacKay's amusing response to an Amazon review lamenting PCA's absence in MacKay's book:
"Principal Component Analysis" is a dimensionally invalid method that gives people a delusion that they are doing something useful with their data. If you change the units that one of the variables is measured in, it will change all the "principal components"! It's for that reason that I made no mention of PCA in my book. I am not a slavish conformist, regurgitating whatever other people think should be taught. I think before I teach. David J C MacKay.
Ha!  He's right, of course.  Snarky, but right.  The results of PCA depend on the scaling of your data.  If, for example, your raw data has one dimension that is on the order of $10^2$ and another on the order of $10^6$, you may run into trouble.

Exploring this point, I'm going to report test classification accuracy before & after applying PCA as a dimensionality reduction technique.  Since, as MacKay points out, variable normalizations are important, I tried each of the following combinations of normalization & PCA before classification:
  1. None.
  2. PCA on the raw data.  
  3. PCA on sphered data (each dimension has mean 0, variance 1).
  4. PCA on 0-to-1 normalized data (each dimension is squished to be between 0 and 1).
  5. ZCA-whitening on the raw data (a rotation & scaling that results in identity covariance).
  6. PCA on ZCA-whitened data.
Some experimental details:
  • Each RF was trained to full depth with $100$ trees and $\sqrt{d}$ features sampled at each split.  I used MATLAB & this RF package.
  • The random forest is not sensitive to any dimension-wise normalizations, and that's why I don't bother comparing RF on the raw data to RF on standard normalized & 0-1 normalized data.\The performance is identical! (That's one of many reasons why we <3 random forests).
  • PCA in the above experiments is always applied as a dimensionality reduction technique - the  principal components that explain 99% of the variance are kept, and the rest are thrown out (see details here).
  • ZCA is usually used as normalization (and not as dimensionality reduction). Rotation does affect the RF, and that's why experiment (5) is included.
  • PCA and ZCA require the data to have zero mean.
  • The demeaning, PCA/ZCA transformations, and classifier training were all done on the training data only, and then applied to the held-out test data.  
And now, the results in table form.

Dataset Raw
Accuracy
PCA Sphere
+ PCA
0-to-1
+ PCA
ZCA ZCA
+ PCA
proteomics 86.3%65.4% 83.2% 82.8% 84.3%  82.8%
dnasim 83.1% 75.6% 73.8% 81.5% 85.8% 86.3%
isolet 94.2% 88.6% 87.9% 88.6% 74.2% 87.9%
usps 93.7% 91.2% 90.4% 90.5% 88.2% 88.4%
covertype 86.8% 94.5% 93.5% 94.4% 94.6% 94.5%

Observations:
  • Applying PCA to the raw data can be disastrous. The proteomics dataset has all kinds of wacky scaling issues, and it shows. Nearly 20% loss in accuracy!
  • For dnasim, choice of normalization before PCA is significant, but not so much for the other datasets. This demonstrates MacKay's point. In other words: don't just sphere like a "slavish conformist"!  Try other normalizations.
  • Sometimes rotating your data can create problems. ZCA keeps all the dimensions & the accuracy still drops for proteomics, isolet, and USPS. Probably because a bunch of the very noisy dimensions are mixed in with all the others, effectively adding noise where there was little before.
  • Try ZCA and PCA - you might get a fantastic boost in accuracy. The covertype accuracy in this post is better than every covertype accuracy Alex reported in his previous post.
  • I also ran these experiments with a 3-tree random forest, and the above trends are still clear. In other words, you can efficiently figure out which combo of normalization and PCA/ZCA is right for your dataset.
There is no simple story here.  What these experiments have taught me is (i) don't apply PCA or ZCA blindly but (ii) do try PCA and ZCA, they have the potential to improve performance significantly.  Validate your algorithmic choices!


Addendum: a table with dimensionality before & after PCA with various normalizations:

Dataset Original
Dimension
PCA Sphere
+ PCA
0-to-1
+ PCA
ZCA
+ PCA
proteomics 109 3 57 24 65
dnasim 403 2 1 3 13
isolet 617 381 400 384 606
usps 256 168 190 168 253
covertype 54 36 48 36 49