Tuesday, July 3, 2012

Speeding Up Greedy Feature Selection

In the last post Alex wrote:
"While you can choose and manipulate your features by divination, tea leaf reading, or good old-fashioned gut-following, it's better to actually check whether what you're doing improves performance on a validation set. The trouble is that repeatedly training a classifier can become cumbersome once you leave the cozy world of Small Data. Even worse, if you want to use some sort of automatic feature selection algorithm (which requires training hundreds or thousands of classifiers), you might find yourself waiting for a very very long time."
My favorite variant of feature selection is greedy forward feature selection (GFFS).  First, some notation.  We have $N$ data points $(x_i,y_i)$ where $x_i$ is a $d$-dimensional feature descriptor and $y_i$ is its label, and we want to figure out which of the $d$ dimensions are useful, or maybe rank the dimensions in order of usefulness.  GFFS adds one feature dimension at a time to a set of already selected features, and checks how good that feature is by training & testing classifiers on $k$ cross-validation splits.  The best feature (in terms of average accuracy) is then added to the set of selected features, and the the next iteration begins.

For example, let's say you've already selected feature number $3$.  The algorithm trains and tests the following feature combinations and obtains the corresponding accuracies:

[3 1] - 65.5%
[3 2] - 63.5%
[3 4] - 67.5%
[3 5] - 70.5%    <--- best
...
[3 d] - 62.5%

Feature 3 got the best average hold-out accuracy in combination with feature 5.  So feature 5 is added to the already-chosen set, and the algorithm goes on to the next stage.

The computation time for a fully naive implementation of GFFS is... stupidly big: $O(d^2 k C)$, where $C$ is the complexity of the classifier being used.  In the last post, Alex investigated which classifier is efficient enough to make this feasible.  I found that even using the fastest classifier, the time actual runtime complexity was too ridiculous to be practical (unless you have a cluster).  So... how can we practically speed up GFFS?

Some Ideas

I implemented and tested two ideas:
  1. Early stopping - stop the outer loop if there has been no improvement for $p$ epochs  ($p=5$ in all my experiments).
  2. Random subsampling - during each epoch, only test $\sqrt{d}$ random dimensions instead of all $d$ dimensions.
Data & Experiments

I tested the above two ideas as follows:
  1. Split the data into training and test.  Get test accuracy using all $d$ dimensions
  2. Send the training data only into GFFS.
  3. Get test accuracy using whichever dimensions are returned from (2).
The best case scenario is that GFFS will get rid of a bunch of noisy dimensions and test accuracy will actually improve.  An acceptable scenario is that GFFS will get rid of a bunch of redundant dimensions and test accuracy will stay the same.

For the classifier, I used quadratic discriminant analysis (QDA), and added $0.1 I$ to the covariance estimates for regularization (chosen without validation).

I tested on three standard datasets: isolet, usps, covertype (with my own training/test split), and two datasets from my research: proteomics and dnasim.

Results

First, the accuracies before and after GFFS.  ES means early stopping.  RS means random subsampling.  A fully naive version of GFFS without early stopping wasn't even close to done running after a whole weekend, so I just stopped it.

Datasets QDA Test Accuracy Early Stopping Early Stopping
+ Random Subsampling
proteomics 66.8% 79.9% 74.7%
dnasim 64.7% 71.7% 64.8%
isolet 95.9% 94.0% 94.6%
usps 94.0% 92.7% 93.9%
covertype 62.9% 64.0% 62.5% 

Using the exhaustive GFFS (with early stopping) can provide some significant gains (proteomics & dnasim), but can also damage performance a bit (isolet and usps).  Adding random subsampling decreases gains somewhat, but not entirely, and can also undo some of the damage caused by feature selection (because more features are usually chosen when using RS).  Looks pretty good, but let's look at the actual runtimes (reported in minutes):

Datasets Early Stopping Early Stopping
+ Random Subsampling
proteomics 5.1 min 0.5 min
dnasim 0.5 min 0.1 min
isolet 698.6 min 41.2 min
usps 115.4 min 4.6 min
covertype 99.5 min 46.0 min

There we go! The time savings range from 50% to 99%. 

In conclusion: greedy forward feature selection is great, but way too expensive. It is still great and much faster if you add (i) early stopping and (ii) random subsampling. 

Next time (maybe):
  • Everyone loves PCA.  Is preprocessing your data with PCA actually a good idea in terms of classification performance?
  • Feature transforms.  There are a billion of them.  Which ones are a good idea, and when?