Wednesday, May 28, 2014

Spark should be better than MapReduce (if only it worked)

Spark is the user-friendly face of Big Data: a distributing programming framework which lets you write collection oriented algorithms in Scala that (theoretically) execute seamlessly across many machines. Spark has an elegant API (parallel collections with methods like map/reduce/groupByKey) that feels like programming locally. Unlike MapReduce, Spark can cache partial results across the memory of its distributed workers, allowing for significantly faster/lower-latency computations. If Spark worked as promised it would be a huge productivity boost over writing MapReduce pipelines

Unfortunately, as I've learned over the past month, Spark will happily generate programs that mysteriously grind to a halt-- and tracking down the source of these problems can be a numbingly opaque process. There are at least two distinct problems: Spark's lazy evaluation makes it hard to know which parts of your program are the bottleneck and, even if you can identify a particularly slow expression, it's not always obvious why it's slow or how to make it faster.

Lazy Evaluation

Spark's API lets you to express your intentions clearly via many fine-grained calls to transformations such as map, filter, flatMap, distinct, &c. If you ran a long chain of transformations one at a time, you'd incur a large communication overhead and clog each worker's local cache with useless partial results. Spark reconciles its functional style with performance via delayed execution: transformations get bundled together and only run on demand. The unruly down-side to Spark's execution model is that big swaths of your program will run as a monolithic glob of computation. And if that computation runs slowly...well, good luck figuring out which of its constituent parts is the culprit. Spark could ameliorate some of this confusion with a non-lazy debug mode or some built-in tooling assistance (i.e. a distributed profiler). For now, however, you're stuck (1) trying to reason your way through the lazy dependency graph ("Oh! This one is a shuffle dependency!") (2) forcing computations and checking how long they take. 

Too Many Parameters, Too Many Ways to Fail

Though Spark looks like you're programming on your laptop, it has many performance gotchas you must guard vigilantly against. Make sure your objects aren't using Java serialization (Kryo is faster), pull that object out of the closure, broadcast this array to keep it from being copied repeatedly. Though annoying for beginners, these rules of thumb at least have some consistency which can be learned. 

More frustrating, however, is the inevitability with which Spark's many (initially invisible) default parameters will be wrong for your application. Buffers and heap sizes will turn out to be too small. Your reductions will use too few (or too many) workers. Too much memory gets used for data caching, or maybe it's too little. There's no rigorous or systematic way to set these parameters: you wait until things fail and supplicate at the feet of Spark's heuristic knob pantheon. 

My Experience Thus Far

Nothing I do can save my cluster from spending hours thrashing its way through a modest input size (before dying under a flood of obscure exceptions). I have recently tried all of the following (and more) to keep Spark from its stubborn predilection toward dying:
After a month of "stab in the dark" debugging, I've learned a lot about the internals of Spark but still don't have a working application. In some weird twist of Stockholm syndrome, I've come to like expressing my algorithms in Spark's abstractions: if only they would actually run! 

Has anyone else had a similar experience with Spark? Alternatively, has anyone had a positive experience with a non-trivial Spark codebase (bigger than tutorial wordcount examples). If so, what were your tricks to avoid the death-by-a-thousand-shuffles I've been experiencing? How about Spark's close cousins: Scalding, Scoobi, and Scrunch, are they substantially better (or worse)?



Wednesday, March 5, 2014

Big speedup for training Random Forests in scikit-learn 0.15

Until recently, wiseRF was the obviously fastest Random Forest implementation for Python (and thus, the best library for dealing with larger in-memory datasets). Though scikit-learn has had tree ensembles for the past several years, their performance was typically at least an order of magnitude worse than wiseRF (a boon to wiseRF's marketing team). The sklearn developers seemed to shake off their tree-building sluggishness with a Cython rewrite in the 0.14 release.

Unfortunately, as Yisheng and I discovered while working on CudaTree, even the faster Cython tree builder can still be significantly slower than wiseRF. Why is there still a performance gap when both libraries now use native implementations? wiseRF is probably doing something smarter with their choice of algorithms and/or data layout but the iron curtain of closed source software keeps us from finding out exactly what's going on.

It turns out that one important choice for building trees efficiently is the algorithm used to sort candidate splitting thresholds. The upcoming 0.15 release of scikit-learn will include some cache-friendly changes to how their algorithm sorts data. These modifications seem to have finally closed the gap with wiseRF.

Below are the benchmark times from the CudaTree paper, with the current branch of scikit-learn included under the label scikit-learn 0.15. The takeaway is that the new release will build Random Forests 2x-6x faster than the old one and that the performance differences between scikit-learn, wiseRF, and CudaTree are not significant.

Training times for 100 trees grown on a 6-core Xeon E5-2630 machine with an NVIDIA Titan graphics card:

Dataset wiseRF 1.5.11 scikit-learn 0.14 scikit-learn 0.15 CudaTree 0.6
ImageNet subset 23s 50s 13s 25s
CIFAR-100 (raw) 160s 502s 181s 197s
covertype 107s 463s 73s 67s
poker 117s 415s 99s 59s
PAMAP2 1,066s 7,630s 1,683s 934s
intrusion 667s 1,528s 241s 199s

Information about the datasets used above:

Name Features Samples Classes Description
ImageNet subset 4,096 10,000 10 Random subset of 10 labels from the 1000 category ImageNet data set, processed by the convolutional filters of a trained convolutional neural network (amazingly attains same accuracy!)
CIFAR-100 3,072 50k 100 Same as CIFAR-10, but with more samples and more labels.
covertype 57 581k 7 Identify tree cover from domain-specific features.
poker 11 1M 10 Poker hands
PAMAP2 52 2.87M 13 Physical activity monitoring
intrusion 41 5M 24 Network intrusion

Wednesday, December 11, 2013

NIPS and the Zuckerberg Visit

This past weekend, several hundred researchers, students, and hobbyists streamed into Lake Tahoe to attend a conference called Neural Information Processing Systems. NIPS is one of the two machine learning conferences of note (the other is ICML). Acceptance rates are low; prestige is high. Anyone interested in machine learning, statistics, applied math, or data can come to the conference but do expect to be bombarded by 10,000 terms that you don't know, even if you have a PhD.  

When we arrived and cracked open the workshop schedule, we found something very peculiar: 

3:00pm – 3:30pm: Q&A with Mark Zuckerberg”

What is the CEO of Facebook doing speaking at an academic conference on machine learning (and, nominally, neuroscience)? There's obviously a porous boundary between the corporate and academic worlds, but has it ever been this porous?

Ordinarily, when employees of Google, Microsoft, or Facebook show up at NIPS, they either (a) keep to the recruiting room or (b) are there to discuss & present research. The latter group, though they might be wearing their employer's logo on a t-shirt, engage with the conference as academics. Their status is derived from their research accomplishments. And this status does not shut down discourse: they will still field questions and suggestions in-person from any passing student. These kinds of interactions are encouraged at conferences like NIPS.  As a professor once told me when I was a graduate student “We need you guys, you’re the lifeblood of new ideas.”

In contrast, consider the presence of Mark Zuckerberg.  I'm sure someone saw a legitimate need to encircle his Q&A session with armed guards, but nothing screams hierarchy like police at the door. The tone changed rapidly: accomplished professors became little more than lowly researchers shuffling into the Deep Learning workshop to see a Very Important Person speak. Zuckerberg couldn't help but disrupt the conference; the spectacle drew so many, that an adjacent workshop was paused to make room for the overflow. And equally distasteful is what went on behind the scenes.  The conference was full of whispered rumors of one-on-one meetings and secret acquisitions.  This is the first academic conference I have attended where there was this much talk about getting rich or being bought out, something that is actually happening to a number of researchers that appeal to Facebook’s ambitions. 

As for the content of the Q&A itself?  My distrust of excessive power will show itself here (note: Soviet childhood), but I can think of Mark Zuckerberg only as a tunnel visionary.  He wants Facebook to connect all the people in the world & have a personalized theory of mind for each user.  As far as he sees, this is for the good.  Some of the questions asked by the incisive audience were polite versions of “What are the dangers of having this much data about so many people?” and “What does Facebook as a company do to help society?”  These Zuckerberg dodged so expertly that by the time he was done “answering” (with a hefty & convincing confidence), I had forgotten exactly what the question was.

Facebook could have easily sent some high-ranking folks to give an interesting & technical talk instead of Zuckerberg coming himself.  His presence was jarring because it subverted the spirit of the conference, and injected into it the distinct aroma of big money.  Was it anything more than a glamorous & sanctioned recruiting visit?  I would have expected the NIPS organizers to decline to endorse such industrial overreach.

The barriers between Silicon Valley and academia are blurry and getting blurrier. Maybe this is to be expected in Zuckerberg's "knowledge economy", where the largest data sets and greatest computational resources are destined to be locked behind corporate doors. However, if academia has any hope maintaining an atmosphere of open inquiry (rather than just proprietary R&D), academics have to protect their culture. Otherwise, the resulting decline in high-quality reproducible research will be a loss for everyone involved, and society at large.
 
In the future, Mark Zuckerberg should be welcome to attend NIPS just like anyone else, assuming he has paid the appropriate registration fee (or obtained a scholarship).  But it is the job of academics (here, the organizers of NIPS) to uphold the necessary boundary between academia and Silicon Valley.  They have failed to do so, and I sincerely hope that this flirtation with Silicon Valley won’t turn into a marriage.
  
This post was written jointly with Alex Rubinsteyn and a bottle of Scotch.

Friday, October 11, 2013

Training Random Forests in Python using the GPU

Random Forests have emerged as a very popular learning algorithm for tackling complex prediction problems. Part of their popularity stems from how remarkably well they work as "black-box" predictors to model nearly arbitrary variable interactions (as opposed to models which are more sensitive to noise and variable scaling). If you want to construct random forests in Python, then the two best options at the moment are scikit-learn (which uses some carefully optimized native code for tree learning) and the slightly speedier closed-source wiseRF.

Both of these implementations use coarse-grained parallelization to split work across multiple cores, but neither of them make can make use of an available graphics card for computation. In fact, aside from an implementation custom-tailored for vision tasks and a few research papers of questionable worth, I haven't been able to find a general-purpose GPU implementation of Random Forests at all. So, for the past few months I've worked with Yisheng Liao to see if we could dramatically lower the training time of a Random Forest by off-loading as much computation to the GPU as possible.

We made a GPU Random Forest library for Python called CudaTree, which, for recent generation NVIDIA GPUs, can be 2x - 6x faster than scikit-learn.

CudaTree is available on PyPI and can be installed by running pip install cudatree. It's written for Python 2.7 and depends on NumPy, PyCUDA (to compile and launch CUDA kernels), and Parakeet (to accelerate CPU-side computations).

Benchmarks

We trained scikit-learn Random Forests (with fifty trees) on four medium-sized datasets (the largest takes up ~500mb of memory) on a 6-core Xeon E5-2630. We can then compare this baseline with training times for CudaTree on machines with a variety of NVIDIA graphics processors:


Dataset scikit-learn CudaTree (C1060) CudaTree (C2075) CudaTree (K20x) CudaTree (Titan)
CIFAR-10 (raw) 114s 52s 40s 24s 20s
5.7x faster
CIFAR-100 800s 707s 308s 162s 136s
5.8x faster
ImageNet subset 115s 80s 60s 35s 28s
4.1x faster
covertype 138s - 86s - 50s
2.7x faster

edit: Thanks to everyone who nudged me to validate the accuracy to CudaTree's generated models, there was actually a bug which resulted in the GPU code stopping early on the covertype data. We had unit tests for training error but none for held-out data. Noob mistake. The latest version of CudaTree fixes this mistake and the performance impact is negligible on all the other datasets. I'll add new covertype timings once the machines we used are free.

Information about the datasets used above:

Name Features Samples Classes Description
CIFAR-10 3,072 10,000 10 Raw pixel values to classify images of objects.
CIFAR-100 3,072 50,000 100 Same as CIFAR-10, but with more samples and more labels.
ImageNet subset 4,096 10,000 10 Random subset of 10 labels from the 1000 category ImageNet data set, processed by the convolutional filters of a trained convolutional neural network (amazingly attains same accuracy!)
covertype 54 581,012 7 Identify tree cover from a given set of 57 domain-specific features.

Limitations

CudaTree currently only implements a RandomForestClassifier, though regression trees are forthcoming. Furthermore, CudaTree's performance degrades when the number of target labels exceeds a few hundred and it might stop working altogether when the number of labels creeps into the thousands.

Also, there are some datasets which are too large for use with CudaTree. Not only does your data have to fit in comparatively smaller GPU memory, it has to contend with CudaTree's auxiliary data structures, which are proportional in size to your data. The exact specification for the largest dataset you can use is a little hairy, but in general try not to exceed about half the size of your GPU's memory.

If your data & task fit within these restrictions, then please give CudaTree a spin and let us know how it works for you.

Due credit: Yisheng did most of the coding, debugging, and profiling. So, you know, the actual hard work. I got to sit back and play advisor/"ideas man", which was a lot of fun. Also, the Tesla K20 used for this research was kindly donated by the NVIDIA Corporation, thanks NVIDIA!

Sunday, July 22, 2012

Expensive lessons in Python performance tuning

Over the past year I've done a lot of number crunching in Python with the wonderful ipython/numpy/pandas/sklearn stack and using picloud to parallelize my computations. Picloud is an amazing service which lets you run your code in Amazon's behemoth data centers while hiding most of the configuration headache. Their computational model is essentially a parallel map: you submit a list of inputs and specify what function to map over them and picloud automagically launches that function on multiple machines. This is less powerful than full-blown MapReduce (no parallel reductions) but, luckily, a lot of the computations machine learning people want to perform are trivial to parallelize. These include:
  • parameter searches (parallelize over distinct parameters)
  • k-fold cross validation (parallelize over the folds)
  • training random forests (trees don't need to know about each other)
  • random starts for locally converging hill climbers like k-meansEM, etc..
As long as you have multiple things to do in parallel and the work for each item greatly exceeds the communication time, then picloud can be a miracle. However, they do charge a premium over the standard AWS instance prices and even Amazon's baseline prices can add up quickly if you're using powerful instance types. When you're running on someone else's hardware the inefficiencies of your code translate very directly into added costs. One month, I managed to rack up an extremely high bill from an embarrassingly small amount of actual work because (1) my Python code created millions of bloated objects in memory, forcing me to use the more expensive high-memory instance type, and (2) I had overlooked some very simple optimizations that later cut my runtime by an order of magnitude.  

So, I thought I might save someone else time and money by sharing some tips which made my Python code more efficient without sacrificing flexibility or maintainability. 
  1. Profile everything! Profile mercilessly and relentlessly. Let no assumption about efficiency pass without being put under the cold and uncaring gaze of a profiler. This applies to both runtime (use ipython's prun command) and memory consumption (which you can inspect with the over-engineered guppy). Don't "optimize" a damn thing until you know for sure it's a top bottleneck. 
  2. Be suspicious of loops (but have faith in Python's core operations). If your bottleneck is a loop performing numerical computations you can almost always speed it up by using some equivalent vectorized NumPy operations. On the other hand, once the obviously vectorizable stuff is off-loaded to library code, then I'm often surprised just how efficient Python's operations can be. I can build lists of startling length, index into dictionaries all day long, construct millions of simple objects, and it all adds up to a small fraction of my total runtime. 
  3. Simplest is often fastest. I spent a whole day concocting elaborate buffering and pre-fetching schemes to traverse the lines of a file as quickly as possible without reading it all into memory. I neglected, however, to compare with the trivial solution of simply writing "for line in file:", which of course turned out to be faster and more efficient. Similarly, I spent a while tweaking a CSV parsers before finding that using "string.split" with the standard parsing routines int and float was just as fast. 
  4. If you're creating millions of objects, make them namedtuples. In one feature extraction pipeline, I take a multi-million line CSV and parse each line into a Python object. My ever-frustrated internal micro-optimizer screams at the obvious inefficiency. And, sadly, he's right. Python objects are extremely heavy-weight compared to the simple struct we would create in C. Luckily, Guido and Friends foresaw the "zillions of simple objects" use-case and thus created namedtuples. They have named fields like objects, but are laid out contiguously and take up a lot less space. Alternatively, if you're using an object for more than just data storage and want to use non-trivial methods, you can lay out a full-blown object contiguously by filling out the slots attribute. How efficient are they? As a totally unscientific demonstration I constructed a million little vectors with field names "x", "y", and "z". When implemented as ordinary Python objects they took up 400MB of memory, whereas their namedtuple cousins only required 114MB. Still not nearly as compact as they could be, but a dramatic improvement nonetheless! 
  5. Don't search linearly through a sorted array, use binary search (duh). On the one hand, this is CS 101 kid's stuff. On the other, I didn't know about the bisect library and had code littered with np.nonzero(sorted<=x)[0]. Once I switched to using bisect_left/bisect_right I saw a huge performance improvement. edit: In the comments, Peter pointed out that NumPy implements a faster binary search called searchsorted; you (and I) should probably use that instead!.
  6. Numpy's "fancy indexing" is slow. If you index into a NumPy array with a range then you get a very lightweight view over the original array; this kind of indexing doesn't copy any data and is almost free. If, however, you index with a boolean vector or a sequence of integers ("fancy indexing"), then you actually incur a significant cost. If your code does a lot of fancy indexing then you should either switch to using the take method or rewrite your indexing as an explicit loop in native code (see #8).
  7. Wes McKinney is a genius. If you're implementing anything Wes McKinney has already put in his library pandas, just stop. His code is faster, more robust, and more likely to be correct than anything you're about to write. Want rolling window aggregators? Use pandas. Need to handle missing data? Use pandas. Are you writing some sort of unbelievably ugly hack which attempts to implement joins and group-by's over NumPy arrays but actually manages to spend 3 hours computing a subtly incorrect result? (I have done this). Jesus Christ, just stop and use pandas.
  8. When all else fails, try weave. Yes, weave is unmaintained, ugly, and hard to debug. But, if you find yourself thinking "if only I could just inline 5 lines of C right in the middle of this Python code", then weave lets you actually do that. If, however, that 5 lines turns into 50 then maybe it's time to learn about Cython.
Also, a few last words specific to machine learning: climb the complexity ladder carefully. It's tempting to throw a fancy tool at some new data, and doubly tempting to try every possible combination of preprocessing steps and hyperparameters. Brute force search and a muscular learning algorithm let you sidestep the need for thinking, experimentation, or understanding. Of course, in my experience, this rarely works out and ends up costing quite a bit of money.

So, do the more prudent mature thing and start simple. Collect simple statistics, plot each feature, and try to understand the data a little before you launch 100,000 jobs on picloud. Even then, start with simpler learners like  SGD Logistic Regression or small Random Forests. These will let you explore your choices of preprocessing steps and target functions without eating up too many resources. Once you're more comfortable with the problem, go ahead and train a 1000 layer deep belief network (if that's your thing).

Postscript: After almost a year, this article seems out of date with regard to the various ways you can generate native code. Particularly, before you go reaching for weave, Cython, or (gasp!) even C itself, it might be worthwhile to try selective just-in-time compilation. Both Numba and my own Parakeet will generate efficient native implementations for array-oriented programs. Copperhead is a little more limited semantically (it's really a subset of Haskell masquerading as Python), but if you can squeeze your algorithm into Copperhead's model then you'll get a blazing fast GPU kernel as a reward. Alternatively, you can skip the "pretending to be Python" charade and build data-parallel expression trees directly using Theano. It can be a little rough around the edges but, like Copperhead, if you can get it working then the (many orders of magnitude faster) GPU kernel performance is well worth the trouble.

Wednesday, July 11, 2012

Should you apply PCA to your data?

If you've ever dipped your toe into the cold & murky pool of data processing, you've probably heard of principal component analysis (PCA).  PCA is a classy way to reduce the dimensionality of your data, while (purportedly) keeping most of the information.  It's ubiquitously well-regarded.
But is it actually a good idea in practice?  Should you apply PCA to your data before, for example, learning a classifier?  This post will take a small step in the direction of answering this question.

I was inspired to investigate PCA by David MacKay's amusing response to an Amazon review lamenting PCA's absence in MacKay's book:
"Principal Component Analysis" is a dimensionally invalid method that gives people a delusion that they are doing something useful with their data. If you change the units that one of the variables is measured in, it will change all the "principal components"! It's for that reason that I made no mention of PCA in my book. I am not a slavish conformist, regurgitating whatever other people think should be taught. I think before I teach. David J C MacKay.
Ha!  He's right, of course.  Snarky, but right.  The results of PCA depend on the scaling of your data.  If, for example, your raw data has one dimension that is on the order of $10^2$ and another on the order of $10^6$, you may run into trouble.

Exploring this point, I'm going to report test classification accuracy before & after applying PCA as a dimensionality reduction technique.  Since, as MacKay points out, variable normalizations are important, I tried each of the following combinations of normalization & PCA before classification:
  1. None.
  2. PCA on the raw data.  
  3. PCA on sphered data (each dimension has mean 0, variance 1).
  4. PCA on 0-to-1 normalized data (each dimension is squished to be between 0 and 1).
  5. ZCA-whitening on the raw data (a rotation & scaling that results in identity covariance.).
  6. PCA on ZCA-whitened data.
Some experimental details:
  • Each RF was trained to full depth with $100$ trees and $\sqrt{d}$ features sampled at each split.  I use MATLAB & this RF package.
  • The random forest is not sensitive to any dimension-wise normalizations, and that's why I don't bother comparing RF on the raw data to RF on standard normalized & 0-1 normalized data.  The performance is identical!  (That's one of many reasons why we <3 random forests).
  • PCA in the above experiments is always applied as a dimensionality reduction technique - the  principal components that explain 99% of the variance are kept, and the rest are thrown out (see details here).
  • ZCA is usually used as normalization (and not as dimensionality reduction).  Rotation does affect the RF, and that's why experiment (5) is included.
  • PCA and ZCA require the data to have zero mean.
  • The demeaning, PCA/ZCA transformations, and classifier training were all done on the training data only, and then applied to the held-out test data.  
And now, the results, in table and bar form.


Dataset Raw
Accuracy
PCA Sphere
+ PCA
0-to-1
+ PCA
ZCA ZCA
+ PCA
proteomics 86.3%65.4% 83.2% 82.8% 84.3%  82.8%
dnasim 83.1% 75.6% 73.8% 81.5% 85.8% 86.3%
isolet 94.2% 88.6% 87.9% 88.6% 74.2% 87.9%
usps 93.7% 91.2% 90.4% 90.5% 88.2% 88.4%
covertype 86.8% 94.5% 93.5% 94.4% 94.6% 94.5%

Observations:
  • Applying PCA to the raw data can be disastrous.  The proteomics dataset has all kinds of wacky scaling issues, and it shows.  Nearly 20% loss in accuracy!
  • For dnasim, choice of normalization before PCA is significant, but not so much for the other datasets.  This demonstrates MacKay's point.  In other words: don't just sphere like a "slavish conformist"!  Try other normalizations.
  • Sometimes rotating your data can create problems.  ZCA keeps all the dimensions & the accuracy still drops for proteomics, isolet, and USPS.  Probably because a bunch of the very noisy dimensions are mixed in with all the others, effectively adding noise where there was little before.
  • Try ZCA and PCA - you might get a fantastic boost in accuracy.  The covertype accuracy in this post is better than every covertype accuracy Alex reported in his previous post.
  • I also ran these experiments with a 3-tree random forest, and the above trends are still clear.  In other words, you can efficiently figure out which combo of normalization and PCA/ZCA is right for your dataset.
There is no simple story here.  What these experiments have taught me is (i) don't apply PCA or ZCA blindly but (ii) do try PCA and ZCA, they have the potential to improve performance significantly.  Validate your algorithmic choices!


Addendum: a table with dimensionality before & after PCA with various normalizations:

Dataset Original
Dimension
PCA Sphere
+ PCA
0-to-1
+ PCA
ZCA
+ PCA
proteomics 109 3 57 24 65
dnasim 403 2 1 3 13
isolet 617 381 400 384 606
usps 256 168 190 168 253
covertype 54 36 48 36 49

Friday, July 6, 2012

Quick classifiers for exploring medium-sized data (redux)

In my last post, I wanted to find out which classifiers are quick enough to allow us to re-train them many times on medium-sized data. A recap of terminology:
  • Small Data is the magical and care-free land where you can work on a single computer and do something/anything with "all pairs of samples" (evaluate kernels, compute distances, etc...). It's a mostly mythical place, occupied only by Intro ML students and whoever keeps working on Gaussian Processes.
  • If you can't fit your whole dataset on one machine and/or your computation needs to be distributed, then congratulations: you've got Big Data (now go learn Mahout)
  • Between Small Data and Big Data is the very common scenario when you are working on a single computer but have too many samples for anything but learning algorithms which scale linearly. Goodbye kernel methods, hello simple boundaries and decision trees.
Previously, I evaluated the runtimes and accuracies of all the scalable classifiers in scikit-learn and found that only Gaussian Naive Bayes and SVM trained by stochastic gradient descent were fast enough. Unfortunately, my evaluation had some obvious shortcomings:
  • I used a single synthetic dataset-- who knows if my results generalize to "real world" data?
  • More specifically, the dataset was sampled from 10 multivariate Gaussians, giving a significant advantage to models which assume the distribution underlying the data is Gaussian (ahem...Gaussian Naive Bayes).
  • I recommended using SGD SVM, but only tried training Logistic Regression with liblinear. However, it's known that linear SVMs and Logistic Regression often get similar performance, so I should have tried SGD Logistic Regression as well.
  • I trained the SGD SVM with the recommended ~1 million updates, but where does that heuristic even come from? Would fewer updates still yield good accuracy? If so, SGD classifiers could be trained even faster (hard to believe, given how uncannily quick they already are).
  • All of my tree ensembles (random forests and extra trees) had at least 10 base classifiers. They showed good classification accuracy but were too slow for exploratory use. I should have tried fewer trees to see if I get acceptable accuracy with quicker-to-train ensembles.
  • Similarly, I never applied regularization to decision trees (either alone or in ensembles) by changing complexity controlling parameters such as the minimum number of samples per leaf. What if I forced the base trees to be simpler? Presumably training would be faster, but would accuracy suffer? I assume that single decision trees might benefit from lowered variance, whereas ensemble methods might do worse.
To get a more complete picture of the speed vs. accuracy landscape, I got three real datasets (UCI's covertype & isolet, as well as some proteomics data) and evaluated scikit-learn's classifiers across a wider swath of parameters. I'll save you from squinting at big tables of numbers and summarize the results first:
  • Small ensembles of decision trees work! Yes, people will often recommend hundreds of trees (grown to full height). However, the performance of Random Forests seems to degrade gracefully as you limit both the number of trees in the ensemble and the complexity of each individual tree. Whereas 128 full-grown trees might get 88.7% accuracy (on the forest cover data set), a wimpy little duo of 2 shorter trees can get 80.7%. The difference in training times, however, is ~700 seconds vs. ~10 seconds. Squeezing out that extra 8% of accuracy might be worth it when you're constructing your final classification system but not necessarily when you're exploring large space of feature selection/transformation decisions.
  • Gradient boosting ain't so great...and it sometimes seems to fail catastrophically. Perhaps this is a problem with some default parameters in scikit-learn's implementation?
  • If you're estimating the variance of a distribution, use smoothing/regularization. Quadratic Discriminant Analysis and Gaussian Naive Bayes break down when a feature appears constant under a particular class label (remember, you're going to be dividing by that zero-variance). Be sure to smooth your variance estimates, either in a rigorous/justified way or by just adding some little epsilon of variance to every feature's estimate.
  • It doesn't matter whether you use Linear SVM or Logistic Regression. SVM seems to sometime achieve faster training times (since it presumably only has to update weights when a prediction is wrong), whereas Logistic Regression seems to achieve slightly better generalization. But really, I would need to look at many more datasets before I could claim there is substantive difference.
  • Stochastic gradient descent really is blazing fast-- but also delicate. Linear learners can be trained 10X - 100X faster with stochastic gradient descent but choosing the wrong number of training iterations can sometimes ruin your classifier's accuracy. Perhaps using some stopping criterion might work better than just have a fixed number of epochs?

Recommendations

If you only have enough time to use a single type of model, I would recommend a small Random Forest (with 3-7 trees). If even that is too slow, then try a linear SVM trained using stochastic gradient descent (the ~1 million update heuristic seems reasonable). However, since different classifiers seem to excel on different datasets, it would be best to train several diverse models (the two above, as well as regularized Naive Bayes and LDA) and use the best score any of them can achieve. That way, you might avoid tailoring your features too closely to the assumptions of one particular model.

The Numbers (here be dragoons)

Forest Cover Type

Given features such as soil type, elevation, distance to water sources predict which sort of tree is growing (i.e., "Spruce Fir" vs. "Ponderosa Pine").
  • Dimensionality: 54
  • Training samples: 290,504
  • Test samples: 290,508
  • Number of classes: 7
  • Accuracy if we always predict the most common class: 48.86%
  • Source: UCI
Algorithm Training / Test Time Training / Test Accuracy
Gaussian Naive Bayes1.08s / 2.55s9.66% / 9.69%
Regularized Gaussian Naive Bayes1.28s / 1.87s63.08% / 63.11%
LDA3.97s / 0.24s67.94% / 68.02%
QDA2.92s / 8.94s8.38% / 8.41%
CART
Decision Tree (min samples per leaf = 1)14.07s / 0.13s100.00% / 92.58%
Decision Tree (min samples per leaf = log n = 13)13.38s / 0.13s97.35% / 91.73%
Decision Tree (min samples per leaf = sqrt n = 538)10.40s / 0.11s81.54% / 80.62%
Support Vector Machine
Linear SVM (SGD, ~250k updates = 1 epochs)1.14s / 0.17s69.72% / 69.89%
Linear SVM (SGD, ~1m updates = 4 epochs)4.38s / 0.17s71.25% / 71.30%
Linear SVM (SGD, ~2m updates = 7 epochs)7.67s / 0.17s71.00% / 71.17%
Linear SVM (liblinear)50.79s / 0.17s71.21% / 71.34%
Logistic Regression
Logisic Regression (SGD, ~250k updates = 1 epoch)1.44s / 0.17s70.06% / 70.26%
Logisic Regression (SGD, ~1m updates = 4 epochs)5.61s / 0.17s71.23% / 71.30%
Logisic Regression (SGD, ~2m updates = 7 epochs)9.81s / 0.17s71.15% / 71.24%
Logistic Regression (liblinear)21.56s / 0.17s71.44% / 71.54%
Extremely Randomized Trees
Extra-Trees (2 trees)9.30s / 0.26s75.30% / 74.23%
Extra-Trees (4 trees)18.39s / 0.54s78.49% / 77.18%
Extra-Trees (8 trees)38.67s / 1.04s80.47% / 78.71%
Extra-Trees (16 trees)83.77s / 2.26s80.84% / 78.98%
Extra-Trees (32 trees)139.69s / 4.00s79.57% / 77.92%
Extra-Trees (64 trees)280.71s / 8.01s79.57% / 77.98%
Extra-Trees (128 trees)579.19s / 17.25s79.37% / 77.78%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)21.33s / 0.32s67.06% / 67.25%
Gradient Boosting (sample size = 100%, 4 stumps)41.75s / 0.45s69.28% / 69.41%
Gradient Boosting (sample size = 100%, 8 stumps)81.61s / 0.68s70.53% / 70.58%
Gradient Boosting (sample size = 100%, 16 stumps)163.75s / 1.19s72.25% / 72.23%
Gradient Boosting (sample size = 100%, 32 stumps)327.92s / 2.32s74.28% / 74.15%
Gradient Boosting (sample size = 100%, 64 stumps)648.88s / 4.14s76.22% / 76.08%
Gradient Boosting (sample size = 100%, 128 stumps)1324.65s / 8.23s77.99% / 77.80%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)12.69s / 0.32s67.47% / 67.63%
Gradient Boosting (sample size = 25%, 4 stumps)24.63s / 0.44s69.32% / 69.43%
Gradient Boosting (sample size = 25%, 8 stumps)48.25s / 0.69s70.57% / 70.67%
Gradient Boosting (sample size = 25%, 16 stumps)97.67s / 1.20s72.37% / 72.31%
Gradient Boosting (sample size = 25%, 32 stumps)192.46s / 2.22s74.53% / 74.45%
Gradient Boosting (sample size = 25%, 64 stumps)382.37s / 4.15s76.52% / 76.38%
Gradient Boosting (sample size = 25%, 128 stumps)770.73s / 8.49s78.56% / 78.39%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)11.19s / 0.25s85.85% / 80.84%
Random Forest (min samples per leaf = 1, 4 trees)22.12s / 0.54s89.96% / 85.40%
Random Forest (min samples per leaf = 1, 8 trees)46.53s / 1.14s92.67% / 87.79%
Random Forest (min samples per leaf = 1, 16 trees)89.64s / 2.16s92.79% / 88.34%
Random Forest (min samples per leaf = 1, 32 trees)177.83s / 4.31s93.01% / 88.53%
Random Forest (min samples per leaf = 1, 64 trees)352.32s / 8.41s93.19% / 88.63%
Random Forest (min samples per leaf = 1, 128 trees)698.02s / 17.02s93.26% / 88.70%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 13, 2 trees)10.43s / 0.24s82.73% / 80.70%
Random Forest (min samples per leaf = 13, 4 trees)19.62s / 0.45s85.42% / 83.40%
Random Forest (min samples per leaf = 13, 8 trees)38.41s / 0.85s86.44% / 84.49%
Random Forest (min samples per leaf = 13, 16 trees)74.81s / 1.77s86.61% / 84.65%
Random Forest (min samples per leaf = 13, 32 trees)153.10s / 3.36s87.35% / 85.35%
Random Forest (min samples per leaf = 13, 64 trees)300.55s / 6.69s87.43% / 85.40%
Random Forest (min samples per leaf = 13, 128 trees)608.06s / 16.93s87.45% / 85.48%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 538, 2 trees)7.00s / 0.19s71.10% / 70.96%
Random Forest (min samples per leaf = 538, 4 trees)13.48s / 0.37s72.39% / 72.28%
Random Forest (min samples per leaf = 538, 8 trees)27.61s / 0.70s73.62% / 73.54%
Random Forest (min samples per leaf = 538, 16 trees)53.80s / 1.36s74.18% / 74.13%
Random Forest (min samples per leaf = 538, 32 trees)108.25s / 2.76s74.29% / 74.17%
Random Forest (min samples per leaf = 538, 64 trees)212.23s / 5.50s74.23% / 74.09%
Random Forest (min samples per leaf = 538, 128 trees)423.53s / 10.91s74.08% / 74.01%

Proteomics

The task is to classify the mass/charge ratio of peptides using some hand-crafted features of their spectra.
  • Dimensionality: 109
  • Training samples: 25,780
  • Test samples: 25,784
  • Number of classes: 6
  • Accuracy if we always predict the most common class: 49.25%
  • Source: Sergey Feldman
Algorithm Training / Test Time Training / Test Accuracy
Gaussian Naive Bayes0.16s / 0.25s17.18% / 17.40%
Regularized Gaussian Naive Bayes0.19s / 0.26s5.51% / 5.38%
LDAfailed
QDAfailed
CART
Decision Tree (min samples per leaf = 1)3.44s / 0.01s100.00% / 77.48%
Decision Tree (min samples per leaf = log n = 11)3.28s / 0.01s95.65% / 77.39%
Decision Tree (min samples per leaf = sqrt n = 160)2.69s / 0.01s83.71% / 79.42%
Support Vector Machine
Linear SVM (SGD, ~250k updates = 10 epochs)1.05s / 0.06s36.81% / 36.89%
Linear SVM (SGD, ~1m updates = 39 epochs)3.98s / 0.06s61.49% / 61.40%
Linear SVM (SGD, ~2m updates = 78 epochs)8.04s / 0.06s47.92% / 46.92%
Linear SVM (liblinear)109.26s / 0.07s56.18% / 56.18%
Logistic Regression
Logisic Regression (SGD, ~250k updates = 10 epochs)1.08s / 0.06s37.84% / 37.87%
Logisic Regression (SGD, ~1m updates = 39 epochs)4.12s / 0.06s37.16% / 37.29%
Logisic Regression (SGD, ~2m updates = 78 epochs)8.22s / 0.06s51.12% / 50.48%
Logistic Regression (liblinear)280.85s / 0.07s58.66% / 58.18%
Extremely Randomized Trees
Extra-Trees (2 trees)3.81s / 0.04s100.00% / 72.67%
Extra-Trees (4 trees)7.88s / 0.08s100.00% / 78.80%
Extra-Trees (8 trees)15.17s / 0.14s100.00% / 82.09%
Extra-Trees (16 trees)29.39s / 0.28s100.00% / 84.15%
Extra-Trees (64 trees)119.16s / 1.14s100.00% / 85.53%
Extra-Trees (32 trees)58.78s / 0.59s100.00% / 85.09%
Extra-Trees (128 trees)233.06s / 2.23s100.00% / 85.89%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)3.99s / 0.03s58.70% / 58.49%
Gradient Boosting (sample size = 100%, 4 stumps)7.82s / 0.04s56.73% / 56.49%
Gradient Boosting (sample size = 100%, 8 stumps)15.37s / 0.07s58.73% / 58.62%
Gradient Boosting (sample size = 100%, 16 stumps)31.09s / 0.11s43.50% / 43.57%
Gradient Boosting (sample size = 100%, 32 stumps)61.36s / 0.21s42.71% / 42.79%
Gradient Boosting (sample size = 100%, 64 stumps)118.30s / 0.39s15.08% / 15.31%
Gradient Boosting (sample size = 100%, 128 stumps)230.85s / 0.78s20.24% / 20.04%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)2.17s / 0.03s46.67% / 46.51%
Gradient Boosting (sample size = 25%, 4 stumps)4.14s / 0.04s44.78% / 44.85%
Gradient Boosting (sample size = 25%, 8 stumps)8.10s / 0.07s43.36% / 43.28%
Gradient Boosting (sample size = 25%, 16 stumps)16.17s / 0.12s45.05% / 45.04%
Gradient Boosting (sample size = 25%, 32 stumps)31.77s / 0.22s22.70% / 22.88%
Gradient Boosting (sample size = 25%, 64 stumps)61.40s / 0.42s32.85% / 33.07%
Gradient Boosting (sample size = 25%, 128 stumps)100.16s / 0.65s49.26% / 49.25%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)2.57s / 0.03s90.58% / 73.65%
Random Forest (min samples per leaf = 1, 4 trees)5.07s / 0.07s96.53% / 79.84%
Random Forest (min samples per leaf = 1, 8 trees)10.20s / 0.13s98.88% / 83.20%
Random Forest (min samples per leaf = 1, 16 trees)20.47s / 0.27s99.66% / 84.79%
Random Forest (min samples per leaf = 1, 32 trees)40.82s / 0.54s99.95% / 85.78%
Random Forest (min samples per leaf = 1, 64 trees)82.00s / 1.07s99.99% / 86.11%
Random Forest (min samples per leaf = 1, 128 trees)165.42s / 2.17s100.00% / 86.40%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 11, 2 trees)2.17s / 0.03s87.35% / 80.06%
Random Forest (min samples per leaf = 11, 4 trees)4.38s / 0.07s89.78% / 82.80%
Random Forest (min samples per leaf = 11, 8 trees)8.75s / 0.13s91.41% / 84.13%
Random Forest (min samples per leaf = 11, 16 trees)17.35s / 0.25s92.00% / 84.94%
Random Forest (min samples per leaf = 11, 32 trees)35.08s / 0.51s92.47% / 85.51%
Random Forest (min samples per leaf = 11, 64 trees)70.10s / 1.02s92.56% / 85.74%
Random Forest (min samples per leaf = 11, 128 trees)139.47s / 2.01s92.60% / 85.84%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 160, 2 trees)1.61s / 0.03s79.59% / 78.49%
Random Forest (min samples per leaf = 160, 4 trees)3.22s / 0.05s81.67% / 80.91%
Random Forest (min samples per leaf = 160, 8 trees)6.37s / 0.10s82.60% / 81.99%
Random Forest (min samples per leaf = 160, 16 trees)12.83s / 0.21s83.17% / 82.50%
Random Forest (min samples per leaf = 160, 32 trees)26.01s / 0.42s83.23% / 82.68%
Random Forest (min samples per leaf = 160, 64 trees)51.79s / 0.82s83.41% / 82.90%
Random Forest (min samples per leaf = 160, 128 trees)102.06s / 1.64s83.41% / 82.94%

Isolet

Recognize which letter is being spoken from hand-crafted auditory features. I don't really think this dataset is large enough to qualify as "medium-sized", but I still threw it in for comparison.
  • Dimensionality: 617
  • Training samples: 6,238
  • Test samples: 1,559
  • Number of classes: 26
  • Baseline test accuracy: 3.85%
  • Source: UCI
Algorithm Training / Test Time Training / Test Accuracy
Gaussian Naive Bayes0.46s / 0.58s84.23% / 80.12%
Regularized Gaussian Naive Bayes0.51s / 0.58s89.87% / 88.65%
LDA4.18s / 0.22s97.52% / 94.16%
QDAfailed
CART
Decision Tree (min samples per leaf = 1)4.52s / 0.00s100.00% / 79.92%
Decision Tree (min samples per leaf = log n = 9)4.39s / 0.00s95.70% / 79.54%
Decision Tree (min samples per leaf = sqrt n = 78)4.03s / 0.00s88.35% / 79.28%
Support Vector Machine
Linear SVM (SGD, ~250k updates / 41 epochs)9.12s / 0.06s99.90% / 95.00%
Linear SVM (SGD, ~1 updates / 161 epochs)35.12s / 0.07s100.00% / 94.48%
Linear SVM (SGD, ~2m updates / 321 epochs)70.07s / 0.06s100.00% / 94.36%
Linear SVM (liblinear)13.10s / 0.05s100.00% / 94.55%
Logistic Regression
Logisic Regression (SGD, ~250k updates / 41 epochs)19.39s / 0.06s99.98% / 95.00%
Logisic Regression (SGD, ~1m updates / 161 epochs)74.00s / 0.06s100.00% / 95.57%
Logisic Regression (SGD, ~2m updates / 321 epochs)147.53s / 0.06s100.00% / 95.57%
Logistic Regression (liblinear)42.67s / 0.05s99.97% / 95.51%
Extremely Randomized Trees
Extra-Trees (2 trees)2.52s / 0.01s100.00% / 66.77%
Extra-Trees (4 trees)4.75s / 0.02s100.00% / 79.09%
Extra-Trees (8 trees)9.34s / 0.03s100.00% / 87.94%
Extra-Trees (16 trees)18.39s / 0.06s100.00% / 92.05%
Extra-Trees (32 trees)36.38s / 0.11s100.00% / 93.46%
Extra-Trees (64 trees)72.55s / 0.22s100.00% / 94.74%
Extra-Trees (128 trees)144.06s / 0.44s100.00% / 95.38%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)20.19s / 0.01s4.14% / 4.43%
Gradient Boosting (sample size = 100%, 4 stumps)40.19s / 0.01s3.91% / 4.87%
Gradient Boosting (sample size = 100%, 8 stumps)80.14s / 0.02s4.09% / 4.30%
Gradient Boosting (sample size = 100%, 16 stumps)160.47s / 0.03s3.77% / 4.62%
Gradient Boosting (sample size = 100%, 32 stumps)319.10s / 0.08s3.77% / 4.43%
Gradient Boosting (sample size = 100%, 64 stumps)630.17s / 0.13s3.82% / 4.36%
Gradient Boosting (sample size = 100%, 128 stumps)1251.35s / 0.25s3.69% / 3.85%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)10.08s / 0.01s3.70% / 3.98%
Gradient Boosting (sample size = 25%, 4 stumps)19.79s / 0.01s3.72% / 3.85%
Gradient Boosting (sample size = 25%, 8 stumps)39.54s / 0.02s3.49% / 3.66%
Gradient Boosting (sample size = 25%, 16 stumps)63.67s / 0.03s3.85% / 3.85%
Gradient Boosting (sample size = 25%, 32 stumps)83.68s / 0.02s3.85% / 3.85%
Gradient Boosting (sample size = 25%, 64 stumps)124.81s / 0.04s3.85% / 3.85%
Gradient Boosting (sample size = 25%, 128 stumps)203.56s / 0.05s3.85% / 3.85%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)2.60s / 0.01s88.12% / 66.32%
Random Forest (min samples per leaf = 1, 4 trees)5.45s / 0.02s97.74% / 81.46%
Random Forest (min samples per leaf = 1, 8 trees)10.46s / 0.03s99.71% / 88.65%
Random Forest (min samples per leaf = 1, 16 trees)21.04s / 0.05s99.98% / 92.05%
Random Forest (min samples per leaf = 1, 32 trees)42.44s / 0.11s100.00% / 93.52%
Random Forest (min samples per leaf = 1, 64 trees)83.37s / 0.21s100.00% / 94.10%
Random Forest (min samples per leaf = 1, 128 trees)168.74s / 0.43s100.00% / 93.97%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 9, 2 trees)2.32s / 0.01s87.48% / 76.78%
Random Forest (min samples per leaf = 9, 4 trees)4.66s / 0.01s93.78% / 85.63%
Random Forest (min samples per leaf = 9, 8 trees)9.32s / 0.03s96.73% / 89.35%
Random Forest (min samples per leaf = 9, 16 trees)18.82s / 0.05s98.12% / 92.62%
Random Forest (min samples per leaf = 9, 32 trees)37.40s / 0.10s98.59% / 93.33%
Random Forest (min samples per leaf = 9, 64 trees)75.41s / 0.21s98.75% / 94.03%
Random Forest (min samples per leaf = 9, 128 trees)152.74s / 0.44s98.86% / 94.16%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 78, 2 trees)2.01s / 0.01s73.87% / 69.53%
Random Forest (min samples per leaf = 78, 4 trees)4.02s / 0.01s82.57% / 78.51%
Random Forest (min samples per leaf = 78, 8 trees)8.00s / 0.02s88.14% / 85.05%
Random Forest (min samples per leaf = 78, 16 trees)15.87s / 0.05s90.85% / 88.65%
Random Forest (min samples per leaf = 78, 32 trees)31.94s / 0.11s91.66% / 90.64%
Random Forest (min samples per leaf = 78, 64 trees)64.22s / 0.20s92.29% / 90.38%
Random Forest (min samples per leaf = 78, 128 trees)128.47s / 0.40s92.40% / 90.96%