Sunday, July 22, 2012

Expensive lessons in Python performance tuning

Over the past year I've done a lot of number crunching in Python with the wonderful ipython/numpy/pandas/sklearn stack and using picloud to parallelize my computations. Picloud is an amazing service which lets you run your code in Amazon's behemoth data centers while hiding most of the configuration headache. Their computational model is essentially a parallel map: you submit a list of inputs and specify what function to map over them and picloud automagically launches that function on multiple machines. This is less powerful than full-blown MapReduce (no parallel reductions) but, luckily, a lot of the computations machine learning people want to perform are trivial to parallelize. These include:
  • parameter searches (parallelize over distinct parameters)
  • k-fold cross validation (parallelize over the folds)
  • training random forests (trees don't need to know about each other)
  • random starts for locally converging hill climbers like k-meansEM, etc..
As long as you have multiple things to do in parallel and the work for each item greatly exceeds the communication time, then picloud can be a miracle. However, they do charge a premium over the standard AWS instance prices and even Amazon's baseline prices can add up quickly if you're using powerful instance types. When you're running on someone else's hardware the inefficiencies of your code translate very directly into added costs. One month, I managed to rack up an extremely high bill from an embarrassingly small amount of actual work because (1) my Python code created millions of bloated objects in memory, forcing me to use the more expensive high-memory instance type, and (2) I had overlooked some very simple optimizations that later cut my runtime by an order of magnitude.  

So, I thought I might save someone else time and money by sharing some tips which made my Python code more efficient without sacrificing flexibility or maintainability. 
  1. Profile everything! Profile mercilessly and relentlessly. Let no assumption about efficiency pass without being put under the cold and uncaring gaze of a profiler. This applies to both runtime (use ipython's prun command) and memory consumption (which you can inspect with the over-engineered guppy). Don't "optimize" a damn thing until you know for sure it's a top bottleneck. 
  2. Be suspicious of loops (but have faith in Python's core operations). If your bottleneck is a loop performing numerical computations you can almost always speed it up by using some equivalent vectorized NumPy operations. On the other hand, once the obviously vectorizable stuff is off-loaded to library code, then I'm often surprised just how efficient Python's operations can be. I can build lists of startling length, index into dictionaries all day long, construct millions of simple objects, and it all adds up to a small fraction of my total runtime. 
  3. Simplest is often fastest. I spent a whole day concocting elaborate buffering and pre-fetching schemes to traverse the lines of a file as quickly as possible without reading it all into memory. I neglected, however, to compare with the trivial solution of simply writing "for line in file:", which of course turned out to be faster and more efficient. Similarly, I spent a while tweaking a CSV parsers before finding that using "string.split" with the standard parsing routines int and float was just as fast. 
  4. If you're creating millions of objects, make them namedtuples. In one feature extraction pipeline, I take a multi-million line CSV and parse each line into a Python object. My ever-frustrated internal micro-optimizer screams at the obvious inefficiency. And, sadly, he's right. Python objects are extremely heavy-weight compared to the simple struct we would create in C. Luckily, Guido and Friends foresaw the "zillions of simple objects" use-case and thus created namedtuples. They have named fields like objects, but are laid out contiguously and take up a lot less space. Alternatively, if you're using an object for more than just data storage and want to use non-trivial methods, you can lay out a full-blown object contiguously by filling out the slots attribute. How efficient are they? As a totally unscientific demonstration I constructed a million little vectors with field names "x", "y", and "z". When implemented as ordinary Python objects they took up 400MB of memory, whereas their namedtuple cousins only required 114MB. Still not nearly as compact as they could be, but a dramatic improvement nonetheless! 
  5. Don't search linearly through a sorted array, use binary search (duh). On the one hand, this is CS 101 kid's stuff. On the other, I didn't know about the bisect library and had code littered with np.nonzero(sorted<=x)[0]. Once I switched to using bisect_left/bisect_right I saw a huge performance improvement. edit: In the comments, Peter pointed out that NumPy implements a faster binary search called searchsorted; you (and I) should probably use that instead!.
  6. Numpy's "fancy indexing" is slow. If you index into a NumPy array with a range then you get a very lightweight view over the original array; this kind of indexing doesn't copy any data and is almost free. If, however, you index with a boolean vector or a sequence of integers ("fancy indexing"), then you actually incur a significant cost. If your code does a lot of fancy indexing then you should either switch to using the take method or rewrite your indexing as an explicit loop in native code (see #8).
  7. Wes McKinney is a genius. If you're implementing anything Wes McKinney has already put in his library pandas, just stop. His code is faster, more robust, and more likely to be correct than anything you're about to write. Want rolling window aggregators? Use pandas. Need to handle missing data? Use pandas. Are you writing some sort of unbelievably ugly hack which attempts to implement joins and group-by's over NumPy arrays but actually manages to spend 3 hours computing a subtly incorrect result? (I have done this). Jesus Christ, just stop and use pandas.
  8. When all else fails, try weave. Yes, weave is unmaintained, ugly, and hard to debug. But, if you find yourself thinking "if only I could just inline 5 lines of C right in the middle of this Python code", then weave lets you actually do that. If, however, that 5 lines turns into 50 then maybe it's time to learn about Cython.
Also, a few last words specific to machine learning: climb the complexity ladder carefully. It's tempting to throw a fancy tool at some new data, and doubly tempting to try every possible combination of preprocessing steps and hyperparameters. Brute force search and a muscular learning algorithm let you sidestep the need for thinking, experimentation, or understanding. Of course, in my experience, this rarely works out and ends up costing quite a bit of money.

So, do the more prudent mature thing and start simple. Collect simple statistics, plot each feature, and try to understand the data a little before you launch 100,000 jobs on picloud. Even then, start with simpler learners like  SGD Logistic Regression or small Random Forests. These will let you explore your choices of preprocessing steps and target functions without eating up too many resources. Once you're more comfortable with the problem, go ahead and train a 1000 layer deep belief network (if that's your thing).

Postscript: After almost a year, this article seems out of date with regard to the various ways you can generate native code. Particularly, before you go reaching for weave, Cython, or (gasp!) even C itself, it might be worthwhile to try selective just-in-time compilation. Both Numba and my own Parakeet will generate efficient native implementations for array-oriented programs. Copperhead is a little more limited semantically (it's really a subset of Haskell masquerading as Python), but if you can squeeze your algorithm into Copperhead's model then you'll get a blazing fast GPU kernel as a reward. Alternatively, you can skip the "pretending to be Python" charade and build data-parallel expression trees directly using Theano. It can be a little rough around the edges but, like Copperhead, if you can get it working then the (many orders of magnitude faster) GPU kernel performance is well worth the trouble.