## Sunday, July 22, 2012

### Expensive lessons in Python performance tuning

Over the past year I've done a lot of number crunching in Python with the wonderful ipython/numpy/pandas/sklearn stack and using picloud to parallelize my computations. Picloud is an amazing service which lets you run your code in Amazon's behemoth data centers while hiding most of the configuration headache. Their computational model is essentially a parallel map: you submit a list of inputs and specify what function to map over them and picloud automagically launches that function on multiple machines. This is less powerful than full-blown MapReduce (no parallel reductions) but, luckily, a lot of the computations machine learning people want to perform are trivial to parallelize. These include:
• parameter searches (parallelize over distinct parameters)
• k-fold cross validation (parallelize over the folds)
• training random forests (trees don't need to know about each other)
• random starts for locally converging hill climbers like k-meansEM, etc..
As long as you have multiple things to do in parallel and the work for each item greatly exceeds the communication time, then picloud can be a miracle. However, they do charge a premium over the standard AWS instance prices and even Amazon's baseline prices can add up quickly if you're using powerful instance types. When you're running on someone else's hardware the inefficiencies of your code translate very directly into added costs. One month, I managed to rack up an extremely high bill from an embarrassingly small amount of actual work because (1) my Python code created millions of bloated objects in memory, forcing me to use the more expensive high-memory instance type, and (2) I had overlooked some very simple optimizations that later cut my runtime by an order of magnitude.

So, I thought I might save someone else time and money by sharing some tips which made my Python code more efficient without sacrificing flexibility or maintainability.
1. Profile everything! Profile mercilessly and relentlessly. Let no assumption about efficiency pass without being put under the cold and uncaring gaze of a profiler. This applies to both runtime (use ipython's prun command) and memory consumption (which you can inspect with the over-engineered guppy). Don't "optimize" a damn thing until you know for sure it's a top bottleneck.
2. Be suspicious of loops (but have faith in Python's core operations). If your bottleneck is a loop performing numerical computations you can almost always speed it up by using some equivalent vectorized NumPy operations. On the other hand, once the obviously vectorizable stuff is off-loaded to library code, then I'm often surprised just how efficient Python's operations can be. I can build lists of startling length, index into dictionaries all day long, construct millions of simple objects, and it all adds up to a small fraction of my total runtime.
3. Simplest is often fastest. I spent a whole day concocting elaborate buffering and pre-fetching schemes to traverse the lines of a file as quickly as possible without reading it all into memory. I neglected, however, to compare with the trivial solution of simply writing "", which of course turned out to be faster and more efficient. Similarly, I spent a while tweaking a CSV parsers before finding that using "string.split" with the standard parsing routines int and float was just as fast.
4. If you're creating millions of objects, make them namedtuples. In one feature extraction pipeline, I take a multi-million line CSV and parse each line into a Python object. My ever-frustrated internal micro-optimizer screams at the obvious inefficiency. And, sadly, he's right. Python objects are extremely heavy-weight compared to the simple struct we would create in C. Luckily, Guido and Friends foresaw the "zillions of simple objects" use-case and thus created namedtuples. They have named fields like objects, but are laid out contiguously and take up a lot less space. Alternatively, if you're using an object for more than just data storage and want to use non-trivial methods, you can lay out a full-blown object contiguously by filling out the slots attribute. How efficient are they? As a totally unscientific demonstration I constructed a million little vectors with field names "x", "y", and "z". When implemented as ordinary Python objects they took up 400MB of memory, whereas their namedtuple cousins only required 114MB. Still not nearly as compact as they could be, but a dramatic improvement nonetheless!
5. Don't search linearly through a sorted array, use binary search (duh). On the one hand, this is CS 101 kid's stuff. On the other, I didn't know about the bisect library and had code littered with np.nonzero(sorted<=x)[0]. Once I switched to using bisect_left/bisect_right I saw a huge performance improvement. edit: In the comments, Peter pointed out that NumPy implements a faster binary search called searchsorted; you (and I) should probably use that instead!.
6. Numpy's "fancy indexing" is slow. If you index into a NumPy array with a range then you get a very lightweight view over the original array; this kind of indexing doesn't copy any data and is almost free. If, however, you index with a boolean vector or a sequence of integers ("fancy indexing"), then you actually incur a significant cost. If your code does a lot of fancy indexing then you should either switch to using the take method or rewrite your indexing as an explicit loop in native code (see #8).
7. Wes McKinney is a genius. If you're implementing anything Wes McKinney has already put in his library pandas, just stop. His code is faster, more robust, and more likely to be correct than anything you're about to write. Want rolling window aggregators? Use pandas. Need to handle missing data? Use pandas. Are you writing some sort of unbelievably ugly hack which attempts to implement joins and group-by's over NumPy arrays but actually manages to spend 3 hours computing a subtly incorrect result? (I have done this). Jesus Christ, just stop and use pandas.
8. When all else fails, try weave. Yes, weave is unmaintained, ugly, and hard to debug. But, if you find yourself thinking "if only I could just inline 5 lines of C right in the middle of this Python code", then weave lets you actually do that. If, however, that 5 lines turns into 50 then maybe it's time to learn about Cython.
Also, a few last words specific to machine learning: climb the complexity ladder carefully. It's tempting to throw a fancy tool at some new data, and doubly tempting to try every possible combination of preprocessing steps and hyperparameters. Brute force search and a muscular learning algorithm let you sidestep the need for thinking, experimentation, or understanding. Of course, in my experience, this rarely works out and ends up costing quite a bit of money.

So, do the more prudent mature thing and start simple. Collect simple statistics, plot each feature, and try to understand the data a little before you launch 100,000 jobs on picloud. Even then, start with simpler learners like  SGD Logistic Regression or small Random Forests. These will let you explore your choices of preprocessing steps and target functions without eating up too many resources. Once you're more comfortable with the problem, go ahead and train a 1000 layer deep belief network (if that's your thing).

Postscript: After almost a year, this article seems out of date with regard to the various ways you can generate native code. Particularly, before you go reaching for weave, Cython, or (gasp!) even C itself, it might be worthwhile to try selective just-in-time compilation. Both Numba and my own Parakeet will generate efficient native implementations for array-oriented programs. Copperhead is a little more limited semantically (it's really a subset of Haskell masquerading as Python), but if you can squeeze your algorithm into Copperhead's model then you'll get a blazing fast GPU kernel as a reward. Alternatively, you can skip the "pretending to be Python" charade and build data-parallel expression trees directly using Theano. It can be a little rough around the edges but, like Copperhead, if you can get it working then the (many orders of magnitude faster) GPU kernel performance is well worth the trouble.

1. Don't say no to panda(s)...

1. Ha, I was going to delete your post as youtube spam, but then I watched it--- I can't delete this, it's too weird and creepy.

2. real time parallel with dynamic load. (Make graphics decay or enhance.) Off-line is trivial.

3. I meant to say "batch" or "non-real-time" is trivial.

1. Hey Terry,

I guess I don't really get what you mean--- it's a contest of difficulty, right? Sometimes you have online workloads and more often you have offline data.

4. Alex, the link to the SGD Logistic Regression class points to the liblinear Logistic Regression class.

1. Thanks for catching that, I changed the link to SGD.

5. which is batch and based on some kind of Coordinate Descent. The SGD logistic regression is http://scikit-learn.org/dev/modules/sgd.html#classification (with loss=log).

6. Alex would you be interested in starting a new BSD licensed github project to host your picloud integration code for scikit-learn related tasks (grid search, cross validation, random forest)?

1. Yes! ...but it will have to wait a month...I'm working on a summer project right now which eats all my time. I only wrote this post because I got sick of debugging my real work.

However, I suspect that a picloud library for this stuff might actually not be that useful for most people. The trickiest part of using picloud effectively is not sending too much over the wire. They actually have hard transmit limits! I forgot the exact number but it's less than 100mb. At first I found this very frustrating but I finally learned to "embrace the cloud" and store my stuff on S3. This made scaling up a lot easier--- but it doesn't fit totally naturally into the numpy-on-a-local machine way of doing things. I don't send my data to each worker-- I just send them a list of filenames they should load off S3.

2. This kind of utility functions to deal with numpy and S3 together would be very useful too, even in a non-picloud context.

Also, it might be interesting to use the libcloud.storage abstraction that works with alternative to S3 too (e.g. RackSpace and private OpenStack-based clouds for instance).

http://libcloud.apache.org/

Let me know by email when you plan started to work on this as I plan to do stuff in that area as well.

3. This comment has been removed by the author.

7. Hi Alex, the link to the different learners in scikit-learn (SGD, LogisticRegression, RandomForest) point to the 0.10 version of the website. We try to always point to the 'stable' version, as people land on old versions and then don't understand why the code examples do not work. Thanks for the post!

1. Ha, all the sklearn people are fixing my links :-). Do you know if there's any way you guys can get google to stop ranking your older documentation so highly? It usually comes up first on searches.

2. Perhaps because people keep linking to the old versions. ;)

8. Regarding issue #5: np.searchsorted also implements binary search:

In [14]: x = np.random.rand(20000)
In [15]: x.sort()
In [16]: %timeit np.searchsorted(x, 0.5)
1000000 loops, best of 3: 1.15 us per loop
In [17]: %timeit bisect.bisect_left(x, 0.5)
100000 loops, best of 3: 9.59 us per loop

1. Here I am writing blog posts about Python performance and I never noticed this function before...

9. About 6.: Is fancy indexing with boolean indices also slow? I haven't seen any numbers to support this (yet).

1. Btw, take doesn't work with boolean indices.

2. Fancy indexing with boolean masks does memory copy though. So it might still be interesting to rollup your own compiled routine to avoid the memory copy.

3. Some very informal tests with an array of 10**7 floats:

In [64]: timeit z[idx]
timeit z.take(idx)
100 loops, best of 3: 11.9 ms per loop

In [65]: timeit z.take(idx)
100 loops, best of 3: 9.25 ms per loop

timeit10 loops, best of 3: 49.7 ms per loop

10 loops, best of 3: 104 ms per loop

10 loops, best of 3: 102 ms per loop
------
...and repeated for arrays for length 10**5...
------

In [85]: timeit z[idx]
10000 loops, best of 3: 73.5 us per loop

In [86]: timeit z.take(idx)
10000 loops, best of 3: 38.2 us per loop

1000 loops, best of 3: 459 us per loop

1000 loops, best of 3: 953 us per loop

----
It looks like (1) fancy indexing with boolean masks is much slower and (2) the difference between __getitem__ and 'take' diminishes as the arrays grow larger.

10. __slots__ can reduce memory usage significantly without the need to switch to namedtuples. PyPy automatically 'slots' objects under the covers for you when it can (actually what it's doing is v8 style 'hidden classes' or 'maps')

11. Have you tried using numexpr by any chance?