Python Tutorial 4: Common Tasks

Plotting

Plotting with matplotlib is generally very straightforward, and you can get most of the documentation you'll ever need from the front page of the matplotlib website at http://matplotlib.sourceforge.net/. If you can't find the answer on the main page, be sure to check out the excellent matplotlib examples at http://matplotlib.sourceforge.net/examples/index.html.

Perhaps the best way to pick-up matplotlib on the fly is to peruse the gallery ( http://matplotlib.sourceforge.net/gallery.html ) and see what's the closest match to what you are looking to do.

The best way to use matplotlib interactively is by running ipython -pylab, which loads a special environment into ipython designed for use with matplotlib.

pylab and pyplot

In the most recent version of matplotlib (0.98, as of this writing), the name of the module containing the main plotting routines has changed from pylab to matplotlib.pyplot. pylab still exists as an alias, and it's still ipython -pylab (though that might change someday too), but now the "proper" way to access the main plotting interface is from matplotlib import pyplot rather than just import pylab.

And matplotlib.pyplot is where you'll find the routines in the online documentation.

Non-interactive plotting

If you're making a complex plot or one you may need to revise a lot, it can be very useful to put your plotting commands in a script or function in a module (well, both, by using script-calls-a-function). To get access to the commands you'd use from within ipython -pylab, just from matplotlib import pyplot, and put a pyplot. in front of all the usual commands.

When plotting non-interactively, you'll need to use pylab.show() to make the plotting windows appear on the screen. Since the actual drawing is delayed until this is actually called, this can make non-interactive plotting much faster than interactive plotting if there are a lot of plot elements.

Instead of making the plots appear in windows and saving them through the GUI, you can also save figures directly from Python using pylab.savefig().

Plotting 2D Intensity Data

There are three main ways of plotting image-like data (there are even more if the image is a 2D histogram):

* imshow. This plots regularly gridded image data directly, and doesn't accept arrays specifying the x and y axes (instead it can take a bounding box in the extent keyword argument. Also be aware that the default origin is the upper left corner, so you'll usually want to pass origin="lower". It also does automatic interpolation by default, so if you want to see the actual pixelization, you'll need to set interpolation="nearest". This is by far the fastest image plotting routine; if you're actually trying to plot a large image, it's the only way to go.

* pcolor. This may look very similar to the results of imshow for regularly gridded data, but it takes arrays that indicate the x and y positions of each data point, which means it doesn't have to be regularly gridded (in fact, it doesn't even have to be rectangular, but I have no idea how to pass the data in in that case). pcolor actually draws tons of tiny rectangles, so it scales and stretches nicely (but imshow does that too when you turn off interpolation). Note that the x and y arrays you pass pcolor indicate the edges of the data grid - that means they should each be one element longer than the shape of the data array in that dimension.

* contour / contourf. These draw contours (unfilled and filled, respectively). They also take x and y arrays to specify the grid points, but unlike pcolor these refer to the centers of the grid points, so they should have the same length as the shape of the data array in that dimension.

You may also be interested in ensuring that your axis ratio is 1:1. You can do that with pyplot.axis("scaled").

Computing and plotting histograms (1d and 2d)

* For simple 1D histograms, pyplot.hist does a pretty good job. If you want more control, or you don't want a plot made up of boxes, you can create the histogram array itself with numpy.histogram and plot it with the regular pylab.plot function.

* For 2D histograms, one option is to use numpy.histogram2d to create the histogram array, and then use pcolor, imshow, or contour[f] to plot it. Note that the x and y axes arrays it returns are suitable for use with pcolor, and hence to use them with contour[f] you have to use the following trick:

import numpy
from matplotlib import pyplot
h,hx,hy = numpy.histogram2d(xdata,ydata)
cx = 0.5*(hx[:-1] + hx[1:])
cy = 0.5*(hy[:-1] + hy[1:])
pyplot.contourf(cx,cy,h.transpose())

Note that histogram2d unfortunately returns an array with dimensions (x,y), instead of the expected (y,x), so you have to transpose it.

* In matplotlib 0.98, there's another option, pyplot.hexbin, which computes a 2d histogram and plots it on a hexagonal grid all in one step. It's pretty cool looking, but a little slow.

* A simple scatter plot can also be an excellent stand-in for a true 2d histogram. While you can use pyplot.scatter for this, you'll find it's much slower than using pyplot.plot with a dot marker (","). That's because pyplot.scatter has a lot more functionality you generally don't need - it can make all the scatter points have different colors, sizes, or symbols based data from additional dimensions. Really neat trick: you might notice when making scatter plots that the dense parts can get horribly saturated. You can make this a lot better by specifying a low alpha value (say, alpha=0.1) as a keyword to plot. This makes all the points transparent, so they only saturate when many of them are on top of each other (for alpha=0.1, that number is 10).

Colors and Color Maps

All of the 2d plotting functions discussed above use colormaps. You can set the default colormap with functions like pyplot.hot() or pyplot.jet(), but you can also use a much wider selection by passing one of the objects in pyplot.cm as the cmap argument of the plot functions. One very useful feature is that for every color map of the form pyplot.cm.<name>, there is a reversed colormap pyplot.cm.<name>_r.

For other plotting commands, like plot, it's most common to pass in single characters to specify the color: "b"=blue,"r"=red,"g"=green,"k"=black,"w"=white,"y"=yellow,"m"=magenta,"c"=cyan. However, you can also pass in tuples of (R,G,B), where R, G, and B are float values between 0 and 1. And you can get tuples like that by calling a colormap object. The example below gets 9 equally spaced color values from the "Blues" colormap:

colors = [pyplot.cm.Blues(z) for z in numpy.linspace(0,1,9)]
x = numpy.random.randn(50,9)
y = numpy.random.randn(50,9)
for i in range(9): pyplot.plot(x[:,i],y[:,i],",",color=colors[i],label="Blue #%i" % (i+1))
pyplot.legend()

Plots with error bars

Use pyplot.errorbar. 'Nuff said.

Log scales

Check out pyplot.semilogx, pyplot.semilogy, and pyplot.loglog. You can call these with arguments to make the plot, or call them with no arguments to force other plotting routines to use the log axes for that plot (this is useful for putting error bars on a log plot; call errorbar and one of the log routines with no arguments - in either order).

Labels and LaTeX

If you're using a backend that supports it, you can include LaTeX markup in your plot labels and legends. Just put the LaTeX math in $$ quotes, like you normally would, and put an r (for "raw") before the string quotes to tell Python not to try to interpret any backslashes in the string itself. Like this:

pyplot.xlabel(r"$\alpha$")

There are a lot of different matplotlib backends, most with some combination of GUI toolkit (GTK, wxWindows, Qt, Tk) and drawing system (Agg, Cairo). All the Agg backends support LaTeX (like GtkAgg or TkAgg), and you can set which one you use in your {{{.matplotlib/matplotlibrc}}] file if LaTeX doesn't just work for you.

Legends

You can add legends with explicit labels for all the plots you made by passing a bunch of labels to pyplot.legend. But it's better to just add a label="something" keyword argument to all the plots you want to have on the legend, and then call pyplot.legend with no arguments.

Single plots with multiple axes

http://matplotlib.sourceforge.net/examples/api/two_scales.html

Multiple plots with shared axes

http://matplotlib.sourceforge.net/examples/pylab_examples/shared_axis_demo.html

Additional Numpy Utilities

Random numbers

Check out numpy.random, obviously).

Linear algebra

There's lots of good stuff in numpy.linalg, like linear least squares solvers (lstsq), singular value decomposition (svd), inverses (inv), and eigenvalue decomposition (eig). All of these use the standard, high-quality LAPACK Fortran routines.

SciPy

Where to get SciPy documentation

* Inline Python help. Be aware that you have to import individual scipy modules separately.

* The new, nice, but maybe not-quite-complete online documentation: http://docs.scipy.org/doc/scipy/reference/ (I remember a few months ago it said "work in progress", but now it doesn't seem to say that anymore).

* The old, ugly, but fairly complete automatically-generated online documentation: http://www.scipy.org/doc/api_docs/

More linear algebra

Even more LAPACK wrappers can be found in scipy.linalg (including all the stuff in numpy.linalg).

Special Functions and Polynomials

scipy.special. Most of these are straightforward ufuncs. Unfortunately, it's a pain to get the documentation. If you do

>>> help(scipy.special.jn)

you'll get a worthless page about ufuncs in general. You'll need to do

>>> help(scipy.special)

and scroll down to learn that scipy.special.jn(n,v) is a Bessel function with integer order n at the real value v.

The special polynomials work a little differently. If you do

>>> H = scipy.special.hermite(5)

you'll get a special numpy.ndarray subclass: a polynomial object. You can evaluate a polynomial by calling it

>>> H(3.2)  # 5th order Hermite polynomial evaluated at 3.2
5878.5382400000062

and you can do a ton of other useful things with it, like integration (H.integ), differentiation (H.deriv), and all the usual mathematical operators. You can also compose them by calling one polynomial with the another polynomial as an argument.

The only "problem" with polynomials is that their coefficients are backwards. A nth-degree polynomial behaves like an array with shape (n+1), which makes sense...but you'll note that in the above example, H[0] is the coefficient for 5th-order term, and H[4] is the 0th order term.

Optimization

scipy.optimize. You can find a Levenberg-Marquardt non-linear least squares fitter at scipy.optimize.leastsq. There's also a simulated annealing algorithm, scipy.optimize.anneal, if you're worried about local minima. And a whole lot more. My recommendation is to figure out what you want by reading the optimization chapter in Numerical Recipes, and then go and find it in scipy. They're all there.

Least Squares Fitting

For most simple fitting problems, scipy.optimize.leastsq is overkill. It's designed for non-linear least squares optimization, and in a lot of cases (including fitting straight lines or polynomials) the problem is actually a linear least squares problem. In many more cases, the problem can be transformed to a linear one (such as fitting straight lines or polynomials in log space). The fit and covariance matrix of a linear least squares problem are much more robust than those of a non-linear problem, and while scipy.optimize.leastsq should give you good answers if you give it a linear problem, it won't if you don't do the math to transform your problem to linear. Check out the least squares via SVD discussion in Numerical Recipes for more information on why.

So whenever you can, make the problem linear, and then use numpy.linalg.lstsq (or equivalently, scipy.linalg.lstsq). A linear least squares problem is often written as a matrix equation Ax=b, where A is the linear model, b is the data, and x are the parameters. You might notice that this doesn't leave any place for the errors on the data to go (it assumes that the're all unity), but you can fix that by transforming A = A/sigma[:,numpy.newaxis] and b=b/sigma and solving the resulting problem.

If all you want to do is fit a polynomial (including a straight line) with no errors, use scipy.special.polyfit. It's very straightforward.

If you want to fit a polynomial, but you do have error bars, you'll want to use numpy.lstsq, like this:

import scipy.special
import numpy

def polyfit_with_err(x,y,sigma,deg):
   """Fit a polynomial p(x) = ''p[0] * x**deg + .... + p[deg]'' of degree 'deg'
   to points '(x,y)' with errors in y given by 'sigma'."""
   n = numpy.arange(deg,-1,-1)
   A = numpy.power.outer(x,n)              # A = x**n
   A /= sigma[:,numpy.newaxis]             # we made A ourselves so we can overwrite it
   y = y/sigma                             # ...but we don't want to overwrite y.
   v,residues,rank,s = numpy.linalg.lstsq(A,y)
   return scipy.special.poly1d(v)    # reverse array because of poly1d's screwed up ordering

The SciPy Disclaimer

Jim doesn't use SciPy much. That's not because he doesn't like it - he just hasn't needed it much - and in fact he hears from respectable sources that it's quite useful. But he doesn't know that much about it, and that's why he isn't writing about it.

So be aware: there's a lot more stuff available in SciPy, but you'll have to find out about it for yourself.

Pickle, Shelve, and Persistence

Almost any Python object can be pickled - that is, turned into a string or saved to disk, and then loaded back up later, on some other computer, as if nothing had ever happened. This is done by the pickle module (but actually the cPickle module, which is a faster version of pickle.

Here's how it works:

import cPickle
a = [5,3,2,6,1,2,"hurray!",3.14159,{"is this a dictionary inside a list?":"yes"}]
s = cPickle.dumps(a)      # s is now a string representation of a
new_a = cPickle.loads(s)  # new_a is an exact copy of a

or, to save to disk:

>>> import cPickle
>>> a = [5,3,2,6,1,2,"hurray!",3.14159,{"is this a dictionary inside a list?":"yes"}]
>>> f = open("a.p","w")
>>> cPickle.dump(a,f)         # save a in pickle format to file "a.p"
<close python>
>>> import cPickle
>>> f = open("a.p","r")
>>> a = cPickle.load(f)       # a is just like new

The default pickle format is plain text (though not very readable), and hence totally architecture-independent. If you're pickling big things, like large numpy arrays, you'll want to call dump with protocol=2, which is way faster, and also more compact (but also binary, so it might not be a good idea if you deal with non-intel Macs or Sparc boxes much).

If you want to save a bunch of little things to one file, shelve provides a dictionary-like object that stores multiple objects, by key, to a single file using pickle. It looks like this:

>>> import numpy
>>> import shelve
>>> s = shelve.open("data.s",protocol=2)
>>> s['name'] = "a nice little string"
>>> s['some random data'] = numpy.random.randn(500)
>>> s.close()
<close python>
>>> import shelve
>>> s = shelve.open("data.s",protocol=2)
>>> print s['name']
a nice little string

The advantage of this over pickling a dictionary is that with a shelve only the objects you ask for are loaded from disk, and since the files are stored in a database file, it will be pretty efficient at doing so. Still, for big objects, it's probably a better idea to pickle them separately.

A few caveats for shelve:

* Always be sure to close your shelve objects. Otherwise changes might not get written to disk.

* Unlike real dictionaries, keys for shelve objects must be strings.

* Don't edit mutable shelve objects in place. In other words:

>>> import shelve
>>> s = shelve.open("data.s",protocol=2)
>>> s['d'] = {}
>>> s['d']['k'] = 5.6   # This line doesn't do anything! Do this instead:
>>> d = s['d']
>>> d['k'] = 5.6
>>> s['d'] = d
>>> s.close()

Cluster Computing

Jim's favorite MPI interface right now is the absurdly simple pypar, found at http://datamining.anu.edu.au/~ole/pypar/.

Logging

Check out the logging module in the standard library. It's really cool.

Command Line Parsing

Check out the optparse module in the standard library. Just suck it up and use it once. You'll be glad you did, because then you can just copy and paste that the next time you want to use it.

Home-Brew Solutions

From Matt Auger: http://gravlens.physics.ucdavis.edu/~auger/software.php

Jim's catalog package (new version with a slightly different interface coming soon, so don't get too attached to it): http://dls.physics.ucdavis.edu/~jbosch/astro_utils/catalogs.tar.gz