Python Tutorial 2: Numpy

Strided Arrays

Representing a one-dimensional array on a computer is an easy thing to figure out: it's just a contiguous block of memory. If you want some contiguous subarray, you just take a sub-block of that memory.

It's much trickier to represent a multidimensional array. Consider a 4x3 matrix. As you might know, in C, the ordering of the elements is traditionally row-major, and it looks like this:

0 1 2
3 4 5
6 7 8
9 10 11

while in Fortran, it's column-major, and it looks like this:

0 4 8
1 5 9
2 6 10
3 7 11

If that's not complicated enough, consider the problem of transposing a matrix; clearly it's more efficient to change the storage order than to physically swap all the elements - so you can't even stick to one convention. Now imagine we wanted to pull out the upper-left 2x2 block. There's no way to do that without copying, unless you give up on having contiguous blocks of memory. And copying gets very ugly if you have very big arrays, which happens pretty often in astronomy.

Here's the solution Numpy uses:

  • In addition to storing the memory block and the shape of the array, maintain another array of strides.
  • The strides, like the shape, is a set of N integers, where N is the dimension of an array. The dot product of the stride with an N-dimensional index gives the memory location of the element.
  • For example: the row-major matrix has strides of (3,1) and the column-major matrix has strides of (1,4). The upper-left 2x2 submatrix of the row-major matrix also has strides of (3,1).

Usually you won't have to worry about the strides, but it's good to know. By the way, the strides in numpy count bytes, not numbers - so for an 8-byte double precision value, the (3,1) stride actually turns into (24,8).

Note that in all of this, I'm talking about indexing arrays with numbers that start with zero, not one - that's the way it's done in C, and that's the way it's always done in Python.

History Lessons

Once upon a time, there was a Python extension module called Numeric. Astronomers liked it, but they (mostly people at STSCI, I think) were frustrated that it didn't perform better on large arrays, and they wanted some more features. So numarray was created, as a new Numeric for astronomy people and other folks who like big arrays. But the original Numeric stuck around, and picked up new features numarray didn't have.

numpy was created as a project to put the two back together, and combine the best features of both. As of right now, the original maintainers of Numeric have moved over completely to numpy, and STSCI has encouraged people who use numarray to do the same. Both projects are now unmaintained, and Numeric in particular has known bugs that will likely never be fixed. Use numpy, and only numpy if you possibly can.

Numpy!

A small warning: with great power comes great responsibility. In many ways, numpy is a bit, well, perlish. It makes it very easy to write concise, arcane things that are very useful but sometimes very hard to read. Sometimes it even rewards those practices by optimizing for that kind of code. But it can also make a lot of very simple and obvious for loops unnecessary in Python, and it does it by using for loops compiled in C, which is what makes this whole practice of using Python to play with astronomical data not a completely crazy idea. Those of you coming from IDL or Matlab backgrounds know exactly what I'm talking about.

Creating Arrays

import numpy

# this makes an array out of a nested sequence (can be nested tuples, nested lists, or a combination of them)
a = numpy.array([[5,3],[2,1]])
# now we'll make sure the data type is float instead of int (the previous example forced numpy to guess)
a = numpy.array([[5,3],[2,1]],dtype=float)
# here's a 4x5x2 int array full of zeros
a = numpy.zeros((4,5,2),dtype=int)
# here's a 3x7 float array full of garbage
a = numpy.empty((3x7),dtype=float)
# here's a 5-element float array full of ones (you can use a single number for the shape instead of a tuple)
a = numpy.ones(5,dtype=float)

By the way, unless you tell it otherwise, numpy will always create arrays with row-major strides.

There's one more that's particularly useful, but only makes 1d arrays:

>>> a = numpy.arange(1.5,3.6,0.5)
>>> print a
[ 1.5,  2. ,  2.5,  3. ,  3.5]

arange works almost like the built-in range and xrange functions - it takes a arguments of (start, stop, step), but:

  • arange returns an array
  • arange works with floats as well as ints (you can use the dtype argument to ensure you get what you want). Using it with floats is tricky, however, because floating point arithmetic isn't exact - it's a good idea to get in the habit of "hedging" your stop arguments to make sure you get what you expect.

For a similar effect, you can also use linspace, which takes arguments of (start,stop,num):

>>> print numpy.linspace(1,5,20):
[ 1.        ,  1.21052632,  1.42105263,  1.63157895,  1.84210526,
  2.05263158,  2.26315789,  2.47368421,  2.68421053,  2.89473684,
  3.10526316,  3.31578947,  3.52631579,  3.73684211,  3.94736842,
  4.15789474,  4.36842105,  4.57894737,  4.78947368,  5.        ]

Or if you want evenly-spaced log values, logspace:

>>> print numpy.logspace(-1,3,5,base=3):
[  0.33333333,   1.        ,   3.        ,   9.        ,  27.        ]

Shape Manipulation

The reshape methods lets you (almost) arbitrarily change the shape of an array - the total number of elements is all that has to remain the same. Often, what you'd rather do is transpose, which simply reverses the strides and the shape if you don't pass it any additional arguments (it can do much more complicated things if you pass additional arguments; I never do). Finally, swapaxes is an easy way to do just what it suggests - swap two dimensions:

>>> a = numpy.arange(24)
>>> print a
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
>>> b = a.reshape(4,6)
>>> print b
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
>>> c = b.transpose()
>>> print b.strides
(48, 8)
>>> print c.strides
(8, 48)
>>> print c
[[ 0  6 12 18]
 [ 1  7 13 19]
 [ 2  8 14 20]
 [ 3  9 15 21]
 [ 4 10 16 22]
 [ 5 11 17 23]]
>>> d = c.reshape(24)
>>> print d
[ 0  6 12 18  1  7 13 19  2  8 14 20  3  9 15 21  4 10 16 22  5 11 17 23]
>>> e = a.reshape(2,3,-1)
>>> print e.shape
(2, 3, 4)
>>> print e
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
>>> f = e.swapaxes(1,2)
>>> print f.shape
(2, 4, 3)
>>> print f
[[[ 0  4  8]
  [ 1  5  9]
  [ 2  6 10]
  [ 3  7 11]]

 [[12 16 20]
  [13 17 21]
  [14 18 22]
  [15 19 23]]]

Here's a few things to keep in mind:

  • All of these methods created new arrays that share memory with the old arrays, so editing one would change all the others.
  • They all returned new array objects, and didn't change the shape or strides of the original one. In general, there's no way to change the shape or strides of an existing array object.
  • The reshape method does its work by considering the array as if it were row-major, even if it isn't - note how it behaves with array c, which is a column-major array.
  • Passing -1 to reshape makes the shape of dimension whatever it needs to be in order to make the operation possible.

1D Array Indexing and Slicing

If you have a one-dimensional array, you can index it with a single number:

>>> import numpy
>>> a = numpy.arange(36)
>>> print a

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35]
>>> print a[3]
3

or with a range of numbers (slicing):

>>> print a[3:10]
[3 4 5 6 7 8 9]

or with a range and a step value:

>>> print a[2:15:4]
[ 2  6 10 14]

You can also assign values using the same syntax:

>>> a[3:6] = -a[3:6]
>>> a[8:10] = 2
>>> print a
[ 0  1  2 -3 -4 -5  6  7  2  2 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35]

The first assignment would work for regular Python list objects too (in fact, with lists, you could assign a new sequence with a different size, which would grow or shrink the list - but that doesn't work for arrays). The second assignment only works for numpy arrays. Another big difference is that slicing an array returns a subarray that shares memory with the original array, which isn't true for lists:

>>> b = a[12:18]
>>> b[1] = 100
>>> print b
[ 12 100  14  15  16  17]
>>> print a
[  0   1   2  -3  -4  -5   6   7   2   2  10  11  12 100  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35]

Range values don't have to be positive numbers; you can also use negative numbers to count backwards from the end of an array:

>>> print a[-4:-1]
[32 33 34]

and negative step values to go backwards:

>>> print a[24:20:-1]
[24 23 22 21]

You can also leave off start and stop values entirely to imply the beginning and end of the array:

>>> print a[:3]
[0 1 2]
>>> print a[32:]
[32 33 34 35]
>>> print a[:]

There are two other, totally different, ways of indexing a 1d array. The first uses a mask, an array of boolean values:

>>> mask = numpy.zeros(a.shape,dtype=bool)
>>> mask[25:30] = True
>>> mask[18] = 1   # this works too!
>>> print mask
[False False False False False False False False False False False False
 False False False False False False  True False False False False False
 False  True  True  True  True  True False False False False False False]
>>> print a[mask]
[18 25 26 27 28 29]
>>> a[mask] = 300
>>> print a
[  0   1   2  -3  -4  -5   6   7   2   2  10  11  12 100  14  15  16  17
 300  19  20  21  22  23  24 300 300 300 300 300  30  31  32  33  34  35]

But here's an important difference: masking-based indexing doesn't share memory, because you can't represent a mask using strides:

>>> c = a[mask]
>>> c[5] = -500
>>> print c
[ 300  300  300  300  300 -500]
>>> print a
[  0   1   2  -3  -4  -5   6   7   2   2  10  11  12 100  14  15  16  17
 300  19  20  21  22  23  24 300 300 300 300 300  30  31  32  33  34  35]

You can still set values using a mask, but you have to do it using the original array.

You can also index an array using an array of integers that represent indices:

>>> indices = numpy.array([12,15,16,29,30,32])
>>> d = a[indices]
>>> print d
[ 12  15  16 300  30  32]

Just as with masks, integer-sequence indexing creates arrays that don't share memory with the original array.

WARNING: if you try to do mask or integer-sequence indexing using list or tuple indexes instead of arrays, it might work. But only sometimes. So don't do it.

Multidimensional Array Slicing

There's almost nothing new here - to index an n-dimensional array, you just use a comma-separated combination of any of the above slicing and indexing tricks. You can mix and match all you like. If you only use range+step slicing, you'll get views into the original array, but if you have even one dimension indexed with masks or integer sequences, you'll get a copy. If one or more dimensions is indexed with a single integer, you'll get a lower-dimensional array as a result. Here are some examples:

>>> import numpy
>>> a = numpy.arange(36).reshape(2,3,6)
>>> print a
[[[ 0  1  2  3  4  5]
  [ 6  7  8  9 10 11]
  [12 13 14 15 16 17]]

 [[18 19 20 21 22 23]
  [24 25 26 27 28 29]
  [30 31 32 33 34 35]]]
>>> print a[0,:,0]
[ 0  6 12]
>>> print a[::-1,2,3:]
[[33 34 35]
 [15 16 17]]
>>> print a[:,:,numpy.array([1,4])]
[[[ 1  4]
  [ 7 10]
  [13 16]]

 [[19 22]
  [25 28]
  [31 34]]]

If you don't provide enough dimensions in your indexing, additional dimensions will be included in their entirety:

>>> print a[1]   # == a[1,:,:]
[[18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 31 32 33 34 35]]

You can also add an ellipsis to do the same thing, but it can be put in other places (this is getting pretty obscure now):

>>> print a[...,2]   # == a[:,:,2]
[[ 2  8 14]
 [20 26 32]]

Much more usefully, you can index multidimensional arrays using multidimensional masks (and probably somehow using multidimensional integer sequences, but that's too arcane even for me):

>>> a = numpy.arange(24).reshape(4,6)
>>> print a
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
>>> mask = numpy.ones(a.shape,dtype=bool)
>>> mask[2,4] = False
>>> print mask
[[ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True False  True]
 [ True  True  True  True  True  True]]
>>> print a[mask]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 17 18 19 20 21 22 23]

Aggregate and Sorting Methods and Functions

Arrays have lots of useful methods, like sum, min, max, mean, var, and std. These do exactly what you'd expect. They each take an axis argument that specifies which axis to perform the operation on; if axis is None, the operation is done on the whole array (as if it was one-dimensional).

If you have a boolean array, the all and any methods can be useful (and also fairly obvious).

There are also function versions of most of these methods (you can do numpy.sum(a) or a.sum()), and there's at least one function of this sort - median - that doesn't exist as a method. The function equivalent of all is numpy.alltrue. This duplication is mostly just for historical and backwards-compatibility reasons - Numeric did things one way, numarray did them the other way, and numpy does it both ways.

Unfortunately, it's also not entirely consistent what the default value for the axis argument is; sometimes it's None, like you'd expect, and sometimes it's 0.

Another similar function that can be quite useful, especially in testing programs, is numpy.allclose, which tests if all elements of two arrays are almost equal (using some floating point tolerance).

You can sort an array (in place) using the sort method, and return an array of indices that sorts an array using argsort:

>>> import numpy
>>> a = numpy.random.randn(10)
>>> print a
[ 1.30102988  0.34831477  0.71065923 -2.07220959 -0.0725881   2.0016034
  0.43410089 -0.47016325  1.3060398   0.74733232]
>>> i = a.argsort()
>>> print i
[3 7 4 1 6 2 9 0 8 5]
>>> print a[i]
[-2.07220959 -0.47016325 -0.0725881   0.34831477  0.43410089  0.71065923
  0.74733232  1.30102988  1.3060398   2.0016034 ]
>>> a = numpy.random.randn(5,6)
>>> print a
[[ 0.13202621 -0.42429597  2.44904165  0.10057947 -0.94205505 -0.49324308]
 [ 0.85227262  1.37048771 -0.73553781  0.09690443  0.30165756  0.35072008]
 [-0.6708106   0.06437029 -1.15906355  1.68817108  0.82277211  0.3891728 ]
 [ 0.30915437 -1.12519311 -0.00829458 -0.25003101  1.13302081  1.08519259]
 [ 0.66832066 -0.83966953  0.24794564  0.66588412  0.00338669 -0.58215323]]
>>> a[0,:].sort()   # sorts the first row only, in-place
>>> print a
[[-0.94205505 -0.49324308 -0.42429597  0.10057947  0.13202621  2.44904165]
 [ 0.85227262  1.37048771 -0.73553781  0.09690443  0.30165756  0.35072008]
 [-0.6708106   0.06437029 -1.15906355  1.68817108  0.82277211  0.3891728 ]
 [ 0.30915437 -1.12519311 -0.00829458 -0.25003101  1.13302081  1.08519259]
 [ 0.66832066 -0.83966953  0.24794564  0.66588412  0.00338669 -0.58215323]]
>>> a.sort(axis=1)  # sorts each rows separately
>>> print a
[[-0.94205505 -0.49324308 -0.42429597  0.10057947  0.13202621  2.44904165]
 [-0.73553781  0.09690443  0.30165756  0.35072008  0.85227262  1.37048771]
 [-1.15906355 -0.6708106   0.06437029  0.3891728   0.82277211  1.68817108]
 [-1.12519311 -0.25003101 -0.00829458  0.30915437  1.08519259  1.13302081]
 [-0.83966953 -0.58215323  0.00338669  0.24794564  0.66588412  0.66832066]]

Universal Functions

Much of the most useful functionality in numpy comes in the form of special objects called universal functions. These act like regular numeric functions when you use them on scalar numbers, but they automatically perform elementwise operations when they are given arrays as arguments. These elementwise operations don't just save you time writing for loops in Python; they're coded in C, so they run much faster than the equivalent Python for loop as well:

>>> import numpy
>>> a = numpy.arange(8,dtype=float)
>>> b = 2
>>> print numpy.log(a)
[       -Inf  0.          0.69314718  1.09861229  1.38629436  1.60943791
  1.79175947  1.94591015]
>>> print numpy.log(b)
0.69314718056
>>> print numpy.log(a)/numpy.log(b)
[       -Inf  0.          1.          1.5849625   2.          2.32192809
  2.5849625   2.80735492]

There are dozens of ufuncs, from standard arithmetic operations to inverse trigonometric functions. Unfortunately, the "automatic documentation" trick doesn't work correctly on ufuncs - if you do help numpy.tan, you get the same generic ufunc help that you get when you do help numpy.exp.

Binary ufuncs - operations that take two arguments - have some special methods. I'm only going to cover two of them, reduce and outer. The first turns the binary function into an aggregate:

>>> a = numpy.arange(1.0,8.1,dtype=float)
>>> print a
[ 1.  2.  3.  4.  5.  6.  7.  8.]
>>> print numpy.multiply.reduce(a)  # == 8!
40320.0

The outer method generates an outer product:

>>> a = numpy.arange(2,10,2)
>>> print a
[2, 4, 6, 8]
>>> b = numpy.arange(1,7)
[1 2 3 4 5 6]
>>> print numpy.multiply.outer(a,b)
[[ 2  4  6  8 10 12]
 [ 4  8 12 16 20 24]
 [ 6 12 18 24 30 36]
 [ 8 16 24 32 40 48]]

You can also simply add, multiply, subtract, divide, and exponentiate arrays using the regular +,*,-,/,** operators - these implicitly call the numpy.add, numpy.multiply, etc. ufuncs.

Comparison operators (<,>,<=,>=,==,!=) also use ufuncs, and return boolean arrays, which can then be used as masks. The following example creates a random array, then sets all the negative values to zero:

>>> a = numpy.random.randn(30)
>>> mask = a < 0
>>> a[mask] = 0.0

or, if you prefer, you can combine the last two lines:

>>> a[a<0] = 0.0

One notable set of standard operators that doesn't work on arrays are the boolean logic operators, which is what you'd expect to use to combine two masks:

>>> a = numpy.ones(5,dtype=bool)
>>> b = numpy.zeros(5,dtype=bool)
>>> mask = a and b   # throws an exception!

Instead, you have to explicitly call the ufuncs numpy.logical_and, numpy.logical_or, and numpy.logical_not with the arrays as arguments:

>>> mask = numpy.logical_and(a,b)

Another solution is to use the bitwise operators |, &, and ^. This is exactly equivalent, and looks prettier, as long as you're certain that you're using boolean arrays. But if you try it with integer arrays that have values besides 0 and 1, you might get a surprise:

>>> 1 & 2
0

Its a matter of preference to use the ugly, safe method or the pretty, mildly dangerous one.

Broadcasting

Binary ufuncs don't even have to be given arrays with the same shape; there are a complex set of rules about what different shapes can be "broadcast" to other, compatible shapes. You can read up on those rules if you'd like - but there's an easier solution. Consider the following situation:

>>> a = numpy.random.randn(4,5)
>>> b = numpy.random.randn(5)
>>> c = a*b   # this works; every column of a is multiplied by b
>>> b = numpy.random.randn(4)
>>> c = a*b   # raises an exception!

Enter numpy.newaxis:

>>> a = numpy.random.randn(4,5)
>>> b = numpy.random.randn(5)
>>> c = a * b[numpy.newaxis,5]   # the old way worked, but this makes it more clear what we're doing
>>> b = numpy.random.randn(4)
>>> c = a * b[4,numpy.newaxis]   # this wasn't even possible before

All you have to do is index the axis you want to replicate with numpy.newaxis.

Some Extras

There are a few miscellaneous things in numpy that don't fit into the above categories:

  • numpy.dot and its fancier cousin, numpy.tensordot do matrix multiplication and dot products (you may have noticed that the regular multiplication operator does not do this).
  • numpy.matrix is a special subclass of numpy arrays that is restricted to 2d shapes. They do perform matrix multiplication when you use the multiplication operator, and they have special methods like .det() (determinant) and attributes like .H] (Hermitian transpose).
  • numpy.meshgrid takes two 1d x and y coordinate arrays, and returns a pair of 2d x and y grid arrays. This is very useful if you want to evaluate a ufunc expression on a grid:
    >>> x = numpy.arange(-4,5)
    >>> y = numpy.arange(-6,8,2)
    >>> X,Y = numpy.meshgrid(x,y)
    >>> print X
    [[-4 -3 -2 -1  0  1  2  3  4]
     [-4 -3 -2 -1  0  1  2  3  4]
     [-4 -3 -2 -1  0  1  2  3  4]
     [-4 -3 -2 -1  0  1  2  3  4]
     [-4 -3 -2 -1  0  1  2  3  4]
     [-4 -3 -2 -1  0  1  2  3  4]
     [-4 -3 -2 -1  0  1  2  3  4]]
    >>> print Y
    [[-6 -6 -6 -6 -6 -6 -6 -6 -6]
     [-4 -4 -4 -4 -4 -4 -4 -4 -4]
     [-2 -2 -2 -2 -2 -2 -2 -2 -2]
     [ 0  0  0  0  0  0  0  0  0]
     [ 2  2  2  2  2  2  2  2  2]
     [ 4  4  4  4  4  4  4  4  4]
     [ 6  6  6  6  6  6  6  6  6]]
    >>> print (X**2 + Y**2)
    [[52 45 40 37 36 37 40 45 52]
     [32 25 20 17 16 17 20 25 32]
     [20 13  8  5  4  5  8 13 20]
     [16  9  4  1  0  1  4  9 16]
     [20 13  8  5  4  5  8 13 20]
     [32 25 20 17 16 17 20 25 32]
     [52 45 40 37 36 37 40 45 52]]
    

Note that the order convention used by meshgrid is that y is dimension 0 and x is dimension 1. Just think "(rows,columns)", and it will be easier to remember. Unfortunately, you get the opposite when you use the outer method of a ufunc:

>>> print numpy.add.outer(x**2,y**2)
[[52 32 20 16 20 32 52]
 [45 25 13  9 13 25 45]
 [40 20  8  4  8 20 40]
 [37 17  5  1  5 17 37]
 [36 16  4  0  4 16 36]
 [37 17  5  1  5 17 37]
 [40 20  8  4  8 20 40]
 [45 25 13  9 13 25 45]
 [52 32 20 16 20 32 52]]

numpy also has a number of subpackages that provide useful functionality. In particular, you might want to check out

  • numpy.random, which has a bunch of random number generators, and can create custom arrays filled with numbers drawn from common distributions (you've already seen me use it some above), and
  • numpy.linalg, which provides access to the most commonly used LAPACK algorithms for things like SVD, linear least squares, and other "standard" linear algebra problems.

We'll talk about another package, numpy.recarray, and numpy's histogramming abilities, when we talk about catalogs and plotting in later lessons.