# Python Tutorial 2: Numpy ¶

## Strided Arrays ¶

Representing a one-dimensional array on a computer is an easy thing to figure out: it's just a contiguous block of memory. If you want some contiguous subarray, you just take a sub-block of that memory.

It's much trickier to represent a multidimensional array. Consider a
4x3 matrix. As you might know, in C, the ordering of the elements is
traditionally *row-major*, and it looks like this:

0 | 1 | 2 |

3 | 4 | 5 |

6 | 7 | 8 |

9 | 10 | 11 |

while in Fortran, it's *column-major*, and it looks like this:

0 | 4 | 8 |

1 | 5 | 9 |

2 | 6 | 10 |

3 | 7 | 11 |

If that's not complicated enough, consider the problem of transposing a matrix; clearly it's more efficient to change the storage order than to physically swap all the elements - so you can't even stick to one convention. Now imagine we wanted to pull out the upper-left 2x2 block. There's no way to do that without copying, unless you give up on having contiguous blocks of memory. And copying gets very ugly if you have very big arrays, which happens pretty often in astronomy.

Here's the solution Numpy uses:

- In addition to storing the memory block and the
*shape*of the array, maintain another array of*strides*.

- The strides, like the shape, is a set of N integers, where N is the dimension of an array. The dot product of the stride with an N-dimensional index gives the memory location of the element.

- For example: the row-major matrix has strides of (3,1) and the column-major matrix has strides of (1,4). The upper-left 2x2 submatrix of the row-major matrix also has strides of (3,1).

Usually you won't have to worry about the strides, but it's good to know. By the way, the strides in numpy count bytes, not numbers - so for an 8-byte double precision value, the (3,1) stride actually turns into (24,8).

Note that in all of this, I'm talking about indexing arrays with
numbers that start with zero, not one - that's the way it's done in C,
and that's the way it's *always* done in Python.

## History Lessons ¶

Once upon a time, there was a Python extension module called ** Numeric**.
Astronomers liked it, but they (mostly people at STSCI, I think) were
frustrated that it didn't perform better on large arrays, and they
wanted some more features. So

**was created, as a new**

`numarray`**for astronomy people and other folks who like big arrays. But the original**

`Numeric`**stuck around, and picked up new features**

`Numeric`**didn't have.**

`numarray`
** numpy** was created as a project to put the two
back together, and combine the best features of both. As of right now,
the original maintainers of

**have moved over completely to**

`Numeric`**, and STSCI has encouraged people who use**

`numpy`**to do the same. Both projects are now unmaintained, and**

`numarray`**in particular has known bugs that will likely**

`Numeric`*never*be fixed. Use

**, and only**

`numpy`**if you possibly can.**

`numpy`## Numpy! ¶

A small warning: with great power comes great responsibility. In many
ways, numpy is a bit, well, perlish. It makes it very easy to write
concise, arcane things that are very useful but sometimes very hard to
read. Sometimes it even rewards those practices by optimizing for that
kind of code. But it can also make a lot of very simple and obvious `for` loops unnecessary in Python, and it does it by using `for`
loops compiled in C, which is what makes this whole practice of using
Python to play with astronomical data not a completely crazy idea.
Those of you coming from IDL or Matlab backgrounds know exactly what
I'm talking about.

### Creating Arrays ¶

importnumpy# this makes an array out of a nested sequence (can be nested tuples, nested lists, or a combination of them)a = numpy.array([[5,3],[2,1]])# now we'll make sure the data type is float instead of int (the previous example forced numpy to guess)a = numpy.array([[5,3],[2,1]],dtype=float)# here's a 4x5x2 int array full of zerosa = numpy.zeros((4,5,2),dtype=int)# here's a 3x7 float array full of garbagea = numpy.empty((3x7),dtype=float)# here's a 5-element float array full of ones (you can use a single number for the shape instead of a tuple)a = numpy.ones(5,dtype=float)

By the way, unless you tell it otherwise, `numpy` will always create arrays with row-major strides.

There's one more that's particularly useful, but only makes 1d arrays:

>>> a = numpy.arange(1.5,3.6,0.5) >>>

** arange** works almost like the built-in

**and**

`range`**functions - it takes a arguments of (start, stop, step), but:**

`xrange`returns an array`arange`works with`arange``float`s as well as`int`s (you can use theargument to ensure you get what you want). Using it with`dtype``float`s is tricky, however, because floating point arithmetic isn't exact - it's a good idea to get in the habit of "hedging" your stop arguments to make sure you get what you expect.

For a similar effect, you can also use ** linspace**, which takes arguments of (start,stop,num):

>>>

Or if you want evenly-spaced log values, ** logspace**:

>>>

### Shape Manipulation ¶

The ** reshape** methods lets you (almost)
arbitrarily change the shape of an array - the total number of elements
is all that has to remain the same. Often, what you'd rather do is

**, which simply reverses the strides and the shape if you don't pass it any additional arguments (it can do much more complicated things if you pass additional arguments; I never do). Finally,**

`transpose`**is an easy way to do just what it suggests - swap two dimensions:**

`swapaxes`>>> a = numpy.arange(24) >>>

Here's a few things to keep in mind:

- All of these methods created new arrays that share memory with the old arrays, so editing one would change all the others.

- They all returned
*new*array objects, and didn't change the shape or strides of the original one. In general, there's no way to change the shape or strides of an existing array object.

- The
method does its work by considering the array`reshape`*as if it were row-major, even if it isn't*- note how it behaves with array, which is a column-major array.`c`

- Passing
`-1`tomakes the shape of dimension whatever it needs to be in order to make the operation possible.`reshape`

### 1D Array Indexing and Slicing ¶

If you have a one-dimensional array, you can index it with a single number:

>>>importnumpy >>> a = numpy.arange(36) >>>

or with a range of numbers (*slicing*):

>>>

or with a range and a step value:

>>>

You can also assign values using the same syntax:

>>> a[3:6] = -a[3:6] >>> a[8:10] = 2 >>>

The first assignment would work for regular Python `list`
objects too (in fact, with lists, you could assign a new sequence with
a different size, which would grow or shrink the list - but that
doesn't work for arrays). The second assignment only works for numpy
arrays. Another big difference is that slicing an array returns a
subarray that shares memory with the original array, which isn't true
for `list`s:

>>> b = a[12:18] >>> b[1] = 100 >>>

Range values don't have to be positive numbers; you can also use negative numbers to count backwards from the end of an array:

>>>

and negative step values to go backwards:

>>>

You can also leave off start and stop values entirely to imply the beginning and end of the array:

>>>

There are two other, totally different, ways of indexing a 1d array. The first uses a *mask*, an array of boolean values:

>>> mask = numpy.zeros(a.shape,dtype=bool) >>> mask[25:30] = True >>> mask[18] = 1# this works too!>>>

But here's an important difference: masking-based indexing *doesn't share memory*, because you can't represent a mask using strides:

>>> c = a[mask] >>> c[5] = -500 >>>

You can still set values using a mask, but you have to do it using the original array.

You can also index an array using an array of integers that represent indices:

>>> indices = numpy.array([12,15,16,29,30,32]) >>> d = a[indices] >>>

Just as with masks, integer-sequence indexing creates arrays that don't share memory with the original array.

**WARNING:** if you try to do mask or integer-sequence indexing using `list` or `tuple` indexes instead of arrays, it might work. But only sometimes. So don't do it.

### Multidimensional Array Slicing ¶

There's almost nothing new here - to index an n-dimensional array, you just use a comma-separated combination of any of the above slicing and indexing tricks. You can mix and match all you like. If you only use range+step slicing, you'll get views into the original array, but if you have even one dimension indexed with masks or integer sequences, you'll get a copy. If one or more dimensions is indexed with a single integer, you'll get a lower-dimensional array as a result. Here are some examples:

>>>importnumpy >>> a = numpy.arange(36).reshape(2,3,6) >>>

If you don't provide enough dimensions in your indexing, additional dimensions will be included in their entirety:

>>># == a[1,:,:][[18 19 20 21 22 23] [24 25 26 27 28 29] [30 31 32 33 34 35]]

You can also add an ellipsis to do the same thing, but it can be put in other places (this is getting pretty obscure now):

>>># == a[:,:,2][[ 2 8 14] [20 26 32]]

Much more usefully, you can index multidimensional arrays using multidimensional masks (and probably somehow using multidimensional integer sequences, but that's too arcane even for me):

>>> a = numpy.arange(24).reshape(4,6) >>>

### Aggregate and Sorting Methods and Functions ¶

Arrays have lots of useful methods, like ** sum**,

**,**

`min`**,**

`max`**,**

`mean`**, and**

`var`**. These do exactly what you'd expect. They each take an**

`std`**argument that specifies which axis to perform the operation on; if**

`axis`**, the operation is done on the whole array (as if it was one-dimensional).**

`axis is None`
If you have a boolean array, the ** all** and

**methods can be useful (and also fairly obvious).**

`any`
There are also function versions of most of these methods (you can do `numpy.sum(a)` or `a.sum()`), and there's at least one function of this sort - ** median** - that doesn't exist as a method. The function equivalent of

**is**

`all`**. This duplication is mostly just for historical and backwards-compatibility reasons - Numeric did things one way, numarray did them the other way, and numpy does it both ways.**

`numpy.alltrue`
Unfortunately, it's also not entirely consistent what the default value for the ** axis** argument is; sometimes it's

**, like you'd expect, and sometimes it's**

`None`**.**

`0`
Another similar function that can be quite useful, especially in testing programs, is ** numpy.allclose**, which tests if all elements of two arrays are almost equal (using some floating point tolerance).

You can sort an array (in place) using the ** sort** method, and return an array of indices that sorts an array using

**:**

`argsort`>>>importnumpy >>> a = numpy.random.randn(10) >>># sorts the first row only, in-place>>># sorts each rows separately>>>

### Universal Functions ¶

Much of the most useful functionality in numpy comes in the form of special objects called *universal functions*. These act like regular numeric functions when you use them on scalar numbers, but they automatically perform *elementwise* operations when they are given arrays as arguments. These elementwise operations don't just save you time writing `for` loops in Python; they're coded in C, so they run *much* faster than the equivalent Python `for` loop as well:

>>>importnumpy >>> a = numpy.arange(8,dtype=float) >>> b = 2 >>>

There are dozens of ufuncs, from standard arithmetic
operations to inverse trigonometric functions. Unfortunately, the
"automatic documentation" trick doesn't work correctly on ufuncs - if
you do ** help numpy.tan**, you get the same generic ufunc help that you get when you do

**.**

`help numpy.exp`
Binary ufuncs - operations that take two arguments - have some special methods. I'm only going to cover two of them, ** reduce** and

**. The first turns the binary function into an aggregate:**

`outer`>>> a = numpy.arange(1.0,8.1,dtype=float) >>># == 8!40320.0

The ** outer** method generates an outer product:

>>> a = numpy.arange(2,10,2) >>>

You can also simply add, multiply, subtract, divide, and exponentiate arrays using the regular ** +,*,-,/,**** operators - these implicitly call the

**,**

`numpy.add`**, etc. ufuncs.**

`numpy.multiply`
Comparison operators (** <,>,<=,>=,==,!=**)
also use ufuncs, and return boolean arrays, which can then be used as
masks. The following example creates a random array, then sets all the
negative values to zero:

>>> a = numpy.random.randn(30) >>> mask = a < 0 >>> a[mask] = 0.0

or, if you prefer, you can combine the last two lines:

>>> a[a<0] = 0.0

One notable set of standard operators that doesn't work on arrays are the boolean logic operators, which is what you'd expect to use to combine two masks:

>>> a = numpy.ones(5,dtype=bool) >>> b = numpy.zeros(5,dtype=bool) >>> mask = aandb# throws an exception!

Instead, you have to explicitly call the ufuncs ** numpy.logical_and**,

**, and**

`numpy.logical_or`**with the arrays as arguments:**

`numpy.logical_not`>>> mask = numpy.logical_and(a,b)

Another solution is to use the bitwise operators ** |**,

**, and**

`&`**. This is exactly equivalent, and looks prettier, as long as you're certain that you're using boolean arrays. But if you try it with integer arrays that have values besides 0 and 1, you might get a surprise:**

`^`>>> 1 & 2 0

Its a matter of preference to use the ugly, safe method or the pretty, mildly dangerous one.

#### Broadcasting ¶

Binary ufuncs don't even have to be given arrays with the same shape; there are a complex set of rules about what different shapes can be "broadcast" to other, compatible shapes. You can read up on those rules if you'd like - but there's an easier solution. Consider the following situation:

>>> a = numpy.random.randn(4,5) >>> b = numpy.random.randn(5) >>> c = a*b# this works; every column of a is multiplied by b>>> b = numpy.random.randn(4) >>> c = a*b# raises an exception!

Enter ** numpy.newaxis**:

>>> a = numpy.random.randn(4,5) >>> b = numpy.random.randn(5) >>> c = a * b[numpy.newaxis,5]# the old way worked, but this makes it more clear what we're doing>>> b = numpy.random.randn(4) >>> c = a * b[4,numpy.newaxis]# this wasn't even possible before

All you have to do is index the axis you want to replicate with ** numpy.newaxis**.

### Some Extras ¶

There are a few miscellaneous things in numpy that don't fit into the above categories:

`numpy.dot`and its fancier cousin,`numpy.tensordot`**do matrix multiplication and dot products (you may have noticed that the regular multiplication operator does***not*do this).

`numpy.matrix`is a special subclass of numpy arrays that is restricted to 2d shapes. They*do*perform matrix multiplication when you use the multiplication operator, and they have special methods like(determinant) and attributes like`.det()`(Hermitian transpose).`.H]`

takes two 1d x and y coordinate arrays, and returns a pair of 2d x and y grid arrays. This is very useful if you want to evaluate a ufunc expression on a grid:`numpy.meshgrid`>>> x = numpy.arange(-4,5) >>> y = numpy.arange(-6,8,2) >>> X,Y = numpy.meshgrid(x,y) >>>

**print**X [[-4 -3 -2 -1 0 1 2 3 4] [-4 -3 -2 -1 0 1 2 3 4] [-4 -3 -2 -1 0 1 2 3 4] [-4 -3 -2 -1 0 1 2 3 4] [-4 -3 -2 -1 0 1 2 3 4] [-4 -3 -2 -1 0 1 2 3 4] [-4 -3 -2 -1 0 1 2 3 4]] >>>**print**Y [[-6 -6 -6 -6 -6 -6 -6 -6 -6] [-4 -4 -4 -4 -4 -4 -4 -4 -4] [-2 -2 -2 -2 -2 -2 -2 -2 -2] [ 0 0 0 0 0 0 0 0 0] [ 2 2 2 2 2 2 2 2 2] [ 4 4 4 4 4 4 4 4 4] [ 6 6 6 6 6 6 6 6 6]] >>>**print**(X**2 + Y**2) [[52 45 40 37 36 37 40 45 52] [32 25 20 17 16 17 20 25 32] [20 13 8 5 4 5 8 13 20] [16 9 4 1 0 1 4 9 16] [20 13 8 5 4 5 8 13 20] [32 25 20 17 16 17 20 25 32] [52 45 40 37 36 37 40 45 52]]

Note that the order convention used by ** meshgrid**
is that y is dimension 0 and x is dimension 1. Just think
"(rows,columns)", and it will be easier to remember. Unfortunately, you
get the opposite when you use the

**method of a ufunc:**

`outer`>>>

`numpy` also has a number of subpackages that provide useful functionality. In particular, you might want to check out

, which has a bunch of random number generators, and can create custom arrays filled with numbers drawn from common distributions (you've already seen me use it some above), and`numpy.random`, which provides access to the most commonly used LAPACK algorithms for things like SVD, linear least squares, and other "standard" linear algebra problems.`numpy.linalg`

We'll talk about another package, ** numpy.recarray**, and numpy's histogramming abilities, when we talk about catalogs and plotting in later lessons.