Astronomical Python Tutorial 1: Basics

http://imgs.xkcd.com/comics/python.png

What Python Is

  • Easy to write simple stuff in.
  • Not horrible to write complex stuff in.
  • Easy to read and understand.
  • Popular, both within astronomy and outside it. That matters because you can use other people's code instead of writing your own.
  • Interactive. This is great for debugging, and even better for one-time analysis tasks.
  • Free.

What Python Isn't

  • Fast. C-coded extension models, like numpy provide a nice way around this, but pure-python number crunching isn't a great idea when speed matters.
  • Concise. Learn to love the whitespace, and putting things on separate lines, even when you didn't have to before.

The Python Environment

The Interpreter

  • Python code is interpreted by the interpreter, which just a regular C program called python. Actually, python is usually a link to a version-specific interpreter, like python2.5. You can also run that explicitly.
  • Just like most other interpreted languages, Python code is partially compiled (on the fly) - this turns *.py files into *.pyc or *.pyo files. You can pretty much ignore them; they'll be recreated and used automatically whenever they're needed.
  • You can run Python interactively as well. Just run python, and you'll get a prompt that allows you to input statements line by line.
  • The latest version of Python (as of this writing) is Python 2.6; right now, it's still quite new, and many popular extension modules haven't yet been updated to work with it without a lot of warnings, but it's very compatible with Python 2.5, so I would use whichever of those comes by default on your machine.
  • A dramatically new version of Python, Python 3.0, will be coming out shortly. It won't be nearly as compatible with Python 2.x, and I would recommend only switching to it after major extension modules (mostly numpy) have made the switch, and proven that they've worked out the differences.

IPython

IPython is an extension module that dramatically improves the python interactive shell. It adds history, tab-completion, interactive help, and lots of other nifty features. Never use python interactively without it (unless you're running it inside gdb, which I hope, for your sake, you never have to do).

You can get it and read about it at http://ipython.scipy.org (it's also available in package form for most linux distributions).

Modules

  • Any Python file (a plain text file full of Python code, ending with a .py) can be treated as a script or as a module, though most files are designed to be used in either one way or the other.
  • A script is passed directly to the Python interpreter, as in python script.py.
  • A module is loaded within Python code using the import statement:
    >>> import numpy
    >>> print numpy.tan(numpy.pi/2)
    1.63317787284e+16
    
  • Numpy isn't actually a Python-coded extension module; it's written in C. A C-coded module is actually just a special shared library (a .so on Linux machines). However, they don't have to be placed in your dynamic library path, and they don't usually have the standard library ("lib") prefix.
  • Actually, import numpy doesn't actually load a shared library. It loads a Python package, which is just a directory (named numpy in this case) with a special file inside named __init__.py. When a package is imported, the __init__.py is evaluated. It is usually used to import regular modules that exist inside the package, or even additional subpackages.
  • There are a lot of variations on the import syntax you can use, especially for importing modules inside packages.
    >>> from numpy import *
    >>> from numpy import linalg
    >>> from numpy import linalg as la
    >>> import numpy as n
    >>> import numpy.linalg
    
    I would recommend using the regular import numpy syntax most of the time, and the others only when you really need to.

Paths

  • When you import a module or a package, Python looks for it in a special list called sys.path. This is a regular Python list, and you can append and insert paths into it just like you would any other list (of course, to access it, you have to import sys). You'll note that the first entry is always the current directory. Most of the paths in the list are defined by the configuration of your actual Python install, and third-party modules will by default install themselves to somewhere on that list. If you put Python modules in a non standard location, you can modify sys.path before importing them.
  • You can also set an environment variable called PYTHONPATH to a colon-separated (at least on linux) list of paths. These paths will automatically be included at the top of sys.path (but below the current directory).
  • If you have multiple packages or modules with the same name in different places, you can find out which one you imported using the __file__ attribute of the module. Packages also have a __path__ attribute that points to the package directory.

Anatomy of a Python Script

This script takes one of the official GOODS ACS catalogs, and spits out a bunch of SQL INSERT statements that move the catalog into a relational database. It's fairly simple, and it illustrates a lot of basic Python features and philosophy.

  • The file name: note that it's a valid name for a Python object; that means you could also import this file as a module, even though it was designed to be used as a script.
  • The first line is a bang line, which works just the same as it does with any other script. Python ignores it as a comment, but if you execute the file directly the shell will ask the /usr/bin/env program to find an executable named python somewhere in your path, and execute it with the script as the argument.
  • Lines 3 and 10 are global (module-level variables). I find them most useful as places to put constants, because they are only evaluated once. Note that Python has no way of declaring, or enforcing that something be constant - if something should be constant, it's up to you to not edit it, and to make sure people who use your code know they shouldn't edit it. This is a big piece of Python philosophy: don't make it impossible for the user to do anything; instead trust them to behave the way they should.
  • Line 20 defines a function that does most of the work for this script. It takes a Python iterable, input_buffer, a writable buffer object output_buffer, a string bandpass, and a boolean flag skip_detect. None of those types, or requirements are specified. The first few lines of the function check if the bandpass argument is one of four values, and raises an Exception if it isn't. But no checking is done for the others. For input_buffer, this is intentional, and useful - because of the way it's used, input_buffer can be a file object, a list, array, or tuple of strings, or anything else which yield the expected sort of strings when iterated over.
  • The actual function code is preceded by a docstring, a plain string that describes the function's usage. Functions, modules, classes, and a few other types of objects can have docstrings (and generally should - this is what is used to create the help you get when you ask for it interactively). In this case, the string is enclosed in triple quotes, which allows the string to cover multiple lines and contain individual single and double-quotes. This isn't specific to docstrings, but it's a good practice for them.
  • The combinations of strings containing % placeholders followed by more %-signs outside the string itself are cases of string formatting. You can learn more about it here. One note: it's going to be replaced with a new way of doing things in Python 3.0, and that new way is available in Python 2.6 if you've got it and you don't care about running on Python 2.5 or earlier.
  • The indexing brackets with ranges of the form variable[m:n] indicate slicing. We'll cover it in much more detail in a later episode. For now, the important thing to remember is that the result starts with the first index, and goes up to just before (and does not include) the last index.
  • Adding lists appends them together into one big list.
  • The backslash on line 38 is used to "escape" the newline, and indicate that the statement will be continued on the next line. If a multi-line statement has line breaks between parenthesis, such as in lines 3 or 46, you don't need to worry about putting backslashes in manually.
  • Note that a string literal, like ",", is actually an object in Python, so you can call methods on it, like ",".join(...).
  • The two separate strings on lines 46 and 47 demonstrate another way to join strings - two string literals, separated by nothing but whitespace, will be joined directly together. This happens when the source code is partially-compiled, not when it is run, so don't treat them like two separate strings in your code.
  • Writing to a buffer using the write method doesn't automatically append newlines ("\n") like the print statement does, and you have to call flush afterwards to force the writing to actually happen when you want (it will all be done eventually if you don't flush).
  • While most of the work is done by the function, the file works as a script because of what follows. __name__ is a special variable that equals "__main__" when the file is executed as a script, and something else when it is imported as a module (I don't actually know what - you can look into this if you like). This means you could, from another script, do import goods_db_upload and call goods_db_upload.generate_inserts(...) with arguments obtained some other way. All the stuff after line 57 involves parsing the command-line arguments and turning them into function arguments, and finally calling generate_inserts(...).
  • The try...except block is a simple case of exception handling; all it does is print the error message to STDERR and quit (a better example would show usage information). Of course, if an unhandled exception happens in a Python script, you get similar behavior (the program prints the message and quits), but it's usually a more complicated message with traceback information - better for debugging, but potentially worse for users who aren't you, if that ever matters.

Iterators

Iterating over the members of container objects is big part of a lot of Python code. The basic formula looks like this:

for item in iterable:
   <do something with item>

Under the hood, this is basically the same as doing:

try:
   i = iter(iterable)
   while True:
      item = i.next()
      <do something with item>
except StopIteration:
   pass

The second one is much more complicated, but it illustrates a few things:

  • An iterator is just an object with a next method that returns the current item when there is one, and raises a special exception, StopIteration when there isn't.
  • You can iterate over anything that returns an iterator when you call the iter function on it.
  • When you iterate over an object in a for loop, you never actually see the iterator itself.
  • You can't add, remove, or replace items in containers using iterators; the item you get is just a regular object that references the same object as the one in the container, and it knows nothing about the container itself.

The built in types list, tuple, and str are sequences. They can be indexed by integers, and have a well-defined order when you iterate over them. While you can't change a tuple or str variable at all, you can change a list, by iterating over those indices:

for n,item in enumerate(list_object):
   list_object[n] = change(item)

enumerate is a special function that takes an iterator and turns its results into a new iterator that returns a 2-element tuple for each original item, with the first element of the tuple is a counter variable. Let that sink in; this means that the above code is equivalent to

for pair in enumerate(list_object):
   n = pair[0]
   item = pair[1]
   list_object[n] = change(item)

and

for pair in enumerate(list_object):
   n,item = pair
   list_object[n] = change(item)

The second line of the last example is called sequence unpacking, and it's what happens implicitly when you put two variables in a for list argument. But you can also do it outside of that context.

Another useful iterating function is zip, which lets you iterate over two or more sequences simultaneously:

for item1,item2 in zip(list1,list2):
   <do things with item1 and item2>

In fact, enumerate(obj) is the same as zip(xrange(len(obj)),obj). len is a function that returns the length of an object, and xrange is a function that returns a counting iterator (you can also specifiy a non-zero starting number, and a step number). range is another function that works just like xrange, but it returns an actual list of integers, rather than just an iterator, so it's less efficient if that list would be huge.

Python's built-in dict type is also iterable, but it's not a sequence. It doesn't have a meaningful order (though it isn't random), and when you iterate over the object itself you only get the keys:

>>> d = {"one":1,"two":2,"three":3}
>>> for k in d:
>>>    print k
three
two
one

You can also iterate over only the values (for v in d.values() or for v in d.itervalues()), or over key-value pairs (for k,v in d.items() or for k,v in d.iteritems()). All of the iter- versions return iterators, while the others return actual lists.

Generators

If you want to write your own iterator (or an iterator returning function, like enumerate), the easiest way is with a special function called a generator. The following example creates an iterator-returning function that appends "-iest" to any string iterator that is passed to it:

def iest_adder(iterable):
   for i in iterable:
      yield "%s-iest" % i

Every time this function uses the yield statement, it makes one iteration step. Amazingly, the result of calling this function is a genuine iterator object - it has a next method and it calls StopIteration.

List Comprehensions

When your goal is to make one sequence out of another, you can use a special form of for loops called list comprehensions. These are actually more efficient than regular for loops, and they're much more concise:

squares = [x**2 for x in range(1,5)]

You can also add a boolean filter:

odd_squares = [x**2 for x in range(1,5) if x % 2]

(nevermind that they are smarter ways to accomplish the same task).

In some cases, in particular in functions that expect iterables or iterators, you can leave off the square brackets for an iterator comprehension:

colon_separated_itegers = ":".join("%i" % i for i in xrange(10))

The advantage of an iterator comprehension is that it doesn't allocate a new list to store the temporary sequence, which is nice when that list might be huge.

More on Built-In Objects

Sequences

The main Python sequence types, list and tuple, are pretty obvious in what they're useful. You might not even end up using tuples explicitly much at all; to a large extent they're useful for temporary constructs like sequence unpacking, and as I'll go into later, passing function arguments. After all, lists can do everything tuples can do, and they're mutable as well.

One thing to remember about lists is that while you can insert or remove elements at any point in the list, it's much more efficient to do both at the end of the list, especially if the list is particularly long (internally, lists work very similar to C++ vectors). Hopefully that sort of optimization shouldn't matter very often (or you should consider doing some of that work in C), but it might come up once in a while, especially when representing catalogs as lists-of-dicts or something similar.

Strings (str in Python) are also fully-fledged sequence types, but they are also pretty obvious in how they work. One twist is that strings are immutable - you cannot edit an individual string object, only create a new (possibly similar) object from it, and perhaps assign that new object to the same variable you used for an old one. One consequence of this, along with the way strings are allocated, is that concatenating long strings together is a slow way of doing things (use string formatting instead):

# don't do this:
very_long_string = long_string + "(" + longish_string + ")" + another_long_string + "."
# instead do this:
very_long_string = "%s(%s)%s" % (long_string,longish_string,another_long_string)

One final sequence type worth knowing about is set. Sets can be iterated over, but they can't be indexed, and they don't maintain their ordering. They enforce unique elements, and as you'd expect, they provide lots of nice set operations, like unions and intersections and differences. There's also an immutable set object called frozenset, which is useful because sets can only contain immutable objects (so you can have a set of frozensets, but not a set of sets).

Dictionaries

The built-in dict object is extremely useful. Dictionaries contain a set of key-value pairs, in which you can quickly look up a value given a key. They don't maintain their ordering, and there are some restrictions on what can serve as a dictionary key (immutable types are always safe, and mutable types are only safe if they don't define a custom equality operator). You can create dictionaries using the curly-brace syntax we used earlier:

d = {"one":1,"two":2,"three":3}

or by passing any number of different things to the dict constructor, such as a sequence of 2-tuples, or anything else that behaves "like" a dictionary:

d = dict([(s,i) for i,s in enumerate(("one","two","three"))])

Dictionaries are extremely flexible. If you have some complex data structure, if you can't represent it with a single dictionary, you can almost always represent it with a dictionary-of-dictionaries, a list-of-dictionaries, or a dictionary-of-lists. Sometimes it's worth putting in the extra effort to design a custom object for a data structure, but very often it's just easier to whip something up with containers-in-containers.

By the way, if you're (unfortunately, like me) the type who feels compelled to optimize the heck out of everything even when it's not necessary, it's worth noting that dictionaries are used internally to represent the attributes of custom Python objects and keyword arguments to Python functions. While it might seem like the flexibility of dictionaries must make them too slow, the fact is that dictionaries are all over the place - you really can't avoid them, even when you think you are - so just learn to love them.

File Objects

You can open a regular text file for reading in Python like this:

file_obj = open("filename.extension","r")   # second argument is optional, "r"==read is the default

While there are many ways to access the file data (you can look them up in the documentation) just iterating over the file is the simplest and usually the fastest, so it's worth doing if it fits your needs. Iterating over a file object splits the file into individual lines, and yields each line as a string.

You can write to a file object if you open it with the "w" (write) or "a" (append) modes. The main methods you'll use are write, which takes a single string, and writelines which takes a sequence of strings (note that neither one automatically adds newline characters).

The special files sys.stdin, sys.stdout and sys.stderr behave just like other file objects, as do the pipes you can get using the subprocess module. Other modules provide other file-like objects for other purposes, all with the same interface.

Essential Standard Library Modules

These all come with Python; I'm saving third party extensions for another day.

  • sys provides access to Python internals and configuration (like sys.path), standard special files (sys.stdin, sys.stdout, sys.stderr), program arguments (sys.argv), and other miscellaneous things.
  • os provides access to file and directory operations, pathname manipulation tools, and other operating system things. It also provides simple ways to launch other programs (os.system), but for more control you probably want to use the subprocess module.
  • shutil provides higher-level file and directory operations, like recursively copying, moving, or deleting directories.
  • glob provides shell-style globbing, for finding files that match certain patterns.
  • subprocess provides control for running subprocesses that you have to communicate with in one way or another. Other ways of doing this (for instance, os.popen) exist right now, but they'll go away with Python 3.0, while subprocess will stick around.
  • re provides regular expressions.
  • StringIO and its faster counterpart cStringIO provide file-like objects which are just blocks of memory, and can be turned into or created from strings.
  • pickle and its faster counterpart cPickle provide a way to serialize generic Python objects - turn them into strings, or write them to disk, and then load them back up later. Note that you can specify the pickling protocol: level 0 is text-based, and more backwards compatible, but level 2 is much faster (but binary, and not readable for Python < 2.2).
  • shelve uses pickle to store several objects into a file-based dictionary. It makes it very easy (and fairly fast) to save a lot of Python objects to disk and load them up later.
  • optparse provides advanced command-line option handling (so does getopt, but optparse is better).

There are many more (check out the Library Reference in the official Python documentation for a list).

Other Resources

  • Interactive Python help: within IPython, just try help <object> on whatever you want to know about.
  • The official Python Documentation, which includes a tutorial, and the super-useful Library Reference: http://docs.python.org (the documentation is for the latest version, Python 2.6, but includes notes about what is different from previous versions).
  • Jim's copy of Python in a Nutshell (also a little bit outdated) lives in 512; feel free to drop by and borrow it for a few hours whenever you'd like.

Attachments