Python Tutorial 3: Reading Data: PyFITS and Catalogs

PyFITS

Image Data and More Import Internals

Here's how to load a FITS image as a numpy array:

>>> import pyfits
>>> hdu_list = pyfits.open("examples/image.fits")
>>> hdu = hdu_list[0]
>>> image = hdu.data

The only slightly confusing part about this is the third line. Remember that FITS files can have multiple extensions - even if there's only one, pyfits.open still returns a one-element list of HDUs. Of course, if you wanted other extensions, you'd use numbers other than 0.

image is now a full-fledged 2D numpy array:

>>> print type(image)
<type 'numpy.ndarray'>

If all you want is the 2D numpy array of pixel values simply use

>> import pyfits
>> image = pyfits.getdata("examples/image.fits")

Its shape is (NAXIS2,NAXIS1) - that's (y,x). This is pretty standard for image data, but it can cause a lot of confusion if you forget. And just like all numpy arrays, the first pixel is image[0,0], not image[1,1]. That means you'll have to subtract one from all your catalog coordinates, because those will typically use the FITS convention.

But if you try do some typical operations on it, you might notice something that could be interesting if you haven't thought about it:

>>> image.mean()
3.814697265625e-06
>>> numpy.median(image)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'numpy' is not defined

Well, obviously. We haven't done import numpy yet. But then how did we get a numpy array? Notice:

>>> print pyfits.np
<module 'numpy' from '/usr/lib/python2.5/site-packages/numpy/__init__.pyc'>

There it is! As you might expect, PyFITS imported numpy itself, and renamed it. We could now do

>>> pyfits.np.median(image)
0.0004119873046875

...but its much better just to import numpy yourself. But that won't cause numpy to be reloaded; it will just create a new alias for it.

You might also have noticed that the first three lines are very fast, and the last one is the one that takes some time. That's because PyFITS is lazy - it only loads the image data when you explicitly ask for it. And when it does, it loads the whole thing into memory at once, which takes a long time (there's a way to only load a section, too, but I'll let you look that up on your own). By the way, if you even want to unload all that image data without unloading the image like this

>>> del image
>>> del hdu.data

...that is, just delete all the objects that reference it - including the data attribute of the HDU. You'll notice that the data}} is indeed gone (if you try, for instance, tab completion in {{{ipython), but if you insist that you do indeed want it, it comes right back:

>>> image = hdu.data

In fact, it wasn't there before we asked for it the first time either. That's a neat trick Python has: dynamic attribute access. We'll learn more about it when we do Object-Oriented Python.

Headers

The advantage of lazy loading is that it makes it much more efficient to edit just the headers of files. PyFITS is just about the easiest way you can imagine to edit FITS headers, but it can occasionally be a little counter-intuitive.

We get can get a FITS header object by looking at the header attribute of the HDU (there's a separate header for each extension):

>>> hdr = hdu.header

The header object behaves a lot like a dictionary:

>>> print hdr["NAXIS"]
2
>>> hdr["CRPIX1"] = 50.3

...but in some ways it isn't quite a dictionary:

>>> print hdr.keys()  # raises an exception
>>> hdr["new_key"] = "new_value" # also raises an exception

Some of these, like the first, are unfortunate oversights. Others, like the second, reflect the fact that a FITS header is more than a dictionary. For every key, you can include a description as well as a value, and there's also well-defined order to the keys. To add a new key to a fits header you use the update method:

>>> hdr.update("new_key","new_value",comment="a new FITS header key")

You can also retrieve values by number:

>>> print hdr[1]  # == BITPIX
-32

And there's a totally different way of approaching the header, as a dict/list of "cards":

>>> print hdr.ascard[1]
BITPIX  =                  -32 / Bits per pixel
>>> print hdr.ascard["BITPIX"]
BITPIX  =                  -32 / Bits per pixel

You can do all sorts of things with card objects:

>>> card = hdr.ascard["BITPIX"]
>>> print card.key
BITPIX
>>> print card.value
-32
>>> card.value = 32

...and with the card list itself:

>>> hdr.ascard.append(pyfits.Card("new_key_2",50,comment="a new key"))

There's more, but I'll let you explore that for yourself, since that's what I'd have to do in order to each it to you.

By the way, changing BITPIX (or NAXIS) in the header doesn't actually do anything useful in PyFITS - when you save a FITS file, those header keys are taken directly from the data type and and shape of the image data.

Saving and Creating FITS Files

If you want to edit a FITS file in-place, open it with the "update" flag.

>>> import pyfits
>>> hdu_list = pyfits.open("examples/image.fits",mode="update")
>>> hdu_list[0].header.update("bkg",50.0,comment="synthetic background")
>>> hdu_list[0].data += 50.0
>>> hdu_list[0].flush()

If you want to save it as something else, use the writeto method of the HDU list:

>>> import pyfits
>>> hdu_list = pyfits.open("examples/image.fits")
>>> hdu_list[0].header.update("bkg",45,comment="synthetic background")
>>> hdu_list[0].data -= 5.0
>>> hdu_list.writeto("examples/image_45.fits")

To create a new FITS file, you can create a new PrimaryHDU object from a generic numpy array

>>> import pyfits
>>> import numpy
>>> image = numpy.random.randn(400,600)
>>> hdu = pyfits.PrimaryHDU(data=image)
>>> hdu.writeto("examples/noise.fits")

Note that the PrimaryHDU object has a writeto method, just like the HDUList objects we were dealing with earlier. This is just for convenience; it creates a temporary, one element HDUList and calls writeto on that. You can also create a new HDUList from the PrimaryHDU list yourself (which gives you the option of adding additional ImageHDU, BinTableHDU, or TableHDU objects). You'll also notice that the new image has a very empty header:

>>> print hdu.header
SIMPLE  =                    T / conforms to FITS standard                      
BITPIX  =                  -64 / array data type                                
NAXIS   =                    2 / number of array dimensions                     
NAXIS1  =                  600                                                  
NAXIS2  =                  400                                                  
EXTEND  =                    T                                                  

If you're creating a new image as a copy of another, you can pass a header argument to PrimaryHDU to copy an existing FITS header:

>>> hdu = pyfits.PrimaryHDU(data=image,header=hdr)

Like I said before, the NAXIS and BITPIX keywords will automatically be reset from the properties of the image data. Note, however, that the WCS information will not be changed, even if the new image is a section of the old one. PyFITS treats WCS keys just like any other FITS header keys - if you want to update the coordinate system on a cutout, you have to do it yourself.

Catalogs

Imagining A Catalog Data Structure

Given Python's built-in types and numpy arrays, how could we represent the way we think about catalog data?

  • We'd like to be able to access fields by name, like a dictionary.
  • We'd like to be able to access records by iterating over them, like a list.

That suggests a list-of-dict or dict-of-list:

  • With a list-of-dict, it's easy to access the data by record, and to add, delete, or filter records we can use list comprehensions. But it's also a little too flexible - we can have different fields in each record, because each one is a separate dictionary.
  • With a dict-of-list (or dict-of-numpy.ndarray, we have the opposite flexibility problem - we can have some columns that are longer than others, and record-based access is a pain. But column-based access is easy, which is great for plotting one column against another, or for creating or deleting columns.

At this point, if you've been using numpy for a while, you're thinking, "Ooh! I know the right answer! It's a matrix!" Indeed, a big numpy 2D array is not a bad solution at all:

  • We can access and extract rows and columns efficiently, and the regularity of the data is nicely enforced in both directions.
  • There's no extra overhead.
  • Filtering rows is even easier, because we can use boolean arrays and fancy slicing.
  • Best of all, there are already plenty of well-written routines that take an ASCII file and load it into an array, like pylab.load in Matplotlib.

Arrays-as-catalogs does have some downsides, though:

  • It's still a pain to add new rows or columns, though deleting them (or rather, slicing them away) is easy.
  • Catalogs with integer fields have those integers turned into floats.
  • It's impossible to read a catalog with string fields unless you skip those fields.
  • An array doesn't know about field names, so we have to remember the order of the columns ourselves.

But there's an even better solution.

PyFITS and FITS Binary Tables

A FITS table is one way of representing a catalog on disk. Lets see what PyFITS does when we load one of those up:

>>> import pyfits
>>> hdu_list = pyfits.open("examples/bintable.fits")
>>> hdu = hdu_list[1]    # table extensions can't be the first extension, so there's a dummy image extension at 0
>>> print hdu.header
XTENSION= 'BINTABLE'           / FITS Binary Table Extension                    
BITPIX  =                    8 / 8-bits character format                        
NAXIS   =                    2 / Tables are 2-D char. array                     
NAXIS1  =                  516 / Characters in a row                            
NAXIS2  =                63501 / No. of rows in table                           
PCOUNT  =                    0 / Parameter count always 0                       
GCOUNT  =                    1 / Group count always 1                           
TFIELDS =                  106 / No. of columns in table                        
                                                                                
TFORM1  = '1J      '           / Format of field                                
TDISP1  = 'I5      '           / Display format of field                        
TTYPE1  = 'Seq     '           / Field label                                    
TUNIT1  = '        '           / Physical unit of field                         
TNULL1  =           -214748364 / Null value for interger                        
TFORM2  = '1D      '           / Format of field                                
TDISP2  = 'R12.7   '           / Display format of field                        
TTYPE2  = 'ra      '           / Field label                                    
TUNIT2  = '        '           / Physical unit of field                         
TFORM3  = '1D      '           / Format of field                                
TDISP3  = 'S12.6   '           / Display format of field                        
TTYPE3  = 'dec     '           / Field label                                    
TUNIT3  = '        '           / Physical unit of field                         
<...>

It has a regular FITS header object, describing the columns...what about the data?

>>> table = hdu.data
>>> print table[0]
(1, 53.046552601898092, -28.064823980261512, 4746.4746, 1.5713711, 25.897972, 0.2597,
 25.500559, 0.33903122, 24.816772, nan, nan, nan, 24, 0, nan, 'Galaxy', 1.1342041, ... )
>>> table.field('ra')
array([ 53.0465526 ,  52.93962625,  53.25214501, ...,  52.87808401,
        53.1489966 ,  53.13236932])
>>> subset = table[table.field('ra') > 53.0]
>>> print subset.field('ra')
array([ 53.0465526 ,  53.25214501,  53.36795776, ...,  53.32982377,
        53.1489966 ,  53.13236932])

What is this magical beast?! It has float, integer, and string fields, it supports record and field-based access and record-based slicing, it returns columns as numpy arrays, it enforces regularity, and it remembers the names of columns!

>>> print type(table)
<class 'pyfits.NP_pyfits.FITS_rec'>
>>> print isinstance(table,numpy.ndarray)
True

It's a numpy array! But obviously a very special one...notice what its dtype is:

>>> print table.dtype
[('Seq', '>i4'), ('ra', '>f8'), ('dec', '>f8'), ('x', '>f4'), ('y', '>f4'), ('Rmag', '>f4'),
 ('e_Rmag', '>f4'), ('Ap_Rmag', '>f4'), ('ApD_Rmag', '>f4'), ('mu_max', '>f4'),
 ('MajAxis', '>f4'), ('MinAxis', '>f4'), ('PA', '>f4'), ('phot_flag', '>i4'),
 ('var_flag', '>i4'), ('stellarity', '>f4'), ('MC_class', '|S16'), ('MC_z', '>f4'), ... ]

The Awesomeness That Is numpy.recarray

As we've just seen, numpy arrays can have fancy, C-struct like data types. It turns out there's a special subtype of array, numpy.recarray that's specifically designed for this sort of usage. The PyFITS table class doesn't actually use recarray, which is why you have to use syntax like table.field('ra'). If table was a recarray, you could use either of the following:

>>> ra_column = table['ra']
>>> ra_column = table.ra     # of course, this only works if the field name is a valid Python name (no spaces, dashes, etc.)

Note that a typical catalog array is one dimensional - the field dimension is special, and doesn't count, since it's considered part of the array data type. And of course, if you had a reason for it, you could have a multidimensional catalog - still only one field dimension, but as many record dimensions as you'd like.

There still a few remaining downsides to recarray:

  • In order to add new records, you have to concatenate two arrays together, which is just as painful as it was with a float matrix catalog. Of course, deleting them is still just as easy as slicing them away.
  • Adding and deleting fields is a big pain - you essentially have to create a new table with a new dtype and copy the old one in column by column.
  • There's no standard routine for loading generic ASCII data into a recarray. (I have one that works for FIAT catalogs; just ask if you want it).

In future tutorials, we'll create a custom subtype of recarray that solves these problems.

In the meantime, check out numpy.rec.fromrecords, and the documentation for numpy.dtype. It's a little complex, but it's worth learning (or you can just wait until next time).