Cookbook

This is a respository for short and sweet examples and links for useful pandas recipes. We encourage users to add to this documentation.

This is a great First Pull Request (to add interesting links and/or put short code inline for existing links)

Selection

The indexing docs.

Indexing using both row labels and conditionals, see here

Use loc for label-oriented slicing and iloc positional slicing, see here

Extend a panel frame by transposing, adding a new dimension, and transposing back to the original dimensions, see here

Mask a panel by using np.where and then reconstructing the panel with the new masked values here

Using ~ to take the complement of a boolean array, see here

Efficiently creating columns using applymap

Missing Data

The missing data docs.

Fill forward a reversed timeseries

In [1]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))

In [2]: df.ix[3,'A'] = np.nan

In [3]: df

                   A
2013-08-01  0.469112
2013-08-02 -0.282863
2013-08-05 -1.509059
2013-08-06       NaN
2013-08-07  1.212112
2013-08-08 -0.173215

In [4]: df.reindex(df.index[::-1]).ffill()

                   A
2013-08-08 -0.173215
2013-08-07  1.212112
2013-08-06  1.212112
2013-08-05 -1.509059
2013-08-02 -0.282863
2013-08-01  0.469112

cumsum reset at NaN values

Data In/Out

Performance comparison of SQL vs HDF5

CSV

The CSV docs

read_csv in action

appending to a csv

Reading a csv chunk-by-chunk

Reading the first few lines of a frame

Reading a file that is compressed but not by gzip/bz2 (the native compresed formats which read_csv understands). This example shows a WinZipped file, but is a general application of opening the file within a context manager and using that handle to read. See here

Inferring dtypes from a file

Dealing with bad lines

Dealing with bad lines II

Reading CSV with Unix timestamps and converting to local timezone

Write a multi-row index CSV without writing duplicates

HDFStore

The HDFStores docs

Simple Queries with a Timestamp Index

Managing heteregenous data using a linked multiple table hierarchy

Merging on-disk tables with millions of rows

Deduplicating a large store by chunks, essentially a recusive reduction operation. Shows a function for taking in data from csv file and creating a store by chunks, with date parsing as well. See here

Appending to a store, while creating a unique index

Large Data work flows

Reading in a sequence of files, then providing a global unique index to a store while appending

Groupby on a HDFStore

Troubleshoot HDFStore exceptions

Setting min_itemsize with strings

Using ptrepack to create a completely-sorted-index on a store

Storing Attributes to a group node

In [5]: df = DataFrame(np.random.randn(8,3))

In [6]: store = HDFStore('test.h5')

In [7]: store.put('df',df)

# you can store an arbitrary python object via pickle
In [8]: store.get_storer('df').attrs.my_attribute = dict(A = 10)

In [9]: store.get_storer('df').attrs.my_attribute
{'A': 10}

Aliasing Axis Names

To globally provide aliases for axis names, one can define these 2 functions:

In [10]: def set_axis_alias(cls, axis, alias):
   ....:      if axis not in cls._AXIS_NUMBERS:
   ....:          raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
   ....:      cls._AXIS_ALIASES[alias] = axis
   ....: 

In [11]: def clear_axis_alias(cls, axis, alias):
   ....:      if axis not in cls._AXIS_NUMBERS:
   ....:          raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
   ....:      cls._AXIS_ALIASES.pop(alias,None)
   ....: 

In [12]: set_axis_alias(DataFrame,'columns', 'myaxis2')

In [13]: df2 = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])

In [14]: df2.sum(axis='myaxis2')

i1   -0.499427
i2    0.966720
i3    0.174175
dtype: float64

In [15]: clear_axis_alias(DataFrame,'columns', 'myaxis2')