November 19, 2012

Python for Data Analysis

Python for Data Analysis; Wes McKinney. O'Reilly Media, October 2012
[Review copy kindly provided gratis by O'Reilly Media]

This book is a welcome addition to the Python canon, and anyone who is at all concerned with data analysis would do well to read it. From the opening chapter it is a clear and concise explanation of how to deploy Python for such tasks. Early on the author makes the point that Python should be thought of primarily as a “glue” language, so the book makes no attempt to show you how to program analytical methods in Python but rather demonstrates through practical examples how to use the existing extremely effective tool chain already available.

After a few brief motivating examples you will read one of the clearest expositions of the benefits of iPython I have seen. Since reading this chapter I have become convinced of the benefits of iPython Notebook and have opened an account with NotebookCloud*, which I feel will be a real boon in my teaching work.

The next chapter covers Numpy basics, followed by an introduction to the pandas library. You are then taken in detail, and with copious practical examples, through data cleanup and various important transformations, plotting and visualization, data aggregation and grouping, time series and finally financial applications.

The final chapter on advanced Numpy provides the kind of look under the hood that will help those interested to extract maximum efficiency from the package and understanding how best to take advantage of more advanced features. The closing Appendix, “Python Language Essentials,” is a concise introduction to the language that should provide even inexperienced Python users with sufficient information to understand the examples.

The writing is clear throughout, with numerous examples that enliven and illuminate the discussions. For those whose mantra is "show me the code," the code is here in spades. Flipping through the book you find very few pages that do not contain Python code (and many of them are displaying the output from Python code).

Wes McKinney took the sensible decision to focus on Python 2.7, which is what is currently deployed by the vast majority of the scientific and analytical communities. The quality of the writing and the recent welcome news that matplotlib** is now ported to Python 3 will, I am sure, make a Python 3 update a popular choice in coming years.

If you want to understand how you can use Python as an analytical tool then I heartily recommend this book.

* An Amazon Web Services account is required to use this service, and will be charged for your computations
** This means the standard "Scientific tool chain" is now available for Python 3

No comments: