November 19, 2012

Python for Data Analysis

Python for Data Analysis; Wes McKinney. O'Reilly Media, October 2012
[Review copy kindly provided gratis by O'Reilly Media]


This book is a welcome addition to the Python canon, and anyone who is at all concerned with data analysis would do well to read it. From the opening chapter it is a clear and concise explanation of how to deploy Python for such tasks. Early on the author makes the point that Python should be thought of primarily as a “glue” language, so the book makes no attempt to show you how to program analytical methods in Python but rather demonstrates through practical examples how to use the existing extremely effective tool chain already available.

After a few brief motivating examples you will read one of the clearest expositions of the benefits of iPython I have seen. Since reading this chapter I have become convinced of the benefits of iPython Notebook and have opened an account with NotebookCloud*, which I feel will be a real boon in my teaching work.

The next chapter covers Numpy basics, followed by an introduction to the pandas library. You are then taken in detail, and with copious practical examples, through data cleanup and various important transformations, plotting and visualization, data aggregation and grouping, time series and finally financial applications.

The final chapter on advanced Numpy provides the kind of look under the hood that will help those interested to extract maximum efficiency from the package and understanding how best to take advantage of more advanced features. The closing Appendix, “Python Language Essentials,” is a concise introduction to the language that should provide even inexperienced Python users with sufficient information to understand the examples.

The writing is clear throughout, with numerous examples that enliven and illuminate the discussions. For those whose mantra is "show me the code," the code is here in spades. Flipping through the book you find very few pages that do not contain Python code (and many of them are displaying the output from Python code).

Wes McKinney took the sensible decision to focus on Python 2.7, which is what is currently deployed by the vast majority of the scientific and analytical communities. The quality of the writing and the recent welcome news that matplotlib** is now ported to Python 3 will, I am sure, make a Python 3 update a popular choice in coming years.

If you want to understand how you can use Python as an analytical tool then I heartily recommend this book.

* An Amazon Web Services account is required to use this service, and will be charged for your computations
** This means the standard "Scientific tool chain" is now available for Python 3

November 1, 2012

Building New Communities


[This post is based on a keynote talk to the PyCon DE conference on November 1, 2012]

[UPDATE: Nov 17, 2012: The video of this talk is at http://pyvideo.org/video/1479/building-new-communities]

Good afternoon, and thank you for inviting me. It’s always intimidating to talk to an audience which is so technically capable. Today, though, I am hoping that we can look at a non-technical topic that has implications for everyone who is interested in Python: how do we promote the Python language and improve the development process?

DO EXISTING COMMUNITIES WORK?
Let me begin by saying that I have invested huge amounts of time in the Python community, and I believe that it has a well-deserved reputation as one of the more welcoming open source communities. So today I am talking more broadly than just Python. What I say applies, I believe, to many open source communities.

This talk is being given at an unprecedented time in human history. Clearly these things are a matter of interpretation, but nowadays it’s hard to find someone that doesn’t agree that the modern world is failing to meet the needs of many of its inhabitants.

Historically, we are living in strange times, The world human population reached 7 billion, depending on which figures you choose to believe, between October 2011 and March 2012, only twelve years after it reached 6 billion. According to one UN estimate it’s possible that population will reach ten billion before reaching any kind of stability, and that population in Africa could triple in this century. However you interpret these figures it is obvious that most governments are going to be struggling with population growth in a world where they are already finding it difficult to manage things to everyone’s satisfaction.

So I have a lot of questions, and not many answers. Being a geek, my approach attempts rationality—we have to take into account what we know about human behavior, differences between the various national and political cultures in which people live, and so on. In an increasingly crowded world it is easy to feel threatened by governmental and economic systems that are posited on continuous growth.

When I was young an organization called the Club of Rome produced a report called “The Limits to Growth,” positing that the finite nature of many resources could, if growth were not checked, lead to cataclysmic failures in supply of the very things that keep us alive: clean air, clean water, and nutritious food. This is where climate change deniers got their start, and yet from recent work it appears that the report was, within its limitations, a very accurate projection of certain futures.

My own vision of the future tends, I must admit, towards the catstrophic. I think it is entirely possible that the human race will wipe itself out simply by continuing to foul its own nest and ignoring the toxic environment it is creating. So I see it as important to try and bring this to people’s attention, and to do what I can to ensure that we (the human race) build a sustainable future for ourselves and the other species we have to share this huge ball of mud we call Earth with.

HEY, ISN'T THIS PYCON?
All of this might seem a very long way from our little open source world. I can already hear people thinking that I am abusing my position as a keynote speaker to stand on a political soap box and peddle my socialistic ideas. I admit I am a socialist, and those who know me also know that I am happy to stand on a soap box at times. But this is not one of those times.

Context, however, is everything. So I want to place my remarks today in the context of a world that isn’t working, that isn’t equitable, and in which ever larger divisions are opening up: between the poor and the rich in almost any country you like to name, between poor countries and rich countries, and so on. Huge amounts of resource are being wasted on machinery to wipe each other out. This is particularly noticeable in my current home country, the USA, which has 11% of the world’s population and yet is responsible for over 40% of all military expenditures, more than the next 14 top-spending countries put together. One might wonder what they are afraid of, but these expenditures certainly give them clout in the geopolitical sphere.

Open source communities, the Python community included, have growth problems of their own. According to Armit Deshpande and Dirk Riehle the number of lines of open source code is growing exponentially. This is a management problem of some magnitude, which is not being addressed at the moment. They suggest that open source applications generated revenues of $1.8 billion in 2006, and certainly growth has continued unimpeded since. Of course that $1.8bn represented only about three-quarters of one percent of total software revenues, so there is still a long way to go.

Tim O'Reilly made a sensible case this year that open source is responsible for increasing small business revenues of a single hosting company’s user base by $180 billion  And, of course,  many proprietary software products nowadays contain significant open source components.


GET PAID TO BUILD STUFF?
I have said many times that the emergence of open source represents a new economic phenomenon. It’s as though, when railways were needed in the United States, a bunch of people had said “You know, that’s a great idea. Let’s go build a railway system,” taken up their picks and shovels and headed off to work. Of course you can’t build a railway without access to heavy engineering, which is arguably difficult to open source. Similarly, most individuals would not find it cost-effective to make their own computers from scratch, or to interconnect them in the way that the infrastructure of the Internet does.

Just the same, the emergence of Linux as a powerful force in the software world, and the somewhat anarchic, less-than formal management of the whole Internet (despite the belief of many US politicians that the Internet should be managed by the US as a divine right) should convince us that it’s possible for individual and group initiatives to take hold and prosper in today’s rather strained commercial environments.

I understand that open source is not the exclusive domain of unpaid experts. Nowadays a lot of companies are paying programmers and other staff to assist with the creation of open source software projects. Many different models are used to try and recoup the investments made, but the fact remains: it is now commonly accepted that there is value to creating common infrastructure whose presence benefits a large proportion of the community rather than a single individual or organization. Of course such ideas often seem, particularly in the USA, socialistic and therefore suspect, but it is still true that large companies like HP, IBM and Microsoft are enthusiastically involved in myriad open source projects. They aren't generally thought of as having socialist tendencies.


HOW IS SOFTWARE BUILT?
http://www.multunus.com/2011/02/software-development-as-a-balance

Here’s a fairly typical example of the “software development lifecycle.” It lists a large number of activities and requirements, which deserve to be looked at in detail.

Whether all these activities are strictly necessary or not could be a pleasant discussion over drinks, but I get the definite impression that many open source projects aren’t engaged in any kind of formal process. This freedom, of course, is one of the things that attracts many people into the open source world, and I would be the last person to suggest that unnecessary work should be involved in software production. Nevertheless, open source projects are predominantly run vby programmers, and often all project members are programmers.


WHO BUILDS SOFTWARE?
http://www.ambysoft.com/essays/agileRoles.html
This is a typical conception of the way software is developed in the agile development world. There are many different ways to think about the development process. Arguably, none of them is specifically right or wrong. But when you think about commercial projects the size of Python, there will be many more people involved than just programmers. So Python, like most other open source projects, has to press developers into service to fill the other roles, or leave them unfilled.

WE need to think about whether this works to the ultimate advantage of our projects, or whether it is really a rather inefficient way to produce software.


SOME DEVELOPER ROLES


Above are just some of the roles that a large-ish software development requires. I would argue that lots of the time projects focus largely or even solely on the programming task, with some design work. Many of the other roles do not offer attractive work for developers, who therefore ignore the need to fill them, often to the project’s great detriment.

A prime example here is the Python documentation. Good as it is in comparison with many other open source projects (and it is good, because there has been a lot of hard work put in on it), it still sucks. Its organization is mostly based on a fifteen-year-old structure, back when Python was a smaller and somewhat different language. I still have no idea why, when I want documentation on the Python standard types (int, list, dict, and so on) I have to look in the Standard Library documentation. They are available without any need for an import statement, so why are they documented with the library?

Of course anyone who has used Python seriously for more than a year knows these things, and can usually find out where to look for the information they need, but this is entirely the wrong emphasis. The first need for documentation is when you are learning a language, and then the information you need should be readily available. Who is the advocate for a rewrite or reorganization of the documentation? Nobody in the current development team relishes such a task, because we don’t have technical authors and editors as a part of our community. And yet this issue causes learners way more pain than it should, and even turns people away from the language.

This brings me to my central thesis: if we don’t broaden our community to include such people we are unlikely to achieve the world domination we all jokingly feel that Python could achieve. I know from my work over the last eighteen years, and from visiting Python gatherings like this one all over the world, that Python is poised for massive growth. It would be possible, if a concerted effort was made, to make Python the first programming language of choice for literate students. The world has woken up to the fact that computational thinking is an important skill in today’s complex societies, and some feel that Python is its ideal expression.

So, do we want to slam the doors, say “No, we like things the way they are, we aren’t going to change”? I suspect that some developers will feel that way. But as a PSF director I have to be mindful of our duty to promote the development of the language and grow the international community, and so I have to force myself to look for deficiencies than can be rectified. This means looking outwards, and trying to include people who can fill the necessary roles.

The alternative, of course, is either to ignore the need for a role to be filled or to have someone fill it whose time would be better spent writing code. Neither is an entirely satisfactory solution.

BEING MORE INCLUSIVE
Sarah Novotny, one of the OSCON conference chairs, wrote of her experience at the 2011 conference: "We must challenge the idea that if someone really wants to use a piece of software, he or she will be willing to slog through half-written documentation, the actual code base and an unkind user interface." 

This is an approach that I have tried to promote to Python developers, with rather limited success. I can think of several reasons why this should be so. Here are some classic signs that the status quo is being defended:
  • Considering the cost of switching before you consider the benefits
  • Highlighting the pain to a few instead of the benefits for the many
  • Focusing on short-term costs instead of long-term benefits, because the short-term is more vivid for you
  • Embracing sunk costs
We need to try and broaden our development community to include people with the necessary skills, even though we aren't necessarily very comfortable getting close to them. This makes sense to me as a historical perspective: back in "the old days", using an open source project was highly likely to involve you as a contributor as well. Nowadays there are many other roles for contributors to fill: bug reporters, technical writers, bloggers, testers, and I am sure all you core Python developers have your own ideas for roles that could reduce your workload.

David Eaves of Mozilla corporation has done a lot of work on how software communities can use metrics to improve their development process and monitor community involvement. He asks relevant questions like “why are bugs in this section of the code taking twice as long to be reviewed?” and “who has contributed consistently over the last 18 months, but not in the last 30 days?”

If we are prepared to answer questions like that we can take the information, both qualitative and quantitative, and then use it to continue to improve our communities.

INVOLVING THE COMMUNITY
We must accept that not every member of the community wants to be involved outside a fairly narrow specialism. This is fine with me: I would much rather have contented community members contributing their skills than seeing members pressed to leave their comfort zone.

At the same time, there is considerable evidence that an inability to contribute is a source of frustration to would-be active members. So if we want to grow our community we need to be welcoming and, dare I say, open. We need to make sure that the barriers to entry into our community are as low as possible, and that people whose ideas may not yet be well-informed do not find a hostile environment when they join.

The forthcoming revision of the python.org web site is an attempt to democratize the content and lower the barriers to making a contribution. The intention is to allow contributors to directly affect the content in their area of competence, without requiring mediation from a "webmaster" team who are familiar with the gory internal technical details of the site. More such initiatives are required.

REACHING OUT
So, it's important to reach out to the people with the skills to make our community, and our software, better: more user-friendly, more robust, better documented, and so on. The existing community does a fantastic job, but think how much better things could be if we added new skills by growing the community outside its current demographic. While one of the advantages of open source, to many participants, is its lack of corporate rules and regulations, we nevertheless should accept that we do face many of the same problems that businesses do. If we can find new and creative solutions to those problems then that's great. If we can't, however, we shouldn't necessarily ignore the solutions that have already been developed in more traditional environments.

Besides wanting to increase the size of the existing Python community, I would also like to see the Python world reaching out more determinedly to groups supporting specific applications areas such as accounting, law, the conference industry, the print industry, the transport industry, the aerospace industry, and so on.

The scientific community is a stellar example of the sort of collaborations I envisage. There, the community members are mostly not interested in Python for its abstract beauty, and don't have any ambition or desire to contribute to the core. They use Python because it has been demonstrated to be an effective tool for solving their problems. We need to remember that's the motivation of almost all users outside the development community.

In my secret life as a conference organizer I am using Python in the shape of Eldarion's Symposion project. While it does currently have many failings, my approach is to use it intensively, and feed back the problems to the developers for fixes. As funds become available we intend to invest in improvements that will make our job as conference organizers easier and more effective. I am hoping that eventually, once we have together built a project that non-specialist users can be happy with, Eldarion and The Open Bastion will find a market among the small to medium-sized conferences, and another industry will start to fall under Python's sway.

Repeating such efforts across many industries is, in my opinion, the best way to make Python popular, and its many existing successes are a stable platform on which to base this work.  This will involve effort, since we have to explain open source to people who are not as familiar as we are with the milieu, and whose interest in it is limited to the areas in which it can help them get their jobs done.

TAKING RESPONSIBILITY
Python is a mature technology, and it's time to consider it, if not exactly a finished work, at least as something that can be presented as a feature-complete tool for use in many different application areas. The UK government, having accepted that its computer education has gone through a period in the "dark ages," has accepted that computational thinking is a core skill in today's world. It looks as though Python will be adopted aggressively as a teaching language. I hope that success will be mirrored in many other countries.

If we are prepared to engage others with different skill sets from our own, and listen to their needs, we can build new communities who will complement our existing skill set and help us promote Python world wide in a multitude of application areas. The benefits for the Python community, though, are only a tiny portion of the gains to be made.

I mentioned at the start of my talk that the world is demonstrably broken. Central governments have neither the resources nor the imagination nor the flair to apply global solutions to local problems. It seems self-evident to me that the best way to solve local problems is locally, and so we need to persuade these governments to release some of their control and their funding to the communities with the power to solve those problems.

Eric Sterling gave a very engaging keynote at DjangoCon this year in which he suggested that it was important for a much broader cross-section of the population to make their needs known. As a group who only have to see an itch to start thinking about how to scratch it, I think that open source devotees are in a position to provide some valuable answers to the question of how we fix some of today's complex problems. This means leaving their comfort zone, reaching out and engaging with government and commerce at all levels, and I want to persuade open source communities to use their considerable skills to do so. The world is broken, but I think with determination and goodwill we can help to fix it.

This is an exciting time to be a Python programmer!