January 14, 2015
Puget Sound Python User Group, Jan 14, 2015
Tammy announced that the group was interested in reaching out to members to help them with their professional development. The group has a new logo designed by New Relic's Jen Pierce. It tweets as @ps_python, and they have a LinkedIn group "to make things less socially awkward." Tammy will be happy to talk to any members who want to get involved: pugetsoundpython at gmail dot com.
The first speaker was Carlos Guestrin, founding CEO of Dato (formerly GraphLab), who is also Professor of Machine Learning at UW. His talk was about recent developments in machine learning. Companies have been collecting data for a long time, but early attempts at analytics were not really addressing the real questions.
In talking to over 300 businesses with interests in machine learning, Carlos again and again has seen a failure to address the need to eliminate what Carlos called "duct tape and spit" work. Building tailored Hadoop systems is expensive and time-consuming, so we need a way to scale Python processes to the standard big-data ecosystem.
Dato's mission was to empower people to handle all sorts of data sources at scale, even from a laptop. Users should never have out-of-memory issues, but the GraphLab Create product allows building analytic tools that will move up to web scale with GraphLab Produce.
A first live demo was an image search, hosted on Amazon, matching with predictions made by a neural network running on Amazon infrastructure. An engineer who joined the team last October built this demo in a few days.
A second demo showed a recommender built from Amazon product data. It allowed two people to list their tastes and then found a way from a favored product of one user to a favored product of another.
Carlos is interested in allowing people to build their own applications, so he showed us an IPython Notebook that read in Amazon book review data. An interesting feature of Graphlab objects is that they have graphical representations in the Notebook, with the ability to drill down.
Carlos built and trained a book recommender with a single function call. This is all very well, but how do we scale such a service? The answer is to take the recommender model and deploy it as a service hosted on EC2 over a number of machines. The system is robust to the failure of single systems, and uses cacheing to take advantage of repeated queries to the RESTful API.
The underlying support infrastructure is written in "high-performance C++". The GraphLab components are based on an open source core, with an SDK that allows you to extend your coding however you want. GraphLab Create is unique in its ability to allow a laptop to handle terabyte volumes. It includes predictive algorithms and scales well - Carlos showed some impressive statistics on desktop performance. TAPAD had a problem with 400M nodes and 600M edges. Graphlab handled it where Mahout and Spark had failed.
One of the key components is the SFrame, a scalable way to represent tabular data. These representations can be sharded, and can be run on a distributed cluster without change to the Python code. Another is the SGraph, which offers similarly scalable aproaches to graph data.
The toolkits are based on an abstract pyrsid wehose base in the algorithms and SDK. Upon this is layered a machine learning layer.
Deployment is very important. How can you connect predictive services to your existing web front-end? GraphLab aloows you to serve predictions in a robust scaleable way. Install
Carlos closed by pointing out the Data Science and Dato Conference, whose fourth incarnation is in July in San Francisco.
Q: What's the pricing model?
A: You can use GraphLab Create for free; you can use Produce and other services either on annual license or by the hour.
Q: How do you debug distributed machine learning algorithms.
A: It's hard. We've been working to make it easier, and Carlos pushes for what he calls "insane customer focus"
Q: Is SFrame an open source component?
A: Yes.
Q: What about training - how do you trrain your models"
A: We have made it somewhat easier by producing tools. HTe second layer, choosing models and parameters, is assisted by recent developments. The next release will support automatic feature search.
Q: Is the data store based on known technologies?
A: It isn't a data store, it's en ephemeral representation. Use storage appropriate to your tasks.
Q: Do you have a time series components?
A: Yes, we are working on it, but we want to talk about whjat's interesting to potential customers?
Q: Do you have use cases in personalized medicine?
A: We've seen some interesting applications; one example reduced the cost of registering new drugs.
Erin Shellman works at Nordstrom Data Lab, a small group of computer and data scientists. She built scrapers to extract data fomr sports retailers. Erin reminded us of the problems of actually getting your hands on the data you want. Volumes of data are effectively locked up in DOMs.
The motivation for the project was to determine to optimum point to reduce prices on specific products. It was therefore necessary to extract the competitive data, which Erin decided to do with Scrapy. Erin likes the technology because it's adaptable and, being based on Twisted, super-fast.
Using code examples (available from a Github repository) Erin showed us how to start a web scraping project. The first principal components are items.py, which tells Scrapy which data you want to scrape. In order to extract this kind of data means you really should look at the DOM, which can yield valuable information (one customer was including SKU-level product availability information in hidden fields!)
The second component is the web crawler. Erin decided to start at one competitor's "brands" page, then follwo the brand's "shop all" link, whidh got her to a page full of products. The crawl setup subclasses the Scrapy CrawlSpider class. The page parser uses yield to generate successive links in the same page. Erin pointed out that Scrapy also has a useful interactive shell mode, which allows you to testt he assumptions on which you build your crawler.
Erin found that smaller brands didn't implement a "shop all" page, and for those cases she had to implement logic to scrape the data from the page where it would normally appear.
Erin's description of how to handle paginated multi-page listings showed that Scrapy will automatically omit already-processed pages, meaning it's easy to perform redundant searches where the same page may appear mutiple times in the result.
Erin underlines the necessity of cleaning the data as early in the processing chain as possible, and showed some code she had used to eliminate unnecessary data. Interestingly Erin ended up collecting a lot more data than she had initially started by looking for. Disposition of the scraped data can be handled by a further component of Scrapy, which was not covered.
Most of data science is data wrangling. Scraper code at https://github.com/erinshellman/backcountry-scraper
Erin also pointed out that the next PyLadies Seattle is on Thursday, January 29th.
Q: If you repeat scrapes, is there support for discovering DOM structure changes?
A: Good question, I don't know. I assume there must be, but didn't feel I needed it.
Q: If you ran this job over the course of 10 minutes (with 40,000+ GET requests) do you need to use stealth techniques?
A: We haven't so far. Retailers aren't currently setting their web sites up to protect against such techniques.
Q: What's the worst web site you've come across to scrape?
A: I don't want to say; but you can tell a lot about the quality of the dev teams by scraping their pages.
Q: Does Scrapy have support to honor robots.txt
A: No, but the crawler can implement this function easily.
Q: What techniques did you use to match information?
A: Pandas was very helkpful in analysis.
Q: What about web sites with a lot of client-side scripting?
A: We didn't run into that problem, so didn't have to address the issues.
Trey Causey from Facebook rose after the interval to talk on Pythonic Data Science: from __future__ import machine_learning. Try's side job is as a statistical consultant for an unnamed NFL team.
The steps of data science are
1. Get data
2. Build Model
3. Predict
Python is quickly becoming the preferred language of the data science world. NumPy has given rise to Pandas, whose DataFrame structure is based on it. Pandas can interface with scikit-learn.
A DataFrame offers fast ETL operatiopns, descriptive statistics, adn the ability to read in a wide variety of data sources. Trey showed a DataFrame populated with NFL data in which each row represented a play in a specific game, which had been processed to remove irrelevant events.
Scikit-learn is a fast system for machine learning. New algorithms are being added with each release, and the consistent API they offer are a notable feature. It offers facilities for classification, regression, clustering, reducing dimensionality and preprocessing. This last is valuable, due to the high volume of data science work that is just getting the data in the desired form.
Scikit-learn implements the interface of the Estimator class, and subclasses have .fit(X, y=None)m .transform() and .predict() methods. Some have .predict_proba() methods. The "algorithm cheat sheet" is an attempt to show how touse the package's features to obtain desired results.
Suppose you are interested in the probability that a given team wins a game given the appearance of a particular play i that game. This allows coaches to answer questions like "should we punt on this fourth down?". This is therefore a classification problem, under the assumption that players are independent (which seems to be a ridiculous assumption, but in fact works pretty well(,
The janitorial steps will involve looking at the data. Does it suffer frmo class imballances (e.g. many more wins than losses)? Does it need centering and scaling. How do I split my data set into training and prediction sets?
A first study showed a histogram of # of plays vs. remaining time showed a major spike just before halftime and smaller spikes before the quarters due to the use of timeouts. The predictive technique used was "random forests." Testing the predictions was performed on both true and false negatoves and positives. Calibration is important for predictive forecasts. Sometimes you will need to use domain knowledge to wring information out of the data.
Trey showed a number of graphics, showing various statistics which my total lack of football knowledge didn't allow me to interpret sensibly. He pointed out some of his important omissions, being quite frank about the shortcomings of his methodology. Feature engineering depends on subtleties of the data, and no confidence intervals are given. Putting models into production is difficult, and data serialization has not yet been suitably standardized.
Data science may not bet easy, but it can be very rewarding. Read more from @treycausey (trey dot causey at gmail dot com) or at thespread.us.
Q: Does what you are doing become less effective over time?
A: In the NFL case most teams don't trust statistics, so it is difficult to get coaches to implement stats-based decisions making. In the larger context, if you get really good at predicting a particular phenomenon, you would expect performace to decline over time due to changing behavior.
Q: Are there particular teams who are making visibly good or bad decisions?
A: Yes. It's easy to see which coaches aren't optimizing at all, but even the best teams still have a way to go. The New York Times has a "fourthdownbot" that is fun.
Q: Are you trying to develop mental constructs of what is right and wrong in feature engineering?
A: That's a philosophical divide in the data science communities; some prefer more parsimonious models, which should yields actionable features, others prefer highly parameterized models with better predictive ability.
Q: Do you understand more about the features in your data as you work with it?
A: Yes, all the time, and the new features aren't always obvious at first. This is basically a knowledge representation problem. Sometimes there is just no signal ...
Q: How are you giving the coaches this data?
A: I can't tell you how. Cellphone communications are forbidden on the sidelines.
Q: Are you REALLY good at fantasy football?
A: Fantasy football's scoring system is highly variable; a lot of research shows that some scoring features can't be predicted. Usually FF's busy time is Trey's busy time too, so he tends to play like other people do and say "he's looking good this week."
Thanks to all speakers for their amazing talks, and thanks to the group for hosting me.
January 6, 2015
PyData London, January 2015
Tonight I'm attending the PyData London Meetup group for the first time.
The meeting began with a short "What's New in the Python World," from the joint organizers, who assumed the audience was mainly here for the invited presentations. That's true -- like many others I find it interesting to learn what other Pythonistas are doing with their Py (so to speak), but it's also nice to hear a potted summary of "events of significance."
I know from my own experiences that they (said organizers) will be DELIGHTED if you would offer any feedback at all. They are doing so much good work for nothing (and it can begin to feel like a thankless task) that we all owe it to them to pass on any suggestion that might help the group be even more effective. Similarly, if you would like to let me know anything about this, or future posts of this nature, just add your comment below. If you feel like making a critical comment at the same time, I am old enough and ugly enough to survive.
Frank Kelly was introduced to talk about "Changepoint Detection with Bayesian Inference". After graduation Frank was sentenced to investigate rock strata produced as part of an oil exploration project. One initial problem was the transmission of information from the drill head (hundreds of feet down in the rock). They discovered that they could encode digital information by pulsing the highly viscous mud that lived in the well. As you can imagine, mud is not a very good transmission medium, and Frank's final-year project was to use Bayesian statistics to decode the transmissions.
Frank's discussion of frequentist vs Bayesian methods was interesting. It revealed that Bayesian methods are a fundamentally different technique. Bayes was a non-conformist who is buried about ten minutes away from the meeting space at Lyst Studios. I don't know whether this was intended as a warning. Essentially, Bayesian methods used fixed data sets and tries to use post-hoc processing to understand the experimental results. Data is assumed to be generated as model data plus noise, which for Frank's purposes they could characterize as Gaussian white noise. Essentially the signal is the reading minus the noise, by integrating with respect to the "nuisance parameters."
We saw a graphical demonstration of how this could be used to detect thresholds in noisy data, and how as the data tended to a lower and lower threshold the result of Franks analytical function tended towards a smooth curve, with no detectable thresholds indicated by minima or maxima.
Frank then went on to discuss other applications of his technique related to Google search. There are pressures on Google to reveal the algorithms they use to produce their search results, but this is unlikely to happen as the algorithm represents Google's "secret sauce". Changes to Google's algorithms cause large fluctuations in results and sales for some companies.
Then we were treated to a view of an IPython Notebook, though the code wasn't very readable (remember that Command+ sequence, Mac presenters), but it was interesting to see a demonstration that a Google change had made a detectable difference to the traffic on sites. By presenting this kind of supporting evidence companies can at least give Google some facts about how the changes have affected them.
Frank then went on to talk about applications to tropical storms, which generally have wind speeds over 17 m/s (35 m/s is classified as a hurricane). Data is available for the number of tropical storms per year since 1856, and Frank pointed out that there were spikes that correlated well with changes in the surface ocean temperature.
Frank explained that he is now mixing Bayesian with frequentist techniques, and that he has done some informal research into correlations between external events such as Christmas and the Scottish independence referendum, with no really conclusive results (except that Christmas had more effect on the stock market than the referendum did). Overall a very interesting talk. Frank even promised that the published slides (now linked from the talk title above as well as here) will include a link to a "Matlab to Python" conversion cheat sheet. Well done!
In the Q and A session the first questioner asked whether Frank had benchmarked his methods against other change-detection algorithms. Frank felt this might be difficult because different algorithms produce different types of output. Next, someone asked whether Frank had tried any custom Bayesian analysis tools, but Frank had simply put his own code together. Next, someone asked whether the technology to communicate from the drill head were any better nowadays, and Frank said that things are now more sophisticated, but "you can't just stick a cable down the hole." A commenter pointed out that mud pulsing was still common five years ago, and asked whether it was easy to apply a moving window to the data. I'm not sure I understood the answer, but Frank seemed to think it could be used to produce online analysis tools. that would allow a sliding window to be applied to time series data. Next someone asked whether the technique could cope with different noise spectra/distributions. The answer was that the noise levels had to be modelled, and sometimes the noise was different after the change from before. The talk ended with a robust round of applause.
There was then a break for beer and pizza, so don't blame me for any degradation of quality in the rest of this post. There are no guarantees. I met a few people while we were milling around, including a data scientist who had heard the the PyData meetups were friendly and relevant, and the business development manager of a company whose business is training scientists.
The second talk of the evening was "Learning to Read URLs: Finding word boundaries in multi-word domain names with Python and sklearn" by Calvin Giles.
Calvin started out with a confession that he isn't a computer scientist but a data scientist, so he was describing algorithms that worked to solve his problems rather than theoretically optimal solutions.
The basic problem: given a domain name such as powerwasherchicago.com, resolve it into words. The point is, if this minimal amount of semantic information can be extracted, you can avoid simple string comparisons and determine the themes that might be present in a relatively large collection of domain names. Calvin warned us that the code is the result of a single day's hacking project, based on Adthena's data of ten million unique domain names and third party data such as the Gutenberg project and the Google ngram viewer data sets.
The process was to determine which of a number of possible sentences should be associated with the "words" found in a domain name. The first code snippet showed a way of extracting all the single words from the data (requiring at least two characters per word). A basic threshold detection was required, since low frequency words are not useful. The initial results were sufficiently interesting to move further research to the Google ngram dataset. This resulted in an English vocabulary or roughly 1,000,000 words. A random selection included "genéticos" and "lanzamantio", but Calvin was happy to allow anything that had a frequency of more than 1,000 in his corpus.
Calvin then presented a neat algorithm to find the words that were present in a string. The list of potential words in powerwasherchicago.com seemed to have about fifty elements in it. catholiccommentaryonsacredscripture.com had 101 words in it, including a two-letter word, making it a very interesting test case.
Calvin's algorithm to find the "most likely" set of words allows you to decide how likely the domain name is to occur given the potential words it could be made up from. Sadly with n substrings of the data you can generate 2^n sentences. Some convincing calculations were used to demonstrate that this wasn't a practical approach. The words should be non-overlapping and contiguous, however, so this allows us to limit the possibilities quite radically, but it isn't easy to find all subsets of non-overlapping words, This is, however, a major win in reducing the number of cases to be considered.
Given the solution with least overlap, Calvin then chose to omit all "sentences" that were less than 95% of the length of the least-overlap solution. The code is all available on the slides, and I am hoping to be able to publish the URL (which Calvin said at the start of his talk would be available).
The get_sentences() wrapper function was only seven lines of code (ignoring the called functionality), and quite comprehensible. The demonstration of its result was extremely convincing, reducing some very large potential sets to manageable proportions. The average domain turned out to have around 10,000 possible interpretations. The domains with more than that number of candidate interpretations turned out to be the most interesting.
In order to produce a probability ordering for the possibilities, the algorithm prefers fewer longer words to a larger number of shorter words.A four-line function gave that ordering - a nice demonstration of Python's high-level abilities. While this did not give perfect results, Calvin convinced me that his methods, while possibly over-trained on his sample domains (modulo some fuzzy word-inference techniques that help to identify one-character words in the domain) are useful and workable.
Calvin has now trained his system to understand which was the correct interpretation for the first hundred domain names. The visual representation of Calvin's latest results showed that even when the algorithm was "wrong" it was because the arbitrary ordering of equally-assessed domains had put the correct answer further down the list.
He expects that in a set of 500 domains his system will currently give roughly two-thirds of the answers to be useful and the remainder to be randomly wrong. A missing part of the model is an assumption that all sentences are equally likely. Using Bayesian techniques one can determine the "likelihood" of the generated sentences being the intended interpretation. A somewhat pragmatic approach to this problem, inspired by Peter Norvig's spell-checker blog post, used as much existing code as possible. The code
guess("powerwasherchicago")[0]
returned a result of
'power washer chicago'
Calvin closed by suggesting that his initial training data set might have been too small, so he plans to use larger data sets and more sophisticated training methods with a rather more rigorous evaluation of the ad hoc decision making incorporated in his existing code base. His list of desired enhancements made sense, usually the sign of a talk that's related closely to the author's real-world experience. I was impressed that he is also considering how to make his work more usable. Another great talk!
Someone opened the Q and A by asking whether the web site content couldn't be analysed algorithmically to provide more input to the parse. Calvin's reply reflected real-world experience that suggested it would not be easy to do this. Given there are 40,000,000 domain names, Calvin felt such a parser would be unlikely to be helpful. The final question asked whether finite-state transducers, as implemented in OpenFST (sadly written in C++) would be useful. Calvin replied that it might well be, and asked to talk to the questioner with a view to updating the slides.
To close the meeting Product Madness announced that they were hiring, and I am happy to say that nobody wanted me to answer any questions, despite the organizers putting me on the spot!
Thanks to Lyst for hosting. This was a most stimulating meeting, and I'll be attending this group again whenever I can.


December 9, 2014
Python Training Day Feedback
Work on the Repository
About eight people turned up throughout the day, with the stalwarts being Kevin Dwyer and João Miguel Neves. With their help I investigated some thorny problems I had created for myself in the area of git filtering. It turns out that there is a subtle bug in Python's JSON generation which has existed for quite a long time.
We eventually worked around it by adding import simplejson as json for now, though I suspect a better fix might be to explicitly add a separator specification. Anyway, with that change the filtering then started to work predictably, and we could move forward with a review of the notebook contents to ensure that changes I made during the teaching were incorporated into the main code base.
With three of us hacking away that work didn't take very long, but before it could commence there was a certain amount of git-wrangling that I confess I probably wouldn't have managed anywhere near as quickly without Kevin's and João's help.
Training Market Discussions
All in all quite a success, and we also spent time discussing the UK training market for Python. I've a few more ideas now about how to approach that market, but if you, dear reader, have any ideas I would be happy to here them either in the comments below or through my web site's contact page.
Thank You
Thanks to everyone who turned up or merely wished the enterprise well. I am really looking forward to spending more time in the UK and helping to encourage even more Python use. Thanks also to the staff at The Church House for their excellent attention. We couldn't have been better taken care of.


December 8, 2014
Python Dev of the Week
I hope Mike keeps going with this series. There are many interesting personalities in the Python world, and this brings them just a little closer.
December 2, 2014
UK Python Training Open Day and Lunch
So
a) I am not a marketing genius;Result: I am now planning to spend the day in said reserved room, working on my current open source project, whose sadly neglected repository can be found in a weed-infested corner of Github. This is the code I used as a script for the videos which (if you are reading my blog) you should see advertised up at top right. The code is presented as iPython Notebooks, and only some of them are properly documented in interspersed Markdown cells, so if I end up spending the day alone then at least some sensible purpose will have been served.
b) this is not the movies.
With that in mind the Skills Lab is now metamorphosing into an open Python Training Day. I understand this is very short notice (I'm not very good at operations either, go figure), but if you have any interest in either receiving Python training or having such training presented to a target group I would be delighted to talk with you on Tuesday December 9, 2014 at any time between 10:00 am and 4:30 pm at the Church House Conference Centre in Westminster.
If you would like to join me for lunch then please sign up (there are 12 places for lunch) and arrive no later than 12:30. If you simply want to drop in and say hello (or, better still, help work on the codebase) then feel free to arrive at any time during the advertised hours and stay as long as you like (but please do sign up for lunch if you plan to eat). It would be helpful if you reserved a drop-in place even if you aren't coming to eat. Those with an interest in beer might want to arrive towards the end of the proceedings so a post-event excursion to a licensed hostelry can be entertained if appropriate.
The room reserved for this event is absolutely delightful, with a view across the Dean's Yard. I welcome all Python friends and anyone who would like to get to know me professionally to visit at some convenient time during that day. But do remember to reserve a place if you want lunch!
I have had very little time to work on the Github repository since the videos were shot, and yearn for an army of Internet interns to help me improve what I believe are useful open-source training materials. In the absence of visitors that day, any spare time I have on that day will go to improve the Notebooks, so at the very least the curated Notebook collection will become more useful.
November 30, 2014
“Rock Star” Programmers
He observed that programmers who were reluctant to share their code tended to hold on to false views of what might be causing issues rather longer than those who were open to review and discussion. My own development as a programmer was greatly aided by this approach, and at university a couple of close friends in particular discussed every aspect of the code we were creating.
Forty years later Weinberg's “egoless” approach, in which mistakes are accepted as inevitable and reviews are performed in a collegial way, remains the sanest way to produce code. Given that computer programming is fast becoming a mainstream activity it seems perverse to deliberately select for ego when seeking programming talent, since the inevitable shortfall in humility will ultimately work to undermine the rock star's programming skills.
When I think of the best programmers I know, the foremost characteristic they share is a modesty about their own achievements which others would do well to emulate. So, can we please dispel the myth of the “rock star” programmer? The best programmers can't be rock stars. Rock star egoism will stand in the way of developing your programming skills.
July 8, 2014
Is Python Under Threat from the IRS?
Nicholas Tollervey (author and performer of A Programmer Pythonical) asked an interesting question on the Python Software Foundation's membership list today. Despite his misgivings it is one that will be of interest to many Python users, so I am grateful for his permission to quote the essence of it here.
I've noticed rumblings on the web about the IRS denying nonprofit/charitable status to various free software based foundations/organisations/legal entities similar to the PSF. In case you've missed it, here is a high level summary.
My questions are:
1) Is the PSF likely to be affected by this change of view from the IRS?
2) If so, do we have contingency for dealing with this (basically,
what are our options)?
The answer to the first question is absolutely not. The PSF is a long-standing 501(c)(3) non-profit in good standing. The recently-reported issues are all to do with current applications for non-profit status. It's fair to say that the IRS is applying scrutiny to such applications, but they are, after all, responsible for making sure that applications are genuine. As long as an existing non-profit makes the necessary returns and complies with all applicable laws, and as long as it continues to honor its requirement to maintain broad public support, there is no reason why the IRS would represent any kind of threat.
The IRS does not exist to help open source
devotees to build a bright new world
Some of those rejected have been advised to apply for a different form of non-profit status, others are reconsidering their options. Good legal advice is imperative when starting any such enterprise, and that requires a specialist. There is a feeling that the IRS might make better decisions if it could be better-informed about open source. From my limited knowledge of the recent cases I'd say that it's essential to ensure that your application has a sound basis in non-profit law. The IRS does not exist to help open source devotees to build a bright new world.
The answer to the second question is that loss of non-profit status would be a blow to the Foundation, but there's no reason why even that should be fatal to Python, though I very much doubt there is serious planning for such an eventuality. Ultimately the Foundation's the bylaws include a standard winding-up clause that requires asset transfer to a similar/suitable non-profit, so assets cannot be stripped even under those circumstances, including the ability to license Python distributions.
The developers could simply migrate to a different development base. They aren't directed or controlled in any way by the Foundation, which in the light of recent decisions it turns out is probably a good model. If the PSF were directing the development of the language there might have been a real risk of it being seen as no different from a software house or vendor, and it is at that point that doubts about non-profit status can be raised.
The Foundation's mission is ... therefore
genuinely educational and charitable


July 2, 2014
Closures Aren't Easy
print_function
because I have become used to the Python 3 way of doing things, but this particular post was made using CPython 3.3.2.from __future__ import print_function
from dis import dis
import sys
print(sys.version)
3.3.2 (default, Nov 19 2013, 03:15:33) [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
The Problem
In my Intermediate Python video series I discuss decorators, and point out that classes as well as functions can be decorated. As an example of this possibility I used a class decorator that wraps each of the class's methods in a function that prints out a message before calling the original method (or so I thought). Here is the relevant code.def barking(cls):
for name in cls.__dict__:
if name.startswith("__"):
continue
func = getattr(cls, name)
def woofer(*args, **kw):
print("Woof")
return func(*args, **kw)
setattr(cls, name, woofer)
return cls
__dict__
, ignoring the so-called "dunder" names. Strictly we should perhaps also check that the name ends as well as begins with a double underscore, but it does not affect the example. Each method is then replaced with a wrapped version of itself that prints "Woof" before calling the original method.This seemed to work fine on a class with a single method.
@barking
class dog_1:
def shout(self):
print("hello from dog_1")
d1 = dog_1()
d1.shout()
Woof hello from dog_1
dog_1
with two additional methods, and decorated that. The inheritance is irrelevant: the significant fact is that two methods will be processed in the decorator loop, but it does demonstrate that superclass attributes do not appear in the subclass's __dict__
.@barking
class dog_3(dog_1):
def wag(self):
print("a dog_3 is happy")
def sniff(self):
print("a dog_3 is curious")
d3 = dog_3()
d3.wag(); d3.sniff(); d3.shout()
Woof a dog_3 is curious Woof a dog_3 is curious Woof hello from dog_1
wag()
and the sniff()
methods gives the same result, which is to say that both wrapped methods in fact call the same original method. Unfortunately I missed this during development and production of the video code, but it was soon picked up by an eagle-eyed viewer who reported it via O'Reilly's web site.The Explanation
__closure__
attribute, which is either None
if the function does not refer to non-local cells or a tuple of non-local references.x
from the enclosing function's namespace. You can see the difference in the bytecode reported by the dis
module, and also that they return different results when created with and called with the same arguments.def f_non_closure(x):
def inner(y):
return y
return inner
non_closure = f_non_closure(240)
dis(non_closure)
print("Value =", non_closure(128))
3 0 LOAD_FAST 0 (y) 3 RETURN_VALUE Value = 128
def f_closure(x):
def inner(y):
return x
return inner
closure = f_closure(240)
dis(closure)
print("Value =", closure(128))
3 0 LOAD_DEREF 0 (x) 3 RETURN_VALUE Value = 240
LOAD_FAST
operation that loads a local value from element 0 of the local namespace. The closure uses LOAD_DEREF
, which loads a value from the function's __closure__
attribute. So let's take a look at the __closure__
of both functions.print(non_closure.__closure__)
print(closure.__closure__)
None (<cell at 0x10bbcb9f0: int object at 0x10a953ba0>,)
_closure__
attribute. The closure, however, does - it is a tuple of cell objects, created by the interpreter. And if we want, we can obtain the values associated with the cells.print(closure.__closure__[0].cell_contents)
240
__closure__
tuple are there so that values from the enclosing namespace remain available after that namespace has been destroyed. Those values herefore do not become garbage when the enclosing function terminates, returning the inner function (which is a closure). And the LOAD_DEREF 0
opcode simply loads the contents of the cell. It puts the function's __closure__[0].cell_contents
onto the stack to be used (in this case) as a return value (and because it's written in C, it's much faster than the Python).x
: the first one should return x+0
, the second x+1
, the third x+2
and so on. You find, however, that this does not happen.def multiple_closures(n):
functions = []
for i in range(n):
def inner(x):
return x+i
functions.append(inner)
print(inner, "Returned", inner(10), inner.__closure__[0])
return functions
functions = multiple_closures(3)
print([f(10) for f in functions])
<function multiple_closures.<locals>.inner at 0x10c808440> Returned 10 <cell at 0x10c817948: int object at 0x10a951da0> <function multiple_closures.<locals>.inner at 0x10c808560> Returned 11 <cell at 0x10c817948: int object at 0x10a951dc0> <function multiple_closures.<locals>.inner at 0x10c808050> Returned 12 <cell at 0x10c817948: int object at 0x10a951de0> [12, 12, 12]
multiple_closures()
they appear to work perfectly well. After returning from that function, however, they all return the same result when called with the same argument. We can find out why by examining the __closure__
of each function.for f in functions:
print(f.__closure__[0])
<cell at 0x10c817948: int object at 0x10a951de0> <cell at 0x10c817948: int object at 0x10a951de0> <cell at 0x10c817948: int object at 0x10a951de0>
__closure__
. The interpreter assumes that since all functions refer to the same local variable they can all be represented by the same cell (despite the fact that the variable had different values at the different times it was used in different functions). Precisely the same thing is happening with the decorators in Intermediate Python.barking()
decorator carefully you can see that name
, func
and woofer
are all names local to the barking()
decorator function, and that func
is used inside the inner function, making it a closure. Which means that all the methods end up referring to the last method processed, apparently in this case sniff()
.print(d3.wag.__closure__[0].cell_contents.__name__) # In Python 2, use __func__
print(d3.sniff.__closure__[0].cell_contents.__name__) # In Python 2, use __func__
sniff sniff
func
references in the inner woofer()
function are all using the same local variable, which is represented by the same cell each time the loop body is executed. Hence, since a cell can only have a single value, they all refer to the same method.print(d3.sniff.__closure__)
print(d3.wag.__closure__)
(<cell at 0x10c817788: function object at 0x10c81f4d0>,) (<cell at 0x10c817788: function object at 0x10c81f4d0>,)
Is This a Bug?
I suspect this question is above my pay grade. It would certainly be nice if I could get this code to work as-is, but the simple fact is that at present it won't. Whether this is a bug I am happy to leave to the developers, so I will post an issue onbugs.python.org
and see what they have to say. There's also the question of whether any current code is likely to be relying on this behavior (though I rather suspect not, given its unhelpful nature) - backwards compatibility should ideally not be broken.Workaround
The issue here is that different uses of the same non-local variable from a function will always reference the same cell, and no matter what the value was at the time it was referenced the cell always contains the final value of that variable.So a fairly simple, though somewhat contorted, workaround is to avoid multiple uses of the same non-local variable in different closures.
def isolated_closures(n):
functions = []
for i in range(n):
def wrapper(i=n):
def inner(x):
return x+i
return inner
f = wrapper(i)
functions.append(f)
print(f, "Returned", f(10), f.__closure__[0])
return functions
functions = isolated_closures(3)
print([f(10) for f in functions])
<function isolated_closures.<locals>.wrapper.<locals>.inner at 0x10c826c20> Returned 10 <cell at 0x10c817e88: int object at 0x10a951da0> <function isolated_closures.<locals>.wrapper.<locals>.inner at 0x10c8264d0> Returned 11 <cell at 0x10c817ec0: int object at 0x10a951dc0> <function isolated_closures.<locals>.wrapper.<locals>.inner at 0x10c826b00> Returned 12 <cell at 0x10c817ef8: int object at 0x10a951de0> [10, 11, 12]
inner()
is still a closure, but each time it is defined the definition takes plce in a different local namespace associated with a new call to wrapper()
, and so each cell is a reference to a different local (to wrapper()
- nonlocal to inner()
) variable, and they do not collide with each other. Redefining the barking()
decorator as follows works the same trick for that.def barking(cls):
for name in cls.__dict__:
if name.startswith("__"):
continue
func = getattr(cls, name)
def wrapper(func=func):
def woofer(*args, **kw):
print("Woof")
return func(*args, **kw)
return woofer
setattr(cls, name, wrapper(func))
return cls
@barking
class dog_3(dog_1):
def wag(self):
print("a dog_3 is happy")
def sniff(self):
print("a dog_3 is curious")
d3 = dog_3()
d3.wag(); d3.sniff(); d3.shout()
Woof a dog_3 is happy Woof a dog_3 is curious Woof hello from dog_1
April 21, 2014
Neat Notebook Trick
I also made the tactical (and, as it turned out, strategic) mistake of choosing to stay at the Hyatt in Montreal. This meant a considerable walk (for a gimpy old geezer such as myself) to the conference site, when the Palais des Congres is already intimidatingly large.
So the combination of exhaustion and knee pain meant I hardly got to see any talks (not totally unheard of) but that I also got very little time in the hallway track either. Probably the most upsetting absence was missing the presentation of Raymond Hettinger's Lifetime Achievement Award. As a PSF director I instituted the Community Service Awards, but these have never really been entirely appropriate for developers. This award makes it much clearer just how significant Raymond's contributions have been.
Because of the video releases I did spend some time of the O'Reilly stand, and signed away 25 free copies of the videos. I was also collecting names and addresses to distribute free copied of the Python Pocket Reference. If you filled out a form, you should receive your book within the next three weeks. We'll mail you with a more exact delivery date shortly.
But the real reason for this post is that I had the pleasure of meeting Fernando Perez, one of the leaders of the IPython project. He was excited to hear that the Intermediate Python notebooks are already available on Github, and when he realized the notebooks were all held in the same directory he showed me that if I dropped that URL into the Notebook Viewer site I would get a web page with links to viewable versions of the notebook. [Please note: they aren't currently optimally configured for reading, so it's still best to run the notebooks interactively, but in the absence of a local notebook server this will be a lot better than nothing. It will get better over time].
He also mentioned a couple of other wrinkles I hadn't picked up on, and we briefly discussed some of the interesting aspects of Notebooks being data structures.
The conversation was interesting enough that I plan to visit Berkeley soon to try and infiltrate my way into the documentation team and see if we can't make the whole system even easier to use and understand. One way or another, open source seems to be in my bloodstream.


April 12, 2014
Intermediate Python: An Open Source Documentation Project
My intention in recording the videos was to produce a broad set of materials (the linked O'Reilly page contains a good summary of the contents). I figure that most Python programmers will receive excellent value for money even if they already know 75% of the content. Those to whom more is new will see a correspondingly greater benefit. But I was also keenly aware that many Python learners, particularly those in less-developed economies, would find the price of the videos prohibitive.
With O'Reilly's contractual approval the code that I used in the video modules, in IPython Notebooks, is going up on Github under a Creative Commons license [EDIT: The initial repository is now available and I very much look forward to hearing from readers and potential contributors - it's perfectly OK if you just want to read the notebooks, but any comments yuu have about your experiences will be read and responded to as time is available]. Some of it already contains markdown annotations among the code, other notebooks have little or no commentary. My intention is that ultimately the content will become more comprehensive than the videos, since I am using the video scripts as a starting point.
I hope that both learner programmers and experienced hands will help me turn it into a resource that groups and individuals all over the world can use to learn more about Python with no fees required. The current repository has to be brought up to date after a rapid spate of editing during the three-day recording session. It should go without saying that viewer input will be very welcome, since the most valuable opinions and information comes from those who have actually tried to use the videos to help them learn.
I hope this will also be a project that sees contributions from documentation professionals (and beginners they can help train), so I will be asking the WriteTheDocs NA team how we can lure some of those bright minds in.
Sadly it's unlikely I will be able to see their talented array of speakers as I will still be recovering from surgery. But a small party one evening or a brunch at the office might be possible. Knowing them it will likely involve sponsorship or beer. Or both. We shall see.
I think it's a worthwhile goal to have free intermediate-level Python sample code available, and I can't think of a better way for a relative beginner to get into an open source project. I also like the idea that two communities can come together over it and learn from each other. Suffice it to say, if there are enough people with a hundred bucks* in their pocket for a six-hour video training I am happy to use part of my share in the profits to support this project to some degree.
[DISCLOSURE: The author will receive a proportion of any profit from the O'Reilly Intermediate Python video series]
* This figure was plucked from the air before publication, and is still a good guideline, though as PyCon opened (Apr 11) a special deal was available on a package of both Jessica McKellar's Introduction to Python and my Intermediate Python.


A Rap @hyatt Customer Service Request
@Hyatt ... #pycon
That's why my face is wearing a frown
Even though I'm at ... #pycon
I love all these Canadians
And Montreal is cool
But don't you know how not to run a network
fool?
If I were a rapper
Then you'd have to call me Milton
Because frankly I get much better service
@Hilton
I'm a businessman myself
And I know we're hard to please
So kindly please allow me
To put you at your ease
Your people are delightful
And as helpful as the best
I want to help, not diss you
I'm not angry like the rest
The food is amazing
And the bar could be geek heaven
If only you weren't calling
For last orders at eleven
We're virtual and sleepless
So we need your help to live
And most of us are more than glad
To pay for what you give
But imagine you're away from home
And want to call your Mom
The Internet's our family
So you've just dropped a bomb
I've had my ups and downs with Hyatt
Over many years
But never felt before
That it should fall on other's ears
I run conferences, for Pete's sake
And I want to spend my money
If only I could reach someone
And I'm NOT being funny
PyCon is my baby
So I cherish it somewhat
But this has harshed my mellow
And just not helped a lot
We're bunch of simple geeks
Who get together every year
We aren't demanding, I don't think
Our simple needs are clear
I don't believe that I could run
Your enterprise right here
It's difficult, and operations
Aren't my thing I fear
So please, don't take this badly
But you've really disappointed
Which is why a kindly soul like me
Has made remarks so pointed
We will help you if we can
We know you pay a lot for bits
But I have to know if web sites
Are receiving any hits
You've cut me off, I'm blind
And so I hope there's nothing funky
Happening to my servers
While I'm sat here getting skunky
Enough, I've made my point
So I must stop before I'm rude
The Internet's my meat and drink
You've left me without food.
trying-to-help-while-disappointed-ly yr's - steve
March 20, 2014
Social Media and Immortality
In this particular case it was triggered by the suggestion from LinkedIn that I might like to add a fellow Learning Tree instructor to my roster of friends. He died, quite young and to most of his colleagues' surprise, about fifteen months ago (if my memory can be relied upon, which I wouldn't necessarily recommend as a strategy). I've seen similar reports on Twitter from other friends.
Now, I'm just a guy who chose to eschew the corporate career ladder and work on small systems that do demonstrable good, so I freely admit that the young devops turks of today are able to develop far more capable systems that I could have conceived of at their age. That's just the nature of technological progress. At the same time, I have to wonder why nobody appears to have asked the question "Should we take special actions (or at least avoid taking regular ones) for users who haven't logged in in over a year?"
Do they have no business analysts? Must we geeks be responsible for avoiding even the most predictable social gaffes?
Sidebar: I once designed the database for a system that monitored the repayment of student support funds by those who had accepted assistance from the federal government to train in teaching disadvantaged students. There were certain valid reasons for deferring repayment (such as military service), and of course these deferrals had to be recorded. I remember feeling very satisfied that all I had to do was associate the null value with the deferral duration for "Dead" as the referral reason to have everything work perfectly well.
The answer to my question of two paragraphs above, by the way, would be “yes”. This will be the last time I give free advice to the social media companies, so Twitter, Facebook, LinkedIn, and the rest, I hope you can find some benefit in this advice. Anything further will cost you dearly. (I should be so lucky).
Quite separately from the above speculations on human frailty, I can't help wondering what kind of immortality a continued existence on these platforms represents (even though this will probably lead to hate mail from all kinds of people the concept offends). I had an email from Google a couple of days ago asking me to log in with a particular identity* within a month or have the account go inactive. That's a necessary second step to whatever palliative actions you choose to take when presenting the account to others. Google, for all their execrable support,** get that you have to log in now and again just to assert your continued existence.
It strikes me this is a reasonably humane way to proceed. If you want to keep someone's memory alive on a social media platform then you must know them at least well enough to log in to their account, after which it's basically your shrine to them if you want it to be. I really don't like to think about what kind of complications the lawyers will dream up about this, though. Otherwise, well, we are after all all born to die (Ray Kurzweil notwithstanding).
*Note to the Google identity nazis: no, of course I was joking, I only have one identity
** Hint re Google customer service: if you aren't paying you aren't a customer, so expecting service might seem presumptuous


January 9, 2014
Practical Python (1)
Note: this blog post is the first I am undertaking with the IPython Notebook. I am still playing with formatting and so on, so please bear with me if the content doesn't seem as easy to read as it should. The notebook itself can be found as a gist file on Github and you can alternatively view it using the online Notebook viewer.
I want to discuss a typical bit of Python, taken from a program sent me by a colleague (whether it's his code or someone else's I don't know, and it hardly matters). It's the kind of stuff we all do every day in Python, and despite the Zen of Python's advice that “there should be one, and preferably only one, obvious way to do it” there are many choices one could make that can impact the structure of the code.
This started out as a way to make the code more readable (I suspect it may have been written by somebody more accustomed to a language like C), but I thought it might be interesting to look at some timings as well.
In order to be able to run the code without providing various dependencies I have taken the liberty of defining a dummy Button
function and various other “mock” objects to allow the code to run (they implement just enough to avoid exceptions being raised)*. This in turn means we can use IPython's %%timeit cell magic to determine whether my “improvements” actually help the execution speed.
Note that each timed cell is preceded by a garbage collection to try as far as possible to run the samples on a level playing field**.
import gc
class MockFrame():
def grid(self, row, column, sticky):
pass
mock_frame = MockFrame()
def Button(frame, text=None, fg=None, width=None, command=None, column=None, sticky=None):
return mock_frame
class Mock():
pass
self = Mock()
self.buttonRed, self.buttonBlue, self.buttonGreen, self.buttonBlack, self.buttonOpen = (None, )*5
f4 = Mock()
f4.columnconfigure = lambda c, weight: None
ALL = Mock()
The code in this next cell is extracted from the original code to avoid repetition - all loop implementations are written to use the same data.
button = ["Red", "Blue", "Green", "Black", "Open"]
color = ["red", "blue", "green", "black", "black"]
commands = [self.buttonRed, self.buttonBlue, self.buttonGreen,
self.buttonBlack, self.buttonOpen]
So here's the original piece of code:
g = gc.collect()
%%timeit
# Benchmark 1, the original code
for c in range(5):
f4.columnconfigure(c, weight=1)
Button(f4, text=button[c], fg=color[c], width=5,
command=commands[c]).grid(row=0, column=c, sticky=ALL)
100000 loops, best of 3: 4.45 µs per loop
You might suspect, as I did, that there are better ways to perform this loop.
The most obvious is simply to create a single list to iterate over, using unpacking assignment in the for loop to assign the individual elements to local variables. This certainly renders the loop body a little more readably. We do still need the column number, so we can use the enumerate()
function to provide it.
g = gc.collect()
%%timeit
for c, (btn, col, cmd) in enumerate(zip(button, color, commands)):
f4.columnconfigure(c, weight=1)
Button(f4, text=btn, fg=col, width=5, command=cmd). \
grid(row=0, column=c, sticky=ALL)
pass
100000 loops, best of 3: 4.26 µs per loop
Unfortunately any speed advantage appears insignificant. These timings aren't very repeatable under the conditions I have run them, so really any difference is lost in the noise - what you see depends on the results when this notebook was run (and therefore also on which computer), and it would be unwise of me to make any predictions about the conditions under which you read it.
We can avoid the use of enumerate()
by maintaining a loop counter, but from an esthetic point of view this is almost as bad (some would say worse) than iterating over the range of indices. In CPython it usually just comes out ahead, but at the cost of a certain amount of Pythonicity. It therefore makes the program a little less comprehensible.
g = gc.collect()
%%timeit
c = 0
for (btn, col, cmd) in zip(button, color, commands):
f4.columnconfigure(c, weight=1)
Button(f4, text=btn, fg=col, width=5, command=cmd). \
grid(row=0, column=c, sticky=ALL)
c += 1
pass
100000 loops, best of 3: 4.05 µs per loop
The next two cells repeat the same timings without the loop body, and this merely emphasises the speed gain of ditching the call to enumerate()
. At this level of simplicity, though, it's difficult to tell how much optimization is taking place since the loop content is effectively null. I suspect PyPy would optimize this code out of existence. Who knows what CPython is actually measuring here.
g = gc.collect()
%%timeit
for c, (btn, col, cmd) in enumerate(zip(button, color, commands)):
pass
1000000 loops, best of 3: 1.18 µs per loop
g = gc.collect()
%%timeit
c = 0
for btn, col, cmd in zip(button, color, commands):
pass
c += 1
1000000 loops, best of 3: 854 ns per loop
Somewhat irritatingly, manual maintenance of an index variable appears to have a predictable slight edge over use of enumerate()
, and naive programmers might therefore rush to convert all their code to this paradigm. Before they do so, though, they should consider that code's environment. In this particular example the whole piece of code is simply setup, executed once only at the start of the program execution as a GUI is being created. Optimization at this level woud not therefore be a sensible step: to optimize you should look first at the code inside the most deeply-nested and oft-executed loops.
If the timed code were to be executed billions of times inside two levels of nesting then one might, in production, consider using such an optimization if (and hopefully only if) there were a real need to extract every last ounce of speed from the hardware. In this case, since the program uses a graphical user interface and so user delays will use orders of magnitude more time than actual computing, it would be unwise to reduce the readability of the code, for which reason I prefer the enumerate()
-based solution.
With many loops the body's processing time is likely to dominate in real cases, however, and that again supportus using enumerate()
. If loop overhead accounts for 5% of each iteration and you reduce your loop control time by 30% you are still only reducing your total loop run time by 1.5%. So keep your program readable and Pythonically idiomatic.
* If you have a serious need for mock objects in testing, you really should look at the mock
module, part of the standard library since Python 3.3. Thanks to Michael Foord for his valiant efforts. Please help him by not using mock
in production.
** An interesting issue here. Originally I wrote the above code to create a new MockFrame object for each call to Button(), and I consistently saw the result of the second test as three orders of magnitude slower than the first (i.e. ms, not µs). It took me a while to understand why timeit was running so many iterations for such a long test, adding further to the elapsed time. It turned out the second test was paying the price of collecting the garbage from the first, and that without garbage collections in between runs the GC overhead would distort the timings.