For Some Value of "Magic": UK

[EDIT: Updates talk with additional background and slide links]

Tonight I'm attending the PyData London Meetup group for the first time.

The meeting began with a short "What's New in the Python World," from the joint organizers, who assumed the audience was mainly here for the invited presentations. That's true -- like many others I find it interesting to learn what other Pythonistas are doing with their Py (so to speak), but it's also nice to hear a potted summary of "events of significance."

I know from my own experiences that they (said organizers) will be DELIGHTED if you would offer any feedback at all. They are doing so much good work for nothing (and it can begin to feel like a thankless task) that we all owe it to them to pass on any suggestion that might help the group be even more effective. Similarly, if you would like to let me know anything about this, or future posts of this nature, just add your comment below. If you feel like making a critical comment at the same time, I am old enough and ugly enough to survive.

Frank Kelly was introduced to talk about "Changepoint Detection with Bayesian Inference". After graduation Frank was sentenced to investigate rock strata produced as part of an oil exploration project. One initial problem was the transmission of information from the drill head (hundreds of feet down in the rock). They discovered that they could encode digital information by pulsing the highly viscous mud that lived in the well. As you can imagine, mud is not a very good transmission medium, and Frank's final-year project was to use Bayesian statistics to decode the transmissions.

Frank's discussion of frequentist vs Bayesian methods was interesting. It revealed that Bayesian methods are a fundamentally different technique. Bayes was a non-conformist who is buried about ten minutes away from the meeting space at Lyst Studios. I don't know whether this was intended as a warning. Essentially, Bayesian methods used fixed data sets and tries to use post-hoc processing to understand the experimental results. Data is assumed to be generated as model data plus noise, which for Frank's purposes they could characterize as Gaussian white noise. Essentially the signal is the reading minus the noise, by integrating with respect to the "nuisance parameters."

We saw a graphical demonstration of how this could be used to detect thresholds in noisy data, and how as the data tended to a lower and lower threshold the result of Franks analytical function tended towards a smooth curve, with no detectable thresholds indicated by minima or maxima.

Frank then went on to discuss other applications of his technique related to Google search. There are pressures on Google to reveal the algorithms they use to produce their search results, but this is unlikely to happen as the algorithm represents Google's "secret sauce". Changes to Google's algorithms cause large fluctuations in results and sales for some companies.

Then we were treated to a view of an IPython Notebook, though the code wasn't very readable (remember that Command+ sequence, Mac presenters), but it was interesting to see a demonstration that a Google change had made a detectable difference to the traffic on sites. By presenting this kind of supporting evidence companies can at least give Google some facts about how the changes have affected them.

Frank then went on to talk about applications to tropical storms, which generally have wind speeds over 17 m/s (35 m/s is classified as a hurricane). Data is available for the number of tropical storms per year since 1856, and Frank pointed out that there were spikes that correlated well with changes in the surface ocean temperature.

Frank explained that he is now mixing Bayesian with frequentist techniques, and that he has done some informal research into correlations between external events such as Christmas and the Scottish independence referendum, with no really conclusive results (except that Christmas had more effect on the stock market than the referendum did). Overall a very interesting talk. Frank even promised that the published slides (now linked from the talk title above as well as here) will include a link to a "Matlab to Python" conversion cheat sheet. Well done!

In the Q and A session the first questioner asked whether Frank had benchmarked his methods against other change-detection algorithms. Frank felt this might be difficult because different algorithms produce different types of output. Next, someone asked whether Frank had tried any custom Bayesian analysis tools, but Frank had simply put his own code together. Next, someone asked whether the technology to communicate from the drill head were any better nowadays, and Frank said that things are now more sophisticated, but "you can't just stick a cable down the hole." A commenter pointed out that mud pulsing was still common five years ago, and asked whether it was easy to apply a moving window to the data. I'm not sure I understood the answer, but Frank seemed to think it could be used to produce online analysis tools. that would allow a sliding window to be applied to time series data. Next someone asked whether the technique could cope with different noise spectra/distributions. The answer was that the noise levels had to be modelled, and sometimes the noise was different after the change from before. The talk ended with a robust round of applause.

There was then a break for beer and pizza, so don't blame me for any degradation of quality in the rest of this post. There are no guarantees. I met a few people while we were milling around, including a data scientist who had heard the the PyData meetups were friendly and relevant, and the business development manager of a company whose business is training scientists.

The second talk of the evening was "Learning to Read URLs: Finding word boundaries in multi-word domain names with Python and sklearn" by Calvin Giles.

Calvin started out with a confession that he isn't a computer scientist but a data scientist, so he was describing algorithms that worked to solve his problems rather than theoretically optimal solutions.

The basic problem: given a domain name such as powerwasherchicago.com, resolve it into words. The point is, if this minimal amount of semantic information can be extracted, you can avoid simple string comparisons and determine the themes that might be present in a relatively large collection of domain names. Calvin warned us that the code is the result of a single day's hacking project, based on Adthena's data of ten million unique domain names and third party data such as the Gutenberg project and the Google ngram viewer data sets.

The process was to determine which of a number of possible sentences should be associated with the "words" found in a domain name. The first code snippet showed a way of extracting all the single words from the data (requiring at least two characters per word). A basic threshold detection was required, since low frequency words are not useful. The initial results were sufficiently interesting to move further research to the Google ngram dataset. This resulted in an English vocabulary or roughly 1,000,000 words. A random selection included "genéticos" and "lanzamantio", but Calvin was happy to allow anything that had a frequency of more than 1,000 in his corpus.

Calvin then presented a neat algorithm to find the words that were present in a string. The list of potential words in powerwasherchicago.com seemed to have about fifty elements in it. catholiccommentaryonsacredscripture.com had 101 words in it, including a two-letter word, making it a very interesting test case.

Calvin's algorithm to find the "most likely" set of words allows you to decide how likely the domain name is to occur given the potential words it could be made up from. Sadly with n substrings of the data you can generate 2^n sentences. Some convincing calculations were used to demonstrate that this wasn't a practical approach. The words should be non-overlapping and contiguous, however, so this allows us to limit the possibilities quite radically, but it isn't easy to find all subsets of non-overlapping words, This is, however, a major win in reducing the number of cases to be considered.

Given the solution with least overlap, Calvin then chose to omit all "sentences" that were less than 95% of the length of the least-overlap solution. The code is all available on the slides, and I am hoping to be able to publish the URL (which Calvin said at the start of his talk would be available).

The get_sentences() wrapper function was only seven lines of code (ignoring the called functionality), and quite comprehensible. The demonstration of its result was extremely convincing, reducing some very large potential sets to manageable proportions. The average domain turned out to have around 10,000 possible interpretations. The domains with more than that number of candidate interpretations turned out to be the most interesting.

In order to produce a probability ordering for the possibilities, the algorithm prefers fewer longer words to a larger number of shorter words.A four-line function gave that ordering - a nice demonstration of Python's high-level abilities. While this did not give perfect results, Calvin convinced me that his methods, while possibly over-trained on his sample domains (modulo some fuzzy word-inference techniques that help to identify one-character words in the domain) are useful and workable.

Calvin has now trained his system to understand which was the correct interpretation for the first hundred domain names. The visual representation of Calvin's latest results showed that even when the algorithm was "wrong" it was because the arbitrary ordering of equally-assessed domains had put the correct answer further down the list.

He expects that in a set of 500 domains his system will currently give roughly two-thirds of the answers to be useful and the remainder to be randomly wrong. A missing part of the model is an assumption that all sentences are equally likely. Using Bayesian techniques one can determine the "likelihood" of the generated sentences being the intended interpretation. A somewhat pragmatic approach to this problem, inspired by Peter Norvig's spell-checker blog post, used as much existing code as possible. The code

guess("powerwasherchicago")[0]

returned a result of

'power washer chicago'

Calvin closed by suggesting that his initial training data set might have been too small, so he plans to use larger data sets and more sophisticated training methods with a rather more rigorous evaluation of the ad hoc decision making incorporated in his existing code base. His list of desired enhancements made sense, usually the sign of a talk that's related closely to the author's real-world experience. I was impressed that he is also considering how to make his work more usable. Another great talk!

Someone opened the Q and A by asking whether the web site content couldn't be analysed algorithmically to provide more input to the parse. Calvin's reply reflected real-world experience that suggested it would not be easy to do this. Given there are 40,000,000 domain names, Calvin felt such a parser would be unlikely to be helpful. The final question asked whether finite-state transducers, as implemented in OpenFST (sadly written in C++) would be useful. Calvin replied that it might well be, and asked to talk to the questioner with a view to updating the slides.

To close the meeting Product Madness announced that they were hiring, and I am happy to say that nobody wanted me to answer any questions, despite the organizers putting me on the spot!

Thanks to Lyst for hosting. This was a most stimulating meeting, and I'll be attending this group again whenever I can.

Regular readers (both?) know that I live in the USA but hail from the UK, where I still return regularly to stay in touch with family and friends. So to dip a toe in the UK training waters I booked a room (rather a nice one, as those who have attended other events I organized will not be surprised to learn) and put out an announcement that I was running a Python Programming Skills Lab. You could have booked a place here. Of course if this were the movies, the spirit of “if you build it they will come” would rule and I could welcome those wishing to drink at the Python fount. But it isn't.

So

a) I am not a marketing genius;
b) this is not the movies.

Result: I am now planning to spend the day in said reserved room, working on my current open source project, whose sadly neglected repository can be found in a weed-infested corner of Github. This is the code I used as a script for the videos which (if you are reading my blog) you should see advertised up at top right. The code is presented as iPython Notebooks, and only some of them are properly documented in interspersed Markdown cells, so if I end up spending the day alone then at least some sensible purpose will have been served.

With that in mind the Skills Lab is now metamorphosing into an open Python Training Day. I understand this is very short notice (I'm not very good at operations either, go figure), but if you have any interest in either receiving Python training or having such training presented to a target group I would be delighted to talk with you on Tuesday December 9, 2014 at any time between 10:00 am and 4:30 pm at the Church House Conference Centre in Westminster.

If you would like to join me for lunch then please sign up (there are 12 places for lunch) and arrive no later than 12:30. If you simply want to drop in and say hello (or, better still, help work on the codebase) then feel free to arrive at any time during the advertised hours and stay as long as you like (but please do sign up for lunch if you plan to eat). It would be helpful if you reserved a drop-in place even if you aren't coming to eat. Those with an interest in beer might want to arrive towards the end of the proceedings so a post-event excursion to a licensed hostelry can be entertained if appropriate.

The room reserved for this event is absolutely delightful, with a view across the Dean's Yard. I welcome all Python friends and anyone who would like to get to know me professionally to visit at some convenient time during that day. But do remember to reserve a place if you want lunch!

I have had very little time to work on the Github repository since the videos were shot, and yearn for an army of Internet interns to help me improve what I believe are useful open-source training materials. In the absence of visitors that day, any spare time I have on that day will go to improve the Notebooks, so at the very least the curated Notebook collection will become more useful.

January 6, 2015

PyData London, January 2015

December 9, 2014

Python Training Day Feedback

December 2, 2014

UK Python Training Open Day and Lunch