January 6, 2015

PyData London, January 2015

[EDIT: Updates talk with additional background and slide links]

Tonight I'm attending the PyData London Meetup group for the first time.

The meeting began with a short "What's New in the Python World," from the joint organizers, who assumed the audience was mainly here for the invited presentations. That's true -- like many others I find it interesting to learn what other Pythonistas are doing with their Py (so to speak), but it's also nice to hear a potted summary of "events of significance."

I know from my own experiences that they (said organizers) will be DELIGHTED if you would offer any feedback at all. They are doing so much good work for nothing (and it can begin to feel like a thankless task) that we all owe it to them to pass on any suggestion that might help the group be even more effective. Similarly, if you would like to let me know anything about this, or future posts of this nature, just add your comment below. If you feel like making a critical comment at the same time, I am old enough and ugly enough to survive.

Frank Kelly was introduced to talk about "Changepoint Detection with Bayesian Inference". After graduation Frank was sentenced to investigate rock strata produced as part of an oil exploration project. One initial problem was the transmission of information from the drill head (hundreds of feet down in the rock). They discovered that they could encode digital information by pulsing the highly viscous mud that lived in the well. As you can imagine, mud is not a very good transmission medium, and Frank's final-year project was to use Bayesian statistics to decode the transmissions.

Frank's discussion of frequentist vs Bayesian methods was interesting. It revealed that Bayesian methods are a fundamentally different technique. Bayes was a non-conformist who is buried about ten minutes away from the meeting space at Lyst Studios. I don't know whether this was intended as a warning. Essentially, Bayesian methods used fixed data sets and tries to use post-hoc processing to understand the experimental results. Data is assumed to be generated as model data plus noise, which for Frank's purposes they could characterize as Gaussian white noise. Essentially the signal is the reading minus the noise, by integrating with respect to the "nuisance parameters."

We saw a graphical demonstration of how this could be used to detect thresholds in noisy data, and how as the data tended to a lower and lower threshold the result of Franks analytical function tended towards a smooth curve, with no detectable thresholds indicated by minima or maxima.

Frank then went on to discuss other applications of his technique related to Google search. There are pressures on Google to reveal the algorithms they use to produce their search results, but this is unlikely to happen as the algorithm represents Google's "secret sauce". Changes to Google's algorithms cause large fluctuations in results and sales for some companies.

Then we were treated to a view of an IPython Notebook, though the code wasn't very readable (remember that Command+ sequence, Mac presenters), but it was interesting to see a demonstration that a Google change had made a detectable difference to the traffic on sites. By presenting this kind of supporting evidence companies can at least give Google some facts about how the changes have affected them.

Frank then went on to talk about applications to tropical storms, which generally have wind speeds over 17 m/s (35 m/s is classified as a hurricane). Data is available for the number of tropical storms per year since 1856, and Frank pointed out that there were spikes that correlated well with changes in the surface ocean temperature.

Frank explained that he is now mixing Bayesian with frequentist techniques, and that he has done some informal research into correlations between external events such as Christmas and the Scottish independence referendum, with no really conclusive results (except that Christmas had more effect on the stock market than the referendum did). Overall a very interesting talk. Frank even promised that the published slides (now linked from the talk title above as well as here) will include a link to a "Matlab to Python" conversion cheat sheet. Well done!

In the Q and A session the first questioner asked whether Frank had benchmarked his methods against other change-detection algorithms. Frank felt this might be difficult because different algorithms produce different types of output. Next, someone asked whether Frank had tried any custom Bayesian analysis tools, but Frank had simply put his own code together. Next, someone asked whether the technology to communicate from the drill head were any better nowadays, and Frank said that things are now more sophisticated, but "you can't just stick a cable down the hole." A commenter pointed out that mud pulsing was still common five years ago, and asked whether it was easy to apply a moving window to the data. I'm not sure I understood the answer, but Frank seemed to think it could be used to produce online analysis tools. that would allow a sliding window to be applied to time series data. Next someone asked whether the technique could cope with different noise spectra/distributions. The answer was that the noise levels had to be modelled, and sometimes the noise was different after the change from before. The talk ended with a robust round of applause.

There was then a break for beer and pizza, so don't blame me for any degradation of quality in the rest of this post. There are no guarantees. I met a few people while we were milling around, including a data scientist who had heard the the PyData meetups were friendly and relevant, and the business development manager of a company whose business is training scientists.

The second talk of the evening was "Learning to Read URLs: Finding word boundaries in multi-word domain names with Python and sklearn" by Calvin Giles.

Calvin started out with a confession that he isn't a computer scientist but a data scientist, so he was describing algorithms that worked to solve his problems rather than theoretically optimal solutions.

The basic problem: given a domain name such as powerwasherchicago.com, resolve it into words. The point is, if this minimal amount of semantic information can be extracted, you can avoid simple string comparisons and determine the themes that might be present in a relatively large collection of domain names. Calvin warned us that the code is the result of a single day's hacking project, based on Adthena's data of ten million unique domain names and third party data such as the Gutenberg project and the Google ngram viewer data sets.

The process was to determine which of a number of possible sentences should be associated with the "words" found in a domain name. The first code snippet showed a way of extracting all the single words from the data (requiring at least two characters per word).  A basic threshold detection was required, since low frequency words are not useful. The initial results were sufficiently interesting to move further research to the Google ngram dataset. This resulted in an English vocabulary or roughly 1,000,000 words. A random selection included "genĂ©ticos" and "lanzamantio", but Calvin was happy to allow anything that had a frequency of more than 1,000 in his corpus.

Calvin then presented a neat algorithm to find the words that were present in a string. The list of potential words in powerwasherchicago.com seemed to have about fifty elements in it. catholiccommentaryonsacredscripture.com had 101 words in it, including a two-letter word, making it a very interesting test case.

Calvin's algorithm to find the "most likely" set of words allows you to decide how likely the domain name is to occur given the potential words it could be made up from. Sadly with n substrings of the data you can generate 2^n sentences. Some convincing calculations were used to demonstrate that this wasn't a practical approach. The words should be non-overlapping and contiguous, however, so this allows us to limit the possibilities quite radically, but it isn't easy to find all subsets of non-overlapping words, This is, however, a major win in reducing the number of cases to be considered.

Given the solution with least overlap, Calvin then chose to omit all "sentences" that were less than 95% of the length of the least-overlap solution. The code is all available on the slides, and I am hoping to be able to publish the URL (which Calvin said at the start of his talk would be available).

The get_sentences() wrapper function was only seven lines of code (ignoring the called functionality), and quite comprehensible. The demonstration of its result was extremely convincing, reducing some very large potential sets to manageable proportions. The average domain turned out to have around 10,000 possible interpretations. The domains with more than that number of candidate interpretations turned out to be the most interesting.

In order to produce a probability ordering for the possibilities, the algorithm prefers fewer longer words to a larger number of shorter words.A four-line function gave that ordering - a nice demonstration of Python's high-level abilities. While this did not give perfect results, Calvin convinced me that his methods, while possibly over-trained on his sample domains (modulo some fuzzy word-inference techniques that help to identify one-character words in the domain) are useful and workable.

Calvin has now trained his system to understand which was the correct interpretation for the first hundred domain names. The visual representation of Calvin's latest results showed that even when the algorithm was "wrong" it was because the arbitrary ordering of equally-assessed domains had put the correct answer further down the list.

He expects that in a set of 500 domains his system will currently give roughly two-thirds of the answers to be useful and the remainder to be randomly wrong. A missing part of the model is an assumption that all sentences are equally likely. Using Bayesian techniques one can determine the "likelihood" of the generated sentences being the intended interpretation.  A somewhat pragmatic approach to this problem, inspired by Peter Norvig's spell-checker blog post, used as much existing code as possible. The code

    guess("powerwasherchicago")[0]

returned a result of

    'power washer chicago'

Calvin closed by suggesting that his initial training data set might have been too small, so he plans to use larger data sets and more sophisticated training methods with a rather more rigorous evaluation of the ad hoc decision making incorporated in his existing code base. His list of desired enhancements made sense, usually the sign of a talk that's related closely to the author's real-world experience. I was impressed that he is also considering how to make his work more usable. Another great talk!

Someone opened the Q and A by asking whether the web site content couldn't be analysed algorithmically to provide more input to the parse. Calvin's reply reflected real-world experience that suggested it would not be easy to do this. Given there are 40,000,000 domain names, Calvin felt such a parser would be unlikely to be helpful. The final question asked whether finite-state transducers, as implemented in OpenFST (sadly written in C++) would be useful. Calvin replied that it might well be, and asked to talk to the questioner with a view to updating the slides.

To close the meeting Product Madness announced that they were hiring, and I am happy to say that nobody wanted me to answer any questions, despite the organizers putting me on the spot!

Thanks to Lyst for hosting. This was a most stimulating meeting, and I'll be attending this group again whenever I can.




No comments: