February 25, 2006

Implementation of the Python ByteCode Compiler

My primary interest in this talk was hearing Jeremy Hylton speak. Jeremy is a formidable intellect, and has made many fine contributions to the Python core. It turns out he is alsi an excellent (if somewhat rapid-fire) speaker.

He introduced the abstract syntax tree notation, showing how it contrasted with the previously-used concrete tree, then went on to look at compilation strategy.

The first pass builds a symbol table, and collects certain information (for example, if a function contains a yield then it's a generator). A second pass determines the scope of the names, with some unexpected differences between implcit and explicit global variables (whose treatment differs, for example, in the class namespace).

Free variables must be treated differently from locals, because their lifetime can extend beyond the activation record of a function call. There are many fine points to be considered in the handling of names, and some of Jeremy's examples were chosen to stress the syntax. Sadly they zipped by so fast you'll have to read the slides for the details, but you can consider his first example
class Spam:
id = id(1)
as typical.

Jeremy's treatment of code generation was concise (he clearly wanted to cover a lot of ground, so he wasn't hanging about), but I did at least understand that the output of an AST tree transformation was a set of basic blocks, pointers to which would be jump targets. Then the code generator iterates over the contents of each block, emitting the bytecodes for each AST type encountered, essentially as a visiting tree walk.

An assembler then linearizes the resulting code blocks, computing the required stack space and the required entries in a line number table (whose construction is horrendously complicated). Jump offsets can be deduced at this point, and then PyCode_New() is called to create the object. Finally the peephole optimizer cherry-picks the resulting code, performing constant folding and simplifying certain jump sequences down to a single jump.

The AST branch has simplified several aspects of the compiler implementation, but it also has potential value in other areas. If the AST code could be exposed to Python applications, for example, tools like pychecker could use the tree rather than the source. The application interface has not yet been determined, and since the core currently contains code to handle the concrete syntax tree there will be library ramifications, especially in the compiler package.

Q: If an application uses the AST, do the nodes make the original program text available?
A: For names, yes, but for example floating point objects have already been converted. Statements and expressions refer to their line numbers.

Q: Can I reproduce the source of a program from the AST?
A: That's something that we need to work on to support applications like a refactoring tool, for which the current implementation would be insufficient.

Q: LOAD_NAME only appears to be used in exec and class. Is it used anywhere else?
A: I don't know. [Compiles code and disassembles it]. Ah, yes!

Q: Has anyone considered [inaudible mumble].
A: I don't think so, but that would be a more complicated but more traditional way to do it.

Q: Can you talk about performance a little?
A: Yes. Thanks to Neal Norwitz, Neal Schemenauer, Brett Cannon and Tim Peters the new compiler is marginally faster than the old. Our primary goal was to write comprehensible code, however, so there may well be further gains. For example the peephole optimizer should really be run before the assembler.

Not really sure whether this is any use to someone who wasn't there!


Marius Gedminas said...

I was not there, but I found this writeup interesting nevertheless.

Steve said...

Thanks for letting me know. It's nice to feel that these words aren't just evaporating into the blogosphere.

Simon Brunning said...

Oh no - we are reading.

Rod Senra said...

I'm reading this from Campinas-SP/Brazil. Eventhough I could attend Europython 2005, PyCon2006 was unfortunately out of reach.
Therefore, I do appreciate this notes of yours. Thank you very much.