April 28, 2013

Why Tuples?

A frequent question from new and even not-so-new Python programmers is "why does the language have both tuples (which, if you know Python, you will recall are immutable) and lists?" You might almost say there are two kinds of Python programmers, those who know what tuples are for and those whose mathematical education has been limited. I know this sounds like an awfully snobbish thing to say, but it isn't meant that way. The fact is that I learned most of my programming in eight years of working experience before I started my degree studies, and I am therefore of the very definite opinion that people with little mathematical background can be excellent programmers.

It's just that I myself only came across the word tuple when I started studying college-level mathematics and had to come to terms with things like
An NFA is represented formally by a 5-tuple, (Q, Σ, Δ, q0F), consisting of
  • a finite set of states Q
  • a finite set of input symbols Σ
  • a transition relation Δ : Q × Σ → P(Q).
  • an initial (or start) state q0 ∈ Q
  • a set of states F distinguished as accepting (or finalstates F ⊆ Q.
Now this has a very definite meaning. It tells us that an (ordered - i.e. positionally-identified) set of five things is sufficient to define the behavior of a specific type of object (in this case a non-deterministic finite state automaton (NFA), though for all you need to know about those I might just as well be describing a flux capacitor).

If we wanted to encapsulate an NFA as a Python data structure, then we might at some point in our code write in our Python code something like

    nfa = (states, symbols, transition, initial, accepting_states)

though in actual fact you would be more likely to want to incorporate behavior in a class and so write instead

    nfa = NFA(states, symbols, transition, initial, accepting_states)

Now even though you may not know what an NFA is, you will surely perceive that the set of its possible states is a very different thing from the function that determines how a current state and a set of inputs are mapped into new states. So there is simply no way that it would be meaningful to write

    for thing in (states, symbols, transition, initial, accepting_states):
        do_something_with(thing)

unless you were, for example, trying to save its state during a pickling operation or some more obscure book-keeping metacode.

And that, best beloved, is what tuples are for: they are ordered collections of objects, and each of the objects has, according to its position, a specific meaning (sometimes referred to as its semantics). If no behaviors are required then a tuple is "about the simplest thing that could work."

This is why you often hear people informally say tuples are for "collections of things you don't need to iterate over" or "tuples are for sequences of dissimilar objects". My advice would be to stay away from such discussions. One might reasonably argue that including them in the language discouraged people from using objects with named attributes, which are always easier to deal with.

The problem with the tuple is that once we have constructed our NFA the only way to refer to the states in code is with the rather unedifying expression

    nfa[0]

which doesn't actually tell the reader much about the programmer's intention. The sequence nature of the tuple means that our accesses to the elements are difficult to imbue with meaning in our code. This latterly became such an obvious shortcoming that it prompted Raymond Hettinger to create the namedtuple object that allows you to easily specify a tuple whose elements can also be referred to by name.

It would be interesting to see whether users of namedtuple objects actually use them as tuples at all. I would guess that the sequence behaviors are rarely used, in which case perhaps it's time to either remove namedtuple's __getitem__() and similar methods or implement a similar object without the sequence behaviors.

26 comments:

Unknown said...

Ah, but you miss the other way to use classic tuples:

states, symbols, transition, initial, accepting_states = nfa

There's a reason that construct is often referred to as tuple unpacking :)

In that model, the tuple is just a convenient way to pass a collection of values around, which you then unpack into individual elements when you want to make use of it.

Where namedtuples come in is that sometimes you're only interested in one or two elements from the tuple - with a namedtuple, it becomes more practical to reference those *without* needing to unpack it.

Steve said...

Sure, I wasn't suggesting that classic tuples should be discarded - simply that I doubt named tuples are indexed very much.

toddrme2178 said...

I get that much.

But now that the conversation has veered off into named tuples, what I don't really understand is the advantage of named tuples over dicts (or at least orderedicts).

Steve said...

To access element joe from a dict you must write


x = d["joe"]

whereas to access the same element in a named tuple you would write'

v = d.joe

When the names are known this is a much simpler way to write the equivalent acces. And, as Nick pointed out, you can't use unpacking assignments with a dict.

toddrme2178 said...

Yes, the syntax is a little shorter. On the other hand it muddles the distinction between keys and class attributes.

And you can do unpacking assignment with an orderdict using .values().

Steve said...

Perhaps you can, but you cannot predict the order in which the values will be produced, so it isn't very useful for an unpacking assignment, is it?

toddrme2178 said...

You can for ordereddict.

claudio canepa said...

"...perhaps it's time to either remove namedtuple's __getitem__() and similar methods or implement a similar object without the sequence behaviors"

What if I want to use some generic code to "trying to save its state during a pickling operation or some more obscure book-keeping metacode"

Samus_ said...

tuples are cheaper in terms of memory, that's the only reason they exist.

anything that can be done with a tuple can (and most of the time is) done with a list.

in Haskell a 2-tuple is different from a 3-tuple so it make more sense in the context exposed on this article but Python won't bother, it'll fail with the wrong sized tuple the same way it would with a wrong sized list.

Samus_ said...

tuples are cheaper in terms of memory, that's the only reason they exist.

everything that can be done with a tuple can (and usually is) done with a list.

in Haskell a 2-tuple is different than a 3-tuple and (str, int) is different from (int, str) but Python won't bother, the wrong tuple will fail the same as the wrong list.

the only thing tuples can do that lists not is being hashed but I doubt a tuple was used as a key to anything ever.

Steve said...

@toddr2178: Your point being?

@claudio: Good point.

@Samus_: You made that up, didn't you? Tuples are used as dict keys all the time.

Anonymous said...

We use TCP/IP five tuples as keys in a dictionary of connection specific data.

Laurence Rowe said...

I wish Python had an equivalent to Javascript's Object.freeze so that tuples wouldn't see so much abuse as frozen lists.

PEP-351, "The freeze protocol" was rejected, but that dealt with creating a hashable copy rather than simply disabling further mutation of an existing object.

Johan Nestaas said...

@toddrme2178, @Steve:
One main reason for using namedtuples over dicts is for performance and to reduce memory usage, as well as being immutable.

dynamic dicts can really hurt performance if you're instantiating a lot at a time, and it's wasteful if you can do the same exact thing with a namedtuple, especially when immutability makes sense for the object.

Consider a namedtuple that represents a specific sequence of bytes for a file. You might want to have filepath, memory_address, sequence. You aren't editing the files, so you just want to scrape through a filesystem and figure out which files have that sequence, therefore you don't want to change the values ever. If you might find 20 millions instances of that sequence, you can save a lot of RAM by using a named tuple, and it represents an object which can and should be immutable.

I've seen this sort of thing in practice, namedtuples make the most sense, and it can really improve performance if you're doing this at a large scale.

Steve said...

@Johann: Yes, I agree, the immutability and speed are important reasons for choosing a named tuple. The dict is far too general a data structure to be efficient in applications like those you mention.

Anonymous said...

# It is possible to subclass a namedtuple:

from collections import namedtuple
fields = ('states', 'symbols', 'transition', 'initial', 'accepting_states')
NFA = namedtuple('NFA', fields)
_nfa = NFA(*fields)
_nfa._fields, _nfa._asdict()

class NFAWithMethods(NFA):
def print_symbols(self):
print(self.symbols)

_nfa = NFAWithMethods(*fields)
_nfa.print_symbols()

Steve said...

Yes, one could, But to do so would be a perversion building, as it does, one unsatisfactory layer on top of another.

Fortunately Raymond knows where to stop, and so we have namedtuple instead. I was wondering whether a slimmed-down version (something like the bunch class) might be useful, not considering adding more cruft!

This does neatly demonstrate the Python geek tendency to focus on ever more peripheral questions. Let's try and resist that, shall we?

Samus_ said...

mydict[('some', 'random', 'stuff')] = 'really?'

no seriously, I haven't seen this anywhere and it's a terrible idea imho.

Anonymous said...

I unpack from struct.unpack into a namedtuple. A tuple subclass with property accessors is perfect for my needs. Don't mess with namedtuple...

There's the built-in namespace object in version 3.3, i.e. types.SimpleNamespace. In 3.3 it's used by sys.implementation and time.get_clock_info(). Basically, it's like a heap type defined with __slots__ = ('__dict__',), so no weak references.

Steve said...
This comment has been removed by the author.
Steve said...

@Samus: Leaving aside the unnecessary parentheses, consider for example a memoizing function that remembers its arguments (a, b, c) and caches the result to avoid duplicating expensive computations. The natural way to store previously-seen arguments is as a dict, keyed by the tuple of argument values.

The basis logic (and sorry for Blogger's horrendous code mangling, by the way) would look like:


stash = {}
def f(a, b, c):
if (a, b, c) in stash:
return stash[a, b, c]
else:
# long expensive computation
stash[a, b, c] = result
return result


In case this too gets mangled the source is stored as as Memoizing function skeleton on Pastebin with permanent retention. [Note: there's even a mistake on Pastebin, where the final return statement is indented one level too many]. If you can think of a more natural way to do this I will be interested to hear it.

Once again, I make the point that this was not the issue I started out to discuss. So while it's interesting it should be regarded as a "rathole" tending towards derailing discussion of the point I was originally trying to make. Sure, you are interested in it, but why do you imagine I should be? Would silence have been an acceptable response?

I was really just asking whether it might be a good idea to have a standard library equivalent of the fairly well-known "bunch" pattern. I believe (though would be happy to be corrected) that namedtuple is currently the nearest we have.

For extra points the interested reader should demonstrate a memoizing function that also accepts keyword arguments.

Steve said...

@anonymous: that sounds reasonable. Thanks for the example.

Andrew Dalke said...

The classic use-case which tests any guideline about tuples is the class __bases__ attribute. It returns a tuple. Generally speaking, it needs to be an immutable ordered collection of class-like object, where the length can be different for different classes. Python's tuple() is a perfect fit for this, and no other container comes close.

This case breaks most tuple guidelines: There is no meaning to a position beyond its relative ordering. Iterating over the base classes does make sense for some cases, such as building an inheritance diagram. It's not used as a dictionary key. Its members are homogenous.

FWIW, my biggest gripe about using namedtuples as light-weight immutable classes is the lingering ability to use indexing. Namedtuples are great as a backwards-compatible transition from tuples to class-like objects, which was needed for os.stat() and time.gmtime(). But in my own code I want the ability to transition from a nametuple to a class, without worrying that users of my API might have been using the getitem interface that I didn't want to implement. While people might also use the _asdict and _replace methods, it's a big temptation in inner loops to use obj[1] instead of obj.green (as with an RGB color class) because index lookup in a namedtuple is twice as fast as property lookup.

boothead said...

And of course in Haskell they're called product types. So given a tuple of (A, B) you can think of it as a cross product of anything from the A's and the set of B's.

Steve said...

I believe I'd rather know how it behaved than have to think of it like that. From the outside Haskell seems disciplined and difficult, but people I respect in the Python world speak highly of it. I tend to think of it as the bastard child of ML and LISP. Ignorance is Bliss.

Steve said...

@Andrew: Well, I believe we are largely in agreement, but there is the valid point that unpacking assignments to the elements is a (somewhat) desirable feature. Does unpacking assignment rely on __getitem__(), or does it use __next__() and the iteration protocol?