Sunday, June 28, 2015

Building a Unicode .format Nanny with Python's AST

Detecting problematic .format calls quickly and painlessly using Python's powerful ast module. Follow along by downloading the Jupyter notebook of this post!

The issue: where I work, we are still a Python 2 shop owing to technical debt and legacy dependencies. Since Python 2's default string type is a byte sequence, painful encoding-related problems surface in our applications from time to time. A fairly typical example runs something like this: a developer writes some code to format data into a string, using the convenient and powerful .format method supported by all string objects:

In [1]:
def pretty_format(some_data):
    return 'Hello, {}!'.format(some_data)

Through the course of our exhaustive testing, we prove that this function is correct over a wide range of inputs:

In [2]:
print pretty_format('world')
Hello, world!

The code ships. Months pass without incident, our pretty_format routine prettily formatting every bit of data thrown its way. Lulled into complacency through enjoyment of our apparent success, we move on to other tasks. One day, everything comes to a screeching halt as, completely unprepared, we receive one of the most dreaded error messages in all of software development:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

What happened? Why did code that worked flawlessly for months suddenly detonate, without warning, without mercy?

Much of the data that flows through this format template and others like it is simple ASCII-valued information: dates, addresses, phone numbers, and the like. Having used Python 2 for many years, we are habituated to spell strings, including our template formatting strings, using the simple single quote

'a typical string'

What happens, though, when our user data contains an accented character or other non-ASCII symbol?

In [3]:
full_name = u'Ariadne Éowyn'
print pretty_format(full_name)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-3d955b7228e5> in <module>()
      1 full_name = u'Ariadne Éowyn'
----> 2 print pretty_format(full_name)

<ipython-input-1-51c87778ff71> in pretty_format(some_data)
      1 def pretty_format(some_data):
----> 2     return 'Hello, {}!'.format(some_data)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 8: ordinal not in range(128)

Boom! Python detects the mismatch between the byte string template and the Unicode data containing some multi-byte characters that simply cannot be represented directly in the target format, ASCII. In other words, Python is refusing to guess what we want: do we prefer a binary expansion, and, if so, in what encoding? Should the accent characters simply be dropped? Do we want unexpected symbols to be translated into ASCII error characters? Python has no way of knowing which of these options is appropriate to the present situation, so it takes the only reasonable course and raises an exception.

Many encoding issues can be quite challenging to reconcile, but this case is rather simple: if the format string is specified in Unicode format — rather than as a plain byte string — this entire class of problem would be avoided:

In [4]:
def pretty_format(some_data):
    # Unicode template prepares this routine to handle non-ASCII symbols
    return u'Hello, {}!'.format(some_data)

print pretty_format(full_name)
Hello, Ariadne Éowyn!

But how do we know where the problematic calls to .format are lurking in our code base without waiting for the next error to occur? Is there a way we could find these calls proactively, eliminating them from the system before they wreak havoc on our application?

A brief diversion, or Now you have two problems

When I first mentioned the idea of automatically detecting problematic .format calls to a coworker, he immediately quipped back “have fun writing that regex!” Before we look at a more powerful and reliable alternative, let's take a moment to examine his intuition and understand why we might not want to use regular expressions for this task.

At first glance, regular expressions seem reasonably well suited to this job: what we're looking for are substrings of a given string — that is, parts of a Python source file — containing a certain pattern. Indeed it is fairly easy to construct a first cut at a regular expression for this. Here is mine, including a bunch of explanatory comments:

In [5]:
import re

# a first draft regular expression for detecting 
# problematic .format calls
pattern = re.compile(r'''
    (?x)        # turn on verbose regex mode
    
    (?<!u)      # ignore matches that start with a u
    
    '           # match ', followed by 
    [^']*       # any number of non-' characters,
    '           # followed by '
    
    \.format    # only if the string is followed by a
                # call to .format
    ''')

This appears to work well for some simple cases:

In [6]:
def is_problem(s):
    return bool(pattern.search(s))

print is_problem("message = 'Hello, world!'")
print is_problem("message = 'Hello, {}!'.format")
print is_problem("message = u'Hello, {}!'.format")
False
True
False

but will fail for many, many others. What if the code we're examining uses double quoted strings? Or multiline strings? Does the regular expression we've concocted handle continuation lines correctly? Backslash escapes? Comments? Here are just a few example cases that our regex will not report correcly:

'another {}string' \
    .format

"now you're just" \
'bein\' {}annoyin\''.format(u'seriously ')

# 'ignore me'.format

Just for fun, here are a few more: what happens if we use the Python parser's string combination feature?

In [7]:
# a false positive
is_problem("""
    u'this ' 'and this are the same string{}'.format
""")
Out[7]:
True

Or if you get in your time machine and import Unicode literals?

In [8]:
# another false positive
is_problem("""
    from __future__ import unicode_literals
    
    'Unicode masquerading as {}'.format
""")
Out[8]:
True

To be effective, our source examination tool needs to handle all these cases and more so that team members will trust the feedback they are receiving: we don't want to create alarm fatigue by producing false or spurious warnings. On the other hand, there are many edge cases and difficult combinations that will require a complicated parser solution to handle correctly.

Notice, though, that Python's own parser already deals with all of these difficulties when interpreting our source code! Since the Python language and community emphasize openness and introspection, it will probably come as no surprise that this machinery is exposed and available to Python programs. Using the ast module, we can transform arbitrary Python source into a tree representation, with all of the aforementioned parsing complexities already handled for us. Our task then becomes much simpler: we merely need to inspect this tree and report any instances of calls to .format that are made on byte strings.

What's an AST?

An abstract syntax tree (AST) is an object representation of the syntactic parts of a program and of the relationship between those parts. If we have some program fragment,

for little_one in ['Asher', 'Ansel', 'Ariadne', 'Andromeda']:
    name_complexity = len(little_one)
    print name_complexity

we can observe that there is a lot going on in even a relatively simple construct like this one:

  • the entire construct is a logical unit with respect to the other statements that may surround it. For example, moving this statement around will not change its internal semantics (although moving it may still affect the semantics of the program around it)
  • the statement itself consists of 3 components: the iterable that supplies the values for the loop, the assignment target, and the loop body, which is in turn composed of multiple statements
  • each of the parts have a specific relationship to the others that is communicated by the source code. This relationship forms part of the semantic structure of the statement: if we were to move the print call back 4 spaces, we would alter the relationship of the parts to each other and therefore the semantics of the entire statement:
for little_one in ['Asher', 'Ansel', 'Ariadne', 'Andromeda']:
    name_complexity = len(little_one)
print name_complexity

An AST captures these nuances in object form. In an AST representation, nodes model individual constructs in the program, drawn from the set of all available constructs defined by the language. In Python, for example, we have node types representing Module, For, Name, and Str. Each node captures both the type of construct being represented and the details of a particular instance: the Name class represents all possible variable usages, while a particular instance of Name in an AST will include the specific name used at a given point in a program and whether that name is being read from or written to.

Let's take a look at the AST representation of this for loop:

In [9]:
import ast
tree = ast.parse("""

for little_one in ['Asher', 'Ansel', 'Ariadne', 'Andromeda']:
    name_complexity = len(little_one)
    print name_complexity

""")

print ast.dump(tree)[:92] + u'…'
Module(body=[For(target=Name(id='little_one', ctx=Store()), iter=List(elts=[Str(s='Asher'), …

First we parse the source text into a tree representation, then ask the ast module to dump that tree structure back out as text. Since the output of ast.dump can be somewhat difficult to read, I've indented this particular trace by hand to reveal the relationship structure of the tree:

Module(body=[
  For(
    target=Name(id='little_one', ctx=Store()), 
    iter=List(
      elts=[Str(s='Asher'), Str(s='Ansel'), 
            Str(s='Ariadne'), Str(s='Andromeda')], 
      ctx=Load()), 
    body=[
      Assign(
        targets=[Name(id='name_complexity', ctx=Store())], 
        value=Call(
          func=Name(id='len', ctx=Load()), 
          args=[Name(id='little_one', ctx=Load())], 
          keywords=[], 
          starargs=None, 
          kwargs=None)), 
      Print(
        dest=None, 
        values=[Name(id='name_complexity', ctx=Load())], 
        nl=True)], 
    orelse=[]
  )
])

The Module object is the outermost container for our code. It corresponds to a .py file and contains a single attribute, body, that contains the list of statements — imports, assignments, function and class definitions, etc. — that make up the module. For this simple example, the module body contains one element: our for loop.

With a little squinting, you can see the two body statements contained within the for loop attached to the For node's body attribute: Assign and Print. This brings the total number of direct children of the For node to four: a List node as its iterable, a Name node as its target, and the Assign and Print nodes making up its body. The List, Assign, and Print nodes each have their own children representing the string literals that make up the list, both sides of the assignment statement, and the values we wish to print.

Given this general introduction to Python's AST, here are two examples illuminating what our .format nanny will be looking for; first is a case we want to alert the user about:

In [10]:
ast.dump(ast.parse("'a {}string'.format"))
Out[10]:
"Module(body=[Expr(value=Attribute(value=Str(s='a {}string'), attr='format', ctx=Load()))])"

followed by an acceptable spelling that should produce no warning:

In [11]:
ast.dump(ast.parse("u'a {}string'.format"))
Out[11]:
"Module(body=[Expr(value=Attribute(value=Str(s=u'a {}string'), attr='format', ctx=Load()))])"

Translating this into a textual description lets us refine the problem statement we formulated earlier: we are looking for Attribute nodes with an attr of 'format' and whose value is a Str node. If that Str node's s attribute starts with a ' and not a u, we have found a problematic spelling and should warn the user.

Stop and ponder this for a moment. Notice how we've gone from a definite but ambiguous notion of what we wanted to a simple, precise, and implementable description that leverages the transformation of source text into a tree representation.

Visiting .format

Python's ast module provides a tree traversal framework for working with ASTs that is based on the Visitor pattern. For each node type, a custom traverser defines a method corresponding to that type that will be invoked whenever the traversal encounters such a node. For example, here's the basic template for working with Attribute nodes:

class MyVisitor(ast.NodeVisitor):
    def visit_Attribute(self, node):
        # do something with Attribute nodes

Let's do some exploratory coding. We've already seen the basic structure of the Attribute nodes we're looking for, but what else is available to us as we're traversing an AST?

In [12]:
tree = ast.parse("""

'Hello, {}!'.format('world')

pi = math.pi

""")
In [13]:
class AttributeInspector(ast.NodeVisitor):
    def visit_Attribute(self, node):
        print node.__dict__
        print node.value.__dict__
        print

AttributeInspector().visit(tree)
{'col_offset': 0, 'ctx': <_ast.Load object at 0x7ff178b71a10>, 'attr': 'format', 'value': <_ast.Str object at 0x7ff15bffe910>, 'lineno': 3}
{'s': 'Hello, {}!', 'lineno': 3, 'col_offset': 0}

{'col_offset': 5, 'ctx': <_ast.Load object at 0x7ff178b71a10>, 'attr': 'pi', 'value': <_ast.Name object at 0x7ff15bffec10>, 'lineno': 5}
{'ctx': <_ast.Load object at 0x7ff178b71a10>, 'id': 'math', 'col_offset': 5, 'lineno': 5}

This little class simply prints the instance dictionaries for each Attribute node in the tree along with that of its value attribute. I often use this trick (print obj.__dict__) when exploring a problem to help me understand the attributes and capabilities of unknown objects. For this particular problem, our inspector has revealed that each node type has a different set of attributes defined on it and given us clues to help us advance to a solution.

As expected, there are two Attribute nodes in our little example program, corresponding to the .format and .pi references in the source. We saw earlier that we are specifically looking for nodes with an attr of 'format', and our little test class revealed another bit of useful data: the line number on which the attribute access appears in the source, which will be useful information to provide back to the user.

Our tiny Visitor implementation has already solved the first part of our problem: finding Attribute nodes in the source. All that remains is to filter down to .format nodes and check the value attribute of each. Here is the rest of the implementation:

In [14]:
class FormatVisitor(ast.NodeVisitor):
    def visit_Attribute(self, node):
        if node.attr == 'format':
            _str = repr(node.value.s)

            if _str[0] != 'u':
                print u'{}: {}'.format(node.lineno, _str)

With just a handful of lines of Python, we have created a program that meets the goals we set out above. This solution harnesses the full power of Python's exposed parser machinery, which allows our code to express a very high level solution with a minimum of syntactic overhead. Run on our example above, it points out that line 3 of the source contains a problematic spelling:

In [15]:
FormatVisitor().visit(tree)
3: 'Hello, {}!'

while source with an acceptable spelling yields no warnings:

In [16]:
tree = ast.parse("""u'Hello, {}!'.format('world')""")
FormatVisitor().visit(tree)

Here is a complex example that demonstrates several of our tricky cases from earlier, including a split string, continuation line, and even a parenthesized expression!

In [17]:
source = """

('Hello, '
    '{}!') \
.format('world')

"""

tree = ast.parse(source)
FormatVisitor().visit(tree)
3: 'Hello, {}!'

The same source but with corrected strings eliminates the warning:

In [18]:
source = source.replace("('", "(u'")
tree = ast.parse(source)
FormatVisitor().visit(tree)

Even the __future__ import case is handled correctly with no extra effort on our part!

In [19]:
tree = ast.parse("""

from __future__ import unicode_literals
'Hello, {}!'.format('world')

""")
FormatVisitor().visit(tree)

Looking back, looking ahead

Thanks for joining me in this tour of a lesser-used corner of the Python standard library. In my shop, we now use a Git commit hook based on these ideas to help detect these sorts of encoding problems before they even make it into the source tree. Figuring out how to build the hook was a fun exploration with a pragmatic result, and I hope that you have enjoyed walking through it with me and that you learned something along the way!

If you'd like to learn more, here are a few resources to continue your exploration:

  • Green Tree Snakes is Thomas Kluyver's introduction to the AST, billed as “the missing Python AST docs”
  • The ast module documentation contains a grammar specification and module reference
  • Andreas Dewes presented a static analysis tool at PyCon 2015 based on the AST
  • Hy implements a Lisp interpreter in Python and makes extensive use of the AST

No comments:

Post a Comment