Python Strings: Part Two

Python stringsIn the first article, we introduced Python strings and covered some of the basics. In this article, we will continue our look at strings.

Escape Sequences

In the last article, we introduced the following example:

>>> 'string\'s', "string\"s"

This example embedded a quote inside a string by preceding it with a backslash. This is representative of a general pattern in strings: backslashes are used to introduce special byte codings known as escape sequences. Escape sequences let us embed byte codes in strings that cannot be easily typed on a keyboard. The character \, and one or more characters following it in the string literal, are replaced with a single character in the resulting string object, which has the binary value specified by the escape sequence. For example, we can embed a newline:

>>> a = 'some\nstring'

We can also embed a tab:

>>> a = 'some\tstring'

The two characters \n stand for a single character – the byte containing the binary value of the newline character in your character set, which is usually ASCII code 10). Similarly, the sequence \t is replaced with the tab character. If we just type the variable at the Python interpreter command line, it shows the escape sequences:

>>> a
'some\tstring'

But print interprets the escape sequences, so we get a different result:

>>> print(a)
some	string

To be completely sure how many bytes are in the string, you can use the built-in len function, which returns the actual number of bytes, regardless of how the string is displayed:

>>> len(a)
11

The string is eleven bytes long. Note that the original backslash characters are not really stored with the string in memory. Rather, they are used to tell Python to store special byte values in the string. Apart from \n and \t, here are some of the more interesting escape sequences:

\\ Backslash (stores one \)
\’ Single quote (stores ‘)
\” Double quote (stores “)
\b Backspace
\xhh Character with hex value hh (at most 2 digits
\ooo Character with octal value ooo (up to three digits)
\uhhhh Unicode 16-bit hex
\Uhhhhhhhh Unicode 32-bit hex

Note that some escape sequences allow you to embed absolute binary values into the bytes of a string. For example, here’s a string that embeds two binary zero bytes:

>>> a = 'a\0d\0e'

This is a five-character string, as we can see:

>>> len(a)
5

In Python, the zero byte does not terminate a string the way it typically does in C. Instead, Python keeps both the string’s length and text in memory. In fact, no character terminates a string in Python. Notice also that Python displays nonprintable characters in hex, regardless of how they were specified.

If Python does not recognize the character after a \ as being a valid escape code, it simply keeps the backslash in the resulting string:

>>> a = "d:\download\mycode"
>>> a
'd:\\download\\mycode'
>>> len(a)
18

Unless you want to memorize the escape codes; you probably should not rely on this behavior. To code literal backslashes explicitly such that they are retained in your strings, double them up (\\ instead of \) or use raw strings.

External Links:

Strings at docs.python.org

Python Strings at Google for Developers

Python strings tutorial at afterhoursprogramming.com