Python Strings: Part Six

Python stringsIn the previous article, we began our look at indexing and slicing. In this article, we will continue our look at slicing and show some practical applications of slicing.

In Python 2.3 and later, there is support for a third index, used as a step. The step is added to the index of each item extracted. The three-index form of a slice is X[I:J:K], which means “extract all the items in X, from offset I through J-1, by K.” The third limit, K, defaults to 1, which is why normally all items in a slice are extracted from left to right. But if you specify an explicit value, you can use the third limit to skip items or to reverse their order.

For instance, a[1:10:2] will fetch every other item in X from offsets 1-9; that is, it will collect the items at offsets 1, 3, 5, 7 and 9. As usual, the first and second limits default to 0 and the length of the sequence, respectively, so a[::2] gets severy other item from the beginning to the end of the sequence:

>>> a = 'nowisthetimeto'
>>> a[1:10:2]
>>> 'oitei'

You can also use a negative stride. For example, the slicing expression “every”[::-1] returns the new string “yreve” – the first two bounds default to 0 and the length of the sequence, as before, and a stride of -1 indicates that the slice should go from right to left instead of the usual left to right. The effect is to reverse the sequence:

>>> a = 'every'
>>> a[::-1]
'yreve'

With a negative stride, the meanings of the first two bounds are essentially reversed. That is, the slice a[5:1:-1] fetches the items from 2 to 5, in reverse order (the result contains items from offsets 5, 4, 3, and 2):

>>> a = 'thequick'
>>> a[5:1:-1]
'iuqe'

Skipping and reverse like this are the most common use cases for three-limit slices, but see Python’s standard library manual for more details.

Slices have many applications. For example, argument words listed on a system command line are made available in the argv attribute of the built-in sys module:

#File command.py - echo command line args
import sys
print(sys.argv)

% python command.py -1 -2 -3
['command.py', '-1', '2', '3']

Usually, however, you’re only interested in inspected the arguments that follow the program name. This leads to a typical application of slices: a single slice expression can be used to return all but the first item of a list. Here, sys.argv[1:] returns the desired list, [‘-1’, ‘-2’, ‘-3’]. You can then process this list without having to accommodate the program name at the front.

External Links:

Strings at docs.python.org

Python Strings at Google for Developers

Python strings tutorial at afterhoursprogramming.com

Python Strings: Part Five

Python stringsBecause strings are defined as ordered collections of characters, we can access their components by position. In Python, characters in a string are fetched by indexing – providing the numeric offset of the desired component in square brackets after the string. When you specify an index, you get back a one-character string at the specified position.

Strings in Python are similar to strings in the C/C++ language in that Python offsets start at 0 and end at one less than the length of the string. Unlike C, however, Python also lets you fetch items from sequences such as strings using negative offsets. Technically, a negative offset is added to the length of a string to derive a positive offset. You can also think of negative offsets as counting backward from the end. For example:

>>> a = 'party'
>>> a[0], a[-2]
>>> ('p', 't')
>>> a[1:3], a[1:], a[:-1]
('ar', 'arty', 'part')

The first line defines a five-character string and assigns it the name a. The next line indexes it in two ways: a[0] gets the item at offset 0 from the left (the one-character string ‘p’), and a[-2] gets the item at offset 2 back from the end.

The last line in the preceding example demonstrates slicing, a generalized form of indexing that returns an entire section, not a single item. Most likely the best way to think of slicing is that it is a type of parsing, especially when applied to strings. It allows us to extract an entire section in a single step. Slices can be used to extract columns of data, chop off leading and trailing text, and more.

The basics of using slicing are fairly simple. When you index a sequence object such as a string on a pair of offsets separated by a colon, Python returns a new object containing the contiguous section identified by the offset pair. The left offset is taken to be the lower bound (which is inclusive), and the right is the upper bound (which is noninclusive). That is, Python fetches all items from the lower bound up to but not including the upper bound, and returns a new object containing the fetched items. If omitted, the left and right bounds default to 0 and the length of the object your are slicing, respectively.

For instance, in the example above, a[1:3] extracts the items at offsets 1 and 2. It grabs the second and third items, and strops before the fourth item and offset 3. Next, a[1:] gets tall the items beyond the first. The upper bound, which is not specified, defaults to the length of the string. Finally, a[:-1] fetches all but the last item. The lower bound defaults to 0, and -1 refers to the last item (noninclusive).

Indexing and slicing are powerful tools, and if you’re not sure about the effects of a slice, you can always try it out at the Python interactive prompt. You can even change an entire section of another object in one step by assigning to a slice, though not for immutables like strings.

External Links:

Strings at docs.python.org

Python Strings at Google for Developers

Python strings tutorial at afterhoursprogramming.com

Python Strings: Part Four

Python stringsIn the previous articles, we introduced Python strings and covered escape sequences, raw strings and triple-quoted strings. Now we can cover some basic string operations. Strings can be concatenated using the + operator and repeated using the * operator:

>>> len('string')
6
>>> 'str' + 'ing'
'string'
>>> 'repeat' * 4
'repeatrepeatrepeatrepeat'

Formally, adding two string objects creates a new string object, with the contents of its operands joined. Repetition is like adding a string to itself a number of times. In both cases, Python lets you create arbitrarily-sized strings. There is no need to pre-declare anything in Python, including the sizes of data structures such as strings. The len built-in function returns the length of a string, or any object with a length.

Repetition comes in handy in a number of contexts. For example, if you want to print out 80 asterisks, just do this:

>>> print('*' * 80)

Notice that we are using the same + and * operators that perform addition and multiplication when using numbers, so we are using operator overloading. Python does the correct operation because it knows the types of the objects being added and multiplied. But there’s a limit to what you can do with operator overloading in Python. For example, Python does not allow you to mix numbers and strings in + expressions. ‘repeat’ + 3 will raise an error instead of automatically converting 3 to a string.

You can also iterate over strings in loops using for statements and test membership for both characters and substrings with the in expression operator, which is essentially a search. For substrings, in is much like the str.find() method, but it returns a Boolean result instead of the substring’s position. For example:

>>> mystr = "repeat"
>>> for c in mystr:
	print(c, ' ')

r e p e a t
>>> "p" in mystr
True
>>> "y" in mystr
False
>>> 'straw' in 'strawberry'
True

The for loop assigns a variable to successive items in a sequence and executes one or more statements for each item. In effect, the variable c becomes a cursor stepping across the string here.

External Links:

Strings at docs.python.org

Python Strings at Google for Developers

Python strings tutorial at afterhoursprogramming.com

Python Strings: Part Three

Python stringsAs we saw in the previous article, escape sequences are handy for embedding special byte codes within strings. Sometimes, however, the special treatment of backslashes can cause problems. For example, let’s assume we want to open a file called thefile.txt for writing in the C directory newdir, and we use a statement like this:

fp = open('C:\newdir\thefile.txt','w')

The problem here is that \n is taken to stand for a newline character, and \t is replaced with a tab. In effect, the call tries to open a file name c:[newline]ew[tab]hefile.txt, which is not what we want.

The solution is to use raw strings. If the letter r (uppercase or lowercase) appears just before the opening quote of a string, it turns off the escape mechanism. The result is that Python retains your backslashes literally, exactly as you type them. Therefore, to fix the filename problem, just remember to add the letter r on Windows:

fp = open(r'C:\newdir\thefile.txt','w')

Alternatively, because two backslashes are really an escape sequence for one backslash, you can keep your backslashes by simply doubling them up:

fp = open('C:\\newdir\\thefile.txt','w')

In fact, Python does this sometimes when it prints strings with embedded backslashes:

>>> path = r'C:\newdir\thefile.txt'
>>> path
'C:\\newdir\\thefile.txt'
>>> print(path)
'C:\\newdir\\thefile.txt'

As with numeric representation, the default format at the interactive prompt prints results as if they were code, and therefore escapes backslashes in the output. The print statement provides a more user-friendly format that shows that there is actually only one backslash in each spot. To verify that this is the case, you can check the result of the built-in len function, which returns the number of bytes in the string, independent of display formats. If you count the characters in the print(path) output, you will see that there is really just one character per backslash, for a total of 21.

Besides directory paths on Windows, raw strings are commonly used for regular expressions. Also note that Python scripts can usually use forward slashes in directory paths on Windows and Unix. This is because Python tries to interpret paths portably. Raw strings are useful, however, if you code paths using native Windows backslashes.

Finally, Python also has a triple-quoted string literal format (sometimes called a block string) that is a syntactic convenience for coding multiline text data. This form begins with three quotes of either the single or double variety, is followed by any number of lines of text, and is closed with the same triple-quote sequence that opened it. Single and double quotes embedded in the string’s text may be, but do not have to be, escaped. The string does not end until Python sees three unescaped quotes of the same kind used to start the literal:

>>> mystr = """This is an example
of using triple quotes
to code a multiline string"""
>>> mystr
'This is an example\nof using triple quotes\nto code a multiline string'

This string spans three lines. Python collects all the triple-quoted text into a single multiline string, with embedded newline characters (\n) at the places where your code has line breaks. To see the string with the newlines interpreted, print it instead of echoing:

>>> print(mystr)
This is an example
of using triple quotes
to code a multiline string

Triple-quoted strings are useful any time you need multiline text in your program. You can embed such blocks directly in your scripts without resorting to external text files or explicit concatenation and newline characters.

Triple-quoted strings are also commonly used for documentation strings, which are string literals that are taken as comments when they appear at specific points in your file. They do not have to be triple-quoted blocks, but they usually are to allow for multiline comments.

External Links:

Strings at docs.python.org

Python Strings at Google for Developers

Python strings tutorial at afterhoursprogramming.com

Python Strings: Part Two

Python stringsIn the first article, we introduced Python strings and covered some of the basics. In this article, we will continue our look at strings.

Escape Sequences

In the last article, we introduced the following example:

>>> 'string\'s', "string\"s"

This example embedded a quote inside a string by preceding it with a backslash. This is representative of a general pattern in strings: backslashes are used to introduce special byte codings known as escape sequences. Escape sequences let us embed byte codes in strings that cannot be easily typed on a keyboard. The character \, and one or more characters following it in the string literal, are replaced with a single character in the resulting string object, which has the binary value specified by the escape sequence. For example, we can embed a newline:

>>> a = 'some\nstring'

We can also embed a tab:

>>> a = 'some\tstring'

The two characters \n stand for a single character – the byte containing the binary value of the newline character in your character set, which is usually ASCII code 10). Similarly, the sequence \t is replaced with the tab character. If we just type the variable at the Python interpreter command line, it shows the escape sequences:

>>> a
'some\tstring'

But print interprets the escape sequences, so we get a different result:

>>> print(a)
some	string

To be completely sure how many bytes are in the string, you can use the built-in len function, which returns the actual number of bytes, regardless of how the string is displayed:

>>> len(a)
11

The string is eleven bytes long. Note that the original backslash characters are not really stored with the string in memory. Rather, they are used to tell Python to store special byte values in the string. Apart from \n and \t, here are some of the more interesting escape sequences:

\\ Backslash (stores one \)
\’ Single quote (stores ‘)
\” Double quote (stores “)
\b Backspace
\xhh Character with hex value hh (at most 2 digits
\ooo Character with octal value ooo (up to three digits)
\uhhhh Unicode 16-bit hex
\Uhhhhhhhh Unicode 32-bit hex

Note that some escape sequences allow you to embed absolute binary values into the bytes of a string. For example, here’s a string that embeds two binary zero bytes:

>>> a = 'a\0d\0e'

This is a five-character string, as we can see:

>>> len(a)
5

In Python, the zero byte does not terminate a string the way it typically does in C. Instead, Python keeps both the string’s length and text in memory. In fact, no character terminates a string in Python. Notice also that Python displays nonprintable characters in hex, regardless of how they were specified.

If Python does not recognize the character after a \ as being a valid escape code, it simply keeps the backslash in the resulting string:

>>> a = "d:\download\mycode"
>>> a
'd:\\download\\mycode'
>>> len(a)
18

Unless you want to memorize the escape codes; you probably should not rely on this behavior. To code literal backslashes explicitly such that they are retained in your strings, double them up (\\ instead of \) or use raw strings.

External Links:

Strings at docs.python.org

Python Strings at Google for Developers

Python strings tutorial at afterhoursprogramming.com

Python Strings: Part One

Python strings

Introduction to Python Strings

A string in Python is an ordered collection of characters used to store and represent text-based information. From a functional perspective, strings can be used to represent just about anything that can be encoded as text. They can also be used to hold the absolute binary values values of bytes and multibyte Unicode text.

You may have used strings in other languages, and Python’s strings serve the same role as character arrays in languages such as C. In C, we might see a statement such as this:

char ch = ‘a’;

If we want to have a string, we would use something like this:

char *str = “Some arbitrary string”;

or:

char str[] = “Some arbitrary string”;

But in either case, our string is actually an array of characters. Python has no distinct type for individual characters; instead you just use one-character strings. Also, unlike in C, strings in Python are a somewhat higher-level tool and come with a powerful set of processing tools.

Python strings are categorized as immutable sequences, meaning that the characters they contain have a left-to-right positional order and that they cannot be changed in place. In fact, strings are a subset of the larger class of objects called sequences.

There are many ways to write strings in Python. This is a valid Python string:

a = ‘string’

But then again, so is this:

a = “string”

You can also use triple quotes:

a = ”’…string…”’
a = “””…string…”””

Around Python strings, single and double quote characters are interchangeable. That is, string literals can be written enclosed in either two single or two double quotes – the two forms work the same and return the same type of object. The reason for supporting both is that it allows you to embed a quote character of the other variety inside a string without escape it with a backslash. You may embed a single quote character in a string enclosed in double quote characters, and vice versa:

>>> ‘string”s’, “string’s”
(‘string”s’, “string’s”)

Incidentally, Python automatically concatenates adjacent string literals in any expression, although it is almost as simple to add a + operator between them to invoke concatenation explicitly:

>>> a = “Some ” ‘arbitrary’ ” string”
>>> a
‘Some arbitrary string’

Note that adding commas between these strings would result in a tuple, not a string. Also notice in all of these outputs that Python prefers to print strings in single quotes, unless they embed one. You can also embed quotes by escaping them with backslashes:

>>> ‘string\’s’, “string\”s”
(“string’s”, ‘string”s’)

External Links:

Strings at docs.python.org

Python Strings at Google for Developers

Python strings tutorial at afterhoursprogramming.com