Modern Python Cookbook
上QQ阅读APP看书,第一时间看更新

Rewriting an immutable string

How can we rewrite an immutable string? We can't change inpidual characters inside a string:

>>> title = "Recipe 5: Rewriting, and the Immutable String"
>>> title[8] = ''
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment

Since this doesn't work, how do we make a change to a string?

Getting ready

Let's assume we have a string like this:

>>> title = "Recipe 5: Rewriting, and the Immutable String"

We'd like to do two transformations:

  • Remove the part up to the :
  • Replace the punctuation with _, and make all the characters lowercase

Since we can't replace characters in a string object, we have to work out some alternatives. There are several common things we can do, shown as follows:

  • A combination of slicing and concatenating a string to create a new string.
  • When shortening, we often use the partition() method.
  • We can replace a character or a substring with the replace() method.
  • We can expand the string into a list of characters, then join the string back into a single string again. This is the subject of a separate recipe, Building complex strings with a list of characters.

How to do it...

Since we can't update a string in place, we have to replace the string variable's object with each modified result. We'll use an assignment statement that looks something like this:

some_string = some_string.method()

Or we could even use an assignment like this:

some_string = some_string[:chop_here]

We'll look at a few specific variations of this general theme. We'll slice a piece of a string, we'll replace inpidual characters within a string, and we'll apply blanket transformations such as making the string lowercase. We'll also look at ways to remove extra _ that show up in our final string.

Slicing a piece of a string

Here's how we can shorten a string via slicing:

  1. Find the boundary:
    >>> colon_position = title.index(':')
    

    The index function locates a particular substring and returns the position where that substring can be found. If the substring doesn't exist, it raises an exception. The following expression will always be true: title[colon_position] == ':'.

  2. Pick the substring:
    >>> discard, post_colon = title[:colon_position], title[colon_position+1:]
    >>> discard
    'Recipe 5'
    >>> post_colon
    ' Rewriting, and the Immutable String'
    

We've used the slicing notation to show the start:end of the characters to pick. We also used multiple assignment to assign two variables, discard and post_colon, from the two expressions.

We can use partition(), as well as manual slicing. Find the boundary and partition:

>>> pre_colon_text, _, post_colon_text = title.partition(':')
>>> pre_colon_text
'Recipe 5'
>>> post_colon_text
' Rewriting, and the Immutable String'

The partition function returns three things: the part before the target, the target, and the part after the target. We used multiple assignment to assign each object to a different variable. We assigned the target to a variable named _ because we're going to ignore that part of the result. This is a common idiom for places where we must provide a variable, but we don't care about using the object.

Updating a string with a replacement

We can use a string's replace() method to create a new string with punctuation marks removed. When using replace to switch punctuation marks, save the results back into the original variable. In this case, post_colon_text:

>>> post_colon_text = post_colon_text.replace(' ', '_')
>>> post_colon_text = post_colon_text.replace(',', '_')
>>> post_colon_text
'_Rewriting__and_the_Immutable_String'

This has replaced the two kinds of punctuation with the desired _ characters. We can generalize this to work with all punctuation. This leverages the for statement, which we'll look at in Chapter 2, Statements and Syntax.

We can iterate through all punctuation characters:

>>> from string import whitespace, punctuation
>>> for character in whitespace + punctuation:
...     post_colon_text = post_colon_text.replace(character, '_')
>>> post_colon_text
'_Rewriting__and_the_Immutable_String'

As each kind of punctuation character is replaced, we assign the latest and greatest version of the string to the post_colon_text variable.

We can also use a string's translate() method for this. This relies on creating a dictionary object to map each source character's position to a resulting character:

>>> from string import whitespace, punctuation
>>> title = "Recipe 5: Rewriting an Immutable String"
>>> title.translate({ord(c): '_' for c in whitespace+punctuation})
Recipe_5__Rewriting_an_Immutable_String

We've created a mapping with {ord(c): '_' for c in whitespace+punctuation} to translate any character, c, in the whitespace+punctuation sequence of characters to the '_' character. This may have better performance than a sequence of inpidual character replacements.

Removing extra punctuation marks

In many cases, there are some additional steps we might follow. We often want to remove leading and trailing _ characters. We can use strip() for this:

>>> post_colon_text = post_colon_text.strip('_')

In some cases, we'll have multiple _ characters because we had multiple punctuation marks. The final step would be something like this to clean up multiple _ characters:

>>> while '__' in post_colon_text:
...    post_colon_text = post_colon_text.replace('__', '_')

This is yet another example of the same pattern we've been using to modify a string in place. This depends on the while statement, which we'll look at in Chapter 2, Statements and Syntax.

How it works...

We can't—technically—modify a string in place. The data structure for a string is immutable. However, we can assign a new string back to the original variable. This technique behaves the same as modifying a string in place.

When a variable's value is replaced, the previous value no longer has any references and is garbage collected. We can see this by using the id() function to track each inpidual string object:

>>> id(post_colon_text)
4346207968
>>> post_colon_text = post_colon_text.replace('_','-')
>>> id(post_colon_text)
4346205488

Your actual ID numbers may be different. What's important is that the original string object assigned to post_colon_text had one ID. The new string object assigned to post_colon_text has a different ID. It's a new string object.

When the old string has no more references, it is removed from memory automatically.

We made use of slice notation to decompose a string. A slice has two parts: [start:end]. A slice always includes the starting index. String indices always start with zero as the first item. A slice never includes the ending index.

The items in a slice have an index from start to end-1. This is sometimes called a half-open interval.

Think of a slice like this: all characters where the index i is in the range start ≤ i < end.

We noted briefly that we can omit the start or end indices. We can actually omit both. Here are the various options available:

  • title[colon_position]: A single item, that is, the : we found using title.index(':').
  • title[:colon_position]: A slice with the start omitted. It begins at the first position, index of zero.
  • title[colon_position+1:]: A slice with the end omitted. It ends at the end of the string, as if we said len(title).
  • title[:]: Since both start and end are omitted, this is the entire string. Actually, it's a copy of the entire string. This is the quick and easy way to duplicate a string.

There's more...

There are more features for indexing in Python collections like a string. The normal indices start with 0 on the left. We have an alternate set of indices that use negative numbers that work from the right end of a string:

  • title[-1] is the last character in the title, 'g'
  • title[-2] is the next-to-last character, 'n'
  • title[-6:] is the last six characters, 'String'

We have a lot of ways to pick pieces and parts out of a string.

Python offers dozens of methods for modifying a string. The Text Sequence Type — str section of the Python Standard Library describes the different kinds of transformations that are available to us. There are three broad categories of string methods: we can ask about the string, we can parse the string, and we can transform the string to create a new one. Methods such as isnumeric() tell us if a string is all digits.

Here's an example:

>>> 'some word'.isnumeric()
False
>>> '1298'.isnumeric()
True

Before doing comparisons, it can help to change a string so that it has the same uniform case. It's frequently helpful to use the lower() method, thus assigning the result to the original variable:

>>> post_colon_text = post_colon_text.lower()

We've looked at parsing with the partition() method. We've also looked at transforming with the lower() method, as well as the replace() and translate() methods.

See also

  • We'll look at the string as list technique for modifying a string in the Building complex strings from lists of characters recipe.
  • Sometimes, we have data that's only a stream of bytes. In order to make sense of it, we need to convert it into characters. That's the subject of the Decoding bytes – how to get proper characters from some bytes recipe.