Avoiding a potential problem with break statements
The common way to understand a for statement is that it creates a for all condition. At the end of the statement, we can assert that, for all items in a collection, some processing has been done.
This isn't the only meaning for a for statement. When we introduce the break statement inside the body of a for, we change the semantics to there exists. When the break statement leaves the for (or while) statement, we can assert only that there exists at least one item that caused the statement to end.
There's a side issue here. What if the for statement ends without executing break? Either way, we're at the statement after the for statement.
The condition that's true upon leaving a for or while statement with a break can be ambiguous. Did it end normally? Did it execute break? We can't easily tell, so we'll provide a recipe that gives us some design guidance.
This can become an even bigger problem when we have multiple break statements, each with its own condition. How can we minimize the problems created by having complex conditions?
Getting ready
When parsing configuration files, we often need to find the first occurrence of a : or = character in a string. This is common when looking for lines that have a similar syntax to assignment statements, for example, option = value or option : value. The properties file format uses lines where : (or =) separate the property name from the property value.
This is a good example of a there exists modification to a for statement. We don't want to process all characters; we want to know where there is the leftmost : or =.
Here's the sample data we're going use as an example:
>>> sample_1 = "some_name = the_value"
Here's a small for statement to locate the leftmost "=" or ":" character in the sample string value:
>>> for position in range(len(sample_1)):
... if sample_1[position] in '=:':
... break
>>> print(f"name={sample_1[:position]!r}",
... f"value={sample_1[position+1:]!r}")
name='some_name ' value=' the_value'
When the "=" character is found, the break statement stops the for statement. The value of the position variable shows where the desired character was found.
What about this edge case?
>>> sample_2 = "name_only"
>>> for position in range(len(sample_2)):
... if sample_2[position] in '=:':
... break
>>> print(f"name={sample_2[:position]!r}",
... f"value={sample_2[position+1:]!r}")
name='name_onl' value=''
The result is awkwardly wrong: the y character got dropped from the value of name. Why did this happen? And, more importantly, how can we make the condition at the end of the for statement more clear?
How to do it...
Every statement establishes a post condition. When designing a for or while statement, we need to articulate the condition that's true at the end of the statement. In this case, the post condition of the for statement is quite complicated.
Ideally, the post condition is something simple like text[position] in '=:'. In other words, the value of position is the location of the "=" or ":" character. However, if there's no = or : in the given text, the overly simple post condition can't be true. At the end of the for statement, one of two things are true: either (a) the character with the index of position is "=" or ":", or (b) all characters have been examined and no character is "=" or ":".
Our application code needs to handle both cases. It helps to carefully articulate all of the relevant conditions.
- Write the obvious post condition. We sometimes call this the happy-path condition because it's the one that's true when nothing unusual has happened:
text[position] in '=:'
- Create the overall post condition by adding the conditions for the edge cases. In this example, we have two additional conditions:
- There's no = or :.
- There are no characters at all. len() is zero, and the for statement never actually does anything. In this case, the position variable will never be created. In this example, we have three conditions:
(len(text) == 0 or not('=' in text or ':' in text) or text[position] in '=:')
- If a while statement is being used, consider redesigning it to have the overall post condition in the while clause. This can eliminate the need for a break statement.
- If a for statement is being used, be sure a proper initialization is done, and add the various terminating conditions to the statements after the loop. It can look redundant to have x = 0 followed by for x = .... It's necessary in the case of a for statement that doesn't execute the break statement. Here's the resulting for statement and a complicated if statement to examine all of the possible post conditions:
>>> position = -1 >>> for position in range(len(sample_2)): ... if sample_2[position] in '=:': ... break ... >>> if position == -1: ... print(f"name=None value=None") ... elif not(sample_2[position] == ':' or sample_2[position] == '='): ... print(f"name={sample_2!r} value=None") ... else: ... print(f"name={sample_2[:position]!r}", ... f"value={sample_2[position+1:]!r}") name= name_only value= None
In the statements after the for, we've enumerated all of the terminating conditions explicitly. If the position found is -1, then the for loop did not process any characters. If the position is not the expected character, then all the characters were examined. The third case is one of the expected characters were found. The final output, name='name_only' value=None, confirms that we've correctly processed the sample text.
How it works...
This approach forces us to work out the post condition carefully so that we can be absolutely sure that we know all the reasons for the loop terminating.
In more complex, nested for and while statements—with multiple break statements—the post condition can be difficult to work out fully. A for statement's post condition must include all of the reasons for leaving the loop: the normal reasons plus all of the break conditions.
In many cases, we can refactor the for statement. Rather than simply asserting that position is the index of the = or : character, we include the next processing steps of assigning substrings to the name and value variables. We might have something like this:
>>> if len(sample_2) > 0:
... name, value = sample_2, None
... else:
... name, value = None, None
>>> for position in range(len(sample_2)):
... if sample_2[position] in '=:':
... name, value = sample_2[:position], sample2[position:]
... break
>>> print(f"{name=} {value=}")
name='name_only' value=None
This version pushes some of the processing forward, based on the complete set of post conditions evaluated previously. The initial values for the name and value variables reflect the two edge cases: there's no = or : in the data or there's no data at all. Inside the for statement, the name and value variables are set prior to the break statement, assuring a consistent post condition.
The idea here is to forego any assumptions or intuition. With a little bit of discipline, we can be sure of the post conditions. The more we think about post conditions, the more precise our software can be. It's imperative to be explicit about the condition that's true when our software works. This is the goal for our software, and you can work backward from the goal by choosing the simplest statements that will make the goal conditions true.
There's more...
We can also use an else clause on a for statement to determine if the statement finished normally or a break statement was executed. We can use something like this:
>>> for position in range(len(sample_2)):
... if sample_2[position] in '=:':
... name, value = sample_2[:position], sample_2[position+1:]
... break
... else:
... if len(sample_2) > 0:
... name, value = sample_2, None
... else:
... name, value = None, None
>>> print(f"{name=} {value=}")
name='name_only' value=None
Using an else clause in a for statement is sometimes confusing, and we don't recommend it. It's not clear if its version is substantially better than any of the alternatives. It's too easy to forget the reason why else is executed because it's used so rarely.
See also
- A classic article on this topic is by David Gries, A note on a standard strategy for developing loop invariants and loops. See http://www.sciencedirect.com/science/article/pii/0167642383900151