Laziness¶
Lazy iterators evaluate only when necessary. They allow us to semantically manipulate large amounts of data while keeping very little of it actually in memory. They act like lists but don’t take up space.
Example - A Tale of Two Cities¶
We open a file containing the text of the classic text “A Tale of Two Cities” by Charles Dickens[1].
>>> book = open('tale-of-two-cities.txt')
Much like a secondary school student, Python owns and opens the book without
reading a single line of the text. The object book
is a lazy iterator!
Python will give us a line of the text only when we explicitly ask it to do so
>>> next(book)
"It was the best of times,"
>>> next(book)
"it was the worst of times,"
and so on. Each time we call next
on book
we burn through another line
of the text and the book
iterator marches slowly onwards through the text.
Computation¶
We can lazily operate on lazy iterators without doing any actual computation. For example lets read the book in upper case
>>> from toolz import map # toolz' map is lazy by default
>>> loud_book = map(str.upper, book)
>>> next(loud_book)
"IT WAS THE AGE OF WISDOM,"
>>> next(loud_book)
"IT WAS THE AGE OF FOOLISHNESS,"
It is as if we applied the function str.upper
onto every line of the book;
yet the first line completes instantaneously. Instead Python does the
uppercasing work only when it becomes necessary, i.e. when you call next
to ask for another line.
Reductions¶
You can operate on lazy iterators just as you would with lists, tuples, or sets. You can use them in for loops as in
for line in loud_book:
...
You can instantiate them all into memory by calling them with the constructors
list
, or tuple
.
loud_book = list(loud_book)
Of course if they are very large then this might be unwise. Often we use
laziness to avoid loading large datasets into memory at once. Many
computations on large datasets don’t require access to all of the data at a
single time. In particular reductions (like sum) often take large amounts of
sequential data (like [1, 2, 3, 4]) and produce much more manageable results
(like 10) and can do so just by viewing the data a little bit at a time. For
example we can count all of the letters in the Tale of Two Cities trivially
using functions from toolz
>>> from toolz import concat, frequencies
>>> letters = frequencies(concat(loud_book))
{ 'A': 48036,
'B': 8402,
'C': 13812,
'D': 28000,
'E': 74624,
...
In this case frequencies
is a sort of reduction. At no time were more than
a few hundred bytes of Tale of Two Cities necessarily in memory. We could just
have easily done this computation on the entire Gutenberg collection or on
Wikipedia. In this case we are limited by the size and speed of our hard drive
and not by the capacity of our memory.
[1] | http://www.gutenberg.org/cache/epub/98/pg98.txt |