Search

16 July, 2018

Generator functions. Why and when should we use them.

You probably have heard of them. But as a beginner python programmer, we tend to ignore it's values because we fail to find uses of it. When we have loops and itertools and other fancy toys offered, why should we bother to understand them. If you want to take your programming  or understanding of Python to the next level, you must understand and use generators more and more.

Before we start, I want to make a simple program that prints square of (hence generates and delivers) 5 numbers from 1 to 5. So what is the first idea that pops up ? Let's use a for loop.


def square(N):
    for i in range(1, N+1):
        print(i**2)

>>>square(5)
1
4
9
16
25

Let's say, you need all the values generated for some other calculation. I know, above example is too lame . But imagine after some complicated calculations, you have generated some result.
So you probably will return all values together in an array. And to that, the first thought in our minds is, using a LIST. Let's use it. 

Now our function will look like this.

>>> def my_function(N):

        new_list = []

        for i in range(1, N+1):
 
            new_list.append(i**2)

        return new_list




>>> my_function(5)

   [1, 4, 9, 16, 25]


All this looks easy. You probably are thinking...If generators are going to ruin this idea, I don't need it. Think again.

  1.  This LIST business will soon start becoming a nuisance as the  number of such calculations and variables increase (believe  me...thinking of new names for variables... SUCK).
  2.  What if you don't know your requirement. How many results  do you want.?
  3.  You had to create an array and it will take memory space  which becomes bigger with length of the list and size of each  element.
  4. Once you are dealing with very large list (imagine a million), generating a list of 1 million integers will take around 8 MB just for the list structure in Python 3 — and much more if the objects stored are large or complex.

A generator will make it all...Simpler. Here's the same example with generator function.


def square(N):
    for i in range(1, N+1):
        yield i**2

gen = square(5)

print(next(gen))
print(next(gen))
print(next(gen))
print(next(gen))
print(next(gen))
print(next(gen))

Result:

1
4
9
16
25

Traceback (most recent call last):

  File "generators.py", line 14, in <module>
    print(next(gen))
StopIteration
>>>

You can guess the reasons for the last trace-back. We went out of fuel. In order to avoid this, you can always use generator object to iterate.


def square(N):
    for i in range(1, N+1):

        yield i**2

gen_object = square(5)

for result in gen_object:

    print(result)

Result:

1
4
9
16
25



More fun with generators:

You can also make generator functions from list comprehensions. Use parenthesis instead of square bracket to make a gen function.


l = (x**2 for x in range(1,6))  

print(l)
for value in l:
    print(value)


Result:

<generator object <genexpr> at 0x...>  # address will vary on your machine
1
4
9
16
25


Lazy Evaluation — the concept behind generators

When Python runs a regular function that builds a list, it computes everything upfront and stores it all in memory before handing anything back to you. A generator does the opposite — it computes one value at a time, only when asked. This is called lazy evaluation. The generator pauses after each yield and resumes only when you ask for the next value. Nothing is computed in advance, nothing is stored. This is why generators can work with data that is too large to fit in memory — or even infinite.


Warning: Generators are exhausted after one pass

This trips up a lot of beginners. Once you have iterated through a generator, it is gone. You cannot loop over it a second time. Unlike a list, there is no rewinding.

gen = (x**2 for x in range(5))

# First pass — works fine
print(list(gen))   # [0, 1, 4, 9, 16]

# Second pass — generator is already exhausted!
print(list(gen))   # []  <-- empty, not an error, just gone

If you need to iterate multiple times, either use a list — or recreate the generator each time.


Infinite Sequences — where generators truly shine

You simply cannot make an infinite list. But an infinite generator is trivial — and perfectly safe, because it only produces a value when you ask for one.

# An infinite counter — useful for generating unique test IDs
def test_id_generator(prefix="TC"):
    n = 1
    while True:
        yield f"{prefix}-{n:04d}"
        n += 1

gen = test_id_generator()
print(next(gen))   # TC-0001
print(next(gen))   # TC-0002
print(next(gen))   # TC-0003
# ...this never runs out, and uses almost zero memory


Real World: Reading Large Log Files (QA use case)

As a QA engineer, you often need to scan through large log files — sometimes hundreds of MB — looking for errors or specific events. Loading the whole file into a list is a bad idea. A generator reads one line at a time, keeping memory usage flat no matter how big the file is.

# Read a log file lazily — one line at a time
def read_logs(filepath):
    with open(filepath, "r") as f:
        for line in f:
            yield line.strip()

# Filter only ERROR lines
def filter_errors(lines):
    for line in lines:
        if "ERROR" in line:
            yield line

# Use them together — this is a pipeline!
for error in filter_errors(read_logs("app.log")):
    print(error)

# Result (example):
# 2024-01-15 10:23:01 ERROR Database connection timeout
# 2024-01-15 10:45:17 ERROR Failed to parse response from API

At no point is the entire log file in memory. Each line flows through the pipeline one at a time.


Real World: Reading Large CSV Files (QA use case)

Same idea applies to test data CSVs. Instead of loading 100,000 rows into a list, stream them one row at a time.

import csv

def read_test_cases(filepath):
    with open(filepath, newline="") as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

# Only run test cases marked as "active"
def active_only(test_cases):
    for tc in test_cases:
        if tc["status"] == "active":
            yield tc

# Run each active test case
for test in active_only(read_test_cases("test_suite.csv")):
    print(f"Running: {test['test_name']} | input: {test['input']} | expected: {test['expected']}")

# Result (example):
# Running: login_valid_user | input: admin/pass123 | expected: 200 OK
# Running: login_empty_password | input: admin/ | expected: 400 Bad Request


Generator Pipelines — chaining generators like Unix pipes

You can chain generators together so that data flows through a series of steps — each step only runs when the next one asks for data. This is exactly how Unix pipes work (cat file | grep ERROR | wc -l), but in pure Python. Here is a complete QA pipeline that reads a log, filters errors, extracts timestamps, and counts them — all without loading the file into memory.

def read_logs(filepath):
    with open(filepath) as f:
        for line in f:
            yield line.strip()

def filter_errors(lines):
    for line in lines:
        if "ERROR" in line:
            yield line

def extract_timestamp(lines):
    for line in lines:
        # assumes log format: "2024-01-15 10:23:01 ERROR ..."
        yield line.split()[0] + " " + line.split()[1]

# Build the pipeline
lines      = read_logs("app.log")
errors     = filter_errors(lines)
timestamps = extract_timestamp(errors)

# Nothing has run yet! The pipeline is lazy.
# Only now does data start flowing, one line at a time:
for ts in timestamps:
    print(ts)

# Result (example):
# 2024-01-15 10:23:01
# 2024-01-15 10:45:17


yield from — delegating to another generator

If your generator needs to yield all values from another generator (or any iterable), you can use yield from instead of a manual loop. This is cleaner and also more efficient.

# A QA tester running tests from multiple suites
def login_tests():
    yield "test_valid_login"
    yield "test_invalid_password"
    yield "test_empty_username"

def api_tests():
    yield "test_get_users"
    yield "test_post_order"

# Combine both suites into one stream using yield from
def all_tests():
    yield from login_tests()
    yield from api_tests()

for test in all_tests():
    print(f"Running: {test}")

# Result:
# Running: test_valid_login
# Running: test_invalid_password
# Running: test_empty_username
# Running: test_get_users
# Running: test_post_order


yield vs yield from — they are NOT the same with a genexpr

This is a very common beginner mistake. You have a generator expression inside a function and you write yield in front of it — expecting individual values to come out. They don't. yield hands out the genexpr object itself as a single item. yield from unwraps it and streams each value individually. Let's see this side by side.

# ❌ Using plain yield — probably not what you want
def wrong(n):
    yield (x**2 for x in range(1, n+1))

for val in wrong(5):
    print(val)

# Result:
# <generator object <genexpr> at 0x...>
#
# You got ONE item — the generator object itself. Not the numbers.

# ✅ Using yield from — unwraps and streams each value
def correct(n):
    yield from (x**2 for x in range(1, n+1))

for val in correct(5):
    print(val)

# Result:
# 1
# 4
# 9
# 16
# 25

And here is the bonus insight — because yield from is just delegation, you can still run code after it inside the same function. The generator's local scope stays alive until the function body finishes. The variable n is still perfectly accessible after all the values have been yielded:

def genfun(n):
    yield from (x**2 for x in range(1, n+1))
    print("We processed {} numbers".format(n))  # n is still alive here!

for val in genfun(5):
    print(val)

# Result:
# 1
# 4
# 9
# 16
# 25
# We processed 5 numbers   <-- runs after the last value is consumed

The generator frame is not destroyed after the last yield. It resumes and runs to completion — so any code after yield from executes once the iterable is exhausted. The local scope is fully intact throughout.


send() — pushing data back into a generator

Generators are not just one-way. You can send a value back into a paused generator using gen.send(value). The sent value becomes the result of the yield expression inside the generator. This turns a generator into a two-way communication channel — sometimes called a coroutine. A practical QA use: a generator that collects test results as you feed them in.

def test_reporter():
    results = []
    while True:
        result = yield len(results)   # yields current count, receives next result
        if result is None:
            break
        results.append(result)
    return results

reporter = test_reporter()
next(reporter)                        # must call next() once to start it

reporter.send("PASS")                 # 1
reporter.send("FAIL")                 # 2
reporter.send("PASS")                 # 3

try:
    reporter.send(None)               # signal we are done
except StopIteration as e:
    print(e.value)                    # ['PASS', 'FAIL', 'PASS']


close() and throw() — controlling a generator's lifecycle

close() — shuts down a generator early. Python raises a GeneratorExit exception inside the generator, giving it a chance to clean up (e.g. close a file handle).

throw() — injects an exception into the generator at the point where it is paused. Useful for signalling error conditions from the outside.


def log_scanner(filepath):
    with open(filepath) as f:
        try:
            for line in f:
                yield line.strip()
        except GeneratorExit:
            print("Scanner closed cleanly.")  # cleanup happens here
        except RuntimeError as e:
            print(f"Scan aborted: {e}")

scanner = log_scanner("app.log")

print(next(scanner))   # read first line normally
print(next(scanner))   # read second line

# Inject an error condition from outside
scanner.throw(RuntimeError, "Disk quota exceeded")
# Output: Scan aborted: Disk quota exceeded

# Or simply stop early and let the file close cleanly
scanner2 = log_scanner("app.log")
next(scanner2)
scanner2.close()
# Output: Scanner closed cleanly.

Summary for QA engineers: Use close() when you have found what you need and want to stop early without leaking file handles. Use throw() to signal external error conditions into a running pipeline.


No comments:

Post a Comment