Today I learned that the largest file ever published to PyPI has 20 MILLION lines of code.

20 million lines of Python code

Today I'm attending PyCon Portugal 2023 and Tom Forbes, the Thursday keynoter, presented the largest Python file ever published to PyPI. It is a Python file with over 20 million lines of Python code. MILLION.

What does this file do, you might ask?

I'll give you a hint. This file comes from a project called EvenOrOdd... Can you see where we are going?

The file implements a single function isEven that starts like this:

def isEven(num):
    if num == 0:
        return True
    elif num == 1:
        return False
    elif num == 2:
        return True
    elif num == 3:
        return False
    # ...

If you scroll down for long enough, you'll eventually reach the end of the function, which looks like this:

def isEven(num):
    # ...
    elif num == 1048571:
        return False
    elif num == 1048572:
        return True
    elif num == 1048573:
        return False
    elif num == 1048574:
        return True
    elif num == 1048575:
        return False
    else:
        raise Exception("Number is not within bounds")

That's a pretty silly function! Why on Earth would it stop at 1,048,575? (Starting at 0 and going up to 1_048_575 means that the function isEven covers \(2^{20}\) integers.)

I installed the package:

python -m pip install evenorodd

And then decided to test it, just to make sure none of the branches were wrong:

>>> from EvenOrOdd
>>> for i in range(2 ** 20):
...     assert EvenOrOdd.isEven(i) == (not (i % 2))

Given that this loop is taking a long time to finish, I wonder if the difference between isEven(0) and isEven(1048575) is noticeable.

Using the module timeit, I tried checking how fast I could compute isEven(0) versus isEven(1048575) and I got these numbers:

>>> from timeit import timeit

>>> setup = "from EvenOrOdd.EvenOrOdd import isEven"
>>> timeit("isEven(0)", setup=setup, number=1000)
2.8417000066838227e-05
>>> timeit("isEven(1048575)", setup=setup, number=1000)
6.486064583999905
>>> _ / 1000
0.0064860645839999054

The timings above might be different on your machine, but the relative comparisons should be somewhat similar, and the numbers above show that it is faster to call isEven(0) 1000 times than it is to compute isEven(1048575) once.

We can take this further:

>>> timeit("isEven(0)", setup=setup, number=100000)  # Changed `number=...`
0.002839708999999857

This new timing shows that we can call isEven(0) about 100,000 times and be done before isEven(1048575) finishes computing...

What is more, I was able to write this blog post and the testing of the function isEven is still running (the for loop I shared above), so I'll interrupt that loop:

>>> for i in range(2 ** 20):
...     assert EvenOrOdd.isEven(i) == (not (i % 2))
...
^CTraceback (most recent call last):
  File "<stdin>", line 2, in <module>
KeyboardInterrupt
>>> i
475308

So, as you can see, I've gone through 46% of the faster test cases. At least we know that the numbers up to 475,307 are correctly classified as even or odd by the function isEven.

Become a better Python 🐍 developer πŸš€

+35 chapters. +400 pages. Hundreds of examples. Over 30,000 readers!

My book β€œPydon'ts” teaches you how to write elegant, expressive, and Pythonic code, to help you become a better developer. >>> Download it here πŸπŸš€.

References

Previous Post Next Post

Blog Comments powered by Disqus.