Midterm Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Aug 17, 2024

Snapshot from a run of load testing with Locust

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I am working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In this post, I will provide an update on the progress we have made so far.

Creating a Performance Benchmark

We are currently focusing on applications built on top of Django, a widely used Python web framework. For our first benchmark, we chose Wagtail, a popular content management system. We created a pipeline with locust to simulate real-world load on the application. All of our work is open-sourced and available on our GitHub repository.

This load-testing pipeline creates hundreds of users who independently create many blog posts on a Wagtail blog site. At the same time, thousands of users are spawned to view these blog posts. Wagtail does not have a built-in API and so it took some initial effort to figure out the endpoints to hit, which I did by inspecting the network logs in the browser while interacting with the Wagtail admin interface.

A snapshot from a run of the load test with Locust is shown in the featured image above. This snapshot was generated by spawning users from 24 different parallel locust processes. This was done on a local server, and we plan to perform the same experiments on CloudLab soon.

Profiling

On running the load tests with a profiler, we found that the bottlenecks in the performance arose not from the Wagtail codebase but from the Django codebase. In particular, we identified three modules in Django that consumed the most time during the load tests: django.db.backends.sqlite3._functions, django.utils.functional, and django.views.debug. Dibri, a graduate student in Ben’s lab, is helping us add types to these modules.

Next Steps

Based on these findings, we are now working on typing these modules to see if we can improve the performance of the application by using Static Python. Typing Django is a non-trivial task, and while there have been some efforts to do so, previous attempts like django-stubs are incomplete for our purpose.

We are also writing scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python file, and run each mixed version several times to obtain a narrow confidence interval for the performance of each version.

We will be posting more updates as we make progress. Thank you for reading!

Midterm Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Creating a Performance Benchmark

Profiling

Next Steps

Mrigank Pawagi

Mathematics and Computing Student at the Indian Institute of Science