Migrating my tools to the Toolforge Build Service

, .

Over the past few weeks, I migrated almost all of my tools to the Toolforge Build Service, and I thought it would be useful to write a bit about the process and my motivation for doing it.

Why I did it

Recently, the Wikimedia Cloud Services team announced the Toolforge push-to-deploy beta, which makes it possible to set up integration with a code forge such as Wikimedia Gitlab that will cause newly pushed versions of a tool’s code to be deployed to Toolforge automatically. This has the potential to significantly simplify the development of tools: instead of having to log into a Toolforge bastion server and deploy every update to the tool manually, one can just run git push and everything else happens automatically.

Currently, the beta has some limitations: most importantly, web services are not supported yet, which means the feature is actually useless to me in its current state because all of my tools are web services. (It should already useful for bots, though it’s not clear to me if any bots already use it in practice; at least I couldn’t find any relevant-looking push-to-deploy config on MediaWiki Codesearch.) However, I’m hopeful that support for web service will be added soon. In the meantime, because it already seems clear that this support will only include tools based on the build service (but not tools using the various other web service types supported by Toolforge), now seems like a good time to migrate my tools to the build service, so that I’ll have less work to do to set up push-to-deploy once it becomes available.

What I did

I also used this as an opportunity to adopt some best practices in my tools in general, even if not all of them were related to the build service migration. I’ll go through them here in roughly the order in which I did them in most tools.

Add health check

A health check is a way for the Toolforge infrastructure to detect if a tool is running (“healthy”) or not. This is useful, for instance, to enable restarts of a tool (including deploying new versions) with no downtime: the infrastructure (Kubernetes) will bring up a container with the new version of the tool, wait for it to become ready according to the health check, switch traffic from the old container to the new one, and only then tear down the old container.

Since 2024, Toolforge installs a TCP health check by default: the tool is considered healthy if it accepts connections on the web service port. However, this doesn’t guarantee that the server is actually ready to handle requests; we can do better by defining a health-check-path in the service.template file, at which point Toolforge will instead use an HTTP health check and test if the tool successfully responds to HTTP requests to this path. It’s apparently conventional to call this path /healthz (though last I looked, nobody seemed to know what the “z” stands for), and as it doesn’t need to return anything special, the Python code for this endpoint looks very simple:

@app.route('/healthz')
def health():
    return ''

(Plus a return type, import-aliased to RRV, in those tools where I use mypy.) And it’s configured in the service.template file like this:

health-check-path: /healthz

I usually did this improvement first (unless I forgot or it was already set up) because it meant that most of the following improvements could be deployed without downtime for users.

Splitting prod and dev dependencies

In most of my tools, I previously had only one requirements.txt file (compiled using pip-tools from requirements.in). This means that the tool’s installation on Toolforge included not just the packages required to run the tool (Flask, Werkzeug, mwapi, etc.) but also the packages required to test it (Flake8, mypy, pytest, etc.). This is wasteful (mypy is big!), and a build service based tool would install its dependencies more often than before (each time a new image is built, i.e. during every deployment), so I took an improvement I’d already done years ago in the Wikidata Lexeme Forms tool and followed through with it in my other tools: split the testing packages into a separate file (dev-requirements.txt, compiled from dev-requirements.in). The dev packages are installed locally (pip-sync *requirements.txt) and in CI (pip install -r requirements.txt -r dev-requirements.txt), but not on Toolforge. In most tools, this shrunk the installed venv roughly by 50%, which is pretty neat!

I also added a CI job that verifies that I didn’t accidentally put a prod dependency into the dev requirements, by only installing the prod requirements and checking that python app.py runs through without crashing on a missing import. This isn’t perfect, but since I know that I’m not doing any advanced lazy-import stuff in my own code, it’s good enough for me. (I guess an alternative would be to reuse the health check for this.)

Configuration from environment variables

All of my Flask tool read the Flask configuration from a (user-only-readable) config.yaml file in the source code directory; this contains, at a minimum, the secret key used to sign the session, and sometimes more information, such as the OAuth consumer key and secret. This is still possible on the build service (by specifying the mount: all option), but it means the tool will rely on NFS, which is generally undesirable. A more forward-looking option is to store the config in environment variables, which Toolforge added support for two years ago.

It turns out that Flask has a method, app.config.from_prefixed_env(), which will automatically load all environment variables whose name starts with a certain prefix (I use TOOL_) into the config. It even has support for nested objects (using double underscores in the name), so that configuration like app.config['OAUTH']['consumer_key'] can be represented as the environment variable named TOOL_OAUTH__consumer_key.

However, there’s one problem with this: Toolforge requires environment variables to have all-uppercase names, but my existing code was expecting lowercase names inside the OAUTH config dict. I worked around this by first converting the configuration keys to all-uppercase (initially, still inside the config.yaml file); then, I moved the configuration to envvars, and finally commented out the contents of the config.yaml file (example SAL). All of this was possible while the tools were still running on the legacy web service types. (The code reading the config.yaml file is still there, by the way – it’s much more convenient for local development, even if it’s not used on Toolforge anymore.)

Move CI from GitHub to GitLab

The CI for most of my tools was on GitHub, mainly because many of them predated Wikimedia GitLab (or the availability of GitLab CI there). However, I don’t really fancy giving Microsoft deploy access to my tools, so I moved the CI over to GitLab CI. For most tools, this was very straightforward, to the point where I just copy+pasted the .gitlab-ci.yml file between tools. (In QuickCategories, setting up a MariaDB service for CI required a little bit more work.)

Actual build service migration

The migration to the build service starts with the Procfile, which tells the infrastructure how to run the tool. I used the same Procfile for all my Python tools:

web: gunicorn --workers=4 app:app

This defines an entrypoint called web which will run Gunicorn, with four worker processes, importing app.py and running the app WSGI app from it. Toolforge specifies the $PORT environment variable to tell the tool where to listen for connections, and Gunicorn will bind to that port by default if the environment variable is defined, so no explicit --bind option is necessary. Of course, this also requires adding gunicorn to requirements.in / requirements.txt, so that it will be installed inside the image. Also, don’t forget to git add Procfile

A significant benefit of the build service is that it gives us early access to newer Python versions. By writing 3.13 in a file called .python-version (don’t forget to git add this one either!), and specifying --use-latest-versions when running toolforge build start (presumably this will become the default at some point), our tool will run on Python 3.13, whereas the latest version available outside of the build service is currently Python 3.11 (until two weeks or so from now). I didn’t actually notice any Python 3.13 features I wanted to use in my tools (except for one tool where I was able to replace a TypeAlias annotation with a type statement), but it’s still nice to use the same version in production as the one I develop on locally. (Of course, I also bumped the Python version in CI from 3.11 to 3.13.)

That said, there is one issue with Python 3.13 that I had to work around. All of my Python tools use the toolforge library for its set_user_agent() function (it has other features but I mostly don’t use them); this library imports PyMySQL as soon as it is imported. PyMySQL, in turn, immediately tries to initialize a default user name for database connections from the environment (even if the tool is never going to open a database connection), via the Python getpass.getuser() function. However, inside a build service container, no user name is set, and so this function raises an error. This was fine in earlier Python versions, because PyMySQL catches the error; however, Python 3.13 changed the error being thrown from KeyError to OSError, which PyMySQL didn’t catch. PyMySQL subsequently added this error to the except clause; however, they haven’t published a new release since that commit. Due to this bizarre confluence of edge cases, it’s impossible to import toolforge or pymysql in a Toolforge Build Service tool on Python 3.13 or later when using the latest released version of PyMySQL. My workaround is to install PyMySQL from Git, using this requirements.in entry:

pymysql @ git+https://github.com/PyMySQL/PyMySQL@main

I look forward to the day when I’ll be able to remove this again.

The remaining part of the build service migration is the service.template file, which contains default arguments for calling webservice commands. I changed the type from python3.11 to buildservice, and also added mount: none to specify that the tool doesn’t need NFS mounted. Then, after pushing the changes to GitLab and building a new container image, I deployed the build service version with commands like this:

webservice stop &&
mv www{,-unused-tool-now-runs-on-buildservice} &&
wget https://gitlab.wikimedia.org/toolforge-repos/translate-link/-/raw/2e2349a9fb/service.template &&
webservice start

This stops the webservice (using the old defaults in www/python/src/service.template), moves the old source code directory away so I don’t get confused by it later (I’ll remove it eventually™), downloads the new service.template file right into the home directory, and then starts the webservice again using the defaults from that file. And last but not least, I updated the instructions in the README.md (initially as a separate commit, later in the same big migration commit because I couldn’t be bothered to separate it anymore).

More details

If you want to follow the migrations in more detail, you can look at the relevant Git commit ranges and SAL entries:

At some point, I should also apply most of these improvements to cookiecutter-toolforge, though I’m not so sure about the split-requirements part (I feel like it might overcomplicate the dev setup for other developers for little benefit). Let me know what you think :)