Migrating my tools to the Toolforge Build Service
Lucas Werkmeister, .Over the past few weeks, I migrated almost all of my tools to the Toolforge Build Service, and I thought it would be useful to write a bit about the process and my motivation for doing it.
Why I did it
Recently, the Wikimedia Cloud Services team announced the
Toolforge push-to-deploy beta,
which makes it possible to set up integration with a code forge such as Wikimedia Gitlab
that will cause newly pushed versions of a tool’s code to be deployed to Toolforge automatically.
This has the potential to significantly simplify the development of tools:
instead of having to log into a Toolforge bastion server and deploy every update to the tool manually,
one can just run git push
and everything else happens automatically.
Currently, the beta has some limitations: most importantly, web services are not supported yet, which means the feature is actually useless to me in its current state because all of my tools are web services. (It should already useful for bots, though it’s not clear to me if any bots already use it in practice; at least I couldn’t find any relevant-looking push-to-deploy config on MediaWiki Codesearch.) However, I’m hopeful that support for web service will be added soon. In the meantime, because it already seems clear that this support will only include tools based on the build service (but not tools using the various other web service types supported by Toolforge), now seems like a good time to migrate my tools to the build service, so that I’ll have less work to do to set up push-to-deploy once it becomes available.
What I did
I also used this as an opportunity to adopt some best practices in my tools in general, even if not all of them were related to the build service migration. I’ll go through them here in roughly the order in which I did them in most tools.
Add health check
A health check is a way for the Toolforge infrastructure to detect if a tool is running (“healthy”) or not. This is useful, for instance, to enable restarts of a tool (including deploying new versions) with no downtime: the infrastructure (Kubernetes) will bring up a container with the new version of the tool, wait for it to become ready according to the health check, switch traffic from the old container to the new one, and only then tear down the old container.
Since 2024, Toolforge installs a TCP health check by default:
the tool is considered healthy if it accepts connections on the web service port.
However, this doesn’t guarantee that the server is actually ready to handle requests;
we can do better by defining a health-check-path
in the
service.template
file,
at which point Toolforge will instead use an HTTP health check and test if the tool successfully responds to HTTP requests to this path.
It’s apparently conventional to call this path /healthz
(though last I looked, nobody seemed to know what the “z” stands for),
and as it doesn’t need to return anything special,
the Python code for this endpoint looks very simple:
@app.route('/healthz')
def health():
return ''
(Plus a return type,
import-aliased to RRV,
in those tools where I use mypy.)
And it’s configured in the service.template
file like this:
health-check-path: /healthz
I usually did this improvement first (unless I forgot or it was already set up) because it meant that most of the following improvements could be deployed without downtime for users.
Splitting prod and dev dependencies
In most of my tools, I previously had only one requirements.txt
file
(compiled using pip-tools from requirements.in
).
This means that the tool’s installation on Toolforge included
not just the packages required to run the tool (Flask, Werkzeug, mwapi, etc.)
but also the packages required to test it (Flake8, mypy, pytest, etc.).
This is wasteful (mypy is big!),
and a build service based tool would install its dependencies more often than before
(each time a new image is built, i.e. during every deployment),
so I took an improvement I’d already done years ago in the Wikidata Lexeme Forms tool
and followed through with it in my other tools:
split the testing packages into a separate file
(dev-requirements.txt
, compiled from dev-requirements.in
).
The dev packages are installed locally (pip-sync *requirements.txt
)
and in CI (pip install -r requirements.txt -r dev-requirements.txt
),
but not on Toolforge.
In most tools, this shrunk the installed venv roughly by 50%, which is pretty neat!
I also added a CI job that verifies that I didn’t accidentally put a prod dependency into the dev requirements,
by only installing the prod requirements and checking that python app.py
runs through without crashing on a missing import.
This isn’t perfect,
but since I know that I’m not doing any advanced lazy-import stuff in my own code,
it’s good enough for me.
(I guess an alternative would be to reuse the health check for this.)
Configuration from environment variables
All of my Flask tool read the Flask configuration
from a (user-only-readable) config.yaml
file in the source code directory;
this contains, at a minimum, the secret key used to sign the session,
and sometimes more information, such as the OAuth consumer key and secret.
This is still possible on the build service (by specifying the mount: all
option),
but it means the tool will rely on NFS, which is generally undesirable.
A more forward-looking option is to store the config in environment variables,
which Toolforge added support for two years ago.
It turns out that Flask has a method, app.config.from_prefixed_env()
,
which will automatically load all environment variables whose name starts with a certain prefix (I use TOOL_
) into the config.
It even has support for nested objects (using double underscores in the name),
so that configuration like app.config['OAUTH']['consumer_key']
can be represented as the environment variable named TOOL_OAUTH__consumer_key
.
However, there’s one problem with this:
Toolforge requires environment variables to have all-uppercase names,
but my existing code was expecting lowercase names inside the OAUTH
config dict.
I worked around this by first converting the configuration keys to all-uppercase
(initially, still inside the config.yaml
file);
then, I moved the configuration to envvars,
and finally commented out the contents of the config.yaml
file
(example SAL).
All of this was possible while the tools were still running on the legacy web service types.
(The code reading the config.yaml
file is still there, by the way –
it’s much more convenient for local development, even if it’s not used on Toolforge anymore.)
Move CI from GitHub to GitLab
The CI for most of my tools was on GitHub,
mainly because many of them predated Wikimedia GitLab (or the availability of GitLab CI there).
However, I don’t really fancy giving Microsoft deploy access to my tools,
so I moved the CI over to GitLab CI.
For most tools, this was very straightforward,
to the point where I just copy+pasted the .gitlab-ci.yml
file between tools.
(In QuickCategories, setting up a MariaDB service for CI required a little bit more work.)
Actual build service migration
The migration to the build service starts with the Procfile
,
which tells the infrastructure how to run the tool.
I used the same Procfile
for all my Python tools:
web: gunicorn --workers=4 app:app
This defines an entrypoint called web
which will run Gunicorn,
with four worker processes, importing app.py
and running the app
WSGI app from it.
Toolforge specifies the $PORT
environment variable to tell the tool where to listen for connections,
and Gunicorn will bind to that port by default if the environment variable is defined,
so no explicit --bind
option is necessary.
Of course, this also requires adding gunicorn
to requirements.in
/ requirements.txt
,
so that it will be installed inside the image.
Also, don’t forget to git add Procfile
…
A significant benefit of the build service is that it gives us early access to newer Python versions.
By writing 3.13
in a file called .python-version
(don’t forget to git add
this one either!),
and specifying --use-latest-versions
when running toolforge build start
(presumably this will become the default at some point),
our tool will run on Python 3.13,
whereas the latest version available outside of the build service is currently Python 3.11
(until two weeks or so from now).
I didn’t actually notice any Python 3.13 features I wanted to use in my tools
(except for one tool where I was able to replace a TypeAlias
annotation with a type
statement),
but it’s still nice to use the same version in production as the one I develop on locally.
(Of course, I also bumped the Python version in CI from 3.11 to 3.13.)
That said, there is one issue with Python 3.13 that I had to work around.
All of my Python tools use the toolforge
library
for its set_user_agent()
function
(it has other features but I mostly don’t use them);
this library imports PyMySQL as soon as it is imported.
PyMySQL, in turn, immediately tries to initialize a default user name for database connections from the environment
(even if the tool is never going to open a database connection),
via the Python getpass.getuser()
function.
However, inside a build service container, no user name is set, and so this function raises an error.
This was fine in earlier Python versions, because PyMySQL catches the error;
however, Python 3.13 changed the error being thrown from KeyError
to OSError
,
which PyMySQL didn’t catch.
PyMySQL subsequently added this error to the except
clause;
however, they haven’t published a new release since that commit.
Due to this bizarre confluence of edge cases,
it’s impossible to import toolforge
or pymysql
in a Toolforge Build Service tool on Python 3.13 or later when using the latest released version of PyMySQL.
My workaround is to install PyMySQL from Git, using this requirements.in
entry:
pymysql @ git+https://github.com/PyMySQL/PyMySQL@main
I look forward to the day when I’ll be able to remove this again.
The remaining part of the build service migration is the service.template
file,
which contains default arguments for calling webservice
commands.
I changed the type
from python3.11
to buildservice
,
and also added mount: none
to specify that the tool doesn’t need NFS mounted.
Then, after pushing the changes to GitLab and building a new container image,
I deployed the build service version with commands like this:
webservice stop &&
mv www{,-unused-tool-now-runs-on-buildservice} &&
wget https://gitlab.wikimedia.org/toolforge-repos/translate-link/-/raw/2e2349a9fb/service.template &&
webservice start
This stops the webservice (using the old defaults in www/python/src/service.template
),
moves the old source code directory away so I don’t get confused by it later (I’ll remove it eventually™),
downloads the new service.template
file right into the home directory,
and then starts the webservice again using the defaults from that file.
And last but not least, I updated the instructions in the README.md
(initially as a separate commit,
later in the same big migration commit because I couldn’t be bothered to separate it anymore).
More details
If you want to follow the migrations in more detail, you can look at the relevant Git commit ranges and SAL entries:
- Wikidata Lexeme Forms: Git, SAL
- Wikidata Image Positions: Git, SAL
- QuickCategories (was mostly already migrated to the build service from T374152): Git, SAL
- SpeedPatrolling: Git, SAL
- PagePile Visual Filter: Git, SAL
- Ranker: Git, SAL
- Translate Link: Git, SAL
At some point, I should also apply most of these improvements to cookiecutter-toolforge, though I’m not so sure about the split-requirements part (I feel like it might overcomplicate the dev setup for other developers for little benefit). Let me know what you think :)