On Mon, Apr 3, 2023, at 3:16 AM, Ian Wienand wrote:
Hi there,
The wheel cache builds are becoming harder and harder to maintain, so I think we need to re-evaluate what we're doing.
To summarise; currently for every platform, every day
* job starts with zuul clone of requirements
* run bindep (for master only? ... probably wrong) and do some more (now-looking-a-bit-dubious) setup [1]
* we iterate over master + stable/* and "pip wheel" build each item in requirements, putting it into a local wheel cache [2]
* except for arm64, where we take the latest two branches (the choosing of which was recently broken by the change in sort ordering from the "YYYY.X" release format)
* then we grep the build logs to find out which .whl files were downloaded from PyPI, and delete them from the local cache
* then we move to the publish step, where we copy the wheels to AFS. This never removes, but it does overwrite (so the .whl is very likely to change every day, as timestamps, etc. mean .whl builds are not reproducable). [3]
Rebuilding on some platforms is important as .so files can move due to library updates and incompatibilities. I have this problem with libre2 on tumbleweed. That said for the CI system I don't think any of the platforms have this problem except maybe centos stream (theoretical and I suspect they won't do this to stream either)?
* then we make a pypi index from the files in afs [4].
* We wait for all the publishing jobs to complete successfully, then we release the AFS volumes. If any fail, we don't publish that day [5]
This started a long time ago, when we had a few platforms and a few branches. We now have newton->2023.1 branches in requirements, and we currently do this for 15 different platforms. When you multiply that out, it's not sustainable. Daily build jobs are timing out now, which holds up all publishing (I think the latest release pushed us over the edge).
For some years, we were not pruning wheels we downloaded from PyPI [6]. If a .whl is built and on PyPI we should get it directly from upstream -- we have a caching proxy setup for CI jobs. I have written a small tool to help us clean up our caches [7]. It would be good if we could audit this tool, and when we're happy with it's output we can look at clearing out our caches.
But that still leaves what is going into them every day. Iterating every branch is fairly useless. Ideally, we'd have a matrix of platforms v branches that gave us an exact mapping of what platforms run jobs on what branches. This does not generally exist; we all have some vague ideas and the extremes are obvious (we are not running Zed jobs on centos-8, and we are not running newton jobs on Ubuntu Jammy) but the middle is fuzzy.
I'd like to solicit opinions on what we want this cache to do?
Thinking out loud here: what if we switched to maintaining an explicit list of packages we care about having wheels: libvirt-python, cryptography, cffi, lxml, etc. Some of these will already have wheels on pypi, some will only have wheels for x86 and not arm, and some won't have wheels at all. We could build those that are not available for the current platform and publish those. I suggest this because I suspect that the vast majority of wheels we are building are for sdist only pure python projects that don't really need wheels published to speed up installation. If this assumption is correct we might end up with a very small list to maintain that builds quickly and doesn't consume much mirror space. Another upside to this approach is that we could decouple the wheel mirrors from openstack and allow other projects to request wheels. The downside is that this will probably require occasional maintenance on the OpenDev side to approve new packages to the list. We will also need to maintain a bindep file to build that subset of packages. In some ways we would be duplicating work openstack is already doing. This is probably the biggest downside and worth considering.
One compelling option is to just build master requirements into the cache. The theory being that as branches are made, the requirement must have passed through master; ergo as we have an additive cache we will have wheels built.
This seems OK, but it also seems that we need a cut-off point. It doesn't seem useful to build building "master" on centos-8/xenial as the requirements are all pinning things for Python's way in advance of what's there. If we do this, how do we maintain where a platform stops building master? stable/* requirements shouldn't change much; but if they do, we should push new .whls into the cache -- how do we do that in this model? This also makes our cache "precious" in that we are never building old branches -- if we lose AFS for some reason, we have a job ahead of us to restore all the old wheels.
I think a perfect solution here might involve making the entire publishing pipeline driven by changes to openstack/requirements. Firstly, we have a non-trivial amount of work to figure out modifying the release process from "everything builds and releases or nothing does" to individual builds. I think we can do this with Zuul semaphores, and there's a decent chance it was written like this because mutexes/sempahores weren't available. This would be a non-trivial amount of work, and also be handing off a significant amount of this from what has traditionally been an infra job to the requirements project. Is anyone interested on working on this?
My only concern with having openstack/requirements drive this is that we have talked about decoupling openstack from these builds in the past so that other projects can more easily take advantage of them. Driving this from requirements would probably kill those dreams. I'm not sure I understand how Zuul semaphores help with the handling of build failures? Instead maybe we should just always publish since any wheel we do build should be valid. If we don't build a wheel for X and Y depends on X the pypi indexes upstream of us will already cause X to be used. We aren't really gaining much by not having the wheels we do build in the downstream published wheel cache index.
I welcome any and all suggestions on what we want out of the wheel cache and how we can achieve it :)
[1] https://opendev.org/openstack/openstack-zuul-jobs/src/commit/699e811cb8fd3f0... [2] https://opendev.org/openstack/openstack-zuul-jobs/src/commit/699e811cb8fd3f0... [3] https://opendev.org/openstack/project-config/src/branch/master/roles/copy-wh... [4] https://opendev.org/openstack/project-config/src/commit/6e4748ca35008a4c25e5... [5] https://opendev.org/openstack/project-config/src/branch/master/playbooks/whe... [6] https://review.opendev.org/c/openstack/project-config/+/703487 [7] https://review.opendev.org/c/opendev/system-config/+/879239