Thank you for prior reviews and sticking with this rather complicated fiddling; we ended up with some failures and reverts that I hope have been addressed: Firstly, we have a typo in matching install playbooks for our current roles that has propogated as we copy-paste; a perfunctory fix https://review.opendev.org/c/opendev/system-config/+/820281 I am then proposing we rename the job we currently call infra-prod-install-ansible to infra-prod-bootstrap-bridge. This is a back-to-the-future situation; as it was originally called something like "install-bridge". I think this better reflects where it will end up, as hopefully revealed in the following steps https://review.opendev.org/c/opendev/system-config/+/820282 After this, we need to run infra-prod-bootstrap-bridge as the base job before all other production jobs. To recap -- this will be the synchronisation point that puts Zuul's checkout of system-config on the bastion host (bridge) that is then used to deploy all production systems (eventually, in as parallel way as possible). This revealed what I think is a problem with the original job -- it runs the install-ansible role via playbooks/zuul/run-production-playbook.yaml. This is a chicken-and-egg problem -- run-production-playbook.yaml uses the Ansible installed on bridge to run playbooks in system-config to ... install Ansible on bridge. Addressed with: https://review.opendev.org/c/opendev/system-config/+/820320/ this makes infra-prod-bootstrap bridge a stand-alone job that should * install the required version of Ansible on bridge * setup system-config to Zuul's checkout for the buildset For sanity, we keep the current parent of the infra-prod-* jobs the same -- this means each infra-prod-* job will still be re-checking-out system-config as it runs. We should validate * the install of ansible/openstacksdk/etc. is actually idempotent; (i.e. very run of infra-prod-bootstrap-bridge isn't reinstalling everything). * infra-prod-bootstrap-bridge always runs first in a deployment buildset (i.e. dependencies are correct) * infra-prod-bootstrap-bridge correctly puts the right checkout of system-config on bridge (as mentioned, for now the other infra-prod jobs will continue to overwrite it) Once we have validated infra-prod-bootstrap-bridge is running as we like, we can drop the other jobs checking-out code with: https://review.opendev.org/c/opendev/system-config/+/820651 and also cleanup base-jobs https://review.opendev.org/c/opendev/base-jobs/+/820652 At this point, we should be ready to run in parallel (touch wood...) -i On Wed, Nov 17, 2021 at 04:04:33PM +1100, Ian Wienand wrote:
Hi,
To recap: currently production deployment jobs run sequentially. Zuul starts the job on an executor, which is setup to log into the bastion host. The job sets up the system-config playbooks on the bastion host and Ansible is run from there against the production server.
To run in parallel, each job needs to not assume it owns the system-config playbooks on the bastion host.
Each Zuul *buildset* can use the same system-config playbook checkout though. To achieve this we need to rework the dependencies; each production job needs to depend on a common source-setup job. Once the source is setup on the bastion host, the actual production jobs can run in parallel.
To the changes...
Firstly, I believe we're doing the setup steps for the executor to log into bridge twice:
https://review.opendev.org/c/opendev/system-config/+/818190
removes this duplication, and should be safe to merge.
As pointed out in prior reviews when running in the periodic or hourly pipelines each job overrides that bastion host checkout to master.
https://review.opendev.org/c/opendev/base-jobs/+/818189
moves this step into base-jobs, in preparation for only being done once by the separate source-setup job. I believe this will be safe to merge; system-config will just do it again in an idempotent way, until:
https://review.opendev.org/c/opendev/system-config/+/818191
merges, which drops this step from system-config.
We can then merge the system-config job dependency updates in
https://review.opendev.org/c/opendev/system-config/+/807672
This should mean that all jobs not only rely on the correct base jobs, but jobs that need certificates, etc. will be relying on the letsencrypt job, etc. This should be safe to merge as nothing should actually change, we just have stricter dependencies.
After this, I think we are ready to refactor the base jobs into the two separate steps -- firstly setup the keys on the executor to log into the bastion host, then setup the source to use on the bastion host:
https://review.opendev.org/c/opendev/base-jobs/+/807807
This initial refactor should be safe to merge as it creates two new jobs, but the existing base job keeps running both steps as-is.
Then we are ready for the penultimate change:
https://review.opendev.org/c/opendev/system-config/+/807808
This updates the system-config jobs to all depend on "infra-prod-setup-src" which will be the canonical job that sets up the source repository on bridge.o.o. All other jobs in the buildset will depend on this job, ensuring consistency for a run.
This should also be safe, as it again doesn't actually change ordering.
Once all this is in, we need the final change to enable parallel running (and think about correct semaphores between periodic/hourly and regular runs). That is yet to be written, but we have enough to get to that point!
-i