Parallel production jobs changes
Ian Wienand
iwienand at redhat.com
Tue Dec 7 05:58:44 UTC 2021
Thank you for prior reviews and sticking with this rather complicated
fiddling; we ended up with some failures and reverts that I hope have
been addressed:
Firstly, we have a typo in matching install playbooks for our current
roles that has propogated as we copy-paste; a perfunctory fix
https://review.opendev.org/c/opendev/system-config/+/820281
I am then proposing we rename the job we currently call
infra-prod-install-ansible to infra-prod-bootstrap-bridge. This is a
back-to-the-future situation; as it was originally called something
like "install-bridge". I think this better reflects where it will end
up, as hopefully revealed in the following steps
https://review.opendev.org/c/opendev/system-config/+/820282
After this, we need to run infra-prod-bootstrap-bridge as the base job
before all other production jobs. To recap -- this will be the
synchronisation point that puts Zuul's checkout of system-config on
the bastion host (bridge) that is then used to deploy all production
systems (eventually, in as parallel way as possible).
This revealed what I think is a problem with the original job -- it
runs the install-ansible role via
playbooks/zuul/run-production-playbook.yaml. This is a
chicken-and-egg problem -- run-production-playbook.yaml uses the
Ansible installed on bridge to run playbooks in system-config to
... install Ansible on bridge. Addressed with:
https://review.opendev.org/c/opendev/system-config/+/820320/
this makes infra-prod-bootstrap bridge a stand-alone job that should
* install the required version of Ansible on bridge
* setup system-config to Zuul's checkout for the buildset
For sanity, we keep the current parent of the infra-prod-* jobs the
same -- this means each infra-prod-* job will still be re-checking-out
system-config as it runs. We should validate
* the install of ansible/openstacksdk/etc. is actually idempotent;
(i.e. very run of infra-prod-bootstrap-bridge isn't reinstalling
everything).
* infra-prod-bootstrap-bridge always runs first in a deployment
buildset (i.e. dependencies are correct)
* infra-prod-bootstrap-bridge correctly puts the right checkout of
system-config on bridge (as mentioned, for now the other infra-prod
jobs will continue to overwrite it)
Once we have validated infra-prod-bootstrap-bridge is running as we
like, we can drop the other jobs checking-out code with:
https://review.opendev.org/c/opendev/system-config/+/820651
and also cleanup base-jobs
https://review.opendev.org/c/opendev/base-jobs/+/820652
At this point, we should be ready to run in parallel (touch wood...)
-i
On Wed, Nov 17, 2021 at 04:04:33PM +1100, Ian Wienand wrote:
> Hi,
>
> To recap: currently production deployment jobs run sequentially. Zuul
> starts the job on an executor, which is setup to log into the bastion
> host. The job sets up the system-config playbooks on the bastion host
> and Ansible is run from there against the production server.
>
> To run in parallel, each job needs to not assume it owns the
> system-config playbooks on the bastion host.
>
> Each Zuul *buildset* can use the same system-config playbook checkout
> though. To achieve this we need to rework the dependencies; each
> production job needs to depend on a common source-setup job. Once the
> source is setup on the bastion host, the actual production jobs can
> run in parallel.
>
> To the changes...
>
> Firstly, I believe we're doing the setup steps for the executor to log
> into bridge twice:
>
> https://review.opendev.org/c/opendev/system-config/+/818190
>
> removes this duplication, and should be safe to merge.
>
> As pointed out in prior reviews when running in the periodic or hourly
> pipelines each job overrides that bastion host checkout to master.
>
> https://review.opendev.org/c/opendev/base-jobs/+/818189
>
> moves this step into base-jobs, in preparation for only being done
> once by the separate source-setup job. I believe this will be safe to
> merge; system-config will just do it again in an idempotent way,
> until:
>
> https://review.opendev.org/c/opendev/system-config/+/818191
>
> merges, which drops this step from system-config.
>
> We can then merge the system-config job dependency updates in
>
> https://review.opendev.org/c/opendev/system-config/+/807672
>
> This should mean that all jobs not only rely on the correct base jobs,
> but jobs that need certificates, etc. will be relying on the
> letsencrypt job, etc. This should be safe to merge as nothing should
> actually change, we just have stricter dependencies.
>
> After this, I think we are ready to refactor the base jobs into the
> two separate steps -- firstly setup the keys on the executor to log
> into the bastion host, then setup the source to use on the bastion
> host:
>
> https://review.opendev.org/c/opendev/base-jobs/+/807807
>
> This initial refactor should be safe to merge as it creates two new
> jobs, but the existing base job keeps running both steps as-is.
>
> Then we are ready for the penultimate change:
>
> https://review.opendev.org/c/opendev/system-config/+/807808
>
> This updates the system-config jobs to all depend on
> "infra-prod-setup-src" which will be the canonical job that sets up
> the source repository on bridge.o.o. All other jobs in the buildset
> will depend on this job, ensuring consistency for a run.
>
> This should also be safe, as it again doesn't actually change
> ordering.
>
> Once all this is in, we need the final change to enable parallel
> running (and think about correct semaphores between periodic/hourly
> and regular runs). That is yet to be written, but we have enough to
> get to that point!
>
> -i
More information about the service-discuss
mailing list