From iwienand at redhat.com Thu Apr 1 02:27:16 2021 From: iwienand at redhat.com (Ian Wienand) Date: Thu, 1 Apr 2021 13:27:16 +1100 Subject: Next steps with new review server Message-ID: Hi, We have a large server provided by Vexxhost up and running in a staging capacity to replace the current server at review02.openstack.org. I have started to track some things at [1] There's a couple of things: 1) Production database Currently, we use a hosted db. Since NoteDB this only stores review seen flags. We've been told that other sites treat this data as ephemeral; they use a H2 db on disk and don't worry about backing up or restoring across upgrades. I have proposed storing this in a mariadb sibling container with [2]. We know how to admin, backup and restore that. That would be my preference, but I'm not terribly fussed. If I could request some reviews on that; I'll take +2's as a sign we should use a container, otherwise we can leave it with H2 it has now. 2) IPv6 issues We've seen a couple of cases that are looking increasingly like stray RA's are some how assigning extra addresses, similar to [1]. Our mirror in the same region has managed to acquire 50+ default routes somehow. It seems like inbound traffic keeps working (why we haven't seen issues with other production servers?). But I feel like it's a little bit troubling to have undiagnosed before we switch our major service to it. I'm running some tracing, trying to at least catch a stray RA while the server is quite, in the etherpad. But suggestions here are welcome. -i [1] https://etherpad.opendev.org/p/gerrit-upgrade-2021 [2] https://review.opendev.org/c/opendev/system-config/+/775961 [3] https://launchpad.net/bugs/1844712 From cboylan at sapwetik.org Thu Apr 1 15:20:31 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Thu, 01 Apr 2021 08:20:31 -0700 Subject: Next steps with new review server In-Reply-To: References: Message-ID: On Wed, Mar 31, 2021, at 7:27 PM, Ian Wienand wrote: > Hi, > > We have a large server provided by Vexxhost up and running in a > staging capacity to replace the current server at > review02.openstack.org. > > I have started to track some things at [1] > > There's a couple of things: > > 1) Production database > > Currently, we use a hosted db. Since NoteDB this only stores review > seen flags. We've been told that other sites treat this data as > ephemeral; they use a H2 db on disk and don't worry about backing up > or restoring across upgrades. > > I have proposed storing this in a mariadb sibling container with [2]. > We know how to admin, backup and restore that. That would be my > preference, but I'm not terribly fussed. If I could request some > reviews on that; I'll take +2's as a sign we should use a container, > otherwise we can leave it with H2 it has now. Agreed, sticking with known DB tooling seems like a good idea for ease of operator interaction. I'll try to review this change today. > > 2) IPv6 issues > > We've seen a couple of cases that are looking increasingly like stray > RA's are some how assigning extra addresses, similar to [1]. Our > mirror in the same region has managed to acquire 50+ default routes > somehow. > > It seems like inbound traffic keeps working (why we haven't seen > issues with other production servers?). But I feel like it's a little > bit troubling to have undiagnosed before we switch our major service > to it. I'm running some tracing, trying to at least catch a stray RA > while the server is quite, in the etherpad. But suggestions here are > welcome. Agreed, ideally we would sort this out before any migration completes. I want to say we saw similar with the mirror in vexxhost and the "solution" there was to disable RAs and create a static yaml config for ubuntu using its new network management config file? That seems less than ideal from a cloud perspective as we can't be the only ones noticing this (in fact some of our CI jobs may indicate they suffer from similar causing some jobs to run long when reaching network resources). I know when we brought this up with the mirror mnaser suggested static config was fine, but maybe we need to reinforce that this is problematic as a cloud user and see if we can help debug (network traces seem like a good start there). > > -i > > > [1] https://etherpad.opendev.org/p/gerrit-upgrade-2021 > [2] https://review.opendev.org/c/opendev/system-config/+/775961 > [3] https://launchpad.net/bugs/1844712 From cboylan at sapwetik.org Thu Apr 1 21:35:32 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Thu, 01 Apr 2021 14:35:32 -0700 Subject: Next steps with new review server In-Reply-To: References: Message-ID: On Thu, Apr 1, 2021, at 8:20 AM, Clark Boylan wrote: > On Wed, Mar 31, 2021, at 7:27 PM, Ian Wienand wrote: snip > > > > 2) IPv6 issues > > > > We've seen a couple of cases that are looking increasingly like stray > > RA's are some how assigning extra addresses, similar to [1]. Our > > mirror in the same region has managed to acquire 50+ default routes > > somehow. > > > > It seems like inbound traffic keeps working (why we haven't seen > > issues with other production servers?). But I feel like it's a little > > bit troubling to have undiagnosed before we switch our major service > > to it. I'm running some tracing, trying to at least catch a stray RA > > while the server is quite, in the etherpad. But suggestions here are > > welcome. > > Agreed, ideally we would sort this out before any migration completes. > I want to say we saw similar with the mirror in vexxhost and the > "solution" there was to disable RAs and create a static yaml config for > ubuntu using its new network management config file? That seems less > than ideal from a cloud perspective as we can't be the only ones > noticing this (in fact some of our CI jobs may indicate they suffer > from similar causing some jobs to run long when reaching network > resources). I know when we brought this up with the mirror mnaser > suggested static config was fine, but maybe we need to reinforce that > this is problematic as a cloud user and see if we can help debug > (network traces seem like a good start there). I ended up double checking the mirror node and in mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml you can see what we did there. Essentially we set dhcpv6 and accept-ra to false then set an address and routes. We should be able to do the same thing with the new review host if we can't figure anything else out. If we do go this route maybe we should consider updating launch-node to do it for us automatically when launching focal nodes on vexxhost (I don't think bionic does netplan?), or at the very least document this somewhere. We should also double check that the address and routes are static and can be configured statically like this (the address should not change but I suppose the routes could at some point?). Ideally though we would sort this out properly and avoid these workarounds. > > > > > -i > > > > > > [1] https://etherpad.opendev.org/p/gerrit-upgrade-2021 > > [2] https://review.opendev.org/c/opendev/system-config/+/775961 > > [3] https://launchpad.net/bugs/1844712 > > From cboylan at sapwetik.org Mon Apr 5 22:30:01 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Mon, 05 Apr 2021 15:30:01 -0700 Subject: Team Meeting Agenda for April 6, 2021 Message-ID: We will meet with this agenda on April 6, 2021 at 19:00UTC in #opendev-meeting: == Agenda for next meeting == * Announcements ** OpenStack producing final RCs this week. Airship also working on a release. * Actions from last meeting * Specs approval * Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.) ** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html Update Config Management] *** topic:update-cfg-mgmt *** Zuul as CD engine ** OpenDev *** Gerrit upgrade to 3.2.8 **** https://review.opendev.org/c/opendev/system-config/+/784152 *** Gerrit account inconsistencies **** All preferred emails lack external ids issues have been corrected. All group loops have been corrected. **** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?) **** Next steps ***** Cleaning external IDs for the last batch of retired users. *** Configuration tuning **** Using strong refs for jgit caches **** Batch user groups and threads * General topics ** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210406) *** Enable Xenial -> Bionic/Focal system upgrades *** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here *** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first. **** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades ** PTG Planning (ianw 20210406) *** Next PTG April 19-23 *** Clarkb filled out the survey and requested a few hours for us. Likely to be spent in more office hours type setup. **** Thursday April 22 1400-1600UTC and 2200-0000UTC ** docs-old volume cleanup (ianw 20210406) *** We were going to double check with Ajaeger if we can then proceed to cleanup if no one had a reason to keep it. ** planet.openstack.org (ianw 20210406) *** Strong preference from clarkb to retire it *** Superuser appears to be a major blog showing up there as well as a couple of others. Maybe we reach out to them and double check they don't want to help? (fungi and clarkb reached out to Superuser and they seem ok. ** tarballs ORD replication (ianw 20210406) *** This has been done. Other than long initial sync is this happy day to day? * Open discussion From mkopec at redhat.com Tue Apr 6 11:21:17 2021 From: mkopec at redhat.com (Martin Kopec) Date: Tue, 6 Apr 2021 13:21:17 +0200 Subject: [devstack][infra] POST_FAILURE on export-devstack-journal : Export journal Message-ID: Hi, one of our jobs (python-tempestconf project) is frequently failing with POST_FAILURE [1] during the following task: export-devstack-journal : Export journal I'm bringing this to a broader audience as we're not sure where exactly the issue might be. Did you encounter a similar issue lately or in the past? [1] https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf Thanks for any advice, -- Martin Kopec -------------- next part -------------- An HTML attachment was scrubbed... URL: From radoslaw.piliszek at gmail.com Tue Apr 6 15:14:02 2021 From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=) Date: Tue, 6 Apr 2021 17:14:02 +0200 Subject: [devstack][infra] POST_FAILURE on export-devstack-journal : Export journal In-Reply-To: References: Message-ID: I am testing whether replacing xz with gzip would solve the problem [1] [2]. [1] https://review.opendev.org/c/openstack/devstack/+/784964 [2] https://review.opendev.org/c/osf/python-tempestconf/+/784967 -yoctozepto On Tue, Apr 6, 2021 at 1:21 PM Martin Kopec wrote: > > Hi, > > one of our jobs (python-tempestconf project) is frequently failing with POST_FAILURE [1] > during the following task: > > export-devstack-journal : Export journal > > I'm bringing this to a broader audience as we're not sure where exactly the issue might be. > > Did you encounter a similar issue lately or in the past? > > [1] https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf > > Thanks for any advice, > -- > Martin Kopec > > > From cboylan at sapwetik.org Tue Apr 6 15:51:19 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Tue, 06 Apr 2021 08:51:19 -0700 Subject: =?UTF-8?Q?Re:_[devstack][infra]_POST=5FFAILURE_on_export-devstack-journa?= =?UTF-8?Q?l_:_Export_journal?= In-Reply-To: References: Message-ID: On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote: > I am testing whether replacing xz with gzip would solve the problem [1] [2]. The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix. > > [1] https://review.opendev.org/c/openstack/devstack/+/784964 > [2] https://review.opendev.org/c/osf/python-tempestconf/+/784967 > > -yoctozepto > > On Tue, Apr 6, 2021 at 1:21 PM Martin Kopec wrote: > > > > Hi, > > > > one of our jobs (python-tempestconf project) is frequently failing with POST_FAILURE [1] > > during the following task: > > > > export-devstack-journal : Export journal > > > > I'm bringing this to a broader audience as we're not sure where exactly the issue might be. > > > > Did you encounter a similar issue lately or in the past? > > > > [1] https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf > > > > Thanks for any advice, > > -- > > Martin Kopec From fungi at yuggoth.org Tue Apr 6 16:02:48 2021 From: fungi at yuggoth.org (Jeremy Stanley) Date: Tue, 6 Apr 2021 16:02:48 +0000 Subject: [devstack][infra] POST_FAILURE on export-devstack-journal : Export journal In-Reply-To: References: Message-ID: <20210406160247.gevud2hlvodg7jzt@yuggoth.org> On 2021-04-06 13:21:17 +0200 (+0200), Martin Kopec wrote: > one of our jobs (python-tempestconf project) is frequently failing with > POST_FAILURE [1] > during the following task: > > export-devstack-journal : Export journal > > I'm bringing this to a broader audience as we're not sure where exactly the > issue might be. > > Did you encounter a similar issue lately or in the past? > > [1] > https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf Looking at the error, I strongly suspect memory exhaustion. We could try tuning xz to use less memory when compressing. -- Jeremy Stanley -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From radoslaw.piliszek at gmail.com Tue Apr 6 16:11:41 2021 From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=) Date: Tue, 6 Apr 2021 18:11:41 +0200 Subject: [devstack][infra] POST_FAILURE on export-devstack-journal : Export journal In-Reply-To: <20210406160247.gevud2hlvodg7jzt@yuggoth.org> References: <20210406160247.gevud2hlvodg7jzt@yuggoth.org> Message-ID: On Tue, Apr 6, 2021 at 6:02 PM Jeremy Stanley wrote: > Looking at the error, I strongly suspect memory exhaustion. We could > try tuning xz to use less memory when compressing. That was my hunch as well, hence why I test using gzip. On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan wrote: > > On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote: > > I am testing whether replacing xz with gzip would solve the problem [1] [2]. > > The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix. Let's see how bad the file sizes are. If they are acceptable, we can keep gzip and be happy. Otherwise we try to tune the params to make xz a better citizen as fungi suggested. -yoctozepto From radoslaw.piliszek at gmail.com Tue Apr 6 16:15:28 2021 From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=) Date: Tue, 6 Apr 2021 18:15:28 +0200 Subject: [devstack][infra] POST_FAILURE on export-devstack-journal : Export journal In-Reply-To: References: <20210406160247.gevud2hlvodg7jzt@yuggoth.org> Message-ID: On Tue, Apr 6, 2021 at 6:11 PM Radosław Piliszek wrote: > On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan wrote: > > > > On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote: > > > I am testing whether replacing xz with gzip would solve the problem [1] [2]. > > > > The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix. > > Let's see how bad the file sizes are. devstack.journal.gz 23.6M Less than all the other logs together, I would not mind. I wonder how it is in other jobs (this is from the failing one). -yoctozepto From cboylan at sapwetik.org Tue Apr 6 16:39:04 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Tue, 06 Apr 2021 09:39:04 -0700 Subject: =?UTF-8?Q?Re:_[devstack][infra]_POST=5FFAILURE_on_export-devstack-journa?= =?UTF-8?Q?l_:_Export_journal?= In-Reply-To: References: <20210406160247.gevud2hlvodg7jzt@yuggoth.org> Message-ID: On Tue, Apr 6, 2021, at 9:15 AM, Radosław Piliszek wrote: > On Tue, Apr 6, 2021 at 6:11 PM Radosław Piliszek > wrote: > > On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan wrote: > > > > > > On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote: > > > > I am testing whether replacing xz with gzip would solve the problem [1] [2]. > > > > > > The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix. > > > > Let's see how bad the file sizes are. > > devstack.journal.gz 23.6M > > Less than all the other logs together, I would not mind. > I wonder how it is in other jobs (this is from the failing one). There does seem to be a range (likely due to how much the job workload causes logging to happen in journald) from about a few megabytes to eighty something MB [3]. This is probably acceptable. Just keep an eye out for jobs that end up with much larger file sizes and we can reevaluate if we notice them. [3] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_038/784964/1/check/tempest-multinode-full-py3/038bd51/controller/logs/index.html From cboylan at sapwetik.org Tue Apr 6 16:46:33 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Tue, 06 Apr 2021 09:46:33 -0700 Subject: =?UTF-8?Q?Re:_[devstack][infra]_POST=5FFAILURE_on_export-devstack-journa?= =?UTF-8?Q?l_:_Export_journal?= In-Reply-To: References: <20210406160247.gevud2hlvodg7jzt@yuggoth.org> Message-ID: <7626869f-dab3-41df-a40b-dafa20dcfaf4@www.fastmail.com> On Tue, Apr 6, 2021, at 9:11 AM, Radosław Piliszek wrote: > On Tue, Apr 6, 2021 at 6:02 PM Jeremy Stanley wrote: > > Looking at the error, I strongly suspect memory exhaustion. We could > > try tuning xz to use less memory when compressing. Worth noting that we continue to suspect memory pressure, and in particular diving into swap, for random failures that appear timing or performance related. I still think it would be a helpful exercise for OpenStack to look at its memory consumption (remember end users will experience this too) and see if there are any unexpected areas of memory use. I think the last time i skimmed logs the privsep daemon was a large consumer because we separate instance is run for each service and they all add up. > > That was my hunch as well, hence why I test using gzip. > > On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan wrote: > > > > On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote: > > > I am testing whether replacing xz with gzip would solve the problem [1] [2]. > > > > The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix. > > Let's see how bad the file sizes are. > If they are acceptable, we can keep gzip and be happy. > Otherwise we try to tune the params to make xz a better citizen as > fungi suggested. > > -yoctozepto > > From jim at acmegating.com Wed Apr 7 01:55:27 2021 From: jim at acmegating.com (James E. Blair) Date: Tue, 06 Apr 2021 18:55:27 -0700 Subject: Recent nodepool label changes Message-ID: <87blaqn9io.fsf@fuligin> Hi, I recently spent some time trying to figure out why a job worked as expected during one run and then failed due to limited memory on the following run. It turns out that back in February this change was merged on an emergency basis, which caused us to start occasionally providing nodes with 32G of ram instead of the typical 8G: https://review.opendev.org/773710 Nodepool labels are designed to represent the combination of an image and set of resources. To the best of our ability, the images and resources they provide should be consistent across different cloud providers. That's why we use DIB to create consistent images and that's why we use "-expanded" labels to request nodes with additional memory. It's also the case that when we add new clouds, we generally try to benchmark performance and adjust flavors as needed. Unfortunately, providing such disparate resources under the same Nodepool labels makes it impossible for job authors to reliably design jobs. To be clear, it's fine to provide resources of varying size, we just need to use different Nodepool labels for them so that job authors get what they're asking for. The last time we were in this position, we updated our Nodepool images to add the mem= Linux kernel command line parameter in order to limit the total available RAM. I suspect that is still possible, but due to the explosion of images and flavors, doing so will be considerably more difficult this time. We now also have the ability to reboot nodes in jobs after they come online, but doing that would add additional run time for every job. I believe we need to address this. Despite the additional work, it seems like the "mem=" approach is our best bet; unless anyone has other ideas? -Jim From cboylan at sapwetik.org Wed Apr 7 16:20:55 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Wed, 07 Apr 2021 09:20:55 -0700 Subject: Recent nodepool label changes In-Reply-To: <87blaqn9io.fsf@fuligin> References: <87blaqn9io.fsf@fuligin> Message-ID: On Tue, Apr 6, 2021, at 6:55 PM, James E. Blair wrote: > Hi, > > I recently spent some time trying to figure out why a job worked as > expected during one run and then failed due to limited memory on the > following run. It turns out that back in February this change was > merged on an emergency basis, which caused us to start occasionally > providing nodes with 32G of ram instead of the typical 8G: > > https://review.opendev.org/773710 > > Nodepool labels are designed to represent the combination of an image > and set of resources. To the best of our ability, the images and > resources they provide should be consistent across different cloud > providers. That's why we use DIB to create consistent images and that's > why we use "-expanded" labels to request nodes with additional memory. > It's also the case that when we add new clouds, we generally try to > benchmark performance and adjust flavors as needed. > > Unfortunately, providing such disparate resources under the same > Nodepool labels makes it impossible for job authors to reliably design > jobs. > > To be clear, it's fine to provide resources of varying size, we just > need to use different Nodepool labels for them so that job authors get > what they're asking for. > > The last time we were in this position, we updated our Nodepool images > to add the mem= Linux kernel command line parameter in order to limit > the total available RAM. I suspect that is still possible, but due to > the explosion of images and flavors, doing so will be considerably more > difficult this time. > > We now also have the ability to reboot nodes in jobs after they come > online, but doing that would add additional run time for every job. > > I believe we need to address this. Despite the additional work, it > seems like the "mem=" approach is our best bet; unless anyone has other > ideas? This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory. At the time I was surprised the change merged so quickly and asked if anyone was starting work on setting the kernel boot parameters again: http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log.html#t2021-02-02T18:04:23 I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example). For completeness other possibilities are: * Convince the clouds that the nova flavor is the best place to control this and set them appropriately * Don't use clouds that can't set appropriate flavors * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable * Kernel module that inspects some attribute at boot time and sets mem appropriately [0] https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/README.rst > > -Jim From smooney at redhat.com Wed Apr 7 16:30:28 2021 From: smooney at redhat.com (Sean Mooney) Date: Wed, 7 Apr 2021 17:30:28 +0100 Subject: Recent nodepool label changes In-Reply-To: References: <87blaqn9io.fsf@fuligin> Message-ID: <41ed1949-f638-eee1-2421-9840750a5c01@redhat.com> On 07/04/2021 17:20, Clark Boylan wrote: > On Tue, Apr 6, 2021, at 6:55 PM, James E. Blair wrote: >> Hi, >> >> I recently spent some time trying to figure out why a job worked as >> expected during one run and then failed due to limited memory on the >> following run. It turns out that back in February this change was >> merged on an emergency basis, which caused us to start occasionally >> providing nodes with 32G of ram instead of the typical 8G: >> >> https://review.opendev.org/773710 >> >> Nodepool labels are designed to represent the combination of an image >> and set of resources. To the best of our ability, the images and >> resources they provide should be consistent across different cloud >> providers. That's why we use DIB to create consistent images and that's >> why we use "-expanded" labels to request nodes with additional memory. >> It's also the case that when we add new clouds, we generally try to >> benchmark performance and adjust flavors as needed. >> >> Unfortunately, providing such disparate resources under the same >> Nodepool labels makes it impossible for job authors to reliably design >> jobs. >> >> To be clear, it's fine to provide resources of varying size, we just >> need to use different Nodepool labels for them so that job authors get >> what they're asking for. >> >> The last time we were in this position, we updated our Nodepool images >> to add the mem= Linux kernel command line parameter in order to limit >> the total available RAM. I suspect that is still possible, but due to >> the explosion of images and flavors, doing so will be considerably more >> difficult this time. >> >> We now also have the ability to reboot nodes in jobs after they come >> online, but doing that would add additional run time for every job. >> >> I believe we need to address this. Despite the additional work, it >> seems like the "mem=" approach is our best bet; unless anyone has other >> ideas? > This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory. > > At the time I was surprised the change merged so quickly and asked if anyone was starting work on setting the kernel boot parameters again: > > http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log.html#t2021-02-02T18:04:23 > > I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example). > > For completeness other possibilities are: > * Convince the clouds that the nova flavor is the best place to control this and set them appropriately > * Don't use clouds that can't set appropriate flavors > * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable > * Kernel module that inspects some attribute at boot time and sets mem appropriately im not sure why the issue is with allowing vms to have 32GB of ram. as job authors we should basically talor our jobs to fit the minium avaiable and if we get more ram then that a bonus. we should not be writing tempest jobs in particarl in such a way that more ram would break things out side of very speciric jobs. for example the whitebox tempest plug that litally ssh into the host vms to validate thing in the libvirt xml makes some assumiton about the env but i would consider it a bug in our plugin if it could not work with more ram. less ram we may have issue but more should not break any of our test or we should fix them. i think we shoudl be able to just have the vexhost flavor labled twice. once with the normal lables and once with the -expand one i would hope that we do not go down the path of hardcodign a kernel mem limit to 8G for all lables it seam very wasteful to me to boot a 32G vm and only use 8G of it. > > [0] > https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/README.rst > >> -Jim From fungi at yuggoth.org Wed Apr 7 16:39:46 2021 From: fungi at yuggoth.org (Jeremy Stanley) Date: Wed, 7 Apr 2021 16:39:46 +0000 Subject: Recent nodepool label changes In-Reply-To: References: <87blaqn9io.fsf@fuligin> Message-ID: <20210407163945.mjcz7l75kimktxed@yuggoth.org> On 2021-04-07 09:20:55 -0700 (-0700), Clark Boylan wrote: [...] > This change was made at the request of mnaser to better support > resource allocation in vexxhost (the flavors we use now use their > standard ratio for memory:cpu). One (likely bad) option would be > to select a flavor based on memory rather than cpu count. In this > case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB > of memory. > > At the time I was surprised the change merged so quickly [...] Based on the commit message and the fact that we were pinged in IRC to review, I got the impression it was relatively urgent. > I suspect that the kernel limit is our best option. We can set > this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will > work in many cases across the various distros. The problem with > this approach is that we would need different images for the > places we want to boot with more memory (the -expanded labels for > example). > > For completeness other possibilities are: > * Convince the clouds that the nova flavor is the best place to > control this and set them appropriately > * Don't use clouds that can't set appropriate flavors > * Accept Fungi's argument in the IRC log above and accept that > memory as with other resources like disk iops and network will be > variable To be clear, this was mostly a "devil's advocate" argument, and not really my opinion. We saw first hand that disparate memory sizing in HPCloud was allowing massive memory usage jumps to merge in OpenStack, and took action back then to artificially limit the available memory at boot. We now have fresh evidence from the Zuul community that this hasn't ceased to be a problem. On the other hand, we also see projects merge changes which significantly increase disk utilization and then can't run on some environments where we get smaller disks (or depend on having multiple network interfaces, or specific addressing schemes, or certain CPU flags, or...), so heterogeneity the problem isn't limited exclusively to memory. > * Kernel module that inspects some attribute at boot time and > sets mem appropriately [...] Not to downplay the value of the donated resources, because they really are very much appreciated, but these currently account for less than 5% of our aggregate node count so having to maintain multiple nearly identical images or doing a lot of additional engineering work seems like it may outweigh any immediate benefits. With the increasing use of special node labels like expanded, nested-virt and NUMA, it might make more sense to just limit this region to not supplying standard nodes, which sidesteps the problem for now. -- Jeremy Stanley -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From jim at acmegating.com Wed Apr 7 17:33:22 2021 From: jim at acmegating.com (James E. Blair) Date: Wed, 07 Apr 2021 10:33:22 -0700 Subject: Recent nodepool label changes In-Reply-To: <41ed1949-f638-eee1-2421-9840750a5c01@redhat.com> (Sean Mooney's message of "Wed, 7 Apr 2021 17:30:28 +0100") References: <87blaqn9io.fsf@fuligin> <41ed1949-f638-eee1-2421-9840750a5c01@redhat.com> Message-ID: <877dlem23h.fsf@fuligin> Sean Mooney writes: > im not sure why the issue is with allowing vms to have 32GB of ram. > as job authors we should basically talor our jobs to fit the minium > avaiable and if we get more ram then that a bonus. > we should not be writing tempest jobs in particarl in such a way that > more ram would break things out side of very speciric jobs. > for example the whitebox tempest plug that litally ssh into the host > vms to validate thing in the libvirt xml makes some assumiton about > the env but i would consider it a bug in our plugin if it could not > work with more ram. I tried really hard to make it clear I have no problem with the idea that we could have flavors with more ram. I absolutely don't object to that. What I am saying is that there is definitely a problem with using a label that has different amounts of ram in different providers. It causes jobs to behave differently. Jobs that pass in one provider will fail in another because of the ram difference. I agree with you that as job authors we should tailor our jobs to fit the minimum available ram. The problem is that is nearly impossible if Nodepool randomly gives us nodes with more ram. We won't realize we have exceeded the minimum ram until we hit a job on a provider with less ram after having exceeded it on a provider with more ram. This is not a theoretical issue -- you are reading this message because I hit this problem after two test runs on a recently started project. > less ram we may have issue but more should not break any of our test > or we should fix them. There is an inherent contradiction in saying that more ram is okay but less ram is not. They are two sides of the same coin. A job will not break because it had more ram the first time, it will break because it had less ram the second time. The fundamental issue is that a Nodepool label describes an image plus a flavor. That flavor must be as consistent as possible across providers if we expect job authors to be able to write predictable jobs. > it seam very wasteful to me to boot a 32G vm and only use 8G of it. It may seem that way, but the infrastructure provider has told us that they have tuned their hardware purchases to that ratio of CPU/RAM, and so we're helping out by doing this. The more wasteful thing is people issuing rechecks because their jobs pass in some providers and not others. -Jim From iwienand at redhat.com Thu Apr 8 05:43:33 2021 From: iwienand at redhat.com (Ian Wienand) Date: Thu, 8 Apr 2021 15:43:33 +1000 Subject: Next steps with new review server In-Reply-To: References: Message-ID: On Thu, Apr 01, 2021 at 02:35:32PM -0700, Clark Boylan wrote: > I ended up double checking the mirror node and in > mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml > you can see what we did there. Essentially we set dhcpv6 and > accept-ra to false then set an address and routes. We should be able > to do the same thing with the new review host if we can't figure > anything else out. > [3] https://launchpad.net/bugs/1844712 So we have a work around in production but also [3] being marked as an open security bug. Are we happy enough ignoring RA's is sufficient to overcome the issues discussed in [3] for this service? The concern mostly seemed to be a targeted MITM attack; something which ssh host keys and SSL certificates should cover? -i From fungi at yuggoth.org Thu Apr 8 19:48:35 2021 From: fungi at yuggoth.org (Jeremy Stanley) Date: Thu, 8 Apr 2021 19:48:35 +0000 Subject: Next steps with new review server In-Reply-To: References: Message-ID: <20210408194835.ma5xr6cm5enegnab@yuggoth.org> On 2021-04-08 15:43:33 +1000 (+1000), Ian Wienand wrote: > On Thu, Apr 01, 2021 at 02:35:32PM -0700, Clark Boylan wrote: > > I ended up double checking the mirror node and in > > mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml > > you can see what we did there. Essentially we set dhcpv6 and > > accept-ra to false then set an address and routes. We should be able > > to do the same thing with the new review host if we can't figure > > anything else out. > > > [3] https://launchpad.net/bugs/1844712 > > So we have a work around in production but also [3] being marked as an > open security bug. > > Are we happy enough ignoring RA's is sufficient to overcome the issues > discussed in [3] for this service? The concern mostly seemed to be a > targeted MITM attack; something which ssh host keys and SSL > certificates should cover? Yes, I think ignoring RAs is probably sufficient. Nobody seems to have yet figured out how the leak happens or what else could be leaked, but as you note the fact that a MitM couldn't usefully spoof a viable HTTPS or SSH connection endpoint is sufficient insurance against anything worse, so we can just focus on mitigating the stability problem arising from stray leaks for now. -- Jeremy Stanley -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From fungi at yuggoth.org Sun Apr 11 15:13:11 2021 From: fungi at yuggoth.org (Jeremy Stanley) Date: Sun, 11 Apr 2021 15:13:11 +0000 Subject: Recent nodepool label changes In-Reply-To: <20210407163945.mjcz7l75kimktxed@yuggoth.org> References: <87blaqn9io.fsf@fuligin> <20210407163945.mjcz7l75kimktxed@yuggoth.org> Message-ID: <20210411151311.p5fyft6m34stqlf4@yuggoth.org> On 2021-04-07 16:39:46 +0000 (+0000), Jeremy Stanley wrote: [...] > With the increasing use of special node labels like expanded, > nested-virt and NUMA, it might make more sense to just limit this > region to not supplying standard nodes, which sidesteps the problem > for now. I've proposed WIP change https://review.opendev.org/785769 as a straw man for this solution. -- Jeremy Stanley -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From cboylan at sapwetik.org Mon Apr 12 23:12:55 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Mon, 12 Apr 2021 16:12:55 -0700 Subject: Team Meeting Agenda for April 13, 2021 Message-ID: <2715073b-53c2-461e-a942-bbfa8a1e638a@www.fastmail.com> We will meet with this agenda on April 13, 2021 at 19:00 UTC in #opendev-meeting: == Agenda for next meeting == * Announcements ** OpenStack completing release April 14. Airship 2.0 doesn't seem to exist yet so will assume they are still working on it. * Actions from last meeting * Specs approval * Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.) ** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html Update Config Management] *** topic:update-cfg-mgmt *** Zuul as CD engine ** OpenDev *** Gerrit upgrade to 3.2.8 **** https://review.opendev.org/c/opendev/system-config/+/784152 *** Gerrit account inconsistencies **** All preferred emails lack external ids issues have been corrected. All group loops have been corrected. **** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?) **** Next steps ***** ~224 accounts were cleaned up. Next batch of ~56 has been started. Will clean their external IDs after letting the retired users sit for a few days. ***** Email sent to two Third Party CI groups about correcting external id conflicts among their accounts. These accounts will not be retired (for the most part). *** Configuration tuning **** Reduce the number of ssh threads. Possibly create bot/batch user groups and thread counts as part of this. **** https://groups.google.com/g/repo-discuss/c/BQKxAfXBXuo Upstream conversation with people struggling with similar problems. * General topics ** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210413) *** Enable Xenial -> Bionic/Focal system upgrades *** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here *** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first. **** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades ** planet.openstack.org (ianw 20210413) *** Strong preference from clarkb to retire it *** Superuser appears to be a major blog showing up there as well as a couple of others. Maybe we reach out to them and double check they don't want to help? (fungi and clarkb reached out to Superuser and they seem ok) ** survey.openstack.org (clarkb 20210413) *** Can we go ahead and clean this service up? I don't think it ever got much use (maybe one or two surveys total). ** docs-old volume cleanup (ianw 20210413) *** We were going to double check with Ajaeger if we can then proceed to cleanup if no one had a reason to keep it. ** PTG Planning (clarkb 20210413) *** Next PTG April 19-23 **** Thursday April 22 1400-1600UTC and 2200-0000UTC * Open discussion From cboylan at sapwetik.org Tue Apr 13 16:47:43 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Tue, 13 Apr 2021 09:47:43 -0700 Subject: Join OpenDev at the Project Teams Gathering Message-ID: <1bd7a8d9-9796-464b-a34f-5f315ba8e974@www.fastmail.com> The PTG is next week, and OpenDev is participating alongside the OpenStack TaCT SIG. We are going to try something a bit different this time around, which is to treat the time as office hours rather than time for our own projects. We will be meeting on April 22 from 14:00 - 16:00 UTC and 22:00 - 00:00 UTC in https://meetpad.opendev.org/apr2021-ptg-opendev. Join us if you would like to: * Start contributing to either OpenDev or the TaCT sig. * Debug a particular job problem. * Learn how to write and review Zuul jobs and related configs. * Learn about specific services or how they are deployed. * And anything else related to OpenDev and our project infrastructure. Feel free to add your topics and suggest preferred times for those topics here: https://etherpad.opendev.org/p/apr2021-ptg-opendev. This etherpad corresponds to the document that will be auto loaded in our meetpad room above. I will also be around next week and will try to keep a flexible schedule. Feel free to reach out if you would like us to join discussions as they happen. See you there, Clark From iwienand at redhat.com Fri Apr 23 05:07:45 2021 From: iwienand at redhat.com (Ian Wienand) Date: Fri, 23 Apr 2021 15:07:45 +1000 Subject: Debian bullsye image Ansible detection Message-ID: Hello, In short, Ansible reports "n/a" for ansible_distribution_release on our new bullseye nodes. This screws up our mirror setup. This has turned into quite an adventure. Currently, Debian is frozen to create the "bullseye" release. This means that "bullseye" is really an alias for "testing", that will turn into the release after the freeze period. So currently Debian bullseye reports itself in /etc/debian_version or /etc/os-release as "bullseye/sid". This sort of makes sense if you consider that you don't commit things to "testing" directly, they go into unstable ("sid") and then migrate after a period of stability. So you can't have "base-files" package in bullseye that hasn't gone through unstable/sid. You can read "bullseye/sid" as "we've chosen the name bullseye and packages going through unstable are destined for it". Now, you might see a problem in that "unstable" and "bullseye" (testing) now both report themselves in these version files as the same thing (because the unstable packages that provide them move into testing). "lsb_release -c" tries to be a bit smart about this, and looks at the output of "apt-cache policy" to try and see if you are actually pulling the .deb files from a bullseye repo or an unstable one. Interestingly, this relies on a "Label" being present in the mirror release files. Since we use reprepro to build our own mirrors, we do not have this (and why nobody else who doens't use our mirrors seems to notice this problem). A fix is proposed with https://review.opendev.org/c/opendev/system-config/+/787661 So "lsb_release -c" doesn't report anything, leaving Ansible in the dark as to what repo it uses. When "lsb_release -c" doesn't return anything helpful, Ansible tries to do it's own parsing of the release files. I started hacking on these, but the point raised in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=845651 gave me pause. It is a fair point that you can not really know if you're on bullseye or sid by examining these files. N/A is probably actually the correct answer from Ansible's POV. Anyway, that is https://github.com/ianw/ansible/commit/847817a82ed86b5f39a4ccc3ffbff0e0cd63e8cc Now, even more annoyingly, setting the label in our mirrors may not be sufficient for "lsb_release -c" to work on our images, because we have cleared out the apt repositories. You would need to run "apt-get update" before Ansible tries to run "lsb_release" to populate it's facts. Now the problem is that we're trying to use Ansible's fact about the distro name to setup apt to point to our mirrors -- so we can't apt-get update before we have that written out! Classic chicken and egg. The only other idea I have is to hack dib/early setup overwrite /etc/debian_version with "11.0" so that we look like the upcoming release has already been done. "lsb_release -c" will then report "bullsye". However, there is some possibility this will confuse other things, as this release technically hasn't been done. I've proposed that with https://review.opendev.org/c/openstack/diskimage-builder/+/787665 I'm open to suggestions! -i From fungi at yuggoth.org Fri Apr 23 12:16:46 2021 From: fungi at yuggoth.org (Jeremy Stanley) Date: Fri, 23 Apr 2021 12:16:46 +0000 Subject: Debian bullsye image Ansible detection In-Reply-To: References: Message-ID: <20210423121645.7ndgdmz22zbrplvu@yuggoth.org> On 2021-04-23 15:07:45 +1000 (+1000), Ian Wienand wrote: > In short, Ansible reports "n/a" for ansible_distribution_release on > our new bullseye nodes. This screws up our mirror setup. This has > turned into quite an adventure. > > Currently, Debian is frozen to create the "bullseye" release. This > means that "bullseye" is really an alias for "testing", that will turn > into the release after the freeze period. [...] The irony is that `lsb_release -c` has been returning "bullseye" on my sid machines for weeks, since base-files 11.1 was uploaded to unstable (2021-04-10). The base-files in bullseye is still 11, but I expect the current problem will sort itself out automatically once 11.1 migrates from unstable to testing: https://tracker.debian.org/pkg/base-files Unfortunately, exactly *when* the release team will allow that is unclear (at least to me). -- Jeremy Stanley -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 963 bytes Desc: not available URL: From radoslaw.piliszek at gmail.com Fri Apr 23 13:17:23 2021 From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=) Date: Fri, 23 Apr 2021 15:17:23 +0200 Subject: Debian bullsye image Ansible detection In-Reply-To: <20210423121645.7ndgdmz22zbrplvu@yuggoth.org> References: <20210423121645.7ndgdmz22zbrplvu@yuggoth.org> Message-ID: On Fri, Apr 23, 2021 at 2:17 PM Jeremy Stanley wrote: > > On 2021-04-23 15:07:45 +1000 (+1000), Ian Wienand wrote: > > In short, Ansible reports "n/a" for ansible_distribution_release on > > our new bullseye nodes. This screws up our mirror setup. This has > > turned into quite an adventure. > > > > Currently, Debian is frozen to create the "bullseye" release. This > > means that "bullseye" is really an alias for "testing", that will turn > > into the release after the freeze period. > [...] > > The irony is that `lsb_release -c` has been returning "bullseye" on > my sid machines for weeks, since base-files 11.1 was uploaded to > unstable (2021-04-10). Well, I guess it means that Ian's hack would be more than acceptable. ;-) -yoctozepto From fungi at yuggoth.org Fri Apr 23 13:32:27 2021 From: fungi at yuggoth.org (Jeremy Stanley) Date: Fri, 23 Apr 2021 13:32:27 +0000 Subject: Debian bullsye image Ansible detection In-Reply-To: References: <20210423121645.7ndgdmz22zbrplvu@yuggoth.org> Message-ID: <20210423133226.qfllhnvrd3hxx4ur@yuggoth.org> On 2021-04-23 15:17:23 +0200 (+0200), Radosław Piliszek wrote: [...] > Well, I guess it means that Ian's hack would be more than acceptable. ;-) Or just fetch base-files 11.1 in from sid temporarily in our infra-package-needs element until it migrates into bullseye. -- Jeremy Stanley -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 963 bytes Desc: not available URL: From radoslaw.piliszek at gmail.com Fri Apr 23 14:36:05 2021 From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=) Date: Fri, 23 Apr 2021 16:36:05 +0200 Subject: Debian bullsye image Ansible detection In-Reply-To: <20210423133226.qfllhnvrd3hxx4ur@yuggoth.org> References: <20210423121645.7ndgdmz22zbrplvu@yuggoth.org> <20210423133226.qfllhnvrd3hxx4ur@yuggoth.org> Message-ID: On Fri, Apr 23, 2021 at 3:32 PM Jeremy Stanley wrote: > > On 2021-04-23 15:17:23 +0200 (+0200), Radosław Piliszek wrote: > [...] > > Well, I guess it means that Ian's hack would be more than acceptable. ;-) > > Or just fetch base-files 11.1 in from sid temporarily in our > infra-package-needs element until it migrates into bullseye. Works for me. -yoctozepto From cboylan at sapwetik.org Mon Apr 26 23:18:25 2021 From: cboylan at sapwetik.org (Clark Boylan) Date: Mon, 26 Apr 2021 16:18:25 -0700 Subject: Team Meeting Agenda for April 27, 2021 Message-ID: <2edff236-f1bc-4644-870a-f6ad0dd2b1d0@www.fastmail.com> We will meet on April 27, 2021 at 19:00UTC in #opendev-meeting with this agenda: == Agenda for next meeting == * Announcements * Actions from last meeting * Specs approval * Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.) ** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html Update Config Management] *** topic:update-cfg-mgmt *** Zuul as CD engine ** OpenDev *** Gerrit account inconsistencies **** All preferred emails lack external ids issues have been corrected. All group loops have been corrected. **** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?) **** Next steps ***** More "dangerous" list has been generated. Should still be safe-ish particularly if we disable the accounts first. *** Configuration tuning **** Reduce the number of ssh threads. Possibly create bot/batch user groups and thread counts as part of this. **** https://groups.google.com/g/repo-discuss/c/BQKxAfXBXuo Upstream conversation with people struggling with similar problems. * General topics ** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210427) *** Enable Xenial -> Bionic/Focal system upgrades *** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here *** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first. **** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades ** survey.openstack.org (clarkb 20210427) *** We're getting friendly reminders that this SSL cert is about to expire. Would be good to cleanup. ** Debian Bullseye Images (clarkb 20210427) *** Need some DIB updates to hack around Debian versioning and Ansible's factorizing of that info. ** Minor git-review release to support --no-thin (clarkb 202104027) * Open discussion