April 2021 - service-discuss - lists.opendev.org

[all] Dynamic Zuul results table in Gerrit 3
by Radosław Piliszek 04 Aug '22

04 Aug '22

Hello Fellow OpenStack and OpenDev Folks! TL;DR click on [3] and enjoy. I am starting this thread to not hijack the discussion happening on [1]. First of all, I would like to thank gibi (Balazs Gibizer) for hacking a way to get the place to render the table in the first place (pun intended). I have been a long-time-now user of [2]. I have improved and customised it for myself but never really got to share back the changes I made. The new Gerrit obviously broke the whole script so it was of no use to share at that particular state. However, inspired by gibi's work, I decided to finally sit down and fix it to work with Gerrit 3 and here it comes: [3]. Works well on Chrome with Tampermonkey. Not tested others. I hope you will enjoy this little helper (I do). I know the script looks super fugly but it generally boils down to a mix of styles of 3 people and Gerrit having funky UI rendering. Finally, I'd also like to thank hrw (Marcin Juszkiewicz) for linking me to the original Michel's script in 2019. [1] http://lists.openstack.org/pipermail/openstack-discuss/2020-November/019051… [2] https://opendev.org/x/coats/src/commit/444c95738677593dcfed0cfd9667d4c4f0d5… [3] https://gist.github.com/yoctozepto/7ea1271c299d143388b7c1b1802ee75e Kind regards, -yoctozepto

7 11

[devstack][infra] POST_FAILURE on export-devstack-journal : Export journal
by Martin Kopec 09 May '21

09 May '21

Hi, one of our jobs (python-tempestconf project) is frequently failing with POST_FAILURE [1] during the following task: export-devstack-journal : Export journal I'm bringing this to a broader audience as we're not sure where exactly the issue might be. Did you encounter a similar issue lately or in the past? [1] https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tem… Thanks for any advice, -- Martin Kopec

4 10

Team Meeting Agenda for April 27, 2021
by Clark Boylan 27 Apr '21

27 Apr '21

We will meet on April 27, 2021 at 19:00UTC in #opendev-meeting with this agenda: == Agenda for next meeting == * Announcements * Actions from last meeting * Specs approval * Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.) ** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-… Update Config Management] *** topic:update-cfg-mgmt *** Zuul as CD engine ** OpenDev *** Gerrit account inconsistencies **** All preferred emails lack external ids issues have been corrected. All group loops have been corrected. **** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?) **** Next steps ***** More "dangerous" list has been generated. Should still be safe-ish particularly if we disable the accounts first. *** Configuration tuning **** Reduce the number of ssh threads. Possibly create bot/batch user groups and thread counts as part of this. **** https://groups.google.com/g/repo-discuss/c/BQKxAfXBXuo Upstream conversation with people struggling with similar problems. * General topics ** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210427) *** Enable Xenial -> Bionic/Focal system upgrades *** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here *** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first. **** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades ** survey.openstack.org (clarkb 20210427) *** We're getting friendly reminders that this SSL cert is about to expire. Would be good to cleanup. ** Debian Bullseye Images (clarkb 20210427) *** Need some DIB updates to hack around Debian versioning and Ansible's factorizing of that info. ** Minor git-review release to support --no-thin (clarkb 202104027) * Open discussion

1 0

Debian bullsye image Ansible detection
by Ian Wienand 23 Apr '21

23 Apr '21

Hello, In short, Ansible reports "n/a" for ansible_distribution_release on our new bullseye nodes. This screws up our mirror setup. This has turned into quite an adventure. Currently, Debian is frozen to create the "bullseye" release. This means that "bullseye" is really an alias for "testing", that will turn into the release after the freeze period. So currently Debian bullseye reports itself in /etc/debian_version or /etc/os-release as "bullseye/sid". This sort of makes sense if you consider that you don't commit things to "testing" directly, they go into unstable ("sid") and then migrate after a period of stability. So you can't have "base-files" package in bullseye that hasn't gone through unstable/sid. You can read "bullseye/sid" as "we've chosen the name bullseye and packages going through unstable are destined for it". Now, you might see a problem in that "unstable" and "bullseye" (testing) now both report themselves in these version files as the same thing (because the unstable packages that provide them move into testing). "lsb_release -c" tries to be a bit smart about this, and looks at the output of "apt-cache policy" to try and see if you are actually pulling the .deb files from a bullseye repo or an unstable one. Interestingly, this relies on a "Label" being present in the mirror release files. Since we use reprepro to build our own mirrors, we do not have this (and why nobody else who doens't use our mirrors seems to notice this problem). A fix is proposed with https://review.opendev.org/c/opendev/system-config/+/787661 So "lsb_release -c" doesn't report anything, leaving Ansible in the dark as to what repo it uses. When "lsb_release -c" doesn't return anything helpful, Ansible tries to do it's own parsing of the release files. I started hacking on these, but the point raised in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=845651 gave me pause. It is a fair point that you can not really know if you're on bullseye or sid by examining these files. N/A is probably actually the correct answer from Ansible's POV. Anyway, that is https://github.com/ianw/ansible/commit/847817a82ed86b5f39a4ccc3ffbff0e0cd63… Now, even more annoyingly, setting the label in our mirrors may not be sufficient for "lsb_release -c" to work on our images, because we have cleared out the apt repositories. You would need to run "apt-get update" before Ansible tries to run "lsb_release" to populate it's facts. Now the problem is that we're trying to use Ansible's fact about the distro name to setup apt to point to our mirrors -- so we can't apt-get update before we have that written out! Classic chicken and egg. The only other idea I have is to hack dib/early setup overwrite /etc/debian_version with "11.0" so that we look like the upcoming release has already been done. "lsb_release -c" will then report "bullsye". However, there is some possibility this will confuse other things, as this release technically hasn't been done. I've proposed that with https://review.opendev.org/c/openstack/diskimage-builder/+/787665 I'm open to suggestions! -i

3 4

Join OpenDev at the Project Teams Gathering
by Clark Boylan 13 Apr '21

13 Apr '21

The PTG is next week, and OpenDev is participating alongside the OpenStack TaCT SIG. We are going to try something a bit different this time around, which is to treat the time as office hours rather than time for our own projects. We will be meeting on April 22 from 14:00 - 16:00 UTC and 22:00 - 00:00 UTC in https://meetpad.opendev.org/apr2021-ptg-opendev. Join us if you would like to: * Start contributing to either OpenDev or the TaCT sig. * Debug a particular job problem. * Learn how to write and review Zuul jobs and related configs. * Learn about specific services or how they are deployed. * And anything else related to OpenDev and our project infrastructure. Feel free to add your topics and suggest preferred times for those topics here: https://etherpad.opendev.org/p/apr2021-ptg-opendev. This etherpad corresponds to the document that will be auto loaded in our meetpad room above. I will also be around next week and will try to keep a flexible schedule. Feel free to reach out if you would like us to join discussions as they happen. See you there, Clark

1 0

Team Meeting Agenda for April 13, 2021
by Clark Boylan 13 Apr '21

13 Apr '21

We will meet with this agenda on April 13, 2021 at 19:00 UTC in #opendev-meeting: == Agenda for next meeting == * Announcements ** OpenStack completing release April 14. Airship 2.0 doesn't seem to exist yet so will assume they are still working on it. * Actions from last meeting * Specs approval * Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.) ** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-… Update Config Management] *** topic:update-cfg-mgmt *** Zuul as CD engine ** OpenDev *** Gerrit upgrade to 3.2.8 **** https://review.opendev.org/c/opendev/system-config/+/784152 *** Gerrit account inconsistencies **** All preferred emails lack external ids issues have been corrected. All group loops have been corrected. **** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?) **** Next steps ***** ~224 accounts were cleaned up. Next batch of ~56 has been started. Will clean their external IDs after letting the retired users sit for a few days. ***** Email sent to two Third Party CI groups about correcting external id conflicts among their accounts. These accounts will not be retired (for the most part). *** Configuration tuning **** Reduce the number of ssh threads. Possibly create bot/batch user groups and thread counts as part of this. **** https://groups.google.com/g/repo-discuss/c/BQKxAfXBXuo Upstream conversation with people struggling with similar problems. * General topics ** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210413) *** Enable Xenial -> Bionic/Focal system upgrades *** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here *** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first. **** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades ** planet.openstack.org (ianw 20210413) *** Strong preference from clarkb to retire it *** Superuser appears to be a major blog showing up there as well as a couple of others. Maybe we reach out to them and double check they don't want to help? (fungi and clarkb reached out to Superuser and they seem ok) ** survey.openstack.org (clarkb 20210413) *** Can we go ahead and clean this service up? I don't think it ever got much use (maybe one or two surveys total). ** docs-old volume cleanup (ianw 20210413) *** We were going to double check with Ajaeger if we can then proceed to cleanup if no one had a reason to keep it. ** PTG Planning (clarkb 20210413) *** Next PTG April 19-23 **** Thursday April 22 1400-1600UTC and 2200-0000UTC * Open discussion

1 0

Recent nodepool label changes
by James E. Blair 11 Apr '21

11 Apr '21

Hi, I recently spent some time trying to figure out why a job worked as expected during one run and then failed due to limited memory on the following run. It turns out that back in February this change was merged on an emergency basis, which caused us to start occasionally providing nodes with 32G of ram instead of the typical 8G: https://review.opendev.org/773710 Nodepool labels are designed to represent the combination of an image and set of resources. To the best of our ability, the images and resources they provide should be consistent across different cloud providers. That's why we use DIB to create consistent images and that's why we use "-expanded" labels to request nodes with additional memory. It's also the case that when we add new clouds, we generally try to benchmark performance and adjust flavors as needed. Unfortunately, providing such disparate resources under the same Nodepool labels makes it impossible for job authors to reliably design jobs. To be clear, it's fine to provide resources of varying size, we just need to use different Nodepool labels for them so that job authors get what they're asking for. The last time we were in this position, we updated our Nodepool images to add the mem= Linux kernel command line parameter in order to limit the total available RAM. I suspect that is still possible, but due to the explosion of images and flavors, doing so will be considerably more difficult this time. We now also have the ability to reboot nodes in jobs after they come online, but doing that would add additional run time for every job. I believe we need to address this. Despite the additional work, it seems like the "mem=" approach is our best bet; unless anyone has other ideas? -Jim

4 5

Next steps with new review server
by Ian Wienand 08 Apr '21

08 Apr '21

Hi, We have a large server provided by Vexxhost up and running in a staging capacity to replace the current server at review02.openstack.org. I have started to track some things at [1] There's a couple of things: 1) Production database Currently, we use a hosted db. Since NoteDB this only stores review seen flags. We've been told that other sites treat this data as ephemeral; they use a H2 db on disk and don't worry about backing up or restoring across upgrades. I have proposed storing this in a mariadb sibling container with [2]. We know how to admin, backup and restore that. That would be my preference, but I'm not terribly fussed. If I could request some reviews on that; I'll take +2's as a sign we should use a container, otherwise we can leave it with H2 it has now. 2) IPv6 issues We've seen a couple of cases that are looking increasingly like stray RA's are some how assigning extra addresses, similar to [1]. Our mirror in the same region has managed to acquire 50+ default routes somehow. It seems like inbound traffic keeps working (why we haven't seen issues with other production servers?). But I feel like it's a little bit troubling to have undiagnosed before we switch our major service to it. I'm running some tracing, trying to at least catch a stray RA while the server is quite, in the etherpad. But suggestions here are welcome. -i [1] https://etherpad.opendev.org/p/gerrit-upgrade-2021 [2] https://review.opendev.org/c/opendev/system-config/+/775961 [3] https://launchpad.net/bugs/1844712

3 4

Team Meeting Agenda for April 6, 2021
by Clark Boylan 05 Apr '21

05 Apr '21

We will meet with this agenda on April 6, 2021 at 19:00UTC in #opendev-meeting: == Agenda for next meeting == * Announcements ** OpenStack producing final RCs this week. Airship also working on a release. * Actions from last meeting * Specs approval * Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.) ** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-… Update Config Management] *** topic:update-cfg-mgmt *** Zuul as CD engine ** OpenDev *** Gerrit upgrade to 3.2.8 **** https://review.opendev.org/c/opendev/system-config/+/784152 *** Gerrit account inconsistencies **** All preferred emails lack external ids issues have been corrected. All group loops have been corrected. **** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?) **** Next steps ***** Cleaning external IDs for the last batch of retired users. *** Configuration tuning **** Using strong refs for jgit caches **** Batch user groups and threads * General topics ** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210406) *** Enable Xenial -> Bionic/Focal system upgrades *** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here *** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first. **** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades ** PTG Planning (ianw 20210406) *** Next PTG April 19-23 *** Clarkb filled out the survey and requested a few hours for us. Likely to be spent in more office hours type setup. **** Thursday April 22 1400-1600UTC and 2200-0000UTC ** docs-old volume cleanup (ianw 20210406) *** We were going to double check with Ajaeger if we can then proceed to cleanup if no one had a reason to keep it. ** planet.openstack.org (ianw 20210406) *** Strong preference from clarkb to retire it *** Superuser appears to be a major blog showing up there as well as a couple of others. Maybe we reach out to them and double check they don't want to help? (fungi and clarkb reached out to Superuser and they seem ok. ** tarballs ORD replication (ianw 20210406) *** This has been done. Other than long initial sync is this happy day to day? * Open discussion

1 0