From iwienand at redhat.com  Thu Apr  1 02:27:16 2021
From: iwienand at redhat.com (Ian Wienand)
Date: Thu, 1 Apr 2021 13:27:16 +1100
Subject: Next steps with new review server
Message-ID: <YGUvhGsmF+4wDRJw@fedora19.localdomain>

Hi,

We have a large server provided by Vexxhost up and running in a
staging capacity to replace the current server at
review02.openstack.org.

I have started to track some things at [1]

There's a couple of things:

1) Production database

Currently, we use a hosted db.  Since NoteDB this only stores review
seen flags.  We've been told that other sites treat this data as
ephemeral; they use a H2 db on disk and don't worry about backing up
or restoring across upgrades.

I have proposed storing this in a mariadb sibling container with [2].
We know how to admin, backup and restore that.  That would be my
preference, but I'm not terribly fussed.  If I could request some
reviews on that; I'll take +2's as a sign we should use a container,
otherwise we can leave it with H2 it has now.

2) IPv6 issues

We've seen a couple of cases that are looking increasingly like stray
RA's are some how assigning extra addresses, similar to [1].  Our
mirror in the same region has managed to acquire 50+ default routes
somehow.

It seems like inbound traffic keeps working (why we haven't seen
issues with other production servers?).  But I feel like it's a little
bit troubling to have undiagnosed before we switch our major service
to it.  I'm running some tracing, trying to at least catch a stray RA
while the server is quite, in the etherpad.  But suggestions here are
welcome.

-i


[1] https://etherpad.opendev.org/p/gerrit-upgrade-2021
[2] https://review.opendev.org/c/opendev/system-config/+/775961
[3] https://launchpad.net/bugs/1844712


From cboylan at sapwetik.org  Thu Apr  1 15:20:31 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Thu, 01 Apr 2021 08:20:31 -0700
Subject: Next steps with new review server
In-Reply-To: <YGUvhGsmF+4wDRJw@fedora19.localdomain>
References: <YGUvhGsmF+4wDRJw@fedora19.localdomain>
Message-ID: <a4bbef80-de4c-497f-805a-50f4a45fb313@www.fastmail.com>

On Wed, Mar 31, 2021, at 7:27 PM, Ian Wienand wrote:
> Hi,
> 
> We have a large server provided by Vexxhost up and running in a
> staging capacity to replace the current server at
> review02.openstack.org.
> 
> I have started to track some things at [1]
> 
> There's a couple of things:
> 
> 1) Production database
> 
> Currently, we use a hosted db.  Since NoteDB this only stores review
> seen flags.  We've been told that other sites treat this data as
> ephemeral; they use a H2 db on disk and don't worry about backing up
> or restoring across upgrades.
> 
> I have proposed storing this in a mariadb sibling container with [2].
> We know how to admin, backup and restore that.  That would be my
> preference, but I'm not terribly fussed.  If I could request some
> reviews on that; I'll take +2's as a sign we should use a container,
> otherwise we can leave it with H2 it has now.

Agreed, sticking with known DB tooling seems like a good idea for ease of operator interaction. I'll try to review this change today.

> 
> 2) IPv6 issues
> 
> We've seen a couple of cases that are looking increasingly like stray
> RA's are some how assigning extra addresses, similar to [1].  Our
> mirror in the same region has managed to acquire 50+ default routes
> somehow.
> 
> It seems like inbound traffic keeps working (why we haven't seen
> issues with other production servers?).  But I feel like it's a little
> bit troubling to have undiagnosed before we switch our major service
> to it.  I'm running some tracing, trying to at least catch a stray RA
> while the server is quite, in the etherpad.  But suggestions here are
> welcome.

Agreed, ideally we would sort this out before any migration completes. I want to say we saw similar with the mirror in vexxhost and the "solution" there was to disable RAs and create a static yaml config for ubuntu using its new network management config file? That seems less than ideal from a cloud perspective as we can't be the only ones noticing this (in fact some of our CI jobs may indicate they suffer from similar causing some jobs to run long when reaching network resources). I know when we brought this up with the mirror mnaser suggested static config was fine, but maybe we need to reinforce that this is problematic as a cloud user and see if we can help debug (network traces seem like a good start there).

> 
> -i
> 
> 
> [1] https://etherpad.opendev.org/p/gerrit-upgrade-2021
> [2] https://review.opendev.org/c/opendev/system-config/+/775961
> [3] https://launchpad.net/bugs/1844712


From cboylan at sapwetik.org  Thu Apr  1 21:35:32 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Thu, 01 Apr 2021 14:35:32 -0700
Subject: Next steps with new review server
In-Reply-To: <a4bbef80-de4c-497f-805a-50f4a45fb313@www.fastmail.com>
References: <YGUvhGsmF+4wDRJw@fedora19.localdomain>
 <a4bbef80-de4c-497f-805a-50f4a45fb313@www.fastmail.com>
Message-ID: <cb809d21-2f12-4178-9be2-41711fde8679@www.fastmail.com>

On Thu, Apr 1, 2021, at 8:20 AM, Clark Boylan wrote:
> On Wed, Mar 31, 2021, at 7:27 PM, Ian Wienand wrote:

snip

> > 
> > 2) IPv6 issues
> > 
> > We've seen a couple of cases that are looking increasingly like stray
> > RA's are some how assigning extra addresses, similar to [1].  Our
> > mirror in the same region has managed to acquire 50+ default routes
> > somehow.
> > 
> > It seems like inbound traffic keeps working (why we haven't seen
> > issues with other production servers?).  But I feel like it's a little
> > bit troubling to have undiagnosed before we switch our major service
> > to it.  I'm running some tracing, trying to at least catch a stray RA
> > while the server is quite, in the etherpad.  But suggestions here are
> > welcome.
> 
> Agreed, ideally we would sort this out before any migration completes. 
> I want to say we saw similar with the mirror in vexxhost and the 
> "solution" there was to disable RAs and create a static yaml config for 
> ubuntu using its new network management config file? That seems less 
> than ideal from a cloud perspective as we can't be the only ones 
> noticing this (in fact some of our CI jobs may indicate they suffer 
> from similar causing some jobs to run long when reaching network 
> resources). I know when we brought this up with the mirror mnaser 
> suggested static config was fine, but maybe we need to reinforce that 
> this is problematic as a cloud user and see if we can help debug 
> (network traces seem like a good start there).

I ended up double checking the mirror node and in mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml you can see what we did there. Essentially we set dhcpv6 and accept-ra to false then set an address and routes. We should be able to do the same thing with the new review host if we can't figure anything else out.

If we do go this route maybe we should consider updating launch-node to do it for us automatically when launching focal nodes on vexxhost (I don't think bionic does netplan?), or at the very least document this somewhere.

We should also double check that the address and routes are static and can be configured statically like this (the address should not change but I suppose the routes could at some point?). Ideally though we would sort this out properly and avoid these workarounds.

> 
> > 
> > -i
> > 
> > 
> > [1] https://etherpad.opendev.org/p/gerrit-upgrade-2021
> > [2] https://review.opendev.org/c/opendev/system-config/+/775961
> > [3] https://launchpad.net/bugs/1844712
> 
>


From cboylan at sapwetik.org  Mon Apr  5 22:30:01 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Mon, 05 Apr 2021 15:30:01 -0700
Subject: Team Meeting Agenda for April 6, 2021
Message-ID: <f790b5f3-d130-4ea6-a85e-0012323836c0@www.fastmail.com>

We will meet with this agenda on April 6, 2021 at 19:00UTC in #opendev-meeting:

== Agenda for next meeting ==

* Announcements
** OpenStack producing final RCs this week. Airship also working on a release.
* Actions from last meeting
* Specs approval

* Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.)
** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html Update Config Management]
*** topic:update-cfg-mgmt
*** Zuul as CD engine
** OpenDev
*** Gerrit upgrade to 3.2.8
**** https://review.opendev.org/c/opendev/system-config/+/784152
*** Gerrit account inconsistencies
**** All preferred emails lack external ids issues have been corrected. All group loops have been corrected.
**** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?)
**** Next steps
***** Cleaning external IDs for the last batch of retired users.
*** Configuration tuning
**** Using strong refs for jgit caches
**** Batch user groups and threads

* General topics
** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210406)
*** Enable Xenial -> Bionic/Focal system upgrades
*** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here
*** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first.
**** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades
** PTG Planning (ianw 20210406)
*** Next PTG April 19-23
*** Clarkb filled out the survey and requested a few hours for us. Likely to be spent in more office hours type setup.
**** Thursday April 22 1400-1600UTC and 2200-0000UTC
** docs-old volume cleanup (ianw 20210406)
*** We were going to double check with Ajaeger if we can then proceed to cleanup if no one had a reason to keep it.
** planet.openstack.org (ianw 20210406)
*** Strong preference from clarkb to retire it
*** Superuser appears to be a major blog showing up there as well as a couple of others. Maybe we reach out to them and double check they don't want to help? (fungi and clarkb reached out to Superuser and they seem ok.
** tarballs ORD replication (ianw 20210406)
*** This has been done. Other than long initial sync is this happy day to day?

* Open discussion


From mkopec at redhat.com  Tue Apr  6 11:21:17 2021
From: mkopec at redhat.com (Martin Kopec)
Date: Tue, 6 Apr 2021 13:21:17 +0200
Subject: [devstack][infra] POST_FAILURE on export-devstack-journal : Export
 journal
Message-ID: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>

Hi,

one of our jobs (python-tempestconf project) is frequently failing with
POST_FAILURE [1]
during the following task:

export-devstack-journal : Export journal

I'm bringing this to a broader audience as we're not sure where exactly the
issue might be.

Did you encounter a similar issue lately or in the past?

[1]
https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf

Thanks for any advice,
-- 
Martin Kopec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20210406/94b77715/attachment.html>

From radoslaw.piliszek at gmail.com  Tue Apr  6 15:14:02 2021
From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=)
Date: Tue, 6 Apr 2021 17:14:02 +0200
Subject: [devstack][infra] POST_FAILURE on export-devstack-journal :
 Export journal
In-Reply-To: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
References: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
Message-ID: <CAKZ_x7_qfAu+4aZvYGqBOnTygvgB02XPi+CJwFMJvFquV_qFQA@mail.gmail.com>

I am testing whether replacing xz with gzip would solve the problem [1] [2].

[1] https://review.opendev.org/c/openstack/devstack/+/784964
[2] https://review.opendev.org/c/osf/python-tempestconf/+/784967

-yoctozepto

On Tue, Apr 6, 2021 at 1:21 PM Martin Kopec <mkopec at redhat.com> wrote:
>
> Hi,
>
> one of our jobs (python-tempestconf project) is frequently failing with POST_FAILURE [1]
> during the following task:
>
> export-devstack-journal : Export journal
>
> I'm bringing this to a broader audience as we're not sure where exactly the issue might be.
>
> Did you encounter a similar issue lately or in the past?
>
> [1] https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf
>
> Thanks for any advice,
> --
> Martin Kopec
>
>
>


From cboylan at sapwetik.org  Tue Apr  6 15:51:19 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Tue, 06 Apr 2021 08:51:19 -0700
Subject: =?UTF-8?Q?Re:_[devstack][infra]_POST=5FFAILURE_on_export-devstack-journa?=
 =?UTF-8?Q?l_:_Export_journal?=
In-Reply-To: <CAKZ_x7_qfAu+4aZvYGqBOnTygvgB02XPi+CJwFMJvFquV_qFQA@mail.gmail.com>
References: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
 <CAKZ_x7_qfAu+4aZvYGqBOnTygvgB02XPi+CJwFMJvFquV_qFQA@mail.gmail.com>
Message-ID: <d66de707-81cb-40d5-88d6-1c0ef6e112ec@www.fastmail.com>

On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote:
> I am testing whether replacing xz with gzip would solve the problem [1] [2].

The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix.


> 
> [1] https://review.opendev.org/c/openstack/devstack/+/784964
> [2] https://review.opendev.org/c/osf/python-tempestconf/+/784967
> 
> -yoctozepto
> 
> On Tue, Apr 6, 2021 at 1:21 PM Martin Kopec <mkopec at redhat.com> wrote:
> >
> > Hi,
> >
> > one of our jobs (python-tempestconf project) is frequently failing with POST_FAILURE [1]
> > during the following task:
> >
> > export-devstack-journal : Export journal
> >
> > I'm bringing this to a broader audience as we're not sure where exactly the issue might be.
> >
> > Did you encounter a similar issue lately or in the past?
> >
> > [1] https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf
> >
> > Thanks for any advice,
> > --
> > Martin Kopec


From fungi at yuggoth.org  Tue Apr  6 16:02:48 2021
From: fungi at yuggoth.org (Jeremy Stanley)
Date: Tue, 6 Apr 2021 16:02:48 +0000
Subject: [devstack][infra] POST_FAILURE on export-devstack-journal :
 Export journal
In-Reply-To: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
References: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
Message-ID: <20210406160247.gevud2hlvodg7jzt@yuggoth.org>

On 2021-04-06 13:21:17 +0200 (+0200), Martin Kopec wrote:
> one of our jobs (python-tempestconf project) is frequently failing with
> POST_FAILURE [1]
> during the following task:
> 
> export-devstack-journal : Export journal
> 
> I'm bringing this to a broader audience as we're not sure where exactly the
> issue might be.
> 
> Did you encounter a similar issue lately or in the past?
> 
> [1]
> https://zuul.opendev.org/t/openstack/builds?job_name=python-tempestconf-tempest-devstack-admin-plugins&project=osf/python-tempestconf

Looking at the error, I strongly suspect memory exhaustion. We could
try tuning xz to use less memory when compressing.
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20210406/fd264747/attachment.sig>

From radoslaw.piliszek at gmail.com  Tue Apr  6 16:11:41 2021
From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=)
Date: Tue, 6 Apr 2021 18:11:41 +0200
Subject: [devstack][infra] POST_FAILURE on export-devstack-journal :
 Export journal
In-Reply-To: <20210406160247.gevud2hlvodg7jzt@yuggoth.org>
References: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
 <20210406160247.gevud2hlvodg7jzt@yuggoth.org>
Message-ID: <CAKZ_x7_f-x3dqh1C38rbGZDt=nPNg1+9xtuHyOXLhfKVgxLndg@mail.gmail.com>

On Tue, Apr 6, 2021 at 6:02 PM Jeremy Stanley <fungi at yuggoth.org> wrote:
> Looking at the error, I strongly suspect memory exhaustion. We could
> try tuning xz to use less memory when compressing.

That was my hunch as well, hence why I test using gzip.

On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan <cboylan at sapwetik.org> wrote:
>
> On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote:
> > I am testing whether replacing xz with gzip would solve the problem [1] [2].
>
> The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix.

Let's see how bad the file sizes are.
If they are acceptable, we can keep gzip and be happy.
Otherwise we try to tune the params to make xz a better citizen as
fungi suggested.

-yoctozepto


From radoslaw.piliszek at gmail.com  Tue Apr  6 16:15:28 2021
From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=)
Date: Tue, 6 Apr 2021 18:15:28 +0200
Subject: [devstack][infra] POST_FAILURE on export-devstack-journal :
 Export journal
In-Reply-To: <CAKZ_x7_f-x3dqh1C38rbGZDt=nPNg1+9xtuHyOXLhfKVgxLndg@mail.gmail.com>
References: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
 <20210406160247.gevud2hlvodg7jzt@yuggoth.org>
 <CAKZ_x7_f-x3dqh1C38rbGZDt=nPNg1+9xtuHyOXLhfKVgxLndg@mail.gmail.com>
Message-ID: <CAKZ_x79tKkpYgxpro1G-1sGqCds1n7X2AHJRwBC_C+Y3=T90Kg@mail.gmail.com>

On Tue, Apr 6, 2021 at 6:11 PM Radosław Piliszek
<radoslaw.piliszek at gmail.com> wrote:
> On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan <cboylan at sapwetik.org> wrote:
> >
> > On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote:
> > > I am testing whether replacing xz with gzip would solve the problem [1] [2].
> >
> > The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix.
>
> Let's see how bad the file sizes are.

devstack.journal.gz 23.6M

Less than all the other logs together, I would not mind.
I wonder how it is in other jobs (this is from the failing one).

-yoctozepto


From cboylan at sapwetik.org  Tue Apr  6 16:39:04 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Tue, 06 Apr 2021 09:39:04 -0700
Subject: =?UTF-8?Q?Re:_[devstack][infra]_POST=5FFAILURE_on_export-devstack-journa?=
 =?UTF-8?Q?l_:_Export_journal?=
In-Reply-To: <CAKZ_x79tKkpYgxpro1G-1sGqCds1n7X2AHJRwBC_C+Y3=T90Kg@mail.gmail.com>
References: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
 <20210406160247.gevud2hlvodg7jzt@yuggoth.org>
 <CAKZ_x7_f-x3dqh1C38rbGZDt=nPNg1+9xtuHyOXLhfKVgxLndg@mail.gmail.com>
 <CAKZ_x79tKkpYgxpro1G-1sGqCds1n7X2AHJRwBC_C+Y3=T90Kg@mail.gmail.com>
Message-ID: <af13e07e-4ac4-4ce4-bc8b-9feca9cdc647@www.fastmail.com>

On Tue, Apr 6, 2021, at 9:15 AM, Radosław Piliszek wrote:
> On Tue, Apr 6, 2021 at 6:11 PM Radosław Piliszek
> <radoslaw.piliszek at gmail.com> wrote:
> > On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan <cboylan at sapwetik.org> wrote:
> > >
> > > On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote:
> > > > I am testing whether replacing xz with gzip would solve the problem [1] [2].
> > >
> > > The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix.
> >
> > Let's see how bad the file sizes are.
> 
> devstack.journal.gz 23.6M
> 
> Less than all the other logs together, I would not mind.
> I wonder how it is in other jobs (this is from the failing one).

There does seem to be a range (likely due to how much the job workload causes logging to happen in journald) from about a few megabytes to eighty something MB [3]. This is probably acceptable. Just keep an eye out for jobs that end up with much larger file sizes and we can reevaluate if we notice them.

[3] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_038/784964/1/check/tempest-multinode-full-py3/038bd51/controller/logs/index.html


From cboylan at sapwetik.org  Tue Apr  6 16:46:33 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Tue, 06 Apr 2021 09:46:33 -0700
Subject: =?UTF-8?Q?Re:_[devstack][infra]_POST=5FFAILURE_on_export-devstack-journa?=
 =?UTF-8?Q?l_:_Export_journal?=
In-Reply-To: <CAKZ_x7_f-x3dqh1C38rbGZDt=nPNg1+9xtuHyOXLhfKVgxLndg@mail.gmail.com>
References: <CAKZGdE3zSKA3bsYj3z_fw+ypiriVPtiZNefDBwsz7ChLWiHd4w@mail.gmail.com>
 <20210406160247.gevud2hlvodg7jzt@yuggoth.org>
 <CAKZ_x7_f-x3dqh1C38rbGZDt=nPNg1+9xtuHyOXLhfKVgxLndg@mail.gmail.com>
Message-ID: <7626869f-dab3-41df-a40b-dafa20dcfaf4@www.fastmail.com>

On Tue, Apr 6, 2021, at 9:11 AM, Radosław Piliszek wrote:
> On Tue, Apr 6, 2021 at 6:02 PM Jeremy Stanley <fungi at yuggoth.org> wrote:
> > Looking at the error, I strongly suspect memory exhaustion. We could
> > try tuning xz to use less memory when compressing.

Worth noting that we continue to suspect memory pressure, and in particular diving into swap, for random failures that appear timing or performance related. I still think it would be a helpful exercise for OpenStack to look at its memory consumption (remember end users will experience this too) and see if there are any unexpected areas of memory use. I think the last time i skimmed logs the privsep daemon was a large consumer because we separate instance is run for each service and they all add up.

> 
> That was my hunch as well, hence why I test using gzip.
> 
> On Tue, Apr 6, 2021 at 5:51 PM Clark Boylan <cboylan at sapwetik.org> wrote:
> >
> > On Tue, Apr 6, 2021, at 8:14 AM, Radosław Piliszek wrote:
> > > I am testing whether replacing xz with gzip would solve the problem [1] [2].
> >
> > The reason we used xz is that the files are very large and gz compression is very poor compared to xz for these files and these files are not really human readable as is (you need to load them into journald first). Let's test it and see what the gz file sizes look like but if they are still quite large then this is unlikely to be an appropriate fix.
> 
> Let's see how bad the file sizes are.
> If they are acceptable, we can keep gzip and be happy.
> Otherwise we try to tune the params to make xz a better citizen as
> fungi suggested.
> 
> -yoctozepto
> 
>


From jim at acmegating.com  Wed Apr  7 01:55:27 2021
From: jim at acmegating.com (James E. Blair)
Date: Tue, 06 Apr 2021 18:55:27 -0700
Subject: Recent nodepool label changes
Message-ID: <87blaqn9io.fsf@fuligin>

Hi,

I recently spent some time trying to figure out why a job worked as
expected during one run and then failed due to limited memory on the
following run.  It turns out that back in February this change was
merged on an emergency basis, which caused us to start occasionally
providing nodes with 32G of ram instead of the typical 8G:

  https://review.opendev.org/773710

Nodepool labels are designed to represent the combination of an image
and set of resources.  To the best of our ability, the images and
resources they provide should be consistent across different cloud
providers.  That's why we use DIB to create consistent images and that's
why we use "-expanded" labels to request nodes with additional memory.
It's also the case that when we add new clouds, we generally try to
benchmark performance and adjust flavors as needed.

Unfortunately, providing such disparate resources under the same
Nodepool labels makes it impossible for job authors to reliably design
jobs.

To be clear, it's fine to provide resources of varying size, we just
need to use different Nodepool labels for them so that job authors get
what they're asking for.

The last time we were in this position, we updated our Nodepool images
to add the mem= Linux kernel command line parameter in order to limit
the total available RAM.  I suspect that is still possible, but due to
the explosion of images and flavors, doing so will be considerably more
difficult this time.

We now also have the ability to reboot nodes in jobs after they come
online, but doing that would add additional run time for every job.

I believe we need to address this.  Despite the additional work, it
seems like the "mem=" approach is our best bet; unless anyone has other
ideas?

-Jim


From cboylan at sapwetik.org  Wed Apr  7 16:20:55 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Wed, 07 Apr 2021 09:20:55 -0700
Subject: Recent nodepool label changes
In-Reply-To: <87blaqn9io.fsf@fuligin>
References: <87blaqn9io.fsf@fuligin>
Message-ID: <ec3bbf85-587b-4fec-9175-c5c3fcc9d667@www.fastmail.com>

On Tue, Apr 6, 2021, at 6:55 PM, James E. Blair wrote:
> Hi,
> 
> I recently spent some time trying to figure out why a job worked as
> expected during one run and then failed due to limited memory on the
> following run.  It turns out that back in February this change was
> merged on an emergency basis, which caused us to start occasionally
> providing nodes with 32G of ram instead of the typical 8G:
> 
>   https://review.opendev.org/773710
> 
> Nodepool labels are designed to represent the combination of an image
> and set of resources.  To the best of our ability, the images and
> resources they provide should be consistent across different cloud
> providers.  That's why we use DIB to create consistent images and that's
> why we use "-expanded" labels to request nodes with additional memory.
> It's also the case that when we add new clouds, we generally try to
> benchmark performance and adjust flavors as needed.
> 
> Unfortunately, providing such disparate resources under the same
> Nodepool labels makes it impossible for job authors to reliably design
> jobs.
> 
> To be clear, it's fine to provide resources of varying size, we just
> need to use different Nodepool labels for them so that job authors get
> what they're asking for.
> 
> The last time we were in this position, we updated our Nodepool images
> to add the mem= Linux kernel command line parameter in order to limit
> the total available RAM.  I suspect that is still possible, but due to
> the explosion of images and flavors, doing so will be considerably more
> difficult this time.
> 
> We now also have the ability to reboot nodes in jobs after they come
> online, but doing that would add additional run time for every job.
> 
> I believe we need to address this.  Despite the additional work, it
> seems like the "mem=" approach is our best bet; unless anyone has other
> ideas?

This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory.

At the time I was surprised the change merged so quickly and asked if anyone was starting work on setting the kernel boot parameters again:

  http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log.html#t2021-02-02T18:04:23

I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example).

For completeness other possibilities are:
 * Convince the clouds that the nova flavor is the best place to control this and set them appropriately
 * Don't use clouds that can't set appropriate flavors
 * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable
 * Kernel module that inspects some attribute at boot time and sets mem appropriately

[0] 
https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/README.rst

> 
> -Jim


From smooney at redhat.com  Wed Apr  7 16:30:28 2021
From: smooney at redhat.com (Sean Mooney)
Date: Wed, 7 Apr 2021 17:30:28 +0100
Subject: Recent nodepool label changes
In-Reply-To: <ec3bbf85-587b-4fec-9175-c5c3fcc9d667@www.fastmail.com>
References: <87blaqn9io.fsf@fuligin>
 <ec3bbf85-587b-4fec-9175-c5c3fcc9d667@www.fastmail.com>
Message-ID: <41ed1949-f638-eee1-2421-9840750a5c01@redhat.com>


On 07/04/2021 17:20, Clark Boylan wrote:
> On Tue, Apr 6, 2021, at 6:55 PM, James E. Blair wrote:
>> Hi,
>>
>> I recently spent some time trying to figure out why a job worked as
>> expected during one run and then failed due to limited memory on the
>> following run.  It turns out that back in February this change was
>> merged on an emergency basis, which caused us to start occasionally
>> providing nodes with 32G of ram instead of the typical 8G:
>>
>>    https://review.opendev.org/773710
>>
>> Nodepool labels are designed to represent the combination of an image
>> and set of resources.  To the best of our ability, the images and
>> resources they provide should be consistent across different cloud
>> providers.  That's why we use DIB to create consistent images and that's
>> why we use "-expanded" labels to request nodes with additional memory.
>> It's also the case that when we add new clouds, we generally try to
>> benchmark performance and adjust flavors as needed.
>>
>> Unfortunately, providing such disparate resources under the same
>> Nodepool labels makes it impossible for job authors to reliably design
>> jobs.
>>
>> To be clear, it's fine to provide resources of varying size, we just
>> need to use different Nodepool labels for them so that job authors get
>> what they're asking for.
>>
>> The last time we were in this position, we updated our Nodepool images
>> to add the mem= Linux kernel command line parameter in order to limit
>> the total available RAM.  I suspect that is still possible, but due to
>> the explosion of images and flavors, doing so will be considerably more
>> difficult this time.
>>
>> We now also have the ability to reboot nodes in jobs after they come
>> online, but doing that would add additional run time for every job.
>>
>> I believe we need to address this.  Despite the additional work, it
>> seems like the "mem=" approach is our best bet; unless anyone has other
>> ideas?
> This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory.
>
> At the time I was surprised the change merged so quickly and asked if anyone was starting work on setting the kernel boot parameters again:
>
>    http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log.html#t2021-02-02T18:04:23
>
> I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example).
>
> For completeness other possibilities are:
>   * Convince the clouds that the nova flavor is the best place to control this and set them appropriately
>   * Don't use clouds that can't set appropriate flavors
>   * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable
>   * Kernel module that inspects some attribute at boot time and sets mem appropriately
im not sure why the issue is with allowing vms to have 32GB of ram.
as job authors we should basically talor our jobs to fit the minium 
avaiable and if we get more ram then that a bonus.
we should not be writing tempest jobs in particarl in such a way that 
more ram would break things out side of very speciric jobs.
for example the whitebox tempest plug that litally ssh into the host vms 
to validate thing in the libvirt xml makes some assumiton about
the env but i would consider it a bug in our plugin if it could not work 
with more ram.

less ram we may have issue but more should not break any of our test or 
we should fix them.

i think we shoudl be able to just have the vexhost flavor labled twice. 
once with the normal lables and once with the -expand one

i would hope that we do not go down the path of hardcodign a kernel mem 
limit to 8G for all lables
it seam very wasteful to me to boot a 32G vm and only use 8G of it.
>
> [0]
> https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/README.rst
>
>> -Jim


From fungi at yuggoth.org  Wed Apr  7 16:39:46 2021
From: fungi at yuggoth.org (Jeremy Stanley)
Date: Wed, 7 Apr 2021 16:39:46 +0000
Subject: Recent nodepool label changes
In-Reply-To: <ec3bbf85-587b-4fec-9175-c5c3fcc9d667@www.fastmail.com>
References: <87blaqn9io.fsf@fuligin>
 <ec3bbf85-587b-4fec-9175-c5c3fcc9d667@www.fastmail.com>
Message-ID: <20210407163945.mjcz7l75kimktxed@yuggoth.org>

On 2021-04-07 09:20:55 -0700 (-0700), Clark Boylan wrote:
[...]
> This change was made at the request of mnaser to better support
> resource allocation in vexxhost (the flavors we use now use their
> standard ratio for memory:cpu). One (likely bad) option would be
> to select a flavor based on memory rather than cpu count. In this
> case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB
> of memory.
> 
> At the time I was surprised the change merged so quickly
[...]

Based on the commit message and the fact that we were pinged in IRC
to review, I got the impression it was relatively urgent.

> I suspect that the kernel limit is our best option. We can set
> this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will
> work in many cases across the various distros. The problem with
> this approach is that we would need different images for the
> places we want to boot with more memory (the -expanded labels for
> example).
> 
> For completeness other possibilities are:
>  * Convince the clouds that the nova flavor is the best place to
>    control this and set them appropriately
>  * Don't use clouds that can't set appropriate flavors
>  * Accept Fungi's argument in the IRC log above and accept that
>    memory as with other resources like disk iops and network will be
>    variable

To be clear, this was mostly a "devil's advocate" argument, and not
really my opinion. We saw first hand that disparate memory sizing in
HPCloud was allowing massive memory usage jumps to merge in
OpenStack, and took action back then to artificially limit the
available memory at boot. We now have fresh evidence from the Zuul
community that this hasn't ceased to be a problem. On the other
hand, we also see projects merge changes which significantly
increase disk utilization and then can't run on some environments
where we get smaller disks (or depend on having multiple network
interfaces, or specific addressing schemes, or certain CPU flags,
or...), so heterogeneity the problem isn't limited exclusively to
memory.

>  * Kernel module that inspects some attribute at boot time and
>    sets mem appropriately
[...]

Not to downplay the value of the donated resources, because they
really are very much appreciated, but these currently account for
less than 5% of our aggregate node count so having to maintain
multiple nearly identical images or doing a lot of additional
engineering work seems like it may outweigh any immediate benefits.
With the increasing use of special node labels like expanded,
nested-virt and NUMA, it might make more sense to just limit this
region to not supplying standard nodes, which sidesteps the problem
for now.
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20210407/4ec5329a/attachment.sig>

From jim at acmegating.com  Wed Apr  7 17:33:22 2021
From: jim at acmegating.com (James E. Blair)
Date: Wed, 07 Apr 2021 10:33:22 -0700
Subject: Recent nodepool label changes
In-Reply-To: <41ed1949-f638-eee1-2421-9840750a5c01@redhat.com> (Sean Mooney's
 message of "Wed, 7 Apr 2021 17:30:28 +0100")
References: <87blaqn9io.fsf@fuligin>
 <ec3bbf85-587b-4fec-9175-c5c3fcc9d667@www.fastmail.com>
 <41ed1949-f638-eee1-2421-9840750a5c01@redhat.com>
Message-ID: <877dlem23h.fsf@fuligin>

Sean Mooney <smooney at redhat.com> writes:

> im not sure why the issue is with allowing vms to have 32GB of ram.
> as job authors we should basically talor our jobs to fit the minium
> avaiable and if we get more ram then that a bonus.
> we should not be writing tempest jobs in particarl in such a way that
> more ram would break things out side of very speciric jobs.
> for example the whitebox tempest plug that litally ssh into the host
> vms to validate thing in the libvirt xml makes some assumiton about
> the env but i would consider it a bug in our plugin if it could not
> work with more ram.

I tried really hard to make it clear I have no problem with the idea
that we could have flavors with more ram.  I absolutely don't object to
that.

What I am saying is that there is definitely a problem with using a
label that has different amounts of ram in different providers.  It
causes jobs to behave differently.  Jobs that pass in one provider will
fail in another because of the ram difference.  I agree with you that as
job authors we should tailor our jobs to fit the minimum available ram.
The problem is that is nearly impossible if Nodepool randomly gives us
nodes with more ram.  We won't realize we have exceeded the minimum ram
until we hit a job on a provider with less ram after having exceeded it
on a provider with more ram.  This is not a theoretical issue -- you are
reading this message because I hit this problem after two test runs on a
recently started project.

> less ram we may have issue but more should not break any of our test
> or we should fix them.

There is an inherent contradiction in saying that more ram is okay but
less ram is not.  They are two sides of the same coin.  A job will not
break because it had more ram the first time, it will break because it
had less ram the second time.

The fundamental issue is that a Nodepool label describes an image plus a
flavor.  That flavor must be as consistent as possible across providers
if we expect job authors to be able to write predictable jobs.

> it seam very wasteful to me to boot a 32G vm and only use 8G of it.

It may seem that way, but the infrastructure provider has told us that
they have tuned their hardware purchases to that ratio of CPU/RAM, and
so we're helping out by doing this.

The more wasteful thing is people issuing rechecks because their jobs
pass in some providers and not others.

-Jim


From iwienand at redhat.com  Thu Apr  8 05:43:33 2021
From: iwienand at redhat.com (Ian Wienand)
Date: Thu, 8 Apr 2021 15:43:33 +1000
Subject: Next steps with new review server
In-Reply-To: <cb809d21-2f12-4178-9be2-41711fde8679@www.fastmail.com>
References: <YGUvhGsmF+4wDRJw@fedora19.localdomain>
 <a4bbef80-de4c-497f-805a-50f4a45fb313@www.fastmail.com>
 <cb809d21-2f12-4178-9be2-41711fde8679@www.fastmail.com>
Message-ID: <YG6YBS9/F7kY9h+T@fedora19.localdomain>

On Thu, Apr 01, 2021 at 02:35:32PM -0700, Clark Boylan wrote:
> I ended up double checking the mirror node and in
> mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml
> you can see what we did there. Essentially we set dhcpv6 and
> accept-ra to false then set an address and routes. We should be able
> to do the same thing with the new review host if we can't figure
> anything else out.

> [3] https://launchpad.net/bugs/1844712

So we have a work around in production but also [3] being marked as an
open security bug.

Are we happy enough ignoring RA's is sufficient to overcome the issues
discussed in [3] for this service?  The concern mostly seemed to be a
targeted MITM attack; something which ssh host keys and SSL
certificates should cover?

-i


From fungi at yuggoth.org  Thu Apr  8 19:48:35 2021
From: fungi at yuggoth.org (Jeremy Stanley)
Date: Thu, 8 Apr 2021 19:48:35 +0000
Subject: Next steps with new review server
In-Reply-To: <YG6YBS9/F7kY9h+T@fedora19.localdomain>
References: <YGUvhGsmF+4wDRJw@fedora19.localdomain>
 <a4bbef80-de4c-497f-805a-50f4a45fb313@www.fastmail.com>
 <cb809d21-2f12-4178-9be2-41711fde8679@www.fastmail.com>
 <YG6YBS9/F7kY9h+T@fedora19.localdomain>
Message-ID: <20210408194835.ma5xr6cm5enegnab@yuggoth.org>

On 2021-04-08 15:43:33 +1000 (+1000), Ian Wienand wrote:
> On Thu, Apr 01, 2021 at 02:35:32PM -0700, Clark Boylan wrote:
> > I ended up double checking the mirror node and in
> > mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml
> > you can see what we did there. Essentially we set dhcpv6 and
> > accept-ra to false then set an address and routes. We should be able
> > to do the same thing with the new review host if we can't figure
> > anything else out.
> 
> > [3] https://launchpad.net/bugs/1844712
> 
> So we have a work around in production but also [3] being marked as an
> open security bug.
> 
> Are we happy enough ignoring RA's is sufficient to overcome the issues
> discussed in [3] for this service?  The concern mostly seemed to be a
> targeted MITM attack; something which ssh host keys and SSL
> certificates should cover?

Yes, I think ignoring RAs is probably sufficient. Nobody seems to
have yet figured out how the leak happens or what else could be
leaked, but as you note the fact that a MitM couldn't usefully
spoof a viable HTTPS or SSH connection endpoint is sufficient
insurance against anything worse, so we can just focus on mitigating
the stability problem arising from stray leaks for now.
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20210408/fd3cd6d3/attachment.sig>

From fungi at yuggoth.org  Sun Apr 11 15:13:11 2021
From: fungi at yuggoth.org (Jeremy Stanley)
Date: Sun, 11 Apr 2021 15:13:11 +0000
Subject: Recent nodepool label changes
In-Reply-To: <20210407163945.mjcz7l75kimktxed@yuggoth.org>
References: <87blaqn9io.fsf@fuligin>
 <ec3bbf85-587b-4fec-9175-c5c3fcc9d667@www.fastmail.com>
 <20210407163945.mjcz7l75kimktxed@yuggoth.org>
Message-ID: <20210411151311.p5fyft6m34stqlf4@yuggoth.org>

On 2021-04-07 16:39:46 +0000 (+0000), Jeremy Stanley wrote:
[...]
> With the increasing use of special node labels like expanded,
> nested-virt and NUMA, it might make more sense to just limit this
> region to not supplying standard nodes, which sidesteps the problem
> for now.

I've proposed WIP change https://review.opendev.org/785769 as a
straw man for this solution.
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20210411/5171d849/attachment.sig>

From cboylan at sapwetik.org  Mon Apr 12 23:12:55 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Mon, 12 Apr 2021 16:12:55 -0700
Subject: Team Meeting Agenda for April 13, 2021
Message-ID: <2715073b-53c2-461e-a942-bbfa8a1e638a@www.fastmail.com>

We will meet with this agenda on April 13, 2021 at 19:00 UTC in #opendev-meeting:

== Agenda for next meeting ==

* Announcements
** OpenStack completing release April 14. Airship 2.0 doesn't seem to exist yet so will assume they are still working on it.

* Actions from last meeting

* Specs approval

* Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.)
** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html Update Config Management]
*** topic:update-cfg-mgmt
*** Zuul as CD engine
** OpenDev
*** Gerrit upgrade to 3.2.8
**** https://review.opendev.org/c/opendev/system-config/+/784152
*** Gerrit account inconsistencies
**** All preferred emails lack external ids issues have been corrected. All group loops have been corrected.
**** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?)
**** Next steps
***** ~224 accounts were cleaned up. Next batch of ~56 has been started. Will clean their external IDs after letting the retired users sit for a few days.
***** Email sent to two Third Party CI groups about correcting external id conflicts among their accounts. These accounts will not be retired (for the most part).
*** Configuration tuning
**** Reduce the number of ssh threads. Possibly create bot/batch user groups and thread counts as part of this.
**** https://groups.google.com/g/repo-discuss/c/BQKxAfXBXuo Upstream conversation with people struggling with similar problems.

* General topics
** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210413)
*** Enable Xenial -> Bionic/Focal system upgrades
*** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here
*** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first.
**** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades
** planet.openstack.org (ianw 20210413)
*** Strong preference from clarkb to retire it
*** Superuser appears to be a major blog showing up there as well as a couple of others. Maybe we reach out to them and double check they don't want to help? (fungi and clarkb reached out to Superuser and they seem ok)
** survey.openstack.org (clarkb 20210413)
*** Can we go ahead and clean this service up? I don't think it ever got much use (maybe one or two surveys total).
** docs-old volume cleanup (ianw 20210413)
*** We were going to double check with Ajaeger if we can then proceed to cleanup if no one had a reason to keep it.
** PTG Planning (clarkb 20210413)
*** Next PTG April 19-23
**** Thursday April 22 1400-1600UTC and 2200-0000UTC

* Open discussion


From cboylan at sapwetik.org  Tue Apr 13 16:47:43 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Tue, 13 Apr 2021 09:47:43 -0700
Subject: Join OpenDev at the Project Teams Gathering
Message-ID: <1bd7a8d9-9796-464b-a34f-5f315ba8e974@www.fastmail.com>

The PTG is next week, and OpenDev is participating alongside the OpenStack TaCT SIG. We are going to try something a bit different this time around, which is to treat the time as office hours rather than time for our own projects. We will be meeting on April 22 from 14:00 - 16:00 UTC and 22:00 - 00:00 UTC in https://meetpad.opendev.org/apr2021-ptg-opendev.

Join us if you would like to:

 * Start contributing to either OpenDev or the TaCT sig.
 * Debug a particular job problem.
 * Learn how to write and review Zuul jobs and related configs.
 * Learn about specific services or how they are deployed.
 * And anything else related to OpenDev and our project infrastructure.

Feel free to add your topics and suggest preferred times for those topics here: https://etherpad.opendev.org/p/apr2021-ptg-opendev. This etherpad corresponds to the document that will be auto loaded in our meetpad room above.

I will also be around next week and will try to keep a flexible schedule. Feel free to reach out if you would like us to join discussions as they happen.

See you there,
Clark


From iwienand at redhat.com  Fri Apr 23 05:07:45 2021
From: iwienand at redhat.com (Ian Wienand)
Date: Fri, 23 Apr 2021 15:07:45 +1000
Subject: Debian bullsye image Ansible detection
Message-ID: <YIJWId3jSs9wIjmF@fedora19.localdomain>

Hello,

In short, Ansible reports "n/a" for ansible_distribution_release on
our new bullseye nodes.  This screws up our mirror setup.  This has
turned into quite an adventure.

Currently, Debian is frozen to create the "bullseye" release.  This
means that "bullseye" is really an alias for "testing", that will turn
into the release after the freeze period.

So currently Debian bullseye reports itself in /etc/debian_version or
/etc/os-release as "bullseye/sid".  This sort of makes sense if you
consider that you don't commit things to "testing" directly, they go
into unstable ("sid") and then migrate after a period of stability.
So you can't have "base-files" package in bullseye that hasn't gone
through unstable/sid.  You can read "bullseye/sid" as "we've chosen
the name bullseye and packages going through unstable are destined for
it".

Now, you might see a problem in that "unstable" and "bullseye"
(testing) now both report themselves in these version files as the
same thing (because the unstable packages that provide them move into
testing).

"lsb_release -c" tries to be a bit smart about this, and looks at the
output of "apt-cache policy" to try and see if you are actually
pulling the .deb files from a bullseye repo or an unstable one.

Interestingly, this relies on a "Label" being present in the mirror
release files.  Since we use reprepro to build our own mirrors, we do
not have this (and why nobody else who doens't use our mirrors seems
to notice this problem).  A fix is proposed with

 https://review.opendev.org/c/opendev/system-config/+/787661

So "lsb_release -c" doesn't report anything, leaving Ansible in the
dark as to what repo it uses.

When "lsb_release -c" doesn't return anything helpful, Ansible tries
to do it's own parsing of the release files.  I started hacking on
these, but the point raised in

 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=845651

gave me pause.  It is a fair point that you can not really know if
you're on bullseye or sid by examining these files.  N/A is probably
actually the correct answer from Ansible's POV.  Anyway, that is

 https://github.com/ianw/ansible/commit/847817a82ed86b5f39a4ccc3ffbff0e0cd63e8cc

Now, even more annoyingly, setting the label in our mirrors may not be
sufficient for "lsb_release -c" to work on our images, because we have
cleared out the apt repositories.  You would need to run "apt-get
update" before Ansible tries to run "lsb_release" to populate it's
facts.  Now the problem is that we're trying to use Ansible's fact
about the distro name to setup apt to point to our mirrors -- so we
can't apt-get update before we have that written out!  Classic chicken
and egg.

The only other idea I have is to hack dib/early setup overwrite
/etc/debian_version with "11.0" so that we look like the upcoming
release has already been done.  "lsb_release -c" will then report
"bullsye".  However, there is some possibility this will confuse other
things, as this release technically hasn't been done.  I've proposed
that with

 https://review.opendev.org/c/openstack/diskimage-builder/+/787665

I'm open to suggestions!

-i


From fungi at yuggoth.org  Fri Apr 23 12:16:46 2021
From: fungi at yuggoth.org (Jeremy Stanley)
Date: Fri, 23 Apr 2021 12:16:46 +0000
Subject: Debian bullsye image Ansible detection
In-Reply-To: <YIJWId3jSs9wIjmF@fedora19.localdomain>
References: <YIJWId3jSs9wIjmF@fedora19.localdomain>
Message-ID: <20210423121645.7ndgdmz22zbrplvu@yuggoth.org>

On 2021-04-23 15:07:45 +1000 (+1000), Ian Wienand wrote:
> In short, Ansible reports "n/a" for ansible_distribution_release on
> our new bullseye nodes.  This screws up our mirror setup.  This has
> turned into quite an adventure.
> 
> Currently, Debian is frozen to create the "bullseye" release.  This
> means that "bullseye" is really an alias for "testing", that will turn
> into the release after the freeze period.
[...]

The irony is that `lsb_release -c` has been returning "bullseye" on
my sid machines for weeks, since base-files 11.1 was uploaded to
unstable (2021-04-10). The base-files in bullseye is still 11, but
I expect the current problem will sort itself out automatically once
11.1 migrates from unstable to testing:

    https://tracker.debian.org/pkg/base-files

Unfortunately, exactly *when* the release team will allow that is
unclear (at least to me).
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20210423/95c6b1d0/attachment.sig>

From radoslaw.piliszek at gmail.com  Fri Apr 23 13:17:23 2021
From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=)
Date: Fri, 23 Apr 2021 15:17:23 +0200
Subject: Debian bullsye image Ansible detection
In-Reply-To: <20210423121645.7ndgdmz22zbrplvu@yuggoth.org>
References: <YIJWId3jSs9wIjmF@fedora19.localdomain>
 <20210423121645.7ndgdmz22zbrplvu@yuggoth.org>
Message-ID: <CAKZ_x7_4qTk9-tdCq+SpCE6k3sP7dxWz7ggCGHyVtMiYEz=c3g@mail.gmail.com>

On Fri, Apr 23, 2021 at 2:17 PM Jeremy Stanley <fungi at yuggoth.org> wrote:
>
> On 2021-04-23 15:07:45 +1000 (+1000), Ian Wienand wrote:
> > In short, Ansible reports "n/a" for ansible_distribution_release on
> > our new bullseye nodes.  This screws up our mirror setup.  This has
> > turned into quite an adventure.
> >
> > Currently, Debian is frozen to create the "bullseye" release.  This
> > means that "bullseye" is really an alias for "testing", that will turn
> > into the release after the freeze period.
> [...]
>
> The irony is that `lsb_release -c` has been returning "bullseye" on
> my sid machines for weeks, since base-files 11.1 was uploaded to
> unstable (2021-04-10).

Well, I guess it means that Ian's hack would be more than acceptable. ;-)

-yoctozepto


From fungi at yuggoth.org  Fri Apr 23 13:32:27 2021
From: fungi at yuggoth.org (Jeremy Stanley)
Date: Fri, 23 Apr 2021 13:32:27 +0000
Subject: Debian bullsye image Ansible detection
In-Reply-To: <CAKZ_x7_4qTk9-tdCq+SpCE6k3sP7dxWz7ggCGHyVtMiYEz=c3g@mail.gmail.com>
References: <YIJWId3jSs9wIjmF@fedora19.localdomain>
 <20210423121645.7ndgdmz22zbrplvu@yuggoth.org>
 <CAKZ_x7_4qTk9-tdCq+SpCE6k3sP7dxWz7ggCGHyVtMiYEz=c3g@mail.gmail.com>
Message-ID: <20210423133226.qfllhnvrd3hxx4ur@yuggoth.org>

On 2021-04-23 15:17:23 +0200 (+0200), Radosław Piliszek wrote:
[...]
> Well, I guess it means that Ian's hack would be more than acceptable. ;-)

Or just fetch base-files 11.1 in from sid temporarily in our
infra-package-needs element until it migrates into bullseye.
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/service-discuss/attachments/20210423/c98d0650/attachment.sig>

From radoslaw.piliszek at gmail.com  Fri Apr 23 14:36:05 2021
From: radoslaw.piliszek at gmail.com (=?UTF-8?Q?Rados=C5=82aw_Piliszek?=)
Date: Fri, 23 Apr 2021 16:36:05 +0200
Subject: Debian bullsye image Ansible detection
In-Reply-To: <20210423133226.qfllhnvrd3hxx4ur@yuggoth.org>
References: <YIJWId3jSs9wIjmF@fedora19.localdomain>
 <20210423121645.7ndgdmz22zbrplvu@yuggoth.org>
 <CAKZ_x7_4qTk9-tdCq+SpCE6k3sP7dxWz7ggCGHyVtMiYEz=c3g@mail.gmail.com>
 <20210423133226.qfllhnvrd3hxx4ur@yuggoth.org>
Message-ID: <CAKZ_x7_vrMKdPDnsVT4pXRQZ_d+6gUFT4hv0h50pjcoDg1skBA@mail.gmail.com>

On Fri, Apr 23, 2021 at 3:32 PM Jeremy Stanley <fungi at yuggoth.org> wrote:
>
> On 2021-04-23 15:17:23 +0200 (+0200), Radosław Piliszek wrote:
> [...]
> > Well, I guess it means that Ian's hack would be more than acceptable. ;-)
>
> Or just fetch base-files 11.1 in from sid temporarily in our
> infra-package-needs element until it migrates into bullseye.

Works for me.

-yoctozepto


From cboylan at sapwetik.org  Mon Apr 26 23:18:25 2021
From: cboylan at sapwetik.org (Clark Boylan)
Date: Mon, 26 Apr 2021 16:18:25 -0700
Subject: Team Meeting Agenda for April 27, 2021
Message-ID: <2edff236-f1bc-4644-870a-f6ad0dd2b1d0@www.fastmail.com>

We will meet on April 27, 2021 at 19:00UTC in #opendev-meeting with this agenda:

== Agenda for next meeting ==

* Announcements

* Actions from last meeting

* Specs approval

* Priority Efforts (Standing meeting agenda items. Please expand if you have subtopics.)
** [http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html Update Config Management]
*** topic:update-cfg-mgmt
*** Zuul as CD engine
** OpenDev
*** Gerrit account inconsistencies
**** All preferred emails lack external ids issues have been corrected. All group loops have been corrected.
**** Workaround is we can stop Gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?)
**** Next steps
***** More "dangerous" list has been generated. Should still be safe-ish particularly if we disable the accounts first.
*** Configuration tuning
**** Reduce the number of ssh threads. Possibly create bot/batch user groups and thread counts as part of this.
**** https://groups.google.com/g/repo-discuss/c/BQKxAfXBXuo Upstream conversation with people struggling with similar problems.

* General topics
** Picking up steam on Puppet -> Ansible rewrites (clarkb 20210427)
*** Enable Xenial -> Bionic/Focal system upgrades
*** https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades Start capturing TODO list here
*** Zuul service host updates in progress now. Scheduler and Zookeeper cluster remaining. Will focus on ZK first.
**** https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 discussion of options for zk upgrades
** survey.openstack.org (clarkb 20210427)
*** We're getting friendly reminders that this SSL cert is about to expire. Would be good to cleanup.
** Debian Bullseye Images (clarkb 20210427)
*** Need some DIB updates to hack around Debian versioning and Ansible's factorizing of that info.
** Minor git-review release to support --no-thin (clarkb 202104027)

* Open discussion