Recent nodepool label changes

newer
Team Meeting Agenda for April 13,...

James E. Blair

6 Apr 2021 6 Apr '21

6:55 p.m.

Hi, I recently spent some time trying to figure out why a job worked as expected during one run and then failed due to limited memory on the following run. It turns out that back in February this change was merged on an emergency basis, which caused us to start occasionally providing nodes with 32G of ram instead of the typical 8G: https://review.opendev.org/773710 Nodepool labels are designed to represent the combination of an image and set of resources. To the best of our ability, the images and resources they provide should be consistent across different cloud providers. That's why we use DIB to create consistent images and that's why we use "-expanded" labels to request nodes with additional memory. It's also the case that when we add new clouds, we generally try to benchmark performance and adjust flavors as needed. Unfortunately, providing such disparate resources under the same Nodepool labels makes it impossible for job authors to reliably design jobs. To be clear, it's fine to provide resources of varying size, we just need to use different Nodepool labels for them so that job authors get what they're asking for. The last time we were in this position, we updated our Nodepool images to add the mem= Linux kernel command line parameter in order to limit the total available RAM. I suspect that is still possible, but due to the explosion of images and flavors, doing so will be considerably more difficult this time. We now also have the ability to reboot nodes in jobs after they come online, but doing that would add additional run time for every job. I believe we need to address this. Despite the additional work, it seems like the "mem=" approach is our best bet; unless anyone has other ideas? -Jim

Show replies by date

Clark Boylan

7 Apr 7 Apr

9:20 a.m.

On Tue, Apr 6, 2021, at 6:55 PM, James E. Blair wrote:

...

Hi,

I recently spent some time trying to figure out why a job worked as expected during one run and then failed due to limited memory on the following run. It turns out that back in February this change was merged on an emergency basis, which caused us to start occasionally providing nodes with 32G of ram instead of the typical 8G:

https://review.opendev.org/773710

Nodepool labels are designed to represent the combination of an image and set of resources. To the best of our ability, the images and resources they provide should be consistent across different cloud providers. That's why we use DIB to create consistent images and that's why we use "-expanded" labels to request nodes with additional memory. It's also the case that when we add new clouds, we generally try to benchmark performance and adjust flavors as needed.

Unfortunately, providing such disparate resources under the same Nodepool labels makes it impossible for job authors to reliably design jobs.

To be clear, it's fine to provide resources of varying size, we just need to use different Nodepool labels for them so that job authors get what they're asking for.

The last time we were in this position, we updated our Nodepool images to add the mem= Linux kernel command line parameter in order to limit the total available RAM. I suspect that is still possible, but due to the explosion of images and flavors, doing so will be considerably more difficult this time.

We now also have the ability to reboot nodes in jobs after they come online, but doing that would add additional run time for every job.

I believe we need to address this. Despite the additional work, it seems like the "mem=" approach is our best bet; unless anyone has other ideas?

This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory. At the time I was surprised the change merged so quickly and asked if anyone was starting work on setting the kernel boot parameters again: http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log.... I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example). For completeness other possibilities are: * Convince the clouds that the nova flavor is the best place to control this and set them appropriately * Don't use clouds that can't set appropriate flavors * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable * Kernel module that inspects some attribute at boot time and sets mem appropriately [0] https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_...

...

-Jim

Sean Mooney

9:30 a.m.

...

On Tue, Apr 6, 2021, at 6:55 PM, James E. Blair wrote:

...
Hi,

I recently spent some time trying to figure out why a job worked as expected during one run and then failed due to limited memory on the following run. It turns out that back in February this change was merged on an emergency basis, which caused us to start occasionally providing nodes with 32G of ram instead of the typical 8G:

https://review.opendev.org/773710

Nodepool labels are designed to represent the combination of an image and set of resources. To the best of our ability, the images and resources they provide should be consistent across different cloud providers. That's why we use DIB to create consistent images and that's why we use "-expanded" labels to request nodes with additional memory. It's also the case that when we add new clouds, we generally try to benchmark performance and adjust flavors as needed.

Unfortunately, providing such disparate resources under the same Nodepool labels makes it impossible for job authors to reliably design jobs.

To be clear, it's fine to provide resources of varying size, we just need to use different Nodepool labels for them so that job authors get what they're asking for.

The last time we were in this position, we updated our Nodepool images to add the mem= Linux kernel command line parameter in order to limit the total available RAM. I suspect that is still possible, but due to the explosion of images and flavors, doing so will be considerably more difficult this time.

We now also have the ability to reboot nodes in jobs after they come online, but doing that would add additional run time for every job.

I believe we need to address this. Despite the additional work, it seems like the "mem=" approach is our best bet; unless anyone has other ideas? This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory.

At the time I was surprised the change merged so quickly and asked if anyone was starting work on setting the kernel boot parameters again:

http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log....

I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example).

For completeness other possibilities are: * Convince the clouds that the nova flavor is the best place to control this and set them appropriately * Don't use clouds that can't set appropriate flavors * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable * Kernel module that inspects some attribute at boot time and sets mem appropriately im not sure why the issue is with allowing vms to have 32GB of ram. as job authors we should basically talor our jobs to fit the minium avaiable and if we get more ram then that a bonus. we should not be writing tempest jobs in particarl in such a way that more ram would break things out side of very speciric jobs. for example the whitebox tempest plug that litally ssh into the host vms to validate thing in the libvirt xml makes some assumiton about

On 07/04/2021 17:20, Clark Boylan wrote: the env but i would consider it a bug in our plugin if it could not work with more ram. less ram we may have issue but more should not break any of our test or we should fix them. i think we shoudl be able to just have the vexhost flavor labled twice. once with the normal lables and once with the -expand one i would hope that we do not go down the path of hardcodign a kernel mem limit to 8G for all lables it seam very wasteful to me to boot a 32G vm and only use 8G of it.

...

[0] https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_...

...
-Jim

James E. Blair

10:33 a.m.

Sean Mooney <smooney@redhat.com> writes:

...

im not sure why the issue is with allowing vms to have 32GB of ram. as job authors we should basically talor our jobs to fit the minium avaiable and if we get more ram then that a bonus. we should not be writing tempest jobs in particarl in such a way that more ram would break things out side of very speciric jobs. for example the whitebox tempest plug that litally ssh into the host vms to validate thing in the libvirt xml makes some assumiton about the env but i would consider it a bug in our plugin if it could not work with more ram.

I tried really hard to make it clear I have no problem with the idea that we could have flavors with more ram. I absolutely don't object to that. What I am saying is that there is definitely a problem with using a label that has different amounts of ram in different providers. It causes jobs to behave differently. Jobs that pass in one provider will fail in another because of the ram difference. I agree with you that as job authors we should tailor our jobs to fit the minimum available ram. The problem is that is nearly impossible if Nodepool randomly gives us nodes with more ram. We won't realize we have exceeded the minimum ram until we hit a job on a provider with less ram after having exceeded it on a provider with more ram. This is not a theoretical issue -- you are reading this message because I hit this problem after two test runs on a recently started project.

...

less ram we may have issue but more should not break any of our test or we should fix them.

There is an inherent contradiction in saying that more ram is okay but less ram is not. They are two sides of the same coin. A job will not break because it had more ram the first time, it will break because it had less ram the second time. The fundamental issue is that a Nodepool label describes an image plus a flavor. That flavor must be as consistent as possible across providers if we expect job authors to be able to write predictable jobs.

...

it seam very wasteful to me to boot a 32G vm and only use 8G of it.

It may seem that way, but the infrastructure provider has told us that they have tuned their hardware purchases to that ratio of CPU/RAM, and so we're helping out by doing this. The more wasteful thing is people issuing rechecks because their jobs pass in some providers and not others. -Jim

Jeremy Stanley

9:39 a.m.

On 2021-04-07 09:20:55 -0700 (-0700), Clark Boylan wrote: [...]

...

This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory.

At the time I was surprised the change merged so quickly [...]

Based on the commit message and the fact that we were pinged in IRC to review, I got the impression it was relatively urgent.

...

I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example).

For completeness other possibilities are: * Convince the clouds that the nova flavor is the best place to control this and set them appropriately * Don't use clouds that can't set appropriate flavors * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable

To be clear, this was mostly a "devil's advocate" argument, and not really my opinion. We saw first hand that disparate memory sizing in HPCloud was allowing massive memory usage jumps to merge in OpenStack, and took action back then to artificially limit the available memory at boot. We now have fresh evidence from the Zuul community that this hasn't ceased to be a problem. On the other hand, we also see projects merge changes which significantly increase disk utilization and then can't run on some environments where we get smaller disks (or depend on having multiple network interfaces, or specific addressing schemes, or certain CPU flags, or...), so heterogeneity the problem isn't limited exclusively to memory.

...

* Kernel module that inspects some attribute at boot time and sets mem appropriately [...]

Not to downplay the value of the donated resources, because they really are very much appreciated, but these currently account for less than 5% of our aggregate node count so having to maintain multiple nearly identical images or doing a lot of additional engineering work seems like it may outweigh any immediate benefits. With the increasing use of special node labels like expanded, nested-virt and NUMA, it might make more sense to just limit this region to not supplying standard nodes, which sidesteps the problem for now. -- Jeremy Stanley

Jeremy Stanley

11 Apr 11 Apr

8:13 a.m.

On 2021-04-07 16:39:46 +0000 (+0000), Jeremy Stanley wrote: [...]

...

With the increasing use of special node labels like expanded, nested-virt and NUMA, it might make more sense to just limit this region to not supplying standard nodes, which sidesteps the problem for now.

I've proposed WIP change https://review.opendev.org/785769 as a straw man for this solution. -- Jeremy Stanley

1575

Age (days ago)

1579

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Clark Boylan
James E. Blair
Jeremy Stanley
Sean Mooney