Recent nodepool label changes

Wed Apr 7 16:20:55 UTC 2021

On Tue, Apr 6, 2021, at 6:55 PM, James E. Blair wrote:
> Hi,
> 
> I recently spent some time trying to figure out why a job worked as
> expected during one run and then failed due to limited memory on the
> following run.  It turns out that back in February this change was
> merged on an emergency basis, which caused us to start occasionally
> providing nodes with 32G of ram instead of the typical 8G:
> 
>   https://review.opendev.org/773710
> 
> Nodepool labels are designed to represent the combination of an image
> and set of resources.  To the best of our ability, the images and
> resources they provide should be consistent across different cloud
> providers.  That's why we use DIB to create consistent images and that's
> why we use "-expanded" labels to request nodes with additional memory.
> It's also the case that when we add new clouds, we generally try to
> benchmark performance and adjust flavors as needed.
> 
> Unfortunately, providing such disparate resources under the same
> Nodepool labels makes it impossible for job authors to reliably design
> jobs.
> 
> To be clear, it's fine to provide resources of varying size, we just
> need to use different Nodepool labels for them so that job authors get
> what they're asking for.
> 
> The last time we were in this position, we updated our Nodepool images
> to add the mem= Linux kernel command line parameter in order to limit
> the total available RAM.  I suspect that is still possible, but due to
> the explosion of images and flavors, doing so will be considerably more
> difficult this time.
> 
> We now also have the ability to reboot nodes in jobs after they come
> online, but doing that would add additional run time for every job.
> 
> I believe we need to address this.  Despite the additional work, it
> seems like the "mem=" approach is our best bet; unless anyone has other
> ideas?

This change was made at the request of mnaser to better support resource allocation in vexxhost (the flavors we use now use their standard ratio for memory:cpu). One (likely bad) option would be to select a flavor based on memory rather than cpu count. In this case I think we would go from 8vcpu + 32GB memory to 2vcpu + 8GB of memory.

At the time I was surprised the change merged so quickly and asked if anyone was starting work on setting the kernel boot parameters again:

  http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-02-02.log.html#t2021-02-02T18:04:23

I suspect that the kernel limit is our best option. We can set this via DIB_BOOTLOADER_DEFAULT_CMDLINE [0] which i expect will work in many cases across the various distros. The problem with this approach is that we would need different images for the places we want to boot with more memory (the -expanded labels for example).

For completeness other possibilities are:
 * Convince the clouds that the nova flavor is the best place to control this and set them appropriately
 * Don't use clouds that can't set appropriate flavors
 * Accept Fungi's argument in the IRC log above and accept that memory as with other resources like disk iops and network will be variable
 * Kernel module that inspects some attribute at boot time and sets mem appropriately

[0] 
https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/bootloader/README.rst

> 
> -Jim