[Edge-computing] [ironic][ops] Taking ironic nodes out of production

Christopher Price christopher.price at est.tech
Tue May 21 08:26:25 UTC 2019

I would add that something as simple as an operator policy could/should be able to remove hardware from an operational domain.  It does not specifically need to be a fault or retirement, it may be as simple as repurposing to a different operational domain. From an OpenStack perspective this should not require any special handling from "retirement", it's just to know that there may be time constraints implied in a policy change that could potentially be ignored in a "retirement scenario".

Further, at least in my imagination, one might be reallocating hardware from one Ironic domain to another which may have implications on how we best bring a new node online.  (or not, I'm no expert) </ end dubious thought stream>

/ Chris

´╗┐On 2019-05-21, 09:16, "Bogdan Dobrelya" <bdobreli at redhat.com> wrote:

    [CC'ed edge-computing at lists.openstack.org]
    On 20.05.2019 18:33, Arne Wiebalck wrote:
    > Dear all,
    > One of the discussions at the PTG in Denver raised the need for
    > a mechanism to take ironic nodes out of production (a task for
    > which the currently available 'maintenance' flag does not seem
    > appropriate [1]).
    > The use case there is an unhealthy physical node in state 'active',
    > i.e. associated with an instance. The request is then to enable an
    > admin to mark such a node as 'faulty' or 'in quarantine' with the
    > aim of not returning the node to the pool of available nodes once
    > the hosted instance is deleted.
    > A very similar use case which came up independently is node
    > retirement: it should be possible to mark nodes ('active' or not)
    > as being 'up for retirement' to prepare the eventual removal from
    > ironic. As in the example above, ('active') nodes marked this way
    > should not become eligible for instance scheduling again, but
    > automatic cleaning, for instance, should still be possible.
    > In an effort to cover these use cases by a more general 
    > "quarantine/retirement" feature:
    > - are there additional use cases which could profit from such a
    >    "take a node out of service" mechanism?
    There are security related examples described in the Edge Security 
    Challenges whitepaper [0] drafted by k8s IoT SIG [1], like in the 
    chapter 2 Trusting hardware, whereby "GPS coordinate changes can be used 
    to force a shutdown of an edge node". So a node may be taken out of 
    service as an indicator of a particular condition of edge hardware.
    [1] https://github.com/kubernetes/community/tree/master/wg-iot-edge
    > - would these use cases put additional constraints on how the
    >    feature should look like (e.g.: "should not prevent cleaning")
    > - are there other characteristics such a feature should have
    >    (e.g.: "finding these nodes should be supported by the cli")
    > Let me know if you have any thoughts on this.
    > Cheers,
    >   Arne
    > [1] https://etherpad.openstack.org/p/DEN-train-ironic-ptg, l. 360
    Best regards,
    Bogdan Dobrelya,
    Irc #bogdando
    Edge-computing mailing list
    Edge-computing at lists.openstack.org

More information about the Edge-computing mailing list