[Edge-computing] [ironic][ops] Taking ironic nodes out of production

Arkady.Kanevsky at dell.com Arkady.Kanevsky at dell.com
Tue May 21 12:55:06 UTC 2019


Let's dig deeper into requirements.
I see three distinct use cases:
1. put node into maintenance mode. Say to upgrade FW/BIOS or any other life-cycle event. It stays in ironic cluster but it is no longer in use by the rest of openstack, like Nova.
2. Put node into "fail" state. That is remove from usage, remove from Ironic cluster. What cleanup, operator would like/can do is subject to failure. Depending on the node type it may need to be "replaced".
3. Put node into "available" to other usage. What cleanup operator wants to do will need to be defined. This is very similar step as used for Baremetal as a Service as node is reassigned back into available pool. Depending on the next usage of a node it may stay in the Ironic cluster or may be removed from it. Once removed it can be "retired" or used for any other purpose.

Thanks,
Arkady

-----Original Message-----
From: Christopher Price <christopher.price at est.tech> 
Sent: Tuesday, May 21, 2019 3:26 AM
To: Bogdan Dobrelya; openstack-discuss at lists.openstack.org; edge-computing at lists.openstack.org
Subject: Re: [Edge-computing] [ironic][ops] Taking ironic nodes out of production


[EXTERNAL EMAIL] 

I would add that something as simple as an operator policy could/should be able to remove hardware from an operational domain.  It does not specifically need to be a fault or retirement, it may be as simple as repurposing to a different operational domain. From an OpenStack perspective this should not require any special handling from "retirement", it's just to know that there may be time constraints implied in a policy change that could potentially be ignored in a "retirement scenario".

Further, at least in my imagination, one might be reallocating hardware from one Ironic domain to another which may have implications on how we best bring a new node online.  (or not, I'm no expert) </ end dubious thought stream>

/ Chris

On 2019-05-21, 09:16, "Bogdan Dobrelya" <bdobreli at redhat.com> wrote:

    [CC'ed edge-computing at lists.openstack.org]
    
    On 20.05.2019 18:33, Arne Wiebalck wrote:
    > Dear all,
    > 
    > One of the discussions at the PTG in Denver raised the need for
    > a mechanism to take ironic nodes out of production (a task for
    > which the currently available 'maintenance' flag does not seem
    > appropriate [1]).
    > 
    > The use case there is an unhealthy physical node in state 'active',
    > i.e. associated with an instance. The request is then to enable an
    > admin to mark such a node as 'faulty' or 'in quarantine' with the
    > aim of not returning the node to the pool of available nodes once
    > the hosted instance is deleted.
    > 
    > A very similar use case which came up independently is node
    > retirement: it should be possible to mark nodes ('active' or not)
    > as being 'up for retirement' to prepare the eventual removal from
    > ironic. As in the example above, ('active') nodes marked this way
    > should not become eligible for instance scheduling again, but
    > automatic cleaning, for instance, should still be possible.
    > 
    > In an effort to cover these use cases by a more general 
    > "quarantine/retirement" feature:
    > 
    > - are there additional use cases which could profit from such a
    >    "take a node out of service" mechanism?
    
    There are security related examples described in the Edge Security 
    Challenges whitepaper [0] drafted by k8s IoT SIG [1], like in the 
    chapter 2 Trusting hardware, whereby "GPS coordinate changes can be used 
    to force a shutdown of an edge node". So a node may be taken out of 
    service as an indicator of a particular condition of edge hardware.
    
    [0] 
    https://docs.google.com/document/d/1iSIk8ERcheehk0aRG92dfOvW5NjkdedN8F7mSUTr-r0/edit#heading=h.xf8mdv7zexgq
    [1] https://github.com/kubernetes/community/tree/master/wg-iot-edge
    
    > 
    > - would these use cases put additional constraints on how the
    >    feature should look like (e.g.: "should not prevent cleaning")
    > 
    > - are there other characteristics such a feature should have
    >    (e.g.: "finding these nodes should be supported by the cli")
    > 
    > Let me know if you have any thoughts on this.
    > 
    > Cheers,
    >   Arne
    > 
    > 
    > [1] https://etherpad.openstack.org/p/DEN-train-ironic-ptg, l. 360
    > 
    
    
    -- 
    Best regards,
    Bogdan Dobrelya,
    Irc #bogdando
    
    _______________________________________________
    Edge-computing mailing list
    Edge-computing at lists.openstack.org
    http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing
    

_______________________________________________
Edge-computing mailing list
Edge-computing at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing


More information about the Edge-computing mailing list