elastic-reckeck maintenance takeover
Hi! I would like to request taking over the maintenance of elastic-recheck project. My team (triple-ci), really valued the tool and has a strong interest not only in keeping it alive but also improving it. While the team considered writing our own tool or forking it and changing it for our own use, adding missing features like multiple gerrit servers or multiple issue trackers, I advocated that we should instead become active maintainers and avoid the not-invented-here approach. I already did some work towards making the tool an easy to deploy standalone container. Starting new instances for testing or production should soon be as easy as a one-line command. I am more than happy to act as a liason between openstack-infra and triplo-ci and assure that no changes are made that could negatively impact openstack main deployment. I propose adding tripleo-ci-core gerrit group as core to the project but if there are concerns that the group is too wide, I can also provide a list of people that I trust to follow the expected review process. Best regards, Sorin Sbarnea Red Hat TripleO-CI
On 2020-08-11 19:30:20 +0100 (+0100), Sorin Sbarnea wrote:
I would like to request taking over the maintenance of elastic-recheck project.
My team (triple-ci), really valued the tool and has a strong interest not only in keeping it alive but also improving it.
While the team considered writing our own tool or forking it and changing it for our own use, adding missing features like multiple gerrit servers or multiple issue trackers, I advocated that we should instead become active maintainers and avoid the not-invented-here approach.
I already did some work towards making the tool an easy to deploy standalone container. Starting new instances for testing or production should soon be as easy as a one-line command.
I wouldn't consider it a takeover. OpenDev exists to promote collaboration on tools by the projects making use of them. That TripleO's CI subteam finds Elastic Recheck a useful tool and wants to take on its maintenance burden is exactly the sort of thing we want, I think.
I am more than happy to act as a liason between openstack-infra and triplo-ci and assure that no changes are made that could negatively impact openstack main deployment.
Just a reminder, "openstack-infra" is no more, except still being a vestigial name for the OpenStack TaCT SIG's IRC channel. I think we've been considering Elastic Recheck a broader OpenDev service anyway, at least insofar as the basic mechanism by which it operates is easily extended to cover non-OpenStack projects hosted in OpenDev. It is, however, worth noting that the massive 6-node/6TB Elasticsearch cluster and 20 Logstash import workers on which this service relies represent a substantial chunk of our overall "control plane" resource utilization, and lately hasn't seemed to me that it's nearly popular enough to responsibly warrant the cost to our donors.
I propose adding tripleo-ci-core gerrit group as core to the project but if there are concerns that the group is too wide, I can also provide a list of people that I trust to follow the expected review process.
This sounds entirely reasonable to me. Thanks to the team for offering to help with it! -- Jeremy Stanley
To be clear, the goal is drive maintenance of the tool, not to "control" it. Working in sync with opende- infra is something I assumed. As you well said there is a signifiant deployment / operational part which is currently done with puppet, which is for good reason managed by opendev team. One of my goals was to simplify that part, migrate to ansible, container conversion being part of it. This is highly dependent on opendev and I am confident that I will get the needed help on achieving that goal. Sorin
On 11 Aug 2020, at 19:53, Jeremy Stanley <fungi@yuggoth.org> wrote:
service relies represent a substantial chunk of our overall "control
does the tool actully need maitance or do we just need to write more queries and submit them to the tool when we hit a bug. we used to have a strong recommentation that you never recheck a patch without first inspecting why it failed and when you leave a recheck add a reference to a bug. there was then a futuer recommentation that if we see the same bug being reject we add an elastic recheck query refernecing the bug(which you were ment to fined if there was not alredy one) so that if other hit the same error elastic recheck comments letting them know about the know issue. for example if you look a the nueton docs https://docs.openstack.org/neutron/pike/contributor/policies/gerrit-recheck.... "Please, do not recheck without providing the bug number for the failed job. For example, do not just put an empty “recheck” comment but find the related bug number and put a “recheck bug ######” comment instead. If a bug does not exist yet, create one so other team members can have a look. It helps us maintain better visibility of gate failures" nova used to have a similar doc but i cant find it currently. i think we might have removed it and driected people to the centralised contibutors guide at some point. the main contibutor guide docs descibe how to add a new recheck query https://docs.openstack.org/contributors/code-and-documentation/elastic-reche... but i dont think it common knoladge that we should add new queries, file bugs or at a minium state the reason for the recheck before rechecking as at least in nove we have not pushed contibutors to do this consitently for 2-3 year. when i first started working on openstack it was common knoladge and the core teams and other explained that you should avoid doing an empty recheck but the last time i brought this up downstream in ternally only about have of my team knew that that convention used to exist. On Tue, 2020-08-11 at 20:16 +0100, Sorin Sbarnea wrote:
To be clear, the goal is drive maintenance of the tool, not to "control" it. Working in sync with opende- infra is something I assumed.
As you well said there is a signifiant deployment / operational part which is currently done with puppet, which is for good reason managed by opendev team. One of my goals was to simplify that part, migrate to ansible, container conversion being part of it.
This is highly dependent on opendev and I am confident that I will get the needed help on achieving that goal.
Sorin
On 11 Aug 2020, at 19:53, Jeremy Stanley <fungi@yuggoth.org> wrote:
service relies represent a substantial chunk of our overall "control
On Tue, Aug 11, 2020, at 2:42 PM, Sean Mooney wrote:
does the tool actully need maitance or do we just need to write more queries and submit them to the tool when we hit a bug. we used to have a strong recommentation that you never recheck a patch without first inspecting why it failed and when you leave a recheck add a reference to a bug. there was then a futuer recommentation that if we see the same bug being reject we add an elastic recheck query refernecing the bug(which you were ment to fined if there was not alredy one) so that if other hit the same error elastic recheck comments letting them know about the know issue.
Its a bit of both. The deployment of the tool could be modernized to python3 as well as related code updates. Ideally we'd also adjust it to be less openstack specific. A big part of that would be loading its queries from configs somewhere and not in the same repo. This way we can host queries for airship and starlingx and zuul too if we want. On the query side of things OpenStack's gate has a >50% unclassified status according to e-r right now. I'm happy if those that find it useful continue to keep it running. Whether that is openstack specific or something different.
for example if you look a the nueton docs https://docs.openstack.org/neutron/pike/contributor/policies/gerrit-recheck....
"Please, do not recheck without providing the bug number for the failed job. For example, do not just put an empty “recheck” comment but find the related bug number and put a “recheck bug ######” comment instead. If a bug does not exist yet, create one so other team members can have a look. It helps us maintain better visibility of gate failures"
We stopped this convention because e-r was working better. Humans were constantly identifying the wrong bugs or not attempting to be accurate and that data ended up being incredibly noisy. What we found instead was that the elastic-recheck data was far more accurate and it was better to invest in that. This worked really well as long as people were investing in it. I would not recommend we go back to human recheck tracking as it will just be noisy and lead to bad assumptions.
nova used to have a similar doc but i cant find it currently. i think we might have removed it and driected people to the centralised contibutors guide at some point. the main contibutor guide docs descibe how to add a new recheck query https://docs.openstack.org/contributors/code-and-documentation/elastic-reche...
but i dont think it common knoladge that we should add new queries, file bugs or at a minium state the reason for the recheck before rechecking as at least in nove we have not pushed contibutors to do this consitently for 2-3 year. when i first started working on openstack it was common knoladge and the core teams and other explained that you should avoid doing an empty recheck but the last time i brought this up downstream in ternally only about have of my team knew that that convention used to exist.
On Tue, 2020-08-11 at 20:16 +0100, Sorin Sbarnea wrote:
To be clear, the goal is drive maintenance of the tool, not to "control" it. Working in sync with opende- infra is something I assumed.
As you well said there is a signifiant deployment / operational part which is currently done with puppet, which is for good reason managed by opendev team. One of my goals was to simplify that part, migrate to ansible, container conversion being part of it.
This is highly dependent on opendev and I am confident that I will get the needed help on achieving that goal.
Sorin
On 11 Aug 2020, at 19:53, Jeremy Stanley <fungi@yuggoth.org> wrote:
service relies represent a substantial chunk of our overall "control
On Tue, 2020-08-11 at 16:18 -0700, Clark Boylan wrote:
On Tue, Aug 11, 2020, at 2:42 PM, Sean Mooney wrote:
does the tool actully need maitance or do we just need to write more queries and submit them to the tool when we hit a bug. we used to have a strong recommentation that you never recheck a patch without first inspecting why it failed and when you leave a recheck add a reference to a bug. there was then a futuer recommentation that if we see the same bug being reject we add an elastic recheck query refernecing the bug(which you were ment to fined if there was not alredy one) so that if other hit the same error elastic recheck comments letting them know about the know issue.
Its a bit of both. The deployment of the tool could be modernized to python3 as well as related code updates. Ideally we'd also adjust it to be less openstack specific. A big part of that would be loading its queries from configs somewhere and not in the same repo. This way we can host queries for airship and starlingx and zuul too if we want. On the query side of things OpenStack's gate has a >50% unclassified status according to e-r right now.
I'm happy if those that find it useful continue to keep it running. Whether that is openstack specific or something different.
for example if you look a the nueton docs https://docs.openstack.org/neutron/pike/contributor/policies/gerrit-recheck....
"Please, do not recheck without providing the bug number for the failed job. For example, do not just put an empty “recheck” comment but find the related bug number and put a “recheck bug ######” comment instead. If a bug does not exist yet, create one so other team members can have a look. It helps us maintain better visibility of gate failures"
We stopped this convention because e-r was working better. Humans were constantly identifying the wrong bugs or not attempting to be accurate and that data ended up being incredibly noisy. What we found instead was that the elastic- recheck data was far more accurate and it was better to invest in that. This worked really well as long as people were investing in it. I would not recommend we go back to human recheck tracking as it will just be noisy and lead to bad assumptions.
well the down side that i have found is that now often people dont look at why it failded an just recheck i try to at least do rechcek <job name> failed because <whatever failed> i dont nessisary reference a bug or file one always but try at least flag why it failed
nova used to have a similar doc but i cant find it currently. i think we might have removed it and driected people to the centralised contibutors guide at some point. the main contibutor guide docs descibe how to add a new recheck query https://docs.openstack.org/contributors/code-and-documentation/elastic-reche...
but i dont think it common knoladge that we should add new queries, file bugs or at a minium state the reason for the recheck before rechecking as at least in nove we have not pushed contibutors to do this consitently for 2-3 year. when i first started working on openstack it was common knoladge and the core teams and other explained that you should avoid doing an empty recheck but the last time i brought this up downstream in ternally only about have of my team knew that that convention used to exist.
On Tue, 2020-08-11 at 20:16 +0100, Sorin Sbarnea wrote:
To be clear, the goal is drive maintenance of the tool, not to "control" it. Working in sync with opende- infra is something I assumed.
As you well said there is a signifiant deployment / operational part which is currently done with puppet, which is for good reason managed by opendev team. One of my goals was to simplify that part, migrate to ansible, container conversion being part of it.
This is highly dependent on opendev and I am confident that I will get the needed help on achieving that goal.
Sorin
On 11 Aug 2020, at 19:53, Jeremy Stanley <fungi@yuggoth.org> wrote:
service relies represent a substantial chunk of our overall "control
---- On Tue, 11 Aug 2020 13:30:20 -0500 Sorin Sbarnea <ssbarnea@redhat.com> wrote ----
Hi!
I would like to request taking over the maintenance of elastic-recheck project.
My team (triple-ci), really valued the tool and has a strong interest not only in keeping it alive but also improving it.
While the team considered writing our own tool or forking it and changing it for our own use, adding missing features like multiple gerrit servers or multiple issue trackers, I advocated that we should instead become active maintainers and avoid the not-invented-here approach.
I already did some work towards making the tool an easy to deploy standalone container. Starting new instances for testing or production should soon be as easy as a one-line command.
I am more than happy to act as a liason between openstack-infra and triplo-ci and assure that no changes are made that could negatively impact openstack main deployment.
I propose adding tripleo-ci-core gerrit group as core to the project but if there are concerns that the group is too wide, I can also provide a list of people that I trust to follow the expected review process.
From QA perspective, we are ok with this. Unfortunately, the QA team does not have much bandwidth and also no java expert to take over the maintenance.
-gmann
Best regards, Sorin Sbarnea Red Hat TripleO-CI
participants (5)
-
Clark Boylan
-
Ghanshyam Mann
-
Jeremy Stanley
-
Sean Mooney
-
Sorin Sbarnea