On Fri, Jan 15, 2021, at 11:24 AM, Jeremy Stanley wrote:
On 2021-01-15 16:56:39 +1100 (+1100), Ian Wienand wrote: [...]
We have been told there are no fixes for the 1.6 servers planned. We do not have a wonderful answer here, unfortunately. [...]
Per subsequent discussion in IRC today, the problem (at least for 1.6) stems from the number of zero bits in the timestamp leading to weak/repeating ID generation, and will cease to present a problem around the end of the month.
It seems our best option here is to take some down-time of the AFS services and perform a manual, in-place upgrade of the existing servers to 1.8 code. We will freeze these from ongoing automated config managment (puppet). This will allow us to deploy and evaluate 1.8 servers and keep with the principle of "change one thing at a time". Once we are in a steady state, we should have enough redundancy that we can replace the servers one-at-a-time with no, or very little, downtime. This gives us time to write, test and review Ansible based deployment as usual. When we do switch in new Focal based servers, we have the advantage we are not also trying to change the AFS version too. In some ways, it works out well (something about lemons and lemonade).
It turns out 1.6 and 1.8 share the same protocols and on-disk volume format, and the old and new keystores for them can coexist side-by-side as well, so we very well may be able to just do this in-place or with a rolling/piecemeal upgrade if we want.
We usually try to have a rollback plan with any upgrades. Since 1.6 is not getting updates, and as far as I know we still do not have the technology to move reality backwards in the time continum before the problem rollover time (if I do figure out time travel, I will be sure to pre-respond to this message that it can be ignored) our revert plans seem limited. [...]
Given the new information we have about the bug ceasing to be a problem in a couple of weeks, and also the ability to switch freely between and mix 1.6 and 1.8 servers, it sounds like a rollback won't necessarily be intractable if we decide it's warranted.
I'd suggest that I could start looking at this around 2021-01-17 21:00UTC, following the plan layed out in [2], which interested parties should definitely review. This should hopefully be a quiet time, and give us a decent runway if we do hit issues. [...]
This sounds reasonable time-wise, but also it seems like we might be able to reduce the impact/outage and can take it slower if we need.
Yes, rather than doing it all in one go with a downtime maybe we should instead start with afs01.ord or afs02.dfw (the more secondary servers), ensure that server is happy then roll through things in a rolling fashion? We have documented that general process here: https://docs.opendev.org/opendev/system-config/latest/afs.html#no-outage-ser.... It does appear that the existing releases may complicate this process though as we typically want to have the RW volume on active servers?
-- Jeremy Stanley
Attachments: * signature.asc