Re: OpenAFS timestamp rollover issues, discussion and plan

15 Jan 2021


      On Fri, Jan 15, 2021, at 11:24 AM, Jeremy Stanley wrote:
...
On 2021-01-15 16:56:39 +1100 (+1100), Ian Wienand wrote:
[...]
...
We have been told there are no fixes for the 1.6 servers planned.
We do not have a wonderful answer here, unfortunately.
[...]
Per subsequent discussion in IRC today, the problem (at least for
1.6) stems from the number of zero bits in the timestamp leading to
weak/repeating ID generation, and will cease to present a problem
around the end of the month.
...
It seems our best option here is to take some down-time of the AFS
services and perform a manual, in-place upgrade of the existing
servers to 1.8 code.  We will freeze these from ongoing automated
config managment (puppet).  This will allow us to deploy and evaluate
1.8 servers and keep with the principle of "change one thing at a
time".  Once we are in a steady state, we should have enough
redundancy that we can replace the servers one-at-a-time with no, or
very little, downtime.  This gives us time to write, test and review
Ansible based deployment as usual.  When we do switch in new Focal
based servers, we have the advantage we are not also trying to change
the AFS version too.  In some ways, it works out well (something about
lemons and lemonade).
It turns out 1.6 and 1.8 share the same protocols and on-disk volume
format, and the old and new keystores for them can coexist
side-by-side as well, so we very well may be able to just do this
in-place or with a rolling/piecemeal upgrade if we want.
...
We usually try to have a rollback plan with any upgrades.  Since 1.6
is not getting updates, and as far as I know we still do not have the
technology to move reality backwards in the time continum before the
problem rollover time (if I do figure out time travel, I will be sure
to pre-respond to this message that it can be ignored) our revert
plans seem limited.
[...]
Given the new information we have about the bug ceasing to be a
problem in a couple of weeks, and also the ability to switch freely
between and mix 1.6 and 1.8 servers, it sounds like a rollback won't
necessarily be intractable if we decide it's warranted.
...
I'd suggest that I could start looking at this around 2021-01-17
21:00UTC, following the plan layed out in [2], which interested
parties should definitely review.  This should hopefully be a quiet
time, and give us a decent runway if we do hit issues.
[...]
This sounds reasonable time-wise, but also it seems like we might be
able to reduce the impact/outage and can take it slower if we need.
Yes, rather than doing it all in one go with a downtime maybe we should instead start with afs01.ord or afs02.dfw (the more secondary servers), ensure that server is happy then roll through things in a rolling fashion? We have documented that general process here: https://docs.opendev.org/opendev/system-config/latest/afs.html#no-outage-ser.... It does appear that the existing releases may complicate this process though as we typically want to have the RW volume on active servers?
...
-- 
Jeremy Stanley
Attachments:
* signature.asc

Re: OpenAFS timestamp rollover issues, discussion and plan

Clark Boylan