OpenAFS timestamp rollover issues, discussion and plan

Clark Boylan cboylan at sapwetik.org
Fri Jan 15 20:42:05 UTC 2021


On Fri, Jan 15, 2021, at 11:24 AM, Jeremy Stanley wrote:
> On 2021-01-15 16:56:39 +1100 (+1100), Ian Wienand wrote:
> [...]
> > We have been told there are no fixes for the 1.6 servers planned.
> > We do not have a wonderful answer here, unfortunately.
> [...]
> 
> Per subsequent discussion in IRC today, the problem (at least for
> 1.6) stems from the number of zero bits in the timestamp leading to
> weak/repeating ID generation, and will cease to present a problem
> around the end of the month.
> 
> > It seems our best option here is to take some down-time of the AFS
> > services and perform a manual, in-place upgrade of the existing
> > servers to 1.8 code.  We will freeze these from ongoing automated
> > config managment (puppet).  This will allow us to deploy and evaluate
> > 1.8 servers and keep with the principle of "change one thing at a
> > time".  Once we are in a steady state, we should have enough
> > redundancy that we can replace the servers one-at-a-time with no, or
> > very little, downtime.  This gives us time to write, test and review
> > Ansible based deployment as usual.  When we do switch in new Focal
> > based servers, we have the advantage we are not also trying to change
> > the AFS version too.  In some ways, it works out well (something about
> > lemons and lemonade).
> 
> It turns out 1.6 and 1.8 share the same protocols and on-disk volume
> format, and the old and new keystores for them can coexist
> side-by-side as well, so we very well may be able to just do this
> in-place or with a rolling/piecemeal upgrade if we want.
> 
> > We usually try to have a rollback plan with any upgrades.  Since 1.6
> > is not getting updates, and as far as I know we still do not have the
> > technology to move reality backwards in the time continum before the
> > problem rollover time (if I do figure out time travel, I will be sure
> > to pre-respond to this message that it can be ignored) our revert
> > plans seem limited.
> [...]
> 
> Given the new information we have about the bug ceasing to be a
> problem in a couple of weeks, and also the ability to switch freely
> between and mix 1.6 and 1.8 servers, it sounds like a rollback won't
> necessarily be intractable if we decide it's warranted.
> 
> > I'd suggest that I could start looking at this around 2021-01-17
> > 21:00UTC, following the plan layed out in [2], which interested
> > parties should definitely review.  This should hopefully be a quiet
> > time, and give us a decent runway if we do hit issues.
> [...]
> 
> This sounds reasonable time-wise, but also it seems like we might be
> able to reduce the impact/outage and can take it slower if we need.

Yes, rather than doing it all in one go with a downtime maybe we should instead start with afs01.ord or afs02.dfw (the more secondary servers), ensure that server is happy then roll through things in a rolling fashion? We have documented that general process here: https://docs.opendev.org/opendev/system-config/latest/afs.html#no-outage-server-maintenance. It does appear that the existing releases may complicate this process though as we typically want to have the RW volume on active servers?

> -- 
> Jeremy Stanley
> 
> Attachments:
> * signature.asc



More information about the service-discuss mailing list