OpenAFS timestamp rollover issues, discussion and plan

15 Jan 2021

      Hello,

We were made aware of an issue relating to handling of a timestamp
rollover affecting OpenAFS deployments [1] that in the worst case
results in inability of the client and server to communicate.  From
the OpenAFS 1.8.7 release:

    It fixes a critical issue in the generation of Rx connection IDs
    that prevent Rx clients started after 14 Jan 2021 08:25:36 AM UTC
    from being able to successfully make connections. In addition to
    cache managers and client utilities, fileservers and database
    servers are also affected, since they initiate connections to
    (other) database servers during their normal operation.  The issue
    occurs only at startup, while generating the initial connection
    ID, so cache managers or servers that were already running at the
    time in question will not be affected until they restart.

The full extent of the issues for our particular circumstances is
still somewhat unclear.  We run a heterogeneous deployment with all
clients under our control using 1.8.6 release, but all servers running
1.6 era packages from Ubuntu Xenial.

My current understanding is that since we started our servers and
clients before the problem time, we are currently unaffected.  When we
restart our servers, due to differences in the code with our 1.6
versions, the bug will manifest more as periodic/random-ish I/O
failures rather than a complete inability to communicate.

We have rebuilt our openafs packages with the required fixes and
deployed these to all clients under our control.  Thus if any client
restarts for whatever reason, they will at least be using fixed code.
Some clients we have restarted just to ensure sanity of the new
version against the existing severs (so far seems good).  All clients
have been checked for deployment of these new packages.  We do not
expect issues, but will monitor this situation.

This leaves us with the server side.  We have been told there are no
fixes for the 1.6 servers planned.  We do not have a wonderful answer
here, unfortunately.

We have planned to move our AFS infrastructure off the Xenial hosts it
runs on for some time.  These servers are all deployed with legacy
puppet, something none of us want to spend significant time hacking
on.  There is also the small matter that 1.6 to 1.8 upgrades are not
terribly well documented, to say the least.  While we don't intend on
restarting the servers, from time to time the cloud provider needs to
migrate things or has other issues that affect uptime; we can't
predict this other than it will happen, eventually.

It seems our best option here is to take some down-time of the AFS
services and perform a manual, in-place upgrade of the existing
servers to 1.8 code.  We will freeze these from ongoing automated
config managment (puppet).  This will allow us to deploy and evaluate
1.8 servers and keep with the principle of "change one thing at a
time".  Once we are in a steady state, we should have enough
redundancy that we can replace the servers one-at-a-time with no, or
very little, downtime.  This gives us time to write, test and review
Ansible based deployment as usual.  When we do switch in new Focal
based servers, we have the advantage we are not also trying to change
the AFS version too.  In some ways, it works out well (something about
lemons and lemonade).

We usually try to have a rollback plan with any upgrades.  Since 1.6
is not getting updates, and as far as I know we still do not have the
technology to move reality backwards in the time continum before the
problem rollover time (if I do figure out time travel, I will be sure
to pre-respond to this message that it can be ignored) our revert
plans seem limited.  However, we will be no worse off than if the
servers decided to reboot themselves now.  In it's defence, this is
not new code; 1.8.0 came out early 2018 and we have been using 1.8
clients since we started integrating arm64.

The queues are very deep at the moment, so we'd obviously like to
minimise downtime.

I'd suggest that I could start looking at this around 2021-01-17
21:00UTC, following the plan layed out in [2], which interested
parties should definitely review.  This should hopefully be a quiet
time, and give us a decent runway if we do hit issues.

-i

[1] https://lists.openafs.org/pipermail/openafs-info/2021-January/043013.html
[2] https://etherpad.opendev.org/p/infra-openafs-1.8

OpenAFS timestamp rollover issues, discussion and plan

Ian Wienand