Hello, We were made aware of an issue relating to handling of a timestamp rollover affecting OpenAFS deployments [1] that in the worst case results in inability of the client and server to communicate. From the OpenAFS 1.8.7 release: It fixes a critical issue in the generation of Rx connection IDs that prevent Rx clients started after 14 Jan 2021 08:25:36 AM UTC from being able to successfully make connections. In addition to cache managers and client utilities, fileservers and database servers are also affected, since they initiate connections to (other) database servers during their normal operation. The issue occurs only at startup, while generating the initial connection ID, so cache managers or servers that were already running at the time in question will not be affected until they restart. The full extent of the issues for our particular circumstances is still somewhat unclear. We run a heterogeneous deployment with all clients under our control using 1.8.6 release, but all servers running 1.6 era packages from Ubuntu Xenial. My current understanding is that since we started our servers and clients before the problem time, we are currently unaffected. When we restart our servers, due to differences in the code with our 1.6 versions, the bug will manifest more as periodic/random-ish I/O failures rather than a complete inability to communicate. We have rebuilt our openafs packages with the required fixes and deployed these to all clients under our control. Thus if any client restarts for whatever reason, they will at least be using fixed code. Some clients we have restarted just to ensure sanity of the new version against the existing severs (so far seems good). All clients have been checked for deployment of these new packages. We do not expect issues, but will monitor this situation. This leaves us with the server side. We have been told there are no fixes for the 1.6 servers planned. We do not have a wonderful answer here, unfortunately. We have planned to move our AFS infrastructure off the Xenial hosts it runs on for some time. These servers are all deployed with legacy puppet, something none of us want to spend significant time hacking on. There is also the small matter that 1.6 to 1.8 upgrades are not terribly well documented, to say the least. While we don't intend on restarting the servers, from time to time the cloud provider needs to migrate things or has other issues that affect uptime; we can't predict this other than it will happen, eventually. It seems our best option here is to take some down-time of the AFS services and perform a manual, in-place upgrade of the existing servers to 1.8 code. We will freeze these from ongoing automated config managment (puppet). This will allow us to deploy and evaluate 1.8 servers and keep with the principle of "change one thing at a time". Once we are in a steady state, we should have enough redundancy that we can replace the servers one-at-a-time with no, or very little, downtime. This gives us time to write, test and review Ansible based deployment as usual. When we do switch in new Focal based servers, we have the advantage we are not also trying to change the AFS version too. In some ways, it works out well (something about lemons and lemonade). We usually try to have a rollback plan with any upgrades. Since 1.6 is not getting updates, and as far as I know we still do not have the technology to move reality backwards in the time continum before the problem rollover time (if I do figure out time travel, I will be sure to pre-respond to this message that it can be ignored) our revert plans seem limited. However, we will be no worse off than if the servers decided to reboot themselves now. In it's defence, this is not new code; 1.8.0 came out early 2018 and we have been using 1.8 clients since we started integrating arm64. The queues are very deep at the moment, so we'd obviously like to minimise downtime. I'd suggest that I could start looking at this around 2021-01-17 21:00UTC, following the plan layed out in [2], which interested parties should definitely review. This should hopefully be a quiet time, and give us a decent runway if we do hit issues. -i [1] https://lists.openafs.org/pipermail/openafs-info/2021-January/043013.html [2] https://etherpad.opendev.org/p/infra-openafs-1.8