OpenAFS timestamp rollover issues, discussion and plan
Hello, We were made aware of an issue relating to handling of a timestamp rollover affecting OpenAFS deployments [1] that in the worst case results in inability of the client and server to communicate. From the OpenAFS 1.8.7 release: It fixes a critical issue in the generation of Rx connection IDs that prevent Rx clients started after 14 Jan 2021 08:25:36 AM UTC from being able to successfully make connections. In addition to cache managers and client utilities, fileservers and database servers are also affected, since they initiate connections to (other) database servers during their normal operation. The issue occurs only at startup, while generating the initial connection ID, so cache managers or servers that were already running at the time in question will not be affected until they restart. The full extent of the issues for our particular circumstances is still somewhat unclear. We run a heterogeneous deployment with all clients under our control using 1.8.6 release, but all servers running 1.6 era packages from Ubuntu Xenial. My current understanding is that since we started our servers and clients before the problem time, we are currently unaffected. When we restart our servers, due to differences in the code with our 1.6 versions, the bug will manifest more as periodic/random-ish I/O failures rather than a complete inability to communicate. We have rebuilt our openafs packages with the required fixes and deployed these to all clients under our control. Thus if any client restarts for whatever reason, they will at least be using fixed code. Some clients we have restarted just to ensure sanity of the new version against the existing severs (so far seems good). All clients have been checked for deployment of these new packages. We do not expect issues, but will monitor this situation. This leaves us with the server side. We have been told there are no fixes for the 1.6 servers planned. We do not have a wonderful answer here, unfortunately. We have planned to move our AFS infrastructure off the Xenial hosts it runs on for some time. These servers are all deployed with legacy puppet, something none of us want to spend significant time hacking on. There is also the small matter that 1.6 to 1.8 upgrades are not terribly well documented, to say the least. While we don't intend on restarting the servers, from time to time the cloud provider needs to migrate things or has other issues that affect uptime; we can't predict this other than it will happen, eventually. It seems our best option here is to take some down-time of the AFS services and perform a manual, in-place upgrade of the existing servers to 1.8 code. We will freeze these from ongoing automated config managment (puppet). This will allow us to deploy and evaluate 1.8 servers and keep with the principle of "change one thing at a time". Once we are in a steady state, we should have enough redundancy that we can replace the servers one-at-a-time with no, or very little, downtime. This gives us time to write, test and review Ansible based deployment as usual. When we do switch in new Focal based servers, we have the advantage we are not also trying to change the AFS version too. In some ways, it works out well (something about lemons and lemonade). We usually try to have a rollback plan with any upgrades. Since 1.6 is not getting updates, and as far as I know we still do not have the technology to move reality backwards in the time continum before the problem rollover time (if I do figure out time travel, I will be sure to pre-respond to this message that it can be ignored) our revert plans seem limited. However, we will be no worse off than if the servers decided to reboot themselves now. In it's defence, this is not new code; 1.8.0 came out early 2018 and we have been using 1.8 clients since we started integrating arm64. The queues are very deep at the moment, so we'd obviously like to minimise downtime. I'd suggest that I could start looking at this around 2021-01-17 21:00UTC, following the plan layed out in [2], which interested parties should definitely review. This should hopefully be a quiet time, and give us a decent runway if we do hit issues. -i [1] https://lists.openafs.org/pipermail/openafs-info/2021-January/043013.html [2] https://etherpad.opendev.org/p/infra-openafs-1.8
On 2021-01-15 16:56:39 +1100 (+1100), Ian Wienand wrote: [...]
We have been told there are no fixes for the 1.6 servers planned. We do not have a wonderful answer here, unfortunately. [...]
Per subsequent discussion in IRC today, the problem (at least for 1.6) stems from the number of zero bits in the timestamp leading to weak/repeating ID generation, and will cease to present a problem around the end of the month.
It seems our best option here is to take some down-time of the AFS services and perform a manual, in-place upgrade of the existing servers to 1.8 code. We will freeze these from ongoing automated config managment (puppet). This will allow us to deploy and evaluate 1.8 servers and keep with the principle of "change one thing at a time". Once we are in a steady state, we should have enough redundancy that we can replace the servers one-at-a-time with no, or very little, downtime. This gives us time to write, test and review Ansible based deployment as usual. When we do switch in new Focal based servers, we have the advantage we are not also trying to change the AFS version too. In some ways, it works out well (something about lemons and lemonade).
It turns out 1.6 and 1.8 share the same protocols and on-disk volume format, and the old and new keystores for them can coexist side-by-side as well, so we very well may be able to just do this in-place or with a rolling/piecemeal upgrade if we want.
We usually try to have a rollback plan with any upgrades. Since 1.6 is not getting updates, and as far as I know we still do not have the technology to move reality backwards in the time continum before the problem rollover time (if I do figure out time travel, I will be sure to pre-respond to this message that it can be ignored) our revert plans seem limited. [...]
Given the new information we have about the bug ceasing to be a problem in a couple of weeks, and also the ability to switch freely between and mix 1.6 and 1.8 servers, it sounds like a rollback won't necessarily be intractable if we decide it's warranted.
I'd suggest that I could start looking at this around 2021-01-17 21:00UTC, following the plan layed out in [2], which interested parties should definitely review. This should hopefully be a quiet time, and give us a decent runway if we do hit issues. [...]
This sounds reasonable time-wise, but also it seems like we might be able to reduce the impact/outage and can take it slower if we need. -- Jeremy Stanley
On Fri, Jan 15, 2021, at 11:24 AM, Jeremy Stanley wrote:
On 2021-01-15 16:56:39 +1100 (+1100), Ian Wienand wrote: [...]
We have been told there are no fixes for the 1.6 servers planned. We do not have a wonderful answer here, unfortunately. [...]
Per subsequent discussion in IRC today, the problem (at least for 1.6) stems from the number of zero bits in the timestamp leading to weak/repeating ID generation, and will cease to present a problem around the end of the month.
It seems our best option here is to take some down-time of the AFS services and perform a manual, in-place upgrade of the existing servers to 1.8 code. We will freeze these from ongoing automated config managment (puppet). This will allow us to deploy and evaluate 1.8 servers and keep with the principle of "change one thing at a time". Once we are in a steady state, we should have enough redundancy that we can replace the servers one-at-a-time with no, or very little, downtime. This gives us time to write, test and review Ansible based deployment as usual. When we do switch in new Focal based servers, we have the advantage we are not also trying to change the AFS version too. In some ways, it works out well (something about lemons and lemonade).
It turns out 1.6 and 1.8 share the same protocols and on-disk volume format, and the old and new keystores for them can coexist side-by-side as well, so we very well may be able to just do this in-place or with a rolling/piecemeal upgrade if we want.
We usually try to have a rollback plan with any upgrades. Since 1.6 is not getting updates, and as far as I know we still do not have the technology to move reality backwards in the time continum before the problem rollover time (if I do figure out time travel, I will be sure to pre-respond to this message that it can be ignored) our revert plans seem limited. [...]
Given the new information we have about the bug ceasing to be a problem in a couple of weeks, and also the ability to switch freely between and mix 1.6 and 1.8 servers, it sounds like a rollback won't necessarily be intractable if we decide it's warranted.
I'd suggest that I could start looking at this around 2021-01-17 21:00UTC, following the plan layed out in [2], which interested parties should definitely review. This should hopefully be a quiet time, and give us a decent runway if we do hit issues. [...]
This sounds reasonable time-wise, but also it seems like we might be able to reduce the impact/outage and can take it slower if we need.
Yes, rather than doing it all in one go with a downtime maybe we should instead start with afs01.ord or afs02.dfw (the more secondary servers), ensure that server is happy then roll through things in a rolling fashion? We have documented that general process here: https://docs.opendev.org/opendev/system-config/latest/afs.html#no-outage-ser.... It does appear that the existing releases may complicate this process though as we typically want to have the RW volume on active servers?
-- Jeremy Stanley
Attachments: * signature.asc
On Fri, Jan 15, 2021 at 12:42:05PM -0800, Clark Boylan wrote:
Given the new information we have about the bug ceasing to be a problem in a couple of weeks, and also the ability to switch freely between and mix 1.6 and 1.8 servers, it sounds like a rollback won't necessarily be intractable if we decide it's warranted.
Yes, rather than doing it all in one go with a downtime maybe we should instead start with afs01.ord or afs02.dfw (the more secondary servers), ensure that server is happy then roll through things in a rolling fashion? We have documented that general process here: https://docs.opendev.org/opendev/system-config/latest/afs.html#no-outage-ser.... It does appear that the existing releases may complicate this process though as we typically want to have the RW volume on active servers?
I agree with this; we can start with one server manually to validate that 1.6 and 1.8 do actually co-exist as happily as advertised. As mentioned in the etherpad I can work on things like new key distribution first as well, which is a good first thing to start running ansible with. I think we need to keep the manual upgrade plan in our pocket in case of a unexpected restart before the end of January. -i
participants (3)
-
Clark Boylan
-
Ian Wienand
-
Jeremy Stanley