Next steps with new review server
Hi, We have a large server provided by Vexxhost up and running in a staging capacity to replace the current server at review02.openstack.org. I have started to track some things at [1] There's a couple of things: 1) Production database Currently, we use a hosted db. Since NoteDB this only stores review seen flags. We've been told that other sites treat this data as ephemeral; they use a H2 db on disk and don't worry about backing up or restoring across upgrades. I have proposed storing this in a mariadb sibling container with [2]. We know how to admin, backup and restore that. That would be my preference, but I'm not terribly fussed. If I could request some reviews on that; I'll take +2's as a sign we should use a container, otherwise we can leave it with H2 it has now. 2) IPv6 issues We've seen a couple of cases that are looking increasingly like stray RA's are some how assigning extra addresses, similar to [1]. Our mirror in the same region has managed to acquire 50+ default routes somehow. It seems like inbound traffic keeps working (why we haven't seen issues with other production servers?). But I feel like it's a little bit troubling to have undiagnosed before we switch our major service to it. I'm running some tracing, trying to at least catch a stray RA while the server is quite, in the etherpad. But suggestions here are welcome. -i [1] https://etherpad.opendev.org/p/gerrit-upgrade-2021 [2] https://review.opendev.org/c/opendev/system-config/+/775961 [3] https://launchpad.net/bugs/1844712
On Wed, Mar 31, 2021, at 7:27 PM, Ian Wienand wrote:
Hi,
We have a large server provided by Vexxhost up and running in a staging capacity to replace the current server at review02.openstack.org.
I have started to track some things at [1]
There's a couple of things:
1) Production database
Currently, we use a hosted db. Since NoteDB this only stores review seen flags. We've been told that other sites treat this data as ephemeral; they use a H2 db on disk and don't worry about backing up or restoring across upgrades.
I have proposed storing this in a mariadb sibling container with [2]. We know how to admin, backup and restore that. That would be my preference, but I'm not terribly fussed. If I could request some reviews on that; I'll take +2's as a sign we should use a container, otherwise we can leave it with H2 it has now.
Agreed, sticking with known DB tooling seems like a good idea for ease of operator interaction. I'll try to review this change today.
2) IPv6 issues
We've seen a couple of cases that are looking increasingly like stray RA's are some how assigning extra addresses, similar to [1]. Our mirror in the same region has managed to acquire 50+ default routes somehow.
It seems like inbound traffic keeps working (why we haven't seen issues with other production servers?). But I feel like it's a little bit troubling to have undiagnosed before we switch our major service to it. I'm running some tracing, trying to at least catch a stray RA while the server is quite, in the etherpad. But suggestions here are welcome.
Agreed, ideally we would sort this out before any migration completes. I want to say we saw similar with the mirror in vexxhost and the "solution" there was to disable RAs and create a static yaml config for ubuntu using its new network management config file? That seems less than ideal from a cloud perspective as we can't be the only ones noticing this (in fact some of our CI jobs may indicate they suffer from similar causing some jobs to run long when reaching network resources). I know when we brought this up with the mirror mnaser suggested static config was fine, but maybe we need to reinforce that this is problematic as a cloud user and see if we can help debug (network traces seem like a good start there).
-i
[1] https://etherpad.opendev.org/p/gerrit-upgrade-2021 [2] https://review.opendev.org/c/opendev/system-config/+/775961 [3] https://launchpad.net/bugs/1844712
On Thu, Apr 1, 2021, at 8:20 AM, Clark Boylan wrote:
On Wed, Mar 31, 2021, at 7:27 PM, Ian Wienand wrote:
snip
2) IPv6 issues
We've seen a couple of cases that are looking increasingly like stray RA's are some how assigning extra addresses, similar to [1]. Our mirror in the same region has managed to acquire 50+ default routes somehow.
It seems like inbound traffic keeps working (why we haven't seen issues with other production servers?). But I feel like it's a little bit troubling to have undiagnosed before we switch our major service to it. I'm running some tracing, trying to at least catch a stray RA while the server is quite, in the etherpad. But suggestions here are welcome.
Agreed, ideally we would sort this out before any migration completes. I want to say we saw similar with the mirror in vexxhost and the "solution" there was to disable RAs and create a static yaml config for ubuntu using its new network management config file? That seems less than ideal from a cloud perspective as we can't be the only ones noticing this (in fact some of our CI jobs may indicate they suffer from similar causing some jobs to run long when reaching network resources). I know when we brought this up with the mirror mnaser suggested static config was fine, but maybe we need to reinforce that this is problematic as a cloud user and see if we can help debug (network traces seem like a good start there).
I ended up double checking the mirror node and in mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml you can see what we did there. Essentially we set dhcpv6 and accept-ra to false then set an address and routes. We should be able to do the same thing with the new review host if we can't figure anything else out. If we do go this route maybe we should consider updating launch-node to do it for us automatically when launching focal nodes on vexxhost (I don't think bionic does netplan?), or at the very least document this somewhere. We should also double check that the address and routes are static and can be configured statically like this (the address should not change but I suppose the routes could at some point?). Ideally though we would sort this out properly and avoid these workarounds.
-i
[1] https://etherpad.opendev.org/p/gerrit-upgrade-2021 [2] https://review.opendev.org/c/opendev/system-config/+/775961 [3] https://launchpad.net/bugs/1844712
On Thu, Apr 01, 2021 at 02:35:32PM -0700, Clark Boylan wrote:
I ended up double checking the mirror node and in mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml you can see what we did there. Essentially we set dhcpv6 and accept-ra to false then set an address and routes. We should be able to do the same thing with the new review host if we can't figure anything else out.
So we have a work around in production but also [3] being marked as an open security bug. Are we happy enough ignoring RA's is sufficient to overcome the issues discussed in [3] for this service? The concern mostly seemed to be a targeted MITM attack; something which ssh host keys and SSL certificates should cover? -i
On 2021-04-08 15:43:33 +1000 (+1000), Ian Wienand wrote:
On Thu, Apr 01, 2021 at 02:35:32PM -0700, Clark Boylan wrote:
I ended up double checking the mirror node and in mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml you can see what we did there. Essentially we set dhcpv6 and accept-ra to false then set an address and routes. We should be able to do the same thing with the new review host if we can't figure anything else out.
So we have a work around in production but also [3] being marked as an open security bug.
Are we happy enough ignoring RA's is sufficient to overcome the issues discussed in [3] for this service? The concern mostly seemed to be a targeted MITM attack; something which ssh host keys and SSL certificates should cover?
Yes, I think ignoring RAs is probably sufficient. Nobody seems to have yet figured out how the leak happens or what else could be leaked, but as you note the fact that a MitM couldn't usefully spoof a viable HTTPS or SSH connection endpoint is sufficient insurance against anything worse, so we can just focus on mitigating the stability problem arising from stray leaks for now. -- Jeremy Stanley
participants (3)
-
Clark Boylan
-
Ian Wienand
-
Jeremy Stanley