From cboylan at sapwetik.org Tue Oct 20 17:49:49 2020 From: cboylan at sapwetik.org (Clark Boylan) Date: Tue, 20 Oct 2020 10:49:49 -0700 Subject: [service-announce] October 20 Gerrit Outage Message-ID: <2ee83d02-f6eb-4ea9-916c-5b5558da862c@www.fastmail.com> Hello everyone, By now most of you have probably noticed that we took Gerrit offline recently. The reason for that is we believe an admin account in Gerrit was compromised allowing an attacker to escalate privileges within Gerrit. Around 02:00 UTC October 20 suspicious review activity was noticed, and we were made aware of it shortly afterwards. The involved account was disabled and removed from privileged Gerrit groups. After further investigation we decided that we needed to stop the service, this happened at about 04:00 UTC. After the service was stopped we shifted focus to identifying the source of the issue as well as investigating impact. We believe this originated on October 6th with at least two compromised Ubuntu One accounts. One of which was a Gerrit admin account. These accounts, like the one that initially tipped us off, have been dealt with at this point. In order to evaluate impact we are using backups from October 1 to find configuration, database, and git repo changes that have been made. We have identified 97 accounts that updated ssh keys after that point in time. These ssh keys are being removed as we can't be sure the changes were valid changes made by the user. If you are one of these users you will need to add your key(s) back in. We will also attempt to reach out to the affected users directly by email soon. We will be checking openid urls and group membership changes as well. We will determine what actions make sense for these items once we have evaluated the impact to them. All Gerrit HTTP API tokens will be deleted. You will need to generate new ones if you are an API user. Sorry, gertty fans. On the git repo side of things there are a few things that we will need to check. Using our October 1 state we will generate lists of commits that have landed since then for each branch on each repo. We will verify that the latest commit can reach the last known good commit in the git DAG. For non merge commits we will also correlate these to Gerrit changes. We will then ask that you help us by verifying the commits on your projects are as reviewed and not malicious. We will also need to check git tags which should all be signed and can be verified that way. This is a good reminder to check activity on your online accounts and identities for anything unexpected. We understand that an inaccessible Gerrit is not fun. We are trying to go as quickly as we can while also not sacrificing caution and care. Clark From iwienand at redhat.com Wed Oct 21 00:33:14 2020 From: iwienand at redhat.com (Ian Wienand) Date: Wed, 21 Oct 2020 11:33:14 +1100 Subject: [service-announce] October 20 Gerrit Outage Update Message-ID: <20201021003314.GB1695651@fedora19.localdomain> As of this mail, Gerrit access has been restored. Please read on for important information, especially around change verification. Background ----------- On 2020-10-20 at 01:30 a user unexpectedly added a workflow approval to a change that they were not expected to have access to. At 02:06 UTC an alert was raised via IRC. Administrators found the account had added themselves to a core group and made the +W vote. The account was disabled, and removed from the groups it had added itself to by 02:55 UTC. Administrators began to analyse the situation and Gerrit was taken offline at 04:02 UTC to preserve state and allow for analysis. >From this time, administrators were working on log collection and analysis, along with restoring backups for comparison purposes. By around 08:45 UTC it was clear that the privilege escalation had been achieved by gaining control of a Launchpad SSO account with Gerrit administrator privileges. By this time, we had ruled out software vulnerabilities. Logs showed the first unauthorized access of the administrator account in Gerrit on 2020-10-06. Communication with Launchpad admins agrees with this analysis. We saw one session opened as the administrator user to StoryBoard on this same day, but logs show no data was modified or hidden stories viewed. Analysis has been performed on the Gerrit database and git trees from October 1st, pre-dating any known unauthorized access. Access was restored at around 2020-10-21 00:00 UTC Outcomes ----------- The following has been verified: The administrator account used has been disabled and credentials updated We have verified that all group and user addition/removals since Oct 1 are valid. The only invalid additions were made by the compromised administrator account to add a single user account to the Administrators group; and then that account added itself to another known group. The account given administrator privilege has been removed from the groups it added itself to and is disabled. There is no evidence of any unauthorized access via methods other than Gerrit HTTP and Gerrit SSH access. No commits have been pushed to git trees bypassing code review. Every git tree has been compared to the Oct 1 version and all commits have been correctly inserted via Gerrit changes. The version of Gerrit we use stores HTTP API passwords in plain-text. We know that a limited number of passwords were gathered via the HTTP API and it is possible passwords were gathered via the database. We thus have assumed that all HTTP API passwords have been disclosed. This password needs to be explicitly enabled by users, and many users do not have it enabled. Remediation ----------- This leaves us with the following remediation actions: Users should double-check their Launchpad recent activity at https://login.launchpad.net/activity for any suspicious logins. If found, please notify the OpenDev admins in Freenode #opendev and Launchpad admins in #launchpad immediately. All HTTP API passwords have been cleared. If you push changes via HTTPS (instead of typical SSH), are a gertty user, or run a CI system or something else that communicates with the Gerrit HTTP API, you will need to regenerate a password. Any SSH keys added to accounts since 2020-10-01 have been removed. This affects only a limited number of accounts. This is done in an abundance of caution, and we do not believe any accounts had unauthorized SSH keys added We should audit all changes for projects since 2020-10-01. We have no evidence that any account had its ssh keys compromised, thus we can rule out any unauthorized changes being uploaded via SSH. However we can not conclusively rule out that compromised HTTP API passwords were used to push a change through Gerrit. For example, a change could be uploaded that looks like it came from a user, or the API key of a core team member may have been used to approve a change without authorization. Given our extensive analysis we consider it exceedingly unlikely that this vector was used. We have had no notifications of users seeing unexpected changes either uploaded by them, or approved by them in projects they work on. This said, we believe it is important to inform the community of this very unlikely, but still possible, vulnerability of the source code. To this end, we have prepared a list of all changes from the known affected period which should be audited for correctness. These are available at https://static.opendev.org/project/opendev.org/gerrit-diffs/ Team members should browse these changes and make sure they were correctly approved in Gerrit. If any change looks suspicious you should notify OpenDev administrators in Freenode #opendev immediately. Further actions ---------------- We are planning the following for the short term future: The Opendev administrators will be looking at alternative models for Gerrit admin account management. We are already well into planning and testing a coming upgrade to a version of Gerrit which does not store plain-text API keys. Longer term, we've written a spec for replacing Launchpad SSO as our authentication provider. We thank you for your patience during this trying time, and we look forward to returning to supporting the community doing what it does best -- working together to create great things. From cboylan at sapwetik.org Tue Oct 27 21:16:53 2020 From: cboylan at sapwetik.org (Clark Boylan) Date: Tue, 27 Oct 2020 14:16:53 -0700 Subject: [service-announce] =?utf-8?q?review=2Eopendev=2Eorg_Gerrit_outage?= =?utf-8?q?_and_upgrade_15=3A00UTC_November_20_to_01=3A00UTC_November_23?= =?utf-8?q?=2C_2020?= Message-ID: <410ac6e8-6381-49d8-836e-302fe166b47a@www.fastmail.com> The OpenDev team is planning a long weekend Gerrit outage on review.opendev.org starting 15:00UTC November 20 and running to 01:00UTC November 23, 2020 in order to upgrade to Gerrit 3.2. The upgrade has two major portions. First we will incrementally move from our current version 2.13 to 2.16. Each point release requires a database migration and git indexing operations that take the majority of the time. The second major part, once at version 2.16, is to convert Gerrit to use the new NoteDB backend. This stores reviews together with code in the git trees, rather than in a separate database. There are known problems converting to NoteDB from any version prior to 2.16, which is why we need the initial upgrade steps. This is this slowest portion of the upgrade process, and the one most prone to unforeseen issues despite extensive pre-testing. We will be creating snapshots to facilitate quick fallback if required. Once this is complete, we can move to the 3.x series and incrementally upgrade from 3.0 to 3.2. Timing wise we expect the first 2.13 -> 2.16 phase to be completed in about 8 hours. Next we'll start the notedb migration and let that run overnight. When we return the next morning the NoteDB migration should be complete and we can finish with the last set of upgrades to 3.2. Hopefully that means we're done in the first half of the outage window, but if we have problems, things are slower than we expect, or if we have to work through any details after finishing the upgrade we'd like to have the extra buffer in place. Some important things to know: * Gerrit's web UI will be changing. We don't have a lot of control over this. As a result we'll lose some existing CI result rendering niceness that we have on 2.13 (in particular the summary table and CI results toggle will go away). We hope that once we've upgraded we can investigate solving this through Gerrit plugins. * All of our existing integrations with launchpad and storyboard may not be fully functional. The changes to the database system impact aspects of how we lookup user details in order to update bugs. * Gerrit 3.2 requires git 2.2.0 or newer to use the Change ID commit hook. This may be a problem for RHEL/CentOS 7 users. * This upgrade will fix the openssh problems with host keys that Fedora 33 users have experienced. * Q&A * If this upgrade isn't perfect why are we doing it anyway? We've come to the realization that if we don't make imperfect progress we'll never make any progress. We have decided that the benefits outweigh the known drawbacks and we'll do our best to work on those issues after the upgrade. Why now? This is long overdue and the quiet time at the end of the year tends to be a good time for these big upgrades Can I install my favorite Gerrit plugin? We're hoping to complete the upgrade then start looking at things like this. One big reason for it is finishing the upgrade simplifies the number of Gerrit docker images we need to care for. Will my third party CI system continue to work? We expect they will. We know Zuul is tested against newer Gerrit already. How can I help? Once the upgrade is complete you'll want to confirm the basic functionality you rely on is there. We know there will be differences or missing features. Patience as we figure out how to address those on a new Gerrit installation is much appreciated. If you're interested in hacking on Java and Javascript we'd love help with the plugins necessary to address the known problems. You should be able to build this out locally without any special access. Please let us know if you are interested and we can help you bootstrap. Finally, we're continuing to test the upgrade and trying to sort out if we can get this upgrade completed in a shorter period of time. We'll keep you up to date if further testing forces us to postpone or if the outage window timeframe changes at all. As always feel free to send your questions our way and we'll do our best to answer them. On behalf of the OpenDev team, Clark