[Openinfralabs] What operational data and data streams should be gathered?

Thu Apr 30 16:02:28 UTC 2020

On 2020-04-16 12:52:28 +0200 (+0200), Marcel Hild wrote:
> I would start with monitoring data, i.e. all prometheus metrics
> being collected.

I agree that's a good route to take initially. Using OpenDev as an
example, while obviously not the same systems and not necessarily
constrained by the same risks, we try to keep all non-sensitive
monitoring and trending data public, like:

http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=25&rra_id=all

http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1

> But those in itself are not really useful without a picture of the
> infrastructure and access to the infrastructure.

One thing we've done in that vein is to try to keep all our
deployment and systems management notes (runbooks or whatever you
like to term them) on a public site:

https://docs.opendev.org/opendev/system-config/

We also try to publish logs for some services of interest if we can
be certain they won't leak things like PII or credentials:

https://nb01.openstack.org/ (sorry, self-signed cert on that one)

> And you might need access to ticket systems, since you want to
> correlate the metrics with incidents.

Absolutely, though relying on a public incident/defect tracker does
also mean you need to train your users to leverage privacy features
for anything which may contain sensitive data, or to avoid putting
such information in tickets and instead forwarding it via more
confidential channels. It also means getting better about redacting
certain classes of data either automatically or on request (and then
dealing with fallout for the latter, treating that material as
though it has leaked even if you've masked or deleted it after the
fact to limit the damage).

> My point is, we need to open up _all_ aspects of operations to
> make it actually useful, otherwise it'll be just a pile of data.
[...]

For more general day-to-day operations, as well as scheduled
maintenance or similar change activity, we've found it's useful to
keep discussion on publicly-archived mailing lists and in publicly
logged IRC channels so that they're easy to refer back to later.

http://lists.opendev.org/pipermail/service-discuss/2020-April/thread.html

http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-04-30.log.html

http://eavesdrop.openstack.org/meetings/opendev_maint/2020/opendev_maint.2020-04-10-17.00.log.html

We're also using an IRC bot which our sysadmins can command to log
important status information to a Web page:

https://wiki.openstack.org/wiki/Infrastructure_Status

That allows us to easily publish notes about what we're doing which
might have public impact, and we also tie it into a notification
system which can echo messages in subscribed IRC channels or
temporarily update their channel topics to reflect important service
status information.

Another major choice we've made is to perform installation and
life-cycle management of a lot of our services through continuous
deployment jobs in a public-facing CI/CD system:

https://zuul.opendev.org/t/openstack/builds?pipeline=deploy

It does require careful thought, however, to make sure you're not
exposing anything which could compromise the integrity or security
of those systems.
-- 
Jeremy Stanley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: not available
URL: <http://lists.opendev.org/pipermail/openinfralabs/attachments/20200430/fe60c517/attachment.sig>