New approaches for grafana.opendev.org

Fri Jul 1 01:35:11 UTC 2022

Hello,
65;6800;1c
Currently all graphs pushed to grafana.opendev.org are defined in
project-config/graphana as YAML files consumed by grafyaml [1]

The fundamental tension for grafyaml is that the upstream Grafana
project do not document and publish a defined schema for dashboards
and their components.  grafyaml has a subset of the upstream data
model which is incomplete, in some cases buggy -- but also, perhaps
most importantly, undocumented [1].

We have over a million data points of interesting information in
graphite but I feel there are significant barriers to new and
interesting visualisations.  With no clear documentation, either
upstream or in grafyaml, where is somebody supposed to start?

A series of changes have been reviewed and landed through grafyaml
that allow it to upload dashboards exported directly from the Grafana
UI in its native .json format.  I would like to achieve some consensus
that we use this feature in the OpenDev environment.

I will leave aside the issues with the schema encoded in grafyaml;
it's possible this might be fixed.  AIUI the main reason for
duplicating the schema in grafyaml was that it presented more
reviewable YAML files.  To this I would say:

1) Layout of the page; i.e. the rows, panels, nesting, etc.  My
   argument here is that reviewers having to build a mental model of
   what a dashboard will look like -- from either YAML or json -- does
   not make for thorough reviews, especially if you're not already
   intimately familiar with the desired output.

   To this end, I have added a new job "project-config-grafana" which
   produces an artifact "Screenshots" that loads changed graphs into a
   Grafana instance and stores actual screenshots loaded from a
   headless browser.  I believe this is a much more effective way to
   review proposed layout changes for both formats of input.

2) The data graphed.  This comes down to the metric selected, and any
   functions applied.  My argument here is that firstly the screenshot
   is a good way to evaluate this; for example you will see if you've
   accidentally treated milliseconds as seconds, etc. when the graph
   axis is wrong.  As to the actual data -- ensuring we have the right
   metric, etc. -- I would say that the "raw" output of the exported
   graphs just isn't that hard to parse.  It is unobfuscated and reads
   logically.  I have proposed some examples:

     https://review.opendev.org/c/openstack/project-config/+/833213/6/grafana/infra-prod-deployment.json
     https://review.opendev.org/c/openstack/project-config/+/848212/3/grafana/nodepool-dib-status.json

   I think you can clearly see the metrics chosen and the functions
   applied.

3) Generally more confusing.  This is true, as the .json file is meant
   for Grafana to read.  However, for better or worse, this is the
   actual data model of your graph page.  To this end, I have proposed
   documentation and a helper-script to start a Grafana instance in a
   local container, and load it with the defined dashboards:

    https://review.opendev.org/c/openstack/project-config/+/833214/

   This is useful for interactive editing sessions to develop new
   dashboards, and if a reviewer wishes to examine a change more
   closely than the screenshots provided by CI, they can simply pull
   the change from gerrit and load it into a live instance using this
   simple method.  I think this is a significantly lower barrier to
   get people developing new and interesting things against the data
   provided.

I'm not proposing any existing graph need change [2] and grafyaml's
features to setup datasources and load the graphs are still used.  I'm
not proposing we remove or even stop any development of the YAML
schema if people want to work on that and prefer to keep their graphs
that way.

I think that there is a great resource here that is underutilised, and
my hope is we have a path to greatly reduce the barriers to new
contributions.

Sorry for the long mail,

-i

[1] Grafana does a good job of backwards compatibility, so "old"
    dashboards work in new releases.  Hence our extant graphs, though
    producing output that looks very different from what the UI
    produces now, generally work.  Modulo some bugs where the "update"
    process doesn't work (thresholds was one I found), deprecations of
    features that will disappear (c.f. time-series graphs) and just
    the many panel types that are completely unsupported.

[2] Though most extant graphs use deprecated panel types that will
    have to be updated one day; but that's an issue for another time.