New approaches for grafana.opendev.org
Ian Wienand
iwienand at redhat.com
Fri Jul 1 01:35:11 UTC 2022
Hello,
65;6800;1c
Currently all graphs pushed to grafana.opendev.org are defined in
project-config/graphana as YAML files consumed by grafyaml [1]
The fundamental tension for grafyaml is that the upstream Grafana
project do not document and publish a defined schema for dashboards
and their components. grafyaml has a subset of the upstream data
model which is incomplete, in some cases buggy -- but also, perhaps
most importantly, undocumented [1].
We have over a million data points of interesting information in
graphite but I feel there are significant barriers to new and
interesting visualisations. With no clear documentation, either
upstream or in grafyaml, where is somebody supposed to start?
A series of changes have been reviewed and landed through grafyaml
that allow it to upload dashboards exported directly from the Grafana
UI in its native .json format. I would like to achieve some consensus
that we use this feature in the OpenDev environment.
I will leave aside the issues with the schema encoded in grafyaml;
it's possible this might be fixed. AIUI the main reason for
duplicating the schema in grafyaml was that it presented more
reviewable YAML files. To this I would say:
1) Layout of the page; i.e. the rows, panels, nesting, etc. My
argument here is that reviewers having to build a mental model of
what a dashboard will look like -- from either YAML or json -- does
not make for thorough reviews, especially if you're not already
intimately familiar with the desired output.
To this end, I have added a new job "project-config-grafana" which
produces an artifact "Screenshots" that loads changed graphs into a
Grafana instance and stores actual screenshots loaded from a
headless browser. I believe this is a much more effective way to
review proposed layout changes for both formats of input.
2) The data graphed. This comes down to the metric selected, and any
functions applied. My argument here is that firstly the screenshot
is a good way to evaluate this; for example you will see if you've
accidentally treated milliseconds as seconds, etc. when the graph
axis is wrong. As to the actual data -- ensuring we have the right
metric, etc. -- I would say that the "raw" output of the exported
graphs just isn't that hard to parse. It is unobfuscated and reads
logically. I have proposed some examples:
https://review.opendev.org/c/openstack/project-config/+/833213/6/grafana/infra-prod-deployment.json
https://review.opendev.org/c/openstack/project-config/+/848212/3/grafana/nodepool-dib-status.json
I think you can clearly see the metrics chosen and the functions
applied.
3) Generally more confusing. This is true, as the .json file is meant
for Grafana to read. However, for better or worse, this is the
actual data model of your graph page. To this end, I have proposed
documentation and a helper-script to start a Grafana instance in a
local container, and load it with the defined dashboards:
https://review.opendev.org/c/openstack/project-config/+/833214/
This is useful for interactive editing sessions to develop new
dashboards, and if a reviewer wishes to examine a change more
closely than the screenshots provided by CI, they can simply pull
the change from gerrit and load it into a live instance using this
simple method. I think this is a significantly lower barrier to
get people developing new and interesting things against the data
provided.
I'm not proposing any existing graph need change [2] and grafyaml's
features to setup datasources and load the graphs are still used. I'm
not proposing we remove or even stop any development of the YAML
schema if people want to work on that and prefer to keep their graphs
that way.
I think that there is a great resource here that is underutilised, and
my hope is we have a path to greatly reduce the barriers to new
contributions.
Sorry for the long mail,
-i
[1] Grafana does a good job of backwards compatibility, so "old"
dashboards work in new releases. Hence our extant graphs, though
producing output that looks very different from what the UI
produces now, generally work. Modulo some bugs where the "update"
process doesn't work (thresholds was one I found), deprecations of
features that will disappear (c.f. time-series graphs) and just
the many panel types that are completely unsupported.
[2] Though most extant graphs use deprecated panel types that will
have to be updated one day; but that's an issue for another time.
More information about the service-discuss
mailing list