Problem:UCS-Dashboard Database / Prometheus - does not start anymore

Problem

The Docker container for UCS Dashboard / Prometheus does not start and with a docker logs or univention-app logs prometheus you get the following log output:

ts=2023-10-19T16:09:40.770Z caller=main.go:1097 level=error err="opening storage failed: reloadBlocks: 116 errors: 
corrupted block 01G3TJRDXS3Y8P860405JGTS7T: mmap files: mmap, size 4970: cannot allocate memory;
corrupted block 01G3TJRMNCGCXY357ZA5XNZ472: mmap files: mmap, size 4833: cannot allocate memory;
corrupted block 01G3TJS4A8TR42PX1W85GS4G72: mmap files: mmap, size 4630: cannot allocate memory; 
corrupted block 01G3TJRBWWYV82Y7CYA7RNNJ0G: mmap files: mmap, size 5078: cannot allocate memory; 
corrupted block 01G3TJRKA2MAFRVWZGNKBVJ4MC: mmap files: mmap, size 4728: cannot allocate memory

Environment

univention-app info
UCS: 5.0-5 
Installed: admin-dashboard=3.0 prometheus-node-exporter=2.0.1 4.4/prometheus=2.35.0-5
Upgradable:

Solution

1. The system requires sufficient RAM, which must not be fully utilized.

free -m 
              total        used        free      shared  buff/cache   available
Mem:           7978        1651        2940          34        3386        6013
Swap:          7627           0        7627

2. Check the UCRV for the prometheus tsdb retention

ucr info prometheus/storage/tsdb/retention
prometheus/storage/tsdb/retention: 15d

Prometheus includes a local on-disk time series database and the retention time default setting are 15 days.
For more informations:
https://prometheus.io/docs/prometheus/latest/storage/

3. Check the tsdb files to see if there are too many of them that outdated.

ls -lah /var/lib/univention-appcenter/apps/prometheus/data/data | less

drwxr-x---     3 nobody root    4096 Mai 23  2022 01G2M0RD78Y9N21ZD92H1NFHZ6
drwxr-x---     3 nobody root    4096 Mai 10  2022 01G2NYHYEMMHB0PS8WJ5SH8S0H
drwxr-x---     3 nobody root    4096 Mai 10  2022 01G2QWBG04NPME1J6KB61D69SP
drwxr-x---     3 nobody root    4096 Mai 11  2022 01G2ST50WZEHKZAPP9V0NJD3DX
drwxr-x---     3 nobody root    4096 Mai 12  2022 01G2VQYJ6462BVJGN0PC816KW8
drwxr-x---     3 nobody root    4096 Mai 13  2022 01G2XNR3EF579CGGJJX4E962E6
drwxr-x---     3 nobody root    4096 Mai 13  2022 01G2ZKHMZE613A89TT2KNWBVPD
drwxr-x---     3 nobody root    4096 Mai 14  2022 01G31HB5SF0EX6JS2FZSXYPZ9R
drwxr-x---     3 nobody root    4096 Mai 15  2022 01G33F4Q3TDFQKYSVHYKSJK0YQ
drwxr-x---     3 nobody root    4096 Mai 16  2022 01G35CY8F5S6N1885WNKHEKG42
drwxr-x---     3 nobody root    4096 Mai 16  2022 01G37AQSXEP14A0HW313Q4EB46
drwxr-x---     3 nobody root    4096 Mai 17  2022 01G398HAWRSQZ56RE6X84X3MER
drwxr-x---     3 nobody root    4096 Mai 18  2022 01G3B6AWJG5W9Z32RASNBQQRKZ
drwxr-x---     3 nobody root    4096 Mai 19  2022 01G3D45EMZ51XFDZ5192FN10HP
drwxr-x---     3 nobody root    4096 Mai 19  2022 01G3F1XZTFKEFYJRR0Y7ZB23HG
drwxr-x---     3 nobody root    4096 Mai 20  2022 01G3GZQG0PYMEV4RQ3DC4JRBV3
drwxr-x---     3 nobody root    4096 Mai 21  2022 01G3JXH1BS2J53XCEMHGGWBVQH
drwxr-x---     3 nobody root    4096 Mai 22  2022 01G3MVAJQSYYBC924FZ3M6REBJ
drwxr-x---     3 nobody root    4096 Mai 22  2022 01G3PS43RNE0ECWK7G7ZRJS0SC

In this case we got more then 32k of tsdb files, you can check that with:

ls -lah /var/lib/univention-appcenter/apps/prometheus/data/data | wc -l
32853

To fix this error, so that the container start again, it is possible to move them out of the directory.

mkdir /root/univention/backup_prometheus
mv /var/lib/univention-appcenter/apps/prometheus/data/data/0* /root/univention/backup_prometheus

Hint

If the tsdb files are older than 365 days, they could also be deleted directly, a backup older than 1 year is rarely needed. This decision is up to you.

Now start the container for prometheus, you can do that with docker itself or could use the univention command.

docker ps -a

CONTAINER ID        IMAGE                                               COMMAND                  CREATED             STATUS              PORTS                      NAMES
a437univention       docker.software-univention.de/prometheus:2.35.0-5   "/bin/prometheus --c…"   8 hours ago         Up 4 hours          127.0.0.1:9090->9090/tcp   stupefied_nobel

docker start a437univention

or

univention-app start prometheus

Starting docker-app-prometheus (via systemctl): docker-app-prometheus.service.

To check if the prometheus process runs:
ps aufx | grep prometheus

prometh+   504  0.0  0.3 1001428 28664 ?       Ssl  Dez06   0:50 /usr/bin/prometheus-node-exporter --web.listen-address 127.0.0.1:9100 --web.telemetry-path=/metrics-node/metrics/
nobody   23760  0.4  1.3 984596 111752 ?       Ssl  19:56   0:01      \_ /bin/prometheus --config.file=/etc/prometheus/prometheus.yml --web.route-prefix=/metrics-prometheus/ --web.external-url=/metrics-prometheu
s/ --query.lookback-delta=12h --storage.tsdb.retention.time=15d --storage.tsdb.allow-overlapping-blocks --web.enable-lifecycle
root     24602  0.0  0.0   6416   816 pts/2    S+   20:02   0:00  |   \_ grep prometheus

Check the status of the container:

univention-app status prometheus

● docker-app-prometheus.service - LSB: Start the Container for prometheus
   Loaded: loaded (/etc/init.d/docker-app-prometheus; generated)
   Active: active (exited) since Tue 2023-12-12 19:56:41 CET; 50min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 23671 ExecStart=/etc/init.d/docker-app-prometheus start (code=exited, status=0/SUCCESS)

Dez 12 19:56:38 ucs-primary systemd[1]: Starting LSB: Start the Container for prometheus...
Dez 12 19:56:41 ucs-primary docker-app-prometheus[23671]: Starting prometheus Container a437univention ....
Dez 12 19:56:41 ucs-primary systemd[1]: Started LSB: Start the Container for prometheus.
Mastodon