Problem
The Docker container for UCS Dashboard / Prometheus does not start and with a docker logs
or univention-app logs prometheus
you get the following log output:
ts=2023-10-19T16:09:40.770Z caller=main.go:1097 level=error err="opening storage failed: reloadBlocks: 116 errors:
corrupted block 01G3TJRDXS3Y8P860405JGTS7T: mmap files: mmap, size 4970: cannot allocate memory;
corrupted block 01G3TJRMNCGCXY357ZA5XNZ472: mmap files: mmap, size 4833: cannot allocate memory;
corrupted block 01G3TJS4A8TR42PX1W85GS4G72: mmap files: mmap, size 4630: cannot allocate memory;
corrupted block 01G3TJRBWWYV82Y7CYA7RNNJ0G: mmap files: mmap, size 5078: cannot allocate memory;
corrupted block 01G3TJRKA2MAFRVWZGNKBVJ4MC: mmap files: mmap, size 4728: cannot allocate memory
Environment
univention-app info
UCS: 5.0-5
Installed: admin-dashboard=3.0 prometheus-node-exporter=2.0.1 4.4/prometheus=2.35.0-5
Upgradable:
Solution
1. The system requires sufficient RAM, which must not be fully utilized.
free -m
total used free shared buff/cache available
Mem: 7978 1651 2940 34 3386 6013
Swap: 7627 0 7627
2. Check the UCRV for the prometheus tsdb retention
ucr info prometheus/storage/tsdb/retention
prometheus/storage/tsdb/retention: 15d
Prometheus includes a local on-disk time series database and the retention time default setting are 15 days.
For more informations:
https://prometheus.io/docs/prometheus/latest/storage/
3. Check the tsdb files to see if there are too many of them that outdated.
ls -lah /var/lib/univention-appcenter/apps/prometheus/data/data | less
drwxr-x--- 3 nobody root 4096 Mai 23 2022 01G2M0RD78Y9N21ZD92H1NFHZ6
drwxr-x--- 3 nobody root 4096 Mai 10 2022 01G2NYHYEMMHB0PS8WJ5SH8S0H
drwxr-x--- 3 nobody root 4096 Mai 10 2022 01G2QWBG04NPME1J6KB61D69SP
drwxr-x--- 3 nobody root 4096 Mai 11 2022 01G2ST50WZEHKZAPP9V0NJD3DX
drwxr-x--- 3 nobody root 4096 Mai 12 2022 01G2VQYJ6462BVJGN0PC816KW8
drwxr-x--- 3 nobody root 4096 Mai 13 2022 01G2XNR3EF579CGGJJX4E962E6
drwxr-x--- 3 nobody root 4096 Mai 13 2022 01G2ZKHMZE613A89TT2KNWBVPD
drwxr-x--- 3 nobody root 4096 Mai 14 2022 01G31HB5SF0EX6JS2FZSXYPZ9R
drwxr-x--- 3 nobody root 4096 Mai 15 2022 01G33F4Q3TDFQKYSVHYKSJK0YQ
drwxr-x--- 3 nobody root 4096 Mai 16 2022 01G35CY8F5S6N1885WNKHEKG42
drwxr-x--- 3 nobody root 4096 Mai 16 2022 01G37AQSXEP14A0HW313Q4EB46
drwxr-x--- 3 nobody root 4096 Mai 17 2022 01G398HAWRSQZ56RE6X84X3MER
drwxr-x--- 3 nobody root 4096 Mai 18 2022 01G3B6AWJG5W9Z32RASNBQQRKZ
drwxr-x--- 3 nobody root 4096 Mai 19 2022 01G3D45EMZ51XFDZ5192FN10HP
drwxr-x--- 3 nobody root 4096 Mai 19 2022 01G3F1XZTFKEFYJRR0Y7ZB23HG
drwxr-x--- 3 nobody root 4096 Mai 20 2022 01G3GZQG0PYMEV4RQ3DC4JRBV3
drwxr-x--- 3 nobody root 4096 Mai 21 2022 01G3JXH1BS2J53XCEMHGGWBVQH
drwxr-x--- 3 nobody root 4096 Mai 22 2022 01G3MVAJQSYYBC924FZ3M6REBJ
drwxr-x--- 3 nobody root 4096 Mai 22 2022 01G3PS43RNE0ECWK7G7ZRJS0SC
In this case we got more then 32k of tsdb files, you can check that with:
ls -lah /var/lib/univention-appcenter/apps/prometheus/data/data | wc -l
32853
To fix this error, so that the container start again, it is possible to move them out of the directory.
mkdir /root/univention/backup_prometheus
mv /var/lib/univention-appcenter/apps/prometheus/data/data/0* /root/univention/backup_prometheus
Hint
If the tsdb files are older than 365 days, they could also be deleted directly, a backup older than 1 year is rarely needed. This decision is up to you.
Now start the container for prometheus, you can do that with docker itself or could use the univention command.
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a437univention docker.software-univention.de/prometheus:2.35.0-5 "/bin/prometheus --c…" 8 hours ago Up 4 hours 127.0.0.1:9090->9090/tcp stupefied_nobel
docker start a437univention
or
univention-app start prometheus
Starting docker-app-prometheus (via systemctl): docker-app-prometheus.service.
To check if the prometheus process runs:
ps aufx | grep prometheus
prometh+ 504 0.0 0.3 1001428 28664 ? Ssl Dez06 0:50 /usr/bin/prometheus-node-exporter --web.listen-address 127.0.0.1:9100 --web.telemetry-path=/metrics-node/metrics/
nobody 23760 0.4 1.3 984596 111752 ? Ssl 19:56 0:01 \_ /bin/prometheus --config.file=/etc/prometheus/prometheus.yml --web.route-prefix=/metrics-prometheus/ --web.external-url=/metrics-prometheu
s/ --query.lookback-delta=12h --storage.tsdb.retention.time=15d --storage.tsdb.allow-overlapping-blocks --web.enable-lifecycle
root 24602 0.0 0.0 6416 816 pts/2 S+ 20:02 0:00 | \_ grep prometheus
Check the status of the container:
univention-app status prometheus
● docker-app-prometheus.service - LSB: Start the Container for prometheus
Loaded: loaded (/etc/init.d/docker-app-prometheus; generated)
Active: active (exited) since Tue 2023-12-12 19:56:41 CET; 50min ago
Docs: man:systemd-sysv-generator(8)
Process: 23671 ExecStart=/etc/init.d/docker-app-prometheus start (code=exited, status=0/SUCCESS)
Dez 12 19:56:38 ucs-primary systemd[1]: Starting LSB: Start the Container for prometheus...
Dez 12 19:56:41 ucs-primary docker-app-prometheus[23671]: Starting prometheus Container a437univention ....
Dez 12 19:56:41 ucs-primary systemd[1]: Started LSB: Start the Container for prometheus.