Letsencrypt Cron Failed to fire

texas-aggie · January 5, 2019, 1:48am

I have 4.3.x install (Active Directory) that was deployed in March '18. Certs successfully updated through Sept but failed to update in December. The certs expired on the 25th of Dec.

The logs show no attempts were made after September.

The system has been updated via policy from 4.2.x to 4.3.3 along the way. As a side note, I’ll be stopping that practice as it has created other serviceability problems.

Using the GUI configuration tool for Letsencrypt, the certs were refreshed and all working as expected now.

What steps do I need to take to trace where this failed and what should happen to ensure any missed updates get thrown back into the queue before an expiry event occurs?

lutz.willek · January 5, 2019, 10:38am

Hi,
you are not alone, we had similar problems: the certificates were always successfully generated and configured into the system when the cli or web interface was used, but renew via cron failed sometimes for several reasons. Don’t interpret it wrongly: we have been using LE for a very long time and there were some bugs in the implementation in UCS, so the pitfalls were expected.

Your current problem is maybe caused by http://errata.software-univention.de/ucs/4.3/350.html. Please look into file /var/log/univention/letsencrypt.log to find the relevant information. Post here if needed.

The following does not help to prevent the problem, but it will allow you to recognize it early enough to react. You should consider observing the expiration date of your certificates through monitoring. If you don’t have monitoring yet then you could use https://www.univention.com/products/ucs/functions/monitoring-nagios/ . Its relatively simple to add custom rules there in order to monitor the expire date of certificates, something like that:

/etc/nagios-plugins/config/local.cfg
define command{
        command_name    expire_http_cert
        command_line    /usr/lib/nagios/plugins/check_http -C27,20 -H '$HOSTADDRESS$' -I '$HOSTADDRESS$'
        }

texas-aggie · January 5, 2019, 5:38pm

Thank you so much.

There is simply nothing in the log file after the last successful update. The system that is affected isn’t configured for SAML SSO noted in the bug report. That leads me to believe cron never fired.

Adding the Nagios notification will be extremely helpful in preventing service issues. I appreciate you taking the time to offer those details.