Problem: UMC Diagnostic Module Complains about Problems with UDN Replication


#1

Problem

The diagnostic module in Univention Management Console (UMC) reports a warning about problems with UDN replication similar to:

Error retrieving notifier ID from the UDN.

or

Univention Directory Notifier ID and the locally stored version differ.

This might indicate an error or still processing transactions.

Solution

Step 1

Please log on to the console as root e.g. via ssh and use the command /usr/lib/nagios/plugins/check_univention_replication to check which replication state the system is in:

/usr/lib/nagios/plugins/check_univention_replication

This command may output an error messsage like the following:

CRITICAL: no change of listener transaction id for last 0 checks (nid=3030 lid=3018)

This output says that the replication is 12 transactions behind. If all services are running correctly, the transactions can be processed after a few seconds:

root@shell:~# /usr/lib/nagios/plugins/check_univention_replication

OK: replication complete (nid=3030 lid=3030)

In this case, the replication is up to date.

If the replication got stuck, the following services should be restarted on the involved systems:

service univention-directory-notifier restart
service univention-directory-listener restart

If check_univention_replication returns the message CRITICAL: failed.ldif exists, please follow

Step 2

Check for notifer service on UCS Master and Backup systems

If check_univention_replication returns the message CRITICAL: no change of listener transaction on the UCS Master, together with the error message nid=Error: [Errno 111] Connection refused, then you may use the following command to check if the notifier service is running at all:

pgrep -f /usr/sbin/univention-directory-notifier

This command should return the process ID number. If it doesn’t, then you can check the status of the service by running the command

sv status univention-directory-notifier | sed -n 's/:.*//p'

Normally this should output run. If the service has been stopped normally (for example temporarily during and update), the status will be down. If the service was terminated for unknown reasons (e.g. due to a programm crash), the status will be finished. If the service is not running, you can try to start it again by running the command

service univention-directory-notifer start

After this step, please continue the analysis by returning to Step 1. In case the service doesn’t resume normal operation, you may check the log file /var/log/univention/notifier.log for recent ERROR messages before continuing with Step 3. In case you finally need to open a support ticket, the error messages may be helpful for analysis.

You should also check the state of the notifier service on UCS Backup servers. The notifier service should be running on each UCS Backup server to support failover for UCS Slave and Memberserver systems.

Check for local listener on all UCS systems

If repeated runs of check_univention_replication continue returning the message CRITICAL: no change of listener transaction id and the lid (listener ID) doesn’t change during calls, then you should check if the listener service is running. This should be checked on every UCS server. It’s a Nagios check, so if you are running Nagios, you may use that to obtain a quick overview. If the service is not running on any system, you should connect to the system via SSH as root and use the following command to check if the listener service is running at all:

pgrep -f /usr/sbin/univention-directory-listener

This command should return the process ID number. If it doesn’t, then you can check the status of the service by running the command

sv status univention-directory-listener | sed -n 's/:.*//p'

Normally this should output run. If the service has been stopped normally (for example temporarily during and update), the status will be down. If the service was terminated for unknown reasons (e.g. due to a programm crash), the status will be finished. If the service is not running, you can try to start it again by running the command

service univention-directory-listener start

After this step, please continue the analysis by returning to Step 1. In case the service doesn’t resume normal operation, you may check the log file /var/log/univention/listener.log for recent ERROR messages. In case you finally need to open a support ticket, the error messages may be helpful for analysis.

Step 3

Check cn=tanslog database size on DC Master

If the notifier and listener services are running, you can continue by checking if the storage capacity of the cn=translog backend database is exausted. This check should be done on the UCS Master as well as on UCS Backup servers. This situation can only arise on servers with an amd64 processor architecture. On i386 servers, the cn=translog backend database is stored in a Berkeley Database instead, which is only limited by the size of the underlying partition.

On amd64 servers you may use the following commands run as root to calculate the percentage of use of the translog database:

used_pages=$(mdb_stat -e /var/lib/univention-ldap/translog | sed -n 's/^ *Number of pages used: //p')
max_pages=$(mdb_stat -e /var/lib/univention-ldap/translog | sed -n 's/^ *Max pages: //p')
python -c "print('%.1f used' % (float($used_pages) / $max_pages * 100))"

Tha last command should output a value between 0 and 100. If this value is close to 100, the database usage is close to its limit and you should consider raising the limit. But first, you should check, if you actually have hit the limit already, by running the command

grep 'MDB_MAP_FULL: Environment mapsize limit reached' /var/log/syslog

By default this is set to 2GB (2147483648 bytes). You may set the UCR variable ldap/database/mdb/maxsize to a higher value. Please note that the value must be given in bytes. To activate the new database size, a simple restart of the LDAP-server is sufficient. This may be done by running service slapd restart.

This should be checked on each UCS server of role Master or Backup.

Please continue with Step 4.

Step 4

The following script can be used to check if the notify/transaction or the listener/listener files suffer from corruption, e.g. due to a overflow of the harddisk which may have occurred in the past.

Since UCS 4.3 errata470 and UCS 4.4 errata33, the command /usr/share/univention-directory-notifier/univention-translog check can also be used for checking and te command /usr/share/univention-directory-notifier/univention-translog check --fix
for correcting this issue.
If no errors are reported, please continue with Step 5. If not, please continue with Step 8.

Alternatively, the following script can be used:

python -c "
#!/usr/bin/env python
import ldap
for transactionfile in ('notify/transaction', 'listener/listener'):
  filepath = '/var/lib/univention-ldap/%s' % transactionfile
  print('Checking %s' % filepath)
  with open(filepath, 'r') as f:
    lc = 0
    for line in f:
      lc += 1
      head_tail = line.strip().split(' ', 1)
      if len(head_tail) != 2:
        print('ERROR missing second column at line %d: "%s"' % (lc, line))
        break
      (id, tail) = head_tail
      try:
        cur_lc = int(id)
      except ValueError:
        print 'ERROR at line %d: "%s"' % (lc, line)
        break
      head_tail = tail.rsplit(' ', 1)
      if len(head_tail) != 2:
        print 'ERROR missing third column at line %d: "%s"' % (lc, line)
        break
      (dn, opcode) = head_tail
      if not ldap.dn.is_dn(dn):
        print 'ERROR not a valid DN at line %d: "%s"' % (lc, line)
        break
    else:
      print('OK')
      continue
    break
"

When copying this script into a terminal or file please make sure to keep the indentation, as the Python programming language depends on this.
If no errors are reported, please continue with Step 5. If not, please continue with Step 8.

Step 5

Next, the transaction file should be checked for contiguous numbering.

Since UCS 4.3 errata470 and UCS 4.4 errata3, the command /usr/share/univention-directory-notifier/univention-translog check can also be used for checking and te command /usr/share/univention-directory-notifier/univention-translog check --fix
for correcting this issue.

Alternatively, this can be accomplished by running the following Python script:

python -c "
#!/usr/bin/env python
with open('/var/lib/univention-ldap/notify/transaction', 'r') as transaction:
  lc = 0
  for line in transaction:
    lc += 1
    (id, tail) = line.strip().split(' ', 1)
    try:
      cur_id = int(id)
    except ValueError:
      print 'ERROR at line %d, does not start with an integer number: "%s"' % (lc, line)
      break
    if lc == 1:
      start_id = cur_id
    if cur_id != (lc - 1 + start_id):
      print 'ERROR at line %d, transaction IDs not contiguous: "%s"' % (lc, line)
      break
  else:
    print('OK')
"

When copying this script into a terminal or file please make sure to keep the indentation, as the Python programming language depends on this.
If this script reports any not contiguous messages, you may start an editor to attempt to fix fill the gaps in numbering. Please make a backup copy of the translog file first. The error message should indicate the line of non-contiguous numbering. You may fill the gap by inserting lines that are numbered contiguously and have the following format:

number  $ldap_base   m

The trailing letter m is a literal “m” and represents a dummy modification. Please replace $ldap_base by the LDAP base of your UCS domain (command: ucr get ldap/base). After changing the file, please run the check script again and adjust as necessary.

If the script above outputs the error doesn't start with an integer number, then you should continue with Step 8 below, otherwise continue with Step 6.

Step 6

Check for transaction file duplicates

Next, the transaction file should be checked for duplicates of transation IDs.

Since UCS 4.3 errata470 and UCS 4.4 errata33, the command /usr/share/univention-directory-notifier/univention-translog check can also be used for checking and te command /usr/share/univention-directory-notifier/univention-translog check --fix
for correcting this issue.

Alternatively, this can be accomplished by running the following Python script:

python -c "
#!/usr/bin/env python
with open('/var/lib/univention-ldap/notify/transaction', 'r') as transaction:
  lc = 0
  last_id = -1
  for line in transaction:
    lc += 1
    (id, tail) = line.strip().split(' ', 1)
    try:
      cur_id = int(id)
    except ValueError:
      print 'ERROR at line %d: "%s"' % (lc, line)
      break
    if last_id >= cur_id:
      print 'ERROR: duplicate at line %d: id %d reused' % (lc, cur_id)
      break
    last_id = cur_id
  else:
    print('OK')
"

When copying this script into a terminal or file please make sure to keep the indentation, as the Python programming language depends on this.

If this script returns an error message, you may run the command

cat /var/lib/univention-ldap/notify/transaction | sort -u > transaction-new
mv /var/lib/univention-ldap/notify/transaction \
      /var/lib/univention-ldap/notify/transaction_backup
mv transaction-new /var/lib/univention-ldap/notify/transaction
mv /var/lib/univention-ldap/notify/transaction.index  \
      /var/lib/univention-ldap/notify/transaction_backup.index 

After that, please run the script above again. In case it doesn’t return an error, then the duplicate entries have been trivial and are resolved.

If you continue to find errors in the transacton file, please continue with Step 8.

Step 7

Check for correct last_id:

The value in /var/lib/univention-ldap/last_id should be checked. If the file /var/lib/univention-ldap/listener/listener contains any lines, then the value in last_id should match the value in the last line of /var/lib/univention-ldap/listener/listener. If the file /var/lib/univention-ldap/listener/listener is empty, last_id should be match the first value from the last line of the file /var/lib/univention-ldap/notify/transaction:

cat /var/lib/univention-ldap/last_id ; echo
tail -1 /var/lib/univention-ldap/notify/transaction | awk '{print $1}'
tail -1 /var/lib/univention-ldap/listener/listener | awk '{print $1}'

Since UCS 4.3 errata470 and UCS 4.4 errata33, the command /usr/share/univention-directory-notifier/univention-translog check can also be used for checking and te command /usr/share/univention-directory-notifier/univention-translog check --fix
for correcting this issue.

Step 8

In case severe inconsistency or corruption has be detected for the translog file we recommend to reset the replication for all systems in the domain. Please note that this is a major operation and will induce temporary downtimes for all services in the UCS domain. The recommended steps are described in the follwing SDB article:


How to reset Listener / Notifier replication