Listener not synchronising on samba4 backend machines

Mbo42 · March 8, 2018, 7:01am

I have 5 servers all are:
UCS: 4.2-3 errata311
App Center compatibility: 4

The domaincontroller_master, domaincontroller_backup and domaincontroller_slave all run samba4 dns backend
and the listeners do not stay in sync with the master

Their are also two memberservers and both return an empty value if I run
ucr get dns/backend
These servers however do keep sync with the master.

univention-check-join-status returns Joined successfully for all servers

I have tried to re-synchronise the listeners as described here
Recreate listener cache
This works but the servers then get out of sync again

I have rejoined the servers but that makes no difference.

This is my first Univention deployment so any suggestions would be appreciated.
Thanks

Moritz_Bunkus · March 8, 2018, 8:06am

Hey,

How exactly do you determine this?

When that happens, please verify the following:

Is the process univention-directory-notifier running on the DC Master?
Is the process univention-directory-listener running on the machine that isn’t in sync with the DC Master?
What does the corresponding log file on the machine that isn’t in sync say (/var/log/univention/listener.log)?

That’s to be expected as member servers typically don’t run DNS servers.

Kind regards,
mosu

Mbo42 · March 8, 2018, 8:47am

Yes

   root@phobos:/home/talcom# ps ax |grep univention-directory-notifier
   854 ?        Ss     0:00 runsv univention-directory-notifier
   1237 ?        S      0:07 /usr/sbin/univention-directory-notifier -o -d 1 -F

Not properly by the look of it.
This is from a ‘bad’ machine

root@deimos:/home/talcom# ps ax|grep univention-directory-listener
  880 ?        Ss     0:20 runsv univention-directory-listener

And this is from a ‘good’ machine
root@socrates:/home/talcom# ps ax|grep univention-directory-listener

1111 ? Ss 0:00 runsv univention-directory-listener
1403 ? S 0:01 /usr/sbin/univention-directory-listener -F -d 2 -b dc=thedomain,dc=intranet -m /usr/lib/univention-directory-listener/system -c /var/lib/univention-directory-listener -ZZ -x -D cn=socrates,cn=memberserver,cn=computers,dc=thedomain,dc=intranet -y /etc/machine.secret

08.03.18 19:40:20.150 DEBUG_INIT
UNIVENTION_DEBUG_BEGIN : uldap.__open host=phobos.thedomain.intranet port=7389 base=dc=thedomain,dc=intranet
UNIVENTION_DEBUG_END : uldap.__open host=phobos.thedomain.intranet port=7389 base=dc=thedomain,dc=intranet
08.03.18 19:40:20.936 LISTENER ( ERROR ) : ‘failed.ldif’ exists. Check for /var/lib/univention-directory-replication/failed.ldif

failed.ldif is 1300 lines long but I can provide information as required.

Thanks for your help

Moritz_Bunkus · March 8, 2018, 11:25am

Hey,

yeah, please show us the failed.ldif file. My assumption at this point is that you have LDAP schema extensions installed on the DC Master but not on the DC Backup and DC Slave. As both run their own LDAP servers, they must have the same schema files installed that the master has installed; otherwise they won’t be able to store the same data.

The member servers on the other hand do not run their own LDAP servers. Therefore this particular problem wouldn’t affect them.

Kind regards,
mosu

Mbo42 · March 8, 2018, 12:53pm

OK, I’ll attach it.
I noticed that the ldif contains some names of old machine entries that have been deleted (e.g.sisyphus).
Could this be relevantfailed.ldif.txt (59.7 KB)
?

Moritz_Bunkus · March 8, 2018, 1:01pm

Hey,

thanks. It’s not obvious from the file itself what the issue is. Please follow the steps written in the following article:

https://help.univention.com/t/what-to-do-if-a-failed-ldif-is-found/6432

The goal is to get the LDAP server to re-try applying the failed.ldif and to observe what the log files contain at that point. That should lead us to the actual error messages, e.g. that the objects exist already or that an attribute isn’t known.

Kind regards,
mosu

Grandjean · March 8, 2018, 1:54pm

Just as an addition for general troubleshooting and other readers:
UCS comes with a pre-configured monitoring check called “UNIVENTION_REPLICATION” that checks the replication state of the local UCS system and also warns if a failed.ldif exists. This monitoring check can either be used with the optional Nagios server App or with any other monitoring solution, that can execute such a check (using NRPE, SSH etc.), which can then alarm the administrator.
The monitoring plugin can also be executed manually, if necessary:
/usr/lib/nagios/plugins/check_univention_replication

Mbo42 · March 8, 2018, 2:24pm

OK, that didn’t do a lot but I might have some new info.

/etc/init.d/slapd restart just sits on

[....] Starting slapd (via systemctl): slapd.service

for a 5 minutes, i.e. it does not say Found failed.ldif Importing …

ps aux | grep /usr/sbin/univention-directory-replication-resync returns nothing

After 5 minutes it times out

[…] Starting slapd (via systemctl): slapd.serviceJob for slapd.service failed. See ‘systemctl status slapd.service’ and ‘journalctl -xn’ for details.
failed!

This is the output from systemctl status slapd.service

● slapd.service - LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)
Loaded: loaded (/etc/init.d/slapd)
Active: failed (Result: timeout) since Fri 2018-03-09 00:49:22 AEDT; 2min 9s ago
Process: 15882 ExecStop=/etc/init.d/slapd stop (code=exited, status=0/SUCCESS)
Process: 22855 ExecStart=/etc/init.d/slapd start (code=exited, status=0/SUCCESS)
Main PID: 15460 (code=exited, status=0/SUCCESS)

Mar 09 00:44:22 deimos slapd[22855]: LDAP server already running.
Mar 09 00:44:22 deimos systemd[1]: PID file /var/run/slapd/slapd.pid not readable (yet?) after start.
Mar 09 00:49:22 deimos systemd[1]: slapd.service start operation timed out. Terminating.
Mar 09 00:49:22 deimos systemd[1]: Failed to start LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Mar 09 00:49:22 deimos systemd[1]: Unit slapd.service entered failed state.

and journalctl -xn (Though this may have been too late to catch the relevant parts??)

root@deimos:/home/talcom# journalctl -xn
-- Logs begin at Sun 2018-02-25 14:23:20 AEDT, end at Fri 2018-03-09 01:11:47 AEDT. --
Mar 09 01:10:41 deimos sshd[27528]: Accepted keyboard-interactive/pam for phobos$ from 192.168.20.3 port 32768 ssh2
Mar 09 01:10:41 deimos sshd[27528]: pam_unix(sshd:session): session opened for user phobos$ by (uid=0)
Mar 09 01:10:41 deimos sshd[27535]: Received disconnect from 192.168.20.3: 11: disconnected by user
Mar 09 01:10:41 deimos sshd[27528]: pam_unix(sshd:session): session closed for user phobos$
Mar 09 01:10:48 deimos CRON[27421]: pam_unix(cron:session): session closed for user root
Mar 09 01:11:19 deimos nrpe[27692]: Host 192.168.20.22 is not allowed to talk to us!
Mar 09 01:11:46 deimos nrpe[27767]: Host 192.168.20.22 is not allowed to talk to us!
Mar 09 01:11:47 deimos systemd[1]: slapd.service start operation timed out. Terminating.
Mar 09 01:11:47 deimos systemd[1]: Failed to start LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
-- Subject: Unit slapd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit slapd.service has failed.
-- 
-- The result is failed.
Mar 09 01:11:47 deimos systemd[1]: Unit slapd.service entered failed state.

And yes, the nagios plugin gives:

root@deimos:/home/talcom# /usr/lib/nagios/plugins/check_univention_replication
CRITICAL: failed.ldif exists (nid=2121 lid=2080)

I must be missing the point here but if this procedure Recreate listener cache will restore the notify and listen ID’s to the same number why can’t I just delete the failed.ldif? I suppose I’m not sure how tolerant the system is to a “try it and see” approach .

I appreciate your help.

Moritz_Bunkus · March 9, 2018, 10:41am

Hey,

just nitpicking here, but you’ll want to be careful here. That script needs to create a cache file named /var/lib/univention-nagios/check_univention_replication.cache. If that file doesn’t exist and you run the check as root, then the cache file’s owner will be root, too. However, when the script is run as a Nagios check via the NRPE daemon, it isn’t run as root but as the user nagios — and wrong file ownership will then prevent the cache file from being updated. If the script exists and is owned by nagios already, then running the check as root will preserve file ownership, though.

mosu

Moritz_Bunkus · March 9, 2018, 10:42am

@Mbo42 Thanks, but I was more interested in the contents of the log files /var/log/univention/listener.log and /var/log/univention/ldap-replication-resync.log from the point in time when the LDAP server was restarted. Can you post those, too, please? Thanks.

mosu

Mbo42 · March 12, 2018, 8:08am

My apologies, I have been side-tracked for a few days.

Here are the latest tests.

I installed a new virtual DC_slave and got the same result with the install failing at the Domain join.
The failed.ldif on this machine was the same length (60Kb) as that on the DC_backup.

Back on the DC_backup I repeated the routine in What to do if a failed.ldif is found

root@deimos:~#/etc/init.d/slapd restart
[....] Restarting slapd (via systemctl): slapd.service

ps aux | grep /usr/sbin/univention-directory-replication-resync show nothing running

Eventually ldap fails as mentioned before.
slapd.service failed

No new entry’s in /var/log/univention/listener.log and /var/log/univention/ldap-replication-resync.log

I deleted the failed.ldif and re-ran univention-join

The join log contained an conflicting name error

12.03.18 17:11:30.948  LISTENER    ( WARN    ) : Set Schema ID to 6
12.03.18 17:11:30.948  LISTENER    ( WARN    ) : initializing module replication
12.03.18 17:12:30.439  LISTENER    ( ERROR   ) : replication: Naming violation; dn="krb5PrincipalName=ldap/apollo.e
nviroscience.intranet@ENVIROSCIENCE.INTRANET,cn=kerberos,dc=enviroscience,dc=intranet": Error
12.03.18 17:12:30.440  LISTENER    ( ERROR   ) :        additional info: value of single-valued naming attribute 'k
rb5PrincipalName' conflicts with value present in entry
12.03.18 17:12:32.405  LISTENER    ( WARN    ) : finished initializing module replication with rv=0
12.03.18 17:12:32.405  LISTENER    ( WARN    ) : initializing module bind
12.03.18 17:12:32.688  LISTENER    ( WARN    ) : finished initializing module bind with rv=0
12.03.18 17:12:32.689  LISTENER    ( WARN    ) : initializing module well-known-sid-name-mapping
12.03.18 17:12:39.663  LISTENER    ( WARN    ) : finished initializing module well-known-sid-name-mapping with rv=0

listener.log now contains:

> Try to sync changes stored in /var/lib/univention-directory-replication/failed.ldif into local LDAP
>                      USER        PID ACCESS COMMAND
> /var/lib/univention-directory-replication/failed.ldif:
>                      root      22726 F.... univention-dire
> File still in use: /var/lib/univention-directory-replication/failed.ldif
> 12.03.18 17:23:57.911  DEBUG_INIT
> UNIVENTION_DEBUG_BEGIN  : uldap.__open host=phobos.enviroscience.intranet port=7389 base=dc=enviroscience,dc=intranet
> UNIVENTION_DEBUG_END    : uldap.__open host=phobos.enviroscience.intranet port=7389 base=dc=enviroscience,dc=intranet

Is ‘file still in use’ unexpected? After everything completed I ran
lsof |grep failed.ldif and nothing had the file open then.

And ldap-replication-resync.log

Try to sync changes stored in /var/lib/univention-directory-replication/failed.ldif into local LDAP
                     USER        PID ACCESS COMMAND
/var/lib/univention-directory-replication/failed.ldif:
                     root      22726 F.... univention-dire
File still in use: /var/lib/univention-directory-replication/failed.ldif
12.03.18 17:23:57.911  DEBUG_INIT
UNIVENTION_DEBUG_BEGIN  : uldap.__open host=phobos.enviroscience.intranet port=7389 base=dc=enviroscience,dc=intranet
UNIVENTION_DEBUG_END    : uldap.__open host=phobos.enviroscience.intranet port=7389 base=dc=enviroscience,dc=intranet

What can I do about the ‘name violation error’ on join?
Why doesn’t slapd restart do anything, including re-read the failed.ldif?
Why did I get this error on a fresh new server with no existing ldap database?

Thanks for your help.

Mbo42 · March 15, 2018, 12:43am

I decided to reload everything from scratch with the new UCS 4.3
I believe the problems I had were due to corruption in the ldap database caused by
novice mistakes and haste in the original setup.
I have reinstalled and all looks good so far.
Thanks for your help.