Cannot authenticate to DC backup/slave

Long story short, I ran out of disk space while on holidays… this lasted at least a few days before I noticed until I tried to upgrade all 3 servers. I have my Master, which worked fine… the slave was out of disk space and I had to manually clean up some old backups in order to run the upgrade, which appeared to work without issue. I tried authenticating with Kopano and to the web interface and both fail now.

Reading through other posts I tried to re-join to the master and it also is failing with

File: /etc/apt/apt.conf.d/61invoke
Module: kopano-cfg
Warning: slapd.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Job for slapd.service failed because the control process exited with error code.
See "systemctl status slapd.service" and "journalctl -xe" for details.
Restarting univention-directory-listener (via systemctl): univention-directory-listener.service.
2020-08-04 22:44:10.333505457-07:00 (in joinscript_save_current_version)


**************************************************************************
* Join failed!                                                           *
* Contact your system administrator                                      *
**************************************************************************
* Message:  Please visit https://help.univention.com/t/8842 for common problems during the join and how to fix them -- FAILED: failed.ldif exists.
**************************************************************************
Tue Aug  4 22:44:10 PDT 2020: finish /usr/sbin/univention-join

I did find this post and followed what I could. What to do if a failed.ldif is found

I tried to remove the failed.ldif and retry the join, which also failed with the same error.

I do see what looks like a process running

root@ucs2:~# ps aux | grep /usr/sbin/univention-directory-replication-resync
root     24107  0.0  0.0  15536   964 pts/0    S+   22:55   0:00 grep /usr/sbin/univention-directory-replication-resync
root@ucs2:~#

so perhaps it’s deadlocked?

running a tail

root@ucs2:~# tail -f /var/log/univention/ldap-replication-resync.log

show nothing in this file.

Are there some steps I can try before I wreck this even more?

Thank you!


edit: I see this as well

root@ucs2:~# /etc/init.d/slapd start
[....] Starting slapd (via systemctl): slapd.serviceWarning: slapd.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Job for slapd.service failed because the control process exited with error code.
See "systemctl status slapd.service" and "journalctl -xe" for details.
 failed!
root@ucs2:~# systemctl daemon-reload
root@ucs2:~# /etc/init.d/slapd start
[....] Starting slapd (via systemctl): slapd.serviceJob for slapd.service failed because the control process exited with error code.
See "systemctl status slapd.service" and "journalctl -xe" for details.
 failed!
root@ucs2:~#
root@ucs2:~# systemctl status slapd.service
● slapd.service - LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)
   Loaded: loaded (/etc/init.d/slapd; generated; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2020-08-04 23:07:26 PDT; 7min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 27743 ExecStart=/etc/init.d/slapd start (code=exited, status=1/FAILURE)
      CPU: 119ms

Aug 04 23:07:26 ucs2 slapd[27755]: DIGEST-MD5 common mech free
Aug 04 23:07:26 ucs2 slapd[27755]: DIGEST-MD5 common mech free
Aug 04 23:07:26 ucs2 slapd[27755]: slapd stopped.
Aug 04 23:07:26 ucs2 slapd[27755]: connections_destroy: nothing to destroy.
Aug 04 23:07:26 ucs2 slapd[27743]: Starting ldap server(s): slapd ...failed.
Aug 04 23:07:26 ucs2 slapd[27743]: WARNING: Another /etc/init.d/slapd start is already in progress..
Aug 04 23:07:26 ucs2 systemd[1]: slapd.service: Control process exited, code=exited status=1
Aug 04 23:07:26 ucs2 systemd[1]: Failed to start LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Aug 04 23:07:26 ucs2 systemd[1]: slapd.service: Unit entered failed state.
Aug 04 23:07:26 ucs2 systemd[1]: slapd.service: Failed with result 'exit-code'.
root@ucs2:~# tail -f /var/log/univention/listener.log
04.08.20 22:57:32.907  LDAP        ( PROCESS ) : connecting to ldap://ucs1.sgvfr.lan:7389
UNIVENTION_DEBUG_BEGIN  : uldap.__open host=ucs1.sgvfr.lan port=7389 base=dc=sgvfr,dc=lan
UNIVENTION_DEBUG_END    : uldap.__open host=ucs1.sgvfr.lan port=7389 base=dc=sgvfr,dc=lan
04.08.20 23:02:29.301  LDAP        ( PROCESS ) : connecting to ldap://ucs1.sgvfr.lan:7389
04.08.20 23:02:29.314  LISTENER    ( PROCESS ) : updating 'cn=ucs2,cn=dc,cn=computers,dc=sgvfr,dc=lan' command m
04.08.20 23:02:29.318  LDAP        ( PROCESS ) : connecting to ldap://localhost:7389
04.08.20 23:02:29.318  LDAP        ( ERROR   ) : start_tls: Can't contact LDAP server
04.08.20 23:02:29.318  LISTENER    ( ERROR   ) : check_parent_dn: bind to local LDAP failed
04.08.20 23:02:29.319  LISTENER    ( WARN    ) : replication: Can't contact LDAP server: Can not connect LDAP Server, retry in 10 seconds
04.08.20 23:02:39.328  LISTENER    ( WARN    ) : replication: Can't contact LDAP server: Can not connect LDAP Server, retry in 10 seconds

root@ucs2:~# ps aux | grep /usr/sbin/univention-directory-replication-resync
root     24247  0.0  0.0  15536   992 pts/0    S+   22:57   0:00 grep /usr/sbin/univention-directory-replication-resync
root@ucs2:~# pkill univention-directory-replication-resync
root@ucs2:~# rm /var/lib/univention-directory-replication/failed.ldif
root@ucs2:~# univention-join
univention-join: joins a computer to an ucs domain
copyright (c) 2001-2020 Univention GmbH, Germany

Enter DC Master Account : Administrator
Enter DC Master Password:

Search DC Master:                                          done
Check DC Master:                                           done
Stop S4-Connector:                                         done
Stop LDAP Server:                                          done
Stop Samba Server:                                         done
Search ldap/base                                           done
Start LDAP Server:                                         done
Search LDAP binddn                                         done
Sync time:                                                 done
Running pre-join hook(s):                                  done
Join Computer Account:                                     done
Stopping univention-directory-notifier daemon:             done
Stopping univention-directory-listener daemon:             done
Sync ldap.secret:                                          done
Sync ldap-backup.secret:                                   done
Sync SSL directory:                                        done
Check TLS connection:                                      done
Download host certificate:                                 done
Sync SSL settings:                                         done
Purging translog database:                                 done
Restart LDAP Server:                                       done
Sync Kerberos settings:                                    done
Not updating kerberos/adminserver
Running pre-joinscripts hook(s):                           done
Configure 00kopano4ucs-safemode-on.inst                    done
Configure 01univention-ldap-server-init.inst               done
Configure 02univention-directory-notifier.inst             done
Configure 03univention-directory-listener.inst             done


**************************************************************************
* Join failed!                                                           *
* Contact your system administrator                                      *
**************************************************************************
* Message:  Please visit https://help.univention.com/t/8842 for common problems during the join and how to fix them -- FAILED: failed.ldif exists.
**************************************************************************
root@ucs2:~#

Personally,
I would try to check the file systems are intact first…
linux is real stable… and has good recovery… except in the case of a filled volume
Had to deal with a few of these cases before on other Linux systems…
look ok after freeing space… but they aint

get a boot CD into the system, boot into recovery & run a FS check… on each drive.

if that’s ok go dig about for “lock” files, that may be holding services off

that was one of the first things I did. all filesystems are fine.

after leaving this for a night, it seems I can log in with the Admistrator account now and view users. I see a message on join script failing and when trying to do a full join now I see this.

root@ucs2:~# univention-join
univention-join: joins a computer to an ucs domain
copyright (c) 2001-2020 Univention GmbH, Germany

Enter DC Master Account : Administrator
Enter DC Master Password:

Search DC Master:                                          done
Check DC Master:                                           done
Stop S4-Connector:                                         done
Stop LDAP Server:                                          done
Stop Samba Server:                                         done
Search ldap/base                                           done
Start LDAP Server:                                         done
Search LDAP binddn E: Can`t find running daemon after 50.0 seconds. (No socketfile)
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
Insufficient access (50)


**************************************************************************
* Join failed!                                                           *
* Contact your system administrator                                      *
**************************************************************************
* Message:  Please visit https://help.univention.com/t/8842 for common problems during the join and how to fix them -- binddn for user Administrator not found.
**************************************************************************
root@ucs2:~#

the administrator account is what I log into with, and exists with the same password on the master.

any thoughts?

Next thing to do then… is put aside this is a univention product and treat it like a standard service failure.
dump some status information on the various running services.
“sudo systemctl” type stuff.
Start with ensuring the LDAP service is up.

i’d start here:

WARNING: Another /etc/init.d/slapd start is already in progress…

check your lock files…once you identify which service is down, it should be east to locate…
Your “24” hours might be down to genreal cleanup, where linux frees up some dead files, thereby moving you closer to a functioning system.
Working that way… you are unlikely to do any additional damage,
but “firing off” random univention commands as an attempt to fix it… I would have my doubts
I’ve seen it go sideways far too many times…

Thanks for the help talleyrand, I’m not a linux expert but I’ve tried troubleshooting more with your suggestions. I cannot find a lock file anywhere… /var/lock contains nothing for slapd… There is no existing pid file as referenced in /etc/ldap/slapd.conf

pidfile                 /var/run/slapd/slapd.pid

I do see this, however I don’t know exactly what this means

root@ucs2:/var/lock# systemctl status slapd.service
● slapd.service - LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)
   Loaded: loaded (/etc/init.d/slapd; generated; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2020-08-06 21:40:18 PDT; 11min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 29967 ExecStart=/etc/init.d/slapd start (code=exited, status=0/SUCCESS)
    Tasks: 7 (limit: 4915)
   Memory: 30.3M
      CPU: 1.334s
   CGroup: /system.slice/slapd.service
           └─1511 /usr/sbin/slapd -h ldapi:/// ldap://:7389/ ldaps://:7636/

Aug 06 21:51:39 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:39 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:39 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:40 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:40 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:40 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:40 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:40 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:40 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
Aug 06 21:51:40 ucs2 slapd[1511]: <= mdb_equality_candidates: (memberOf) not indexed
root@ucs2:/var/lock#

the only service that isn’t started is slapd
when trying to manually start

  slapd.service                                                                                               loaded activating start     start LSB: OpenLDAP standalone server (Lightweight Dire

after a failure
syslog

Aug  6 21:58:01 ucs2 systemd[1]: slapd.service: Start operation timed out. Terminating.
Aug  6 21:58:01 ucs2 systemd[1]: Failed to start LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Aug  6 21:58:01 ucs2 systemd[1]: slapd.service: Unit entered failed state.
Aug  6 21:58:01 ucs2 systemd[1]: slapd.service: Failed with result 'timeout'.

systemctl

● slapd.service                                                                                               loaded failed failed    LSB: OpenLDAP standalone server (Lightweight Directory Acce

Thanks for anything you can suggest.

Those messages say there is a ldap seach on a non indexed item… which is strange…

Ok this is what i see

root@root-pr:~# systemctl status slapd.service
● slapd.service - LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)
Loaded: loaded (/etc/init.d/slapd; generated; vendor preset: enabled)
Active: active (running) since Wed 2020-08-05 16:56:04 HKT; 1 day 20h ago
Docs: man:systemd-sysv-generator(8)
Process: 1115 ExecStart=/etc/init.d/slapd start (code=exited, status=0/SUCCESS)
Main PID: 1437 (slapd)
Tasks: 6 (limit: 4915)
Memory: 21.6M
CPU: 47.899s
CGroup: /system.slice/slapd.service
└─1437 /usr/sbin/slapd -h ldapi:/// ldap://:7389/ ldaps://:7636/

Aug 05 16:56:00 root-pr systemd[1]: Starting LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)…
Aug 05 16:56:03 root-pr slapd[1322]: @(#) $OpenLDAP: slapd (Apr 29 2020 09:53:20) $
Debian OpenLDAP Maintainers pkg-openldap-devel@lists.alioth.debian.org
Aug 05 16:56:03 root-pr slapd[1322]: Loaded metadata from "/usr/share/univention-management-console/saml/idp/ucs-sso.org.somedomain.com.xml
Aug 05 16:56:03 root-pr slapd[1437]: WARNING: No dynamic config support for overlay translog.
Aug 05 16:56:03 root-pr slapd[1437]: WARNING: No dynamic config support for overlay shadowbind.
Aug 05 16:56:04 root-pr slapd[1437]: slapd starting
Aug 05 16:56:04 root-pr slapd[1115]: Starting ldap server(s): slapd …done.
Aug 05 16:56:04 root-pr ldapsearch[1471]: DIGEST-MD5 common mech free
Aug 05 16:56:04 root-pr slapd[1115]: Checking Schema ID: …done.
Aug 05 16:56:04 root-pr systemd[1]: Started LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
root@root-pr:~#

which is interesting since if i run:

/usr/sbin/slapcat -n 0 -l somedata.ldif

i get two errors thrown:
5f2cec17 WARNING: No dynamic config support for overlay translog.
5f2cec17 WARNING: No dynamic config support for overlay shadowbind.

give slapcat a test see if you get different errors., my resulting file was empty

Maybe you can run a re-index based on the contents of /etc/ldap/slap.d
do a backup then try looking at “slapindex” command… but makesure you are the "right"user

this is all I get…

root@ucs2:~# /usr/sbin/slapcat -n 0 -l somedata.ldif
5f2d9a7a WARNING: No dynamic config support for overlay shadowbind.
root@ucs2:~#

nothing in the file “somedata.ldif” ?

what about the content of "/etc/ldap/slapd/?

maybe try the “slapindex” to rebuild any current indexes,
it will only rebuild pre-existing ones, based on the content or the /etc/ ldap

hi sorry for the delay… there was nothing in the file.

root@ucs2:~# ls -al /etc/ldap/slapd.d/
total 24
drwxr-xr-x 3 openldap openldap  4096 Sep 20  2017 .
drwxr-xr-x 5 root     root     12288 Aug  9 04:30 ..
drwxr-x--- 3 openldap openldap  4096 Sep 20  2017 cn=config
-rw------- 1 openldap openldap   478 Sep 20  2017 cn=config.ldif.DISABLED
root@ucs2:~# cd /etc/ldap/slapd.d/
root@ucs2:/etc/ldap/slapd.d# cat cn\=config.ldif.DISABLED
# AUTO-GENERATED FILE - DO NOT EDIT!! Use ldapmodify.
# CRC32 dfa60f04
dn: cn=config
objectClass: olcGlobal
cn: config
olcArgsFile: /var/run/slapd/slapd.args
olcLogLevel: none
olcPidFile: /var/run/slapd/slapd.pid
olcToolThreads: 1
structuralObjectClass: olcGlobal
entryUUID: db573402-32c8-1037-9315-353f31a1036a
creatorsName: cn=config
createTimestamp: 20170921032944Z
entryCSN: 20170921032944.885889Z#000000#000#000000
modifiersName: cn=config
modifyTimestamp: 20170921032944Z
root@ucs2:/etc/ldap/slapd.d#

I don’t have the first clue about using slapindex so I’m a little afraid to mess with that… i’m wondering if it would be easier just to do a fresh install and rejoin it. I use haproxy for ldap auth and have removed this server… This seems to be an issue that isn’t going to be fixed quickly and I think it will be faster to re-install a fresh one and rebuild.

LOL… no need to apologize… it’s not my system.

The slapindex, would just rebuild any LDAP indexes on that machine, the indexes are only used to speed lookups in the ldap database.
I was just thinking maybe one of your indexes got corrupted, so the daemon wont start.

if rebuilding & rejoining is an option… then by all means use it.
The other option is to “virtulise” that system, then play about with it off line…
It’s something i do all the time. (only issue is the netowrk config)

Also looking at that file it is from 20170921, mine had a date stamp of 2020, so your server was installed nearly 3 years ago.

All my servers are virtualized… I’m wondering if restoring an old backup would work and restore the system before the disk filled up…

However, this just turned into a much bigger problem… I had to reboot the primary server and now slapd won’t start on that one… i have no idea WTF is going on anymore.

root@ucs1:~# systemctl status slapd.service
● slapd.service - LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol)
   Loaded: loaded (/etc/init.d/slapd; generated; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2020-08-09 18:58:43 PDT; 22s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 9821 ExecStart=/etc/init.d/slapd start (code=exited, status=1/FAILURE)
      CPU: 649ms

Aug 09 18:58:43 ucs1 slapd[9833]: slapd stopped.
Aug 09 18:58:43 ucs1 slapd[9833]: connections_destroy: nothing to destroy.
Aug 09 18:58:43 ucs1 slapd[9821]: Starting ldap server(s): slapd ...failed.
Aug 09 18:58:43 ucs1 slapschema[9836]: Loaded metadata from "/usr/share/univention-management-console/saml/idp/ucs-sso.sgvfr.lan.xml
Aug 09 18:58:43 ucs1 slapschema[9836]: DIGEST-MD5 common mech free
Aug 09 18:58:43 ucs1 slapd[9821]: .
Aug 09 18:58:43 ucs1 systemd[1]: slapd.service: Control process exited, code=exited status=1
Aug 09 18:58:43 ucs1 systemd[1]: Failed to start LSB: OpenLDAP standalone server (Lightweight Directory Access Protocol).
Aug 09 18:58:43 ucs1 systemd[1]: slapd.service: Unit entered failed state.
Aug 09 18:58:43 ucs1 systemd[1]: slapd.service: Failed with result 'exit-code'.

uggggg

Hi,
yep… you CANNOT just restore an old backup…
the time epocs would be radically different…(on a distributed/slave system)
Say for example…
you did a backup, then something moved… your system would show the item as moved.
Then you pull in an old backup… and suddenly the item is back in it’s old position…
and the LDAP is … W…T…H just happened!!!, now we have two items the same in diferent parts of the tree…

Also there is a strict backup sequence & scripts that need to be run to backup univention, i’m presuming you have backup scripts that shut the LDAP down before a backup.

For the virtulisation , i ment a completely offline separate virtulisation environment, becasue the “backup” LDAPS cannot be on any part of the existing live network… for obvious reasons, they will fight.

if you have the option to virtulise your network, separate from your curent LIVE network.
Then you might try pulling all LDAP backups as a complete set into an offline virtulised environment, then see the results of that, work out a playbook & then work live.

yea thats what I figured about restoring an old backup… I’ve done it in the past when I was testing with vmware snapshots, and ended up having to re-join everything… it was a serious pain in the arse…

I still cannot find any reason why the second DC won’t start or re-join so I think I will try for a couple more hours, then rebuild it…

can always try stopping the ldap then, regen the indexes:

sudo -u openldap slapindex

then do a restart, see if it works, if not… then revert to plan “B”

oddly enough…on the primary I ended up regenerating the DH keys on the master with “sh -x /usr/share/univention-ldap/create-dh-parameter-files”

once that finished slapd could be started again… the join on the secondary is also progressing and slapd is starting there as well, but failing on 30univention-appcenter.inst

chasing this issue now… I decided i’m taking this time to rebuild all the controllers and finally migrate away from the old domain. this is a good excuse to get that started but I’d like to make sure the existing ones are working before taking on another project lol

Mastodon