Problem: Users Unable to Log In Due to LDAP Primary Replication Failure

Problem

Users were unable to log in after successfully resetting their passwords. Newly created or modified users were not replicated across the LDAP infrastructure.

The environment showed inconsistent LDAP data between the LDAP primary instances:

  • Objects existed on one LDAP primary but not on the other.
  • Replication to LDAP replicas stopped working.
  • Password changes were not synchronized.
  • User modifications written through the UDM REST API appeared only on one LDAP primary.

The issue started after scaling the LDAP primary StatefulSet from one instance to two instances.


Environment:

  • Nubus for UCS Version 1.15.2

Root Cause

The investigation identified multiple infrastructure-related problems inside the Kubernetes cluster.

Initially, temporary DNS resolution failures inside the cluster caused communication failures between the LDAP instances and the Kubernetes API server. This interrupted the LDAP leader election process and replication handling.

The following errors were observed:

WARNING [...] Failed to resolve 'api..internal.ske.eu01.stackit.cloud' ([Errno -3] Temporary failure in name resolution)
ERROR [...] Fatal error in leader election
ERROR [...] Error updating pod leader label: (500) | failed calling webhook "vpod.kb.io"

Additional Kubernetes webhook failures were logged:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\""}]},"code":500}

Later investigation revealed the actual root cause:

  • The Kubernetes Service object for ums-ldap-server-primary-1 was missing.

  • LDAP primary communication became asymmetric:

    • primary-1 could communicate with primary-0
    • primary-0 could not communicate with primary-1
  • LDAP replication between both primaries stopped functioning.

  • Replicas no longer received updates.

  • The LDAP leader lease was assigned to ums-ldap-server-primary-1.

Because the UDM Listener depends on the first LDAP primary (ldap-server-primary-0) for change notifications, provisioning and replication became inconsistent when synchronization between both primaries failed.


Solution

The issue was resolved by scaling the LDAP primary StatefulSet back from two replicas to one replica.

After reducing the StatefulSet size:

  • LDAP synchronization resumed.
  • Password resets propagated correctly.
  • User logins started working again.
  • Replication inconsistencies disappeared.

Command example:

  • kubectl scale statefulset ums-ldap-server-primary --replicas=1 -n ${NAMESPACE}

Investigation

Initial analysis showed Kubernetes DNS and webhook related failures.

Observed LDAP leader election errors:

WARNING [...] Failed to resolve 'api..internal.ske.eu01.stackit.cloud' ([Errno -3] Temporary failure in name resolution)

ERROR [...] Fatal error in leader election

ERROR [...] Error updating pod leader label: (500) | failed calling webhook "vpod.kb.io"

Kubernetes webhook errors:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\""}]},"code":500}

Later logs showed additional Kubernetes control plane instability:

2026-02-06 00:00:09,904 INFO [leader_elector.update_service_selector:171] Service selector updated to ums-ldap-server-primary-0

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: leader changed","code":500}

Reason: Internal Server Error

2026-02-06 00:00:09,667 ERROR [leader_elector.acquire_or_renew:124] Failed to acquire/renew lease: (500)

Connectivity tests from multiple pods later succeeded:

  • Kubernetes API reachable
  • Kruise webhook reachable
  • No further CoreDNS errors observed

LDAP consistency checks revealed:

Replica State

  • Replicas were internally consistent.
  • contextCSN values matched between replicas.
  • Objects had identical entryUUID and modifyTimestamp.

Primary State

  • contextCSN differed between LDAP primaries.
  • Object modifications appeared only on primary-1.
  • primary-0 did not receive updates.
  • Replicas did not receive updates from primary-1.

LDAP lease information:

  • kubectl -n ${NAMESPACE} describe leases.coordination.k8s.io ldap-primary-leader

Output:

Holder Identity: ums-ldap-server-primary-1
Lease Duration Seconds: 15

Service selector verification:

  • kubectl get svc ums-ldap-server-primary -n ${NAMESPACE} -o yaml | grep -A10 selector

Output:

selector:
  app.kubernetes.io/instance: ums
  app.kubernetes.io/name: ldap-server
  ldap-server-type: primary
  statefulset.kubernetes.io/pod-name: ums-ldap-server-primary-1

Database verification on primary-0:

  • kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-0 -c main -- ls -lh /var/lib/univention-ldap/ldap/

Output:

total 17M
-rw-rw---- 1 101 openldap  17M Feb 24 14:23 data.mdb
-rw-rw---- 1 101 openldap 8.0K Feb 26 10:30 lock.mdb

No syncrepl activity was detected:

  • kubectl logs -n ${NAMESPACE} ums-ldap-server-primary-0 -c main | grep -i "syncrepl|rid="

  • kubectl logs -n ${NAMESPACE} ums-ldap-server-primary-1 -c main | grep -i "syncrepl|rid="

Output:

no output

LDAP contextCSN comparison on primary-0:

  • kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-0 -c main -- ldapsearch -H ldapi:/// -Y EXTERNAL -b "dc=,dc=internal" -s base contextCSN

Output:

contextCSN: 20250610053835.618495Z#000000#000#000000
contextCSN: 20260224142328.812935Z#000000#001#000000
contextCSN: 20251223104038.960536Z#000000#002#000000

LDAP contextCSN comparison on primary-1:

  • kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-1 -c main -- ldapsearch -H ldapi:/// -Y EXTERNAL -b "dc=,dc=internal" -s base contextCSN

Output:

contextCSN: 20250610053835.618495Z#000000#000#000000
contextCSN: 20251216111229.463643Z#000000#001#000000
contextCSN: 20260226102452.381993Z#000000#002#000000

Connectivity test from primary-0 to primary-1 failed:

  • kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-0 -c main -- ldapsearch -x -H ldap://ums-ldap-server-primary-1:389 -b "dc=,dc=internal" -s base

Output:

ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
command terminated with exit code 255

Connectivity test from primary-1 to primary-0 succeeded:

  • kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-1 -c main -- ldapsearch -x -H ldap://ums-ldap-server-primary-0:389 -b "dc=,dc=internal" -s base

Output:

# search result
search: 2
result: 50 Insufficient access

Further investigation finally identified that the Service object for ums-ldap-server-primary-1 was missing.


Additional Notes

The following documentation section is relevant for LDAP primary synchronization and provisioning behavior:

Important:

To maintain event consistency with the LDAP transaction log, the UDM Listener ties to the first LDAP Primary. The respective pod’s name is ldap-server-primary-0. If the first LDAP Primary is down, the UDM Listener doesn’t notify the Provisioning Service of changes to user and group objects until the first LDAP Primary comes back.


Reference:

Univention Nubus Kubernetes Architecture – Identity Store

How can I use ldapsearch in Nubus/OpenDesk