Problem
Users were unable to log in after successfully resetting their passwords. Newly created or modified users were not replicated across the LDAP infrastructure.
The environment showed inconsistent LDAP data between the LDAP primary instances:
- Objects existed on one LDAP primary but not on the other.
- Replication to LDAP replicas stopped working.
- Password changes were not synchronized.
- User modifications written through the UDM REST API appeared only on one LDAP primary.
The issue started after scaling the LDAP primary StatefulSet from one instance to two instances.
Environment:
- Nubus for UCS Version 1.15.2
Root Cause
The investigation identified multiple infrastructure-related problems inside the Kubernetes cluster.
Initially, temporary DNS resolution failures inside the cluster caused communication failures between the LDAP instances and the Kubernetes API server. This interrupted the LDAP leader election process and replication handling.
The following errors were observed:
WARNING [...] Failed to resolve 'api..internal.ske.eu01.stackit.cloud' ([Errno -3] Temporary failure in name resolution)
ERROR [...] Fatal error in leader election
ERROR [...] Error updating pod leader label: (500) | failed calling webhook "vpod.kb.io"
Additional Kubernetes webhook failures were logged:
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\""}]},"code":500}
Later investigation revealed the actual root cause:
-
The Kubernetes Service object for
ums-ldap-server-primary-1was missing. -
LDAP primary communication became asymmetric:
primary-1could communicate withprimary-0primary-0could not communicate withprimary-1
-
LDAP replication between both primaries stopped functioning.
-
Replicas no longer received updates.
-
The LDAP leader lease was assigned to
ums-ldap-server-primary-1.
Because the UDM Listener depends on the first LDAP primary (ldap-server-primary-0) for change notifications, provisioning and replication became inconsistent when synchronization between both primaries failed.
Solution
The issue was resolved by scaling the LDAP primary StatefulSet back from two replicas to one replica.
After reducing the StatefulSet size:
- LDAP synchronization resumed.
- Password resets propagated correctly.
- User logins started working again.
- Replication inconsistencies disappeared.
Command example:
kubectl scale statefulset ums-ldap-server-primary --replicas=1 -n ${NAMESPACE}
Investigation
Initial analysis showed Kubernetes DNS and webhook related failures.
Observed LDAP leader election errors:
WARNING [...] Failed to resolve 'api..internal.ske.eu01.stackit.cloud' ([Errno -3] Temporary failure in name resolution)
ERROR [...] Fatal error in leader election
ERROR [...] Error updating pod leader label: (500) | failed calling webhook "vpod.kb.io"
Kubernetes webhook errors:
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"vpod.kb.io\": failed to call webhook: Post \"https://kruise-webhook-service.kruise-system.svc:443/validate-pod?timeout=30s\": no endpoints available for service \"kruise-webhook-service\""}]},"code":500}
Later logs showed additional Kubernetes control plane instability:
2026-02-06 00:00:09,904 INFO [leader_elector.update_service_selector:171] Service selector updated to ums-ldap-server-primary-0
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: leader changed","code":500}
Reason: Internal Server Error
2026-02-06 00:00:09,667 ERROR [leader_elector.acquire_or_renew:124] Failed to acquire/renew lease: (500)
Connectivity tests from multiple pods later succeeded:
- Kubernetes API reachable
- Kruise webhook reachable
- No further CoreDNS errors observed
LDAP consistency checks revealed:
Replica State
- Replicas were internally consistent.
contextCSNvalues matched between replicas.- Objects had identical
entryUUIDandmodifyTimestamp.
Primary State
contextCSNdiffered between LDAP primaries.- Object modifications appeared only on
primary-1. primary-0did not receive updates.- Replicas did not receive updates from
primary-1.
LDAP lease information:
kubectl -n ${NAMESPACE} describe leases.coordination.k8s.io ldap-primary-leader
Output:
Holder Identity: ums-ldap-server-primary-1
Lease Duration Seconds: 15
Service selector verification:
kubectl get svc ums-ldap-server-primary -n ${NAMESPACE} -o yaml | grep -A10 selector
Output:
selector:
app.kubernetes.io/instance: ums
app.kubernetes.io/name: ldap-server
ldap-server-type: primary
statefulset.kubernetes.io/pod-name: ums-ldap-server-primary-1
Database verification on primary-0:
kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-0 -c main -- ls -lh /var/lib/univention-ldap/ldap/
Output:
total 17M
-rw-rw---- 1 101 openldap 17M Feb 24 14:23 data.mdb
-rw-rw---- 1 101 openldap 8.0K Feb 26 10:30 lock.mdb
No syncrepl activity was detected:
-
kubectl logs -n ${NAMESPACE} ums-ldap-server-primary-0 -c main | grep -i "syncrepl|rid=" -
kubectl logs -n ${NAMESPACE} ums-ldap-server-primary-1 -c main | grep -i "syncrepl|rid="
Output:
no output
LDAP contextCSN comparison on primary-0:
kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-0 -c main -- ldapsearch -H ldapi:/// -Y EXTERNAL -b "dc=,dc=internal" -s base contextCSN
Output:
contextCSN: 20250610053835.618495Z#000000#000#000000
contextCSN: 20260224142328.812935Z#000000#001#000000
contextCSN: 20251223104038.960536Z#000000#002#000000
LDAP contextCSN comparison on primary-1:
kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-1 -c main -- ldapsearch -H ldapi:/// -Y EXTERNAL -b "dc=,dc=internal" -s base contextCSN
Output:
contextCSN: 20250610053835.618495Z#000000#000#000000
contextCSN: 20251216111229.463643Z#000000#001#000000
contextCSN: 20260226102452.381993Z#000000#002#000000
Connectivity test from primary-0 to primary-1 failed:
kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-0 -c main -- ldapsearch -x -H ldap://ums-ldap-server-primary-1:389 -b "dc=,dc=internal" -s base
Output:
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
command terminated with exit code 255
Connectivity test from primary-1 to primary-0 succeeded:
kubectl exec -n ${NAMESPACE} ums-ldap-server-primary-1 -c main -- ldapsearch -x -H ldap://ums-ldap-server-primary-0:389 -b "dc=,dc=internal" -s base
Output:
# search result
search: 2
result: 50 Insufficient access
Further investigation finally identified that the Service object for ums-ldap-server-primary-1 was missing.
Additional Notes
The following documentation section is relevant for LDAP primary synchronization and provisioning behavior:
Important:
To maintain event consistency with the LDAP transaction log, the UDM Listener ties to the first LDAP Primary. The respective pod’s name is ldap-server-primary-0. If the first LDAP Primary is down, the UDM Listener doesn’t notify the Provisioning Service of changes to user and group objects until the first LDAP Primary comes back.