How-to: SSSD Performance Optimization and Cache Management in UCS 5.2

SSSD Performance Optimization and Cache Management in UCS 5.2

Document Overview

  • Target Audience: System Administrators, Technical Support Engineers
  • Application: Univention Corporate Server (UCS) 5.2.x
  • Topic: SSSD architecture changes, performance troubleshooting, and cache remediation.

Introduction & Architectural Shift

With the release of UCS 5.2 (based on Debian 12), Univention introduced a significant architectural shift in identity management and authentication. In UCS 5.0, System Security Services Daemon (SSSD) was not utilized for standard authentication tasks. Instead, local system requests via Name Service Switch (NSS) and Pluggable Authentication Modules (PAM) interacted directly with the underlying directory services.

In UCS 5.2, SSSD acts as the primary intermediary between local authentication interfaces and remote identity providers (such as OpenLDAP and Samba/Active Directory). This change improves offline resilience and centralized credential management, but it also introduces complex caching mechanisms that require careful monitoring in large-scale or high-churn environments.

The SSSD Caching Architecture

SSSD relies on a tiered caching system to minimize network roundtrips to the domain controller:

  1. Fast Cache (Memory Cache / libnss_sss): An in-memory cache mapped into RAM for rapid name resolution.
  2. Persistent Disk Cache (LDB Cache): A permanent database stored as .ldb files under /var/lib/sss/db/. This database tracks user attributes, group configurations, nested memberships, and authentication tokens.

Symptom: Cache Bloat & Resource Exhaustion

In large deployments—particularly environments with complex nested group structures, thousands of objects, or high-frequency group membership changes—the persistent disk cache can grow uncontrollably.

A healthy, baseline UCS 5.2 installation typically maintains an SSSD database folder size between 10 MB and 50 MB. However, misconfigurations or high-turnover environments can cause these databases to scale exponentially.

root@production-node:~# du -h /var/lib/sss/db/
720M    /var/lib/sss/db/

root@production-node:~# ls -lah /var/lib/sss/db/
total 720M
drwx------  2 root root 4.0K Jul 23  2025 .
drwxr-xr-x 10 root root 4.0K Jul 23  2025 ..
-rw-------  1 root root 265M Apr 21 16:18 cache_customldap.ldb
-rw-------  1 root root 286M Apr 21 16:15 cache_domain.example.com.ldb
-rw-------  1 root root 1.3M Apr  5 11:56 config.ldb
-rw-------  1 root root 1.3M Jul 23  2025 sssd.ldb
-rw-------  1 root root  36M Apr 21 16:29 timestamps_customldap.ldb
-rw-------  1 root root 133M Apr 21 16:30 timestamps_domain.example.com.ldb

Root Cause

SSSD maps these .ldb database files directly into memory via mmap(). When an LDB file grows to hundreds of megabytes:

  • Quadratic Performance Degradation: Directory lookup and indexing operations scale quadratically relative to database size.
  • Intensive Paging and CPU Spikes: Every lookup triggers aggressive disk paging and forces the system to traverse massive, fragmented index structures.
  • Socket Failures: The SSSD backend responder processes (sssd_be, sssd_nss, sssd_pam) can become unresponsive under high I/O wait times, tripping systemd timeouts and causing corresponding .socket units to fail.

System Diagnostics Output

When this condition occurs, checking failed systemd services often highlights the SSSD responders:

root@production-node:~# systemctl --failed
  UNIT                  LOAD   ACTIVE SUB    DESCRIPTION                  
â—Ź sssd-nss.socket       loaded failed failed SSSD NSS Service responder socket
â—Ź sssd-pam-priv.socket  loaded failed failed SSSD PAM Service responder private socket
â—Ź sssd-pam.socket       loaded failed failed SSSD PAM Service responder socket

Monitoring running SSSD processes under load via top will show elevated memory footprints (RES / SHR) and sustained CPU usage on sssd_nss and sssd_be:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1297 root      20   0  979016 451380 328488 S   0.3   1.4 210:16.64 sssd_nss
 1286 root      20   0  593604 157476 134228 S   2.7   0.5  13:55.55 sssd_be

Performance Analysis: Pros & Cons of SSSD Caching

Feature Aspect Advantage (Pro) Disadvantage (Contra)
Aggressive Caching Immediate local lookups for known users; ensures uninterrupted logins during temporary identity provider offline windows. Cache-bloat vulnerability; heavy CPU penalties during index validation loops.
Background Updates SSSD pro-actively refreshes expiring cache entries in the background before users log in. Introduces a continuous baseline CPU and network load, even when the system is idling.
Nested Group Resolution Efficiently resolves multi-layered group structures on the backend rather than rebuilding recursions locally. High-turnover groups lead to constant cache invalidations and database growth.

Remediation & Mitigation Options

1. Modifying Group Optimization via UCR (Univention Configuration Registry)

UCS offers explicit UCR variables designed to improve lookups in enterprise-scale environments by creating a dedicated NSS cache file. Verify your current configuration:

ucr info nss/group/cachefile
ucr info nss/group/cachefile/invalidate_on_changes
  • nss/group/cachefile: When set to yes, group structures are exported to a localized cache file and integrated using the NSS extrausers module. This provides measurable speed improvements in dense environments.
  • nss/group/cachefile/invalidate_on_changes: When enabled, the group cache file regenerates automatically when an administrative change occurs within the UCS management console.

2. Advanced Tuning: ignore_group_members

If your environment contains massive distribution lists or global groups where individual membership listings aren’t critical for local POSIX permissions, you can leverage the ignore_group_members directive within the SSSD domain configuration file.

Hint

This option must be added manually to /etc/sssd/sssd.conf under your domain section, as there is currently no native UCR variable mapping for it. Direct modifications to template-generated files can be overwritten; ensure your configuration management workflows account for manual sssd.conf overrides.


Step-by-Step Guide: Purging the SSSD Cache

If the cache database size has already ballooned and performance is compromised, clear the cache to return the filesystem to a stable baseline.

Hint

Pre-requisite Validation: Only flush the SSSD cache when your identity provider infrastructure (LDAP/Samba AD) is fully online and reachable. If the backend is unreachable when the cache is purged, users will be unable to authenticate. Taking a storage snapshot of the virtual machine prior to this operation is recommended.

Option A: Controlled Flush (Recommended)

This approach leverages built-in SSSD binaries to systematically invalidate cached objects across users, groups, netgroups, and sudo rules.

  1. Invalidate all records currently tracked in the database:
sss_cache -E
  1. Restart the SSSD service daemon to apply the cleanup and initiate fresh upstream directory lookups on next access:
systemctl restart sssd

Option B: Aggressive Aggregation Cleanup (Complete Wipe)

If the .ldb database file remains physically massive on disk or the responder sockets are completely locked up, manually purge the cache files from disk.

  1. Stop the core SSSD service engine:
systemctl stop sssd
  1. Delete all database structures inside the working SSSD directory:
rm -rf /var/lib/sss/db/*
  1. (Optional) If minor clock drift exists between the UCS node and the identity provider, synchronize system time to prevent instant token invalidation:
# Format: MMDDhhmm (Month, Day, Hour, Minute)
date 05271600 
  1. Start SSSD to generate a clean, empty database file structure:
systemctl start sssd

Upon execution, the initial authentication requests will hit the authoritative identity server directly to rebuild an optimized, unfragmented local cache.


References

1 Like