How To: Debug High CPU Load on LDAP Server

Christian_Voelker · January 28, 2019, 3:03pm

How-To: Debug High CPU Load on LDAP Server

Step 1: Identify the issue

You notice a high CPU load on your UCS server caused by the slapd (OpenLDAP) process. In top you might see:

top - 15:31:13 up 103 days, 19:34,  3 users,  load average: 3,49, 1,70, 1,88
Tasks: 194 total,   2 running, 192 sleeping,   0 stopped,   0 zombie
%Cpu(s): 61,4 us,  9,6 sy,  0,0 ni, 27,8 id,  0,0 wa,  0,0 hi,  1,1 si,  0,0 st
KiB Mem:  65981876 total, 14902920 used, 51078956 free,   257468 buffers
KiB Swap: 10821628 total,        0 used, 10821628 free. 11232612 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                        
11787 root      20   0 3774964 668400 573568 S 214,2  1,0  14055:40 slapd                                                                                                                                                                                                                                                                                 
 6161 root      20   0   72576  34992   7364 R  15,3  0,1   0:11.15 ldap-group-to-f                                                                                                                                
 3691 root      20   0       0      0      0 S   0,3  0,0   0:43.44 kworker/2:2                                                                                                                                    
    1 root      20   0   29992   5592   3140 S   0,0  0,0   2:51.38 systemd

Where you see the process of the OpenLDAP server (called “slapd”) is using a high amount of CPU. It’s process ID (PID) is in the case above 11787.

Step 2: strace

Use “strace” to check what the process is doing.

root@master:/usr/lib/univention-pam# strace -p 11787
Process 11787 attached
futex(0x7f53a08cc9d0, FUTEX_WAIT, 11788, NULL

From this output nothing more is happening. It tells us the process is waiting for it’s client to finish. So this is the parent process doing nothing. We have to identify the child processes.
Abort the strace command with “Ctrl-C”.

Step 3: Identify child PIDs

Start “top -p 11787” and press “H”. You will see all child processes of the parent slapd process:


top - 15:47:44 up 103 days, 19:51,  3 users,  load average: 2,78, 1,86, 1,67
Threads: 328 total,   7 running, 321 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48,4 us,  2,0 sy,  0,0 ni, 48,9 id,  0,0 wa,  0,0 hi,  0,7 si,  0,0 st
KiB Mem:  65981876 total, 14873484 used, 51108392 free,   257468 buffers
KiB Swap: 10821628 total,        0 used, 10821628 free. 11232112 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                         
11800 root      20   0 3774964 668400 573568 R 14,3  1,0 974:32.52 slapd                                                                                                                                           
11793 root      20   0 3774964 668400 573568 R 13,6  1,0 973:50.57 slapd                                                                                                                                           
11801 root      20   0 3774964 668400 573568 S 13,6  1,0 972:20.04 slapd

Step 4: Check Client’s Activity

Use “strace -p 11787” for a client’s PID to see the activity. You might see an output like this:

write(5, "0", 1)                        = 1
write(21, "\27\3\3\0\201Z\371=\353s\274\237\344\377'\25\26\304\334a\201d\362\307qH2\357\337\343\251:"..., 134) = 134
write(21, "\27\3\3\0(Z\371=\353s\274\237\345{+RG\336Z\235\213\\\7\304\377\206\327Z\300\20\333\240"..., 45) = 45
futex(0x55798234d094, FUTEX_WAIT_PRIVATE, 1091144229, NULL) = 0
futex(0x55798234d068, FUTEX_WAKE_PRIVATE, 1) = 0
read(21, "\27\3\3\0~", 5)               = 5
read(21, "\377\363\2462B\256\25\226A\257l\t\360@\332\223\3?\244\336<\fV}q\372\315N\316O\254\275"..., 126) = 126
read(21, 0x7f536017d0f3, 5)             = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(6, EPOLL_CTL_MOD, 21, {EPOLLIN, {u32=703811684, u64=139999457792100}}) = 0

The important information in the above output are the lines starting with “read(21”. It tells us the process is reading from the file descriptor (fd) 21.

Step 5: Identify the accessed file

Use the parent PID (11787) and the child PID (11801) to check what file is accessed by the file descriptor (21):

root@master:/# ls -alh /proc/11787/task/11801/fd/21
lrwx------ 1 root root 64 Jan 28 13:07 /proc/11787/task/11801/fd/21 -> socket:[193162363]

The output points us to an open (presumably network) socket with the ID of 193162363

Step 6: Identify Connection

Now we use the socket ID to identify the related connection:

root@master:/# lsof -i | grep 193162363
slapd     11787       root   21u  IPv4 193162363      0t0  TCP master.schulen.ucs:7389->slave1.schulen.ucs:51882 (ESTABLISHED)

Using a script

Instead of the above steps you might use the following script which outputs the IP addresses of the connected systems to identify who is connecting the server:

#!/bin/bash

rm -f slapd_trace.* &>/dev/null
rm -f nohup.* &>/dev/null
if [ ! -e /usr/bin/strace ] ; then
 apt --assume-yes install strace
fi

pgrep slapd > /tmp/list.$$

for pid in $(cat /tmp/list.$$) ; do
	nohup strace -y -yy -f -o "slapd_trace.$pid" -p "$pid"  > "nohup.$pid" &
done
sleep 30s
pkill strace
grep read slapd_trace.*| grep "TCP:"| cut -d ">" -f 2| cut -d ":" -f 1| sort | tee "ip_list1.$$" |uniq > "ip_list2.$$"
for i in $(cat ip_list2.$$); do
 echo -n "$i: "
 grep -c "$i ip_list1.$$"
done

rm -f "/tmp/list.$$" &>/dev/null
rm -f "ip_list?.&&" &>/dev/null

Summary

The above steps help us to identify the possible reason for a high load. Now we know which slave server is doing a lot of requests, causing the high load of the slapd on the master.

You may also take a look at:
How-To: Enable Debugging for libnss-ldap
How-To: Troubleshooting LDAP Performance Issues