How-To: Debug High CPU Load on LDAP Server
Step 1: Identify the issue
You notice a high CPU load on your UCS server caused by the slapd (OpenLDAP) process. In top you might see:
top - 15:31:13 up 103 days, 19:34, 3 users, load average: 3,49, 1,70, 1,88
Tasks: 194 total, 2 running, 192 sleeping, 0 stopped, 0 zombie
%Cpu(s): 61,4 us, 9,6 sy, 0,0 ni, 27,8 id, 0,0 wa, 0,0 hi, 1,1 si, 0,0 st
KiB Mem: 65981876 total, 14902920 used, 51078956 free, 257468 buffers
KiB Swap: 10821628 total, 0 used, 10821628 free. 11232612 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11787 root 20 0 3774964 668400 573568 S 214,2 1,0 14055:40 slapd
6161 root 20 0 72576 34992 7364 R 15,3 0,1 0:11.15 ldap-group-to-f
3691 root 20 0 0 0 0 S 0,3 0,0 0:43.44 kworker/2:2
1 root 20 0 29992 5592 3140 S 0,0 0,0 2:51.38 systemd
Where you see the process of the OpenLDAP server (called “slapd”) is using a high amount of CPU. It’s process ID (PID) is in the case above 11787.
Step 2: strace
Use “strace” to check what the process is doing.
root@master:/usr/lib/univention-pam# strace -p 11787
Process 11787 attached
futex(0x7f53a08cc9d0, FUTEX_WAIT, 11788, NULL
From this output nothing more is happening. It tells us the process is waiting for it’s client to finish. So this is the parent process doing nothing. We have to identify the child processes.
Abort the strace command with “Ctrl-C”.
Step 3: Identify child PIDs
Start “top -p 11787” and press “H”. You will see all child processes of the parent slapd process:
top - 15:47:44 up 103 days, 19:51, 3 users, load average: 2,78, 1,86, 1,67
Threads: 328 total, 7 running, 321 sleeping, 0 stopped, 0 zombie
%Cpu(s): 48,4 us, 2,0 sy, 0,0 ni, 48,9 id, 0,0 wa, 0,0 hi, 0,7 si, 0,0 st
KiB Mem: 65981876 total, 14873484 used, 51108392 free, 257468 buffers
KiB Swap: 10821628 total, 0 used, 10821628 free. 11232112 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11800 root 20 0 3774964 668400 573568 R 14,3 1,0 974:32.52 slapd
11793 root 20 0 3774964 668400 573568 R 13,6 1,0 973:50.57 slapd
11801 root 20 0 3774964 668400 573568 S 13,6 1,0 972:20.04 slapd
Step 4: Check Client’s Activity
Use “strace -p 11787” for a client’s PID to see the activity. You might see an output like this:
write(5, "0", 1) = 1
write(21, "\27\3\3\0\201Z\371=\353s\274\237\344\377'\25\26\304\334a\201d\362\307qH2\357\337\343\251:"..., 134) = 134
write(21, "\27\3\3\0(Z\371=\353s\274\237\345{+RG\336Z\235\213\\\7\304\377\206\327Z\300\20\333\240"..., 45) = 45
futex(0x55798234d094, FUTEX_WAIT_PRIVATE, 1091144229, NULL) = 0
futex(0x55798234d068, FUTEX_WAKE_PRIVATE, 1) = 0
read(21, "\27\3\3\0~", 5) = 5
read(21, "\377\363\2462B\256\25\226A\257l\t\360@\332\223\3?\244\336<\fV}q\372\315N\316O\254\275"..., 126) = 126
read(21, 0x7f536017d0f3, 5) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(6, EPOLL_CTL_MOD, 21, {EPOLLIN, {u32=703811684, u64=139999457792100}}) = 0
The important information in the above output are the lines starting with “read(21
”. It tells us the process is reading from the file descriptor (fd) 21.
Step 5: Identify the accessed file
Use the parent PID (11787) and the child PID (11801) to check what file is accessed by the file descriptor (21):
root@master:/# ls -alh /proc/11787/task/11801/fd/21
lrwx------ 1 root root 64 Jan 28 13:07 /proc/11787/task/11801/fd/21 -> socket:[193162363]
The output points us to an open (presumably network) socket with the ID of 193162363
Step 6: Identify Connection
Now we use the socket ID to identify the related connection:
root@master:/# lsof -i | grep 193162363
slapd 11787 root 21u IPv4 193162363 0t0 TCP master.schulen.ucs:7389->slave1.schulen.ucs:51882 (ESTABLISHED)
Using a script
Instead of the above steps you might use the following script which outputs the IP addresses of the connected systems to identify who is connecting the server:
#!/bin/bash
rm -f slapd_trace.* &>/dev/null
rm -f nohup.* &>/dev/null
if [ ! -e /usr/bin/strace ] ; then
apt --assume-yes install strace
fi
pgrep slapd > /tmp/list.$$
for pid in $(cat /tmp/list.$$) ; do
nohup strace -y -yy -f -o "slapd_trace.$pid" -p "$pid" > "nohup.$pid" &
done
sleep 30s
pkill strace
grep read slapd_trace.*| grep "TCP:"| cut -d ">" -f 2| cut -d ":" -f 1| sort | tee "ip_list1.$$" |uniq > "ip_list2.$$"
for i in $(cat ip_list2.$$); do
echo -n "$i: "
grep -c "$i ip_list1.$$"
done
rm -f "/tmp/list.$$" &>/dev/null
rm -f "ip_list?.&&" &>/dev/null
Summary
The above steps help us to identify the possible reason for a high load. Now we know which slave server is doing a lot of requests, causing the high load of the slapd on the master.
You may also take a look at:
How-To: Enable Debugging for libnss-ldap
How-To: Troubleshooting LDAP Performance Issues