Unable to add computers due to failures during conversion of Backup to Primary

Due to a catastrophic hardware failure, the virtualization host for my primary UCS server died and cannot be restored in a reasonable time. So, I decided to convert my backup UCS server (on a different virtualization host) to become the new primary. I’ll add a new backup UCS server later.

I am using UCS 5.0-1 with the latest updates.

I followed the prevailing documentation for how to do this. Despite the frequent I/O Errors and reboots I was getting on the original primary server during the process, I was very careful to follow every step and keep the artifacts on my independent workstation (like the base.conf, base-forced.conf, dpkg.selection, and ldap-schema files). I followed and carefully double-checked every step as I worked through the documentation.

While I’m happy to share all the things that broke after I permanently shutdown the original primary UCS host and ran the backup2master script, I’ve resolved everything I can up to now. I’d like to focus on this last(?) issue unless you want to explore what else I’ve had to fix (I’m keeping notes).

Right now, I cannot join any computers to the new primary UCS server. When logging into the UCS web console, I’m told (repeatedly) that a “Domain Join” module script needs to be run. It fails when I try and try and try to get it to run. I’ve ultimately found this error in the Domain Join log via the UCS Web console:

RUNNING 98univention-samba4-saml-kerberos.inst
2022-06-14 16:21:34.487832395-05:00 (in joinscript_init)
could not obtain current kerberos secret for sso user
__JOINERR__:FAILED: /usr/lib/univention-install/98univention-samba4-saml-kerberos.inst

Strangely, EXITCODE=0 seems more like a success than a failure, yet the script constantly comes back again and again as Pending.

Regarding the sso user, it is one of several objects that failed during the backup2master process. Strangely, all of the affected objects were in both SAMBA and LDAP. I confirmed every one of them by hand, yet they were throwing errors in /var/log/univention/connector-s4.log. Per the documentation, I manually verified and then removed them all from the sqlite database. However, the sso-user keeps coming back up as failing to sync (usually after trying to re-run 98univention-samba4-saml-kerberos.inst). The stack dump for the sso-user is:

14.06.2022 15:39:05.266 LDAP        (PROCESS): sync AD > UCS: Resync rejected dn: 'CN=ucs-sso,CN=Users,DC=redacted,DC=tld'
14.06.2022 15:39:05.275 LDAP        (PROCESS): sync AD > UCS: [          user] [    modify] 'uid=ucs-sso,cn=users,dc=redacted,dc=tld'
14.06.2022 15:39:05.289 LDAP        (ERROR  ): Unknown Exception during sync_to_ucs
14.06.2022 15:39:05.289 LDAP        (ERROR  ): Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/users/user.py", line 2299, in __allocate_rid
    return self.request_lock('sid', sid)
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/__init__.py", line 1693, in request_lock
    value = univention.admin.allocators.request(self.lo, self.position, name, value)
  File "/usr/lib/python3/dist-packages/univention/admin/allocators.py", line 209, in request
    return acquireUnique(lo, position, type, value, _type2attr[type], scope=_type2scope[type])
  File "/usr/lib/python3/dist-packages/univention/admin/allocators.py", line 198, in acquireUnique
    univention.admin.locking.lock(lo, position, type, value.encode('utf-8'), scope=scope)
  File "/usr/lib/python3/dist-packages/univention/admin/locking.py", line 121, in lock
    raise univention.admin.uexceptions.noLock(_('The attribute %r could not get locked.') % (type,))
univention.admin.uexceptions.noLock: The attribute 'sid' could not get locked.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/univention/s4connector/__init__.py", line 1498, in sync_to_ucs
    result = self.modify_in_ucs(property_type, object, module, position)
  File "/usr/lib/python3/dist-packages/univention/s4connector/__init__.py", line 1223, in modify_in_ucs
    res = ucs_object.modify(serverctrls=serverctrls, response=response)
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/users/user.py", line 1503, in modify
    return super(object, self).modify(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/__init__.py", line 638, in modify
    dn = self._modify(modify_childs, ignore_license=ignore_license, response=response)
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/__init__.py", line 1342, in _modify
    ml = self._ldap_modlist()
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/users/user.py", line 1796, in _ldap_modlist
    ml = self._modlist_samba_sid(ml)
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/users/user.py", line 2160, in _modlist_samba_sid
    sid = self.__generate_user_sid(self['uidNumber'])
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/users/user.py", line 2305, in __generate_user_sid
    return self.__allocate_rid(self['sambaRID'])
  File "/usr/lib/python3/dist-packages/univention/admin/handlers/users/user.py", line 2301, in __allocate_rid
    raise univention.admin.uexceptions.sidAlreadyUsed(rid)
univention.admin.uexceptions.sidAlreadyUsed: 1104

I have a bunch of computers that need to be added back to the new UCS primary server. They were all on the original primary but they’ve all been disconnected after backup2master was run. Strangely, LDAP and SAMBA both list them as present, but they are definitely not actually attached.

Thanks much for any and all guidance.

This website only permits 2 links per post, so I was unable to share what other references I’ve attempted to follow when trying to resolve this on my own. To no avail, I have tried some tips from the following links with seemingly similar issues:

I have services down right now for over a day because I’m unable to join computers to the promoted UCS host (backup to primary). I’ve tried to tail various logs here and there while attempting to join computers but I can’t find any log traffic being generated that would hint at a root cause or solution for this issue.

I’m running low on ideas. Is it safe to destroy the ucs-sso user from LDAP, SAMBA, or both? Will it regenerate in a usable fashion? Is this even the reason I cannot join computers to the new primary UCS host?

Well, this went from bad to much worse. Today, every user who logged into a Windows workstation was forced into a whole new user profile, including me. It took me all day to sort out what had happened. Evidently, the backup2primary script – or the normal UCS backup behavior prior to running the script – had failed to synchronize the SIDs between the original primary and the backup-turned-primary. Every SID I checked was different (Windows keeps a record of all user SIDs who log into each workstation in the Registry).

With this new work-destroying disaster, I had no choice. I was forced to kill the new UCS primary and resurrect the original primary on a different virtualization host from backup. I wanted to avoid that because the original primary was having serious issues prior to its complete meltdown (periodic inexplicable hard lockups in its VM). It took hours to clean up the mess doing this created, but we’re back on our feet again. My users are logging back into their original profiles and can finally access all their programs and files. Next, I’ll start re-attaching computers to the domain that had broken off when I’d attempted the backup2primary process.

The backup2primary process is deeply flawed. It does not work. Maybe some light QA testing might make it appear to work, but out here in the real world with an actual organization with real servers and users depending on it to work in a real disaster scenario, it absolutely fails.

Please, please fix the backup2primary script and related process. Honestly… I should just have a single button somewhere in the UCS web interface that handles this take-over with no more than a single click. All the manual work, all the pain, all the “Oh! It didn’t move that, so I guess I need to do that myself!”, all the failures… just shouldn’t be. I suffered a real disaster scenario. I expected backup2primary to be simple, reliable, and successful. It… was not.