Samba DNS resolution fails with most records

MasinAD-SI · September 14, 2020, 2:16pm

Hi,

after ucr set dns/debug/level=10 and systemctl restart bind9.service, DNS resolution broke.

  490  2020-09-14 14:10:29 ucr get dns
  491  2020-09-14 14:10:36 ucr get dns/debug/level 
  492  2020-09-14 14:10:42 ucr set dns/debug/level=3 
  493  2020-09-14 14:10:53 systemctl restart bind9.service 
  494  2020-09-14 14:11:18 cd /etc/rsyslog.d/
  495  2020-09-14 14:11:28 cp slapd-queries.conf bind9.conf
  496  2020-09-14 14:11:31 nano bind9.conf 
  497  2020-09-14 14:12:01 ll /var/log/
  498  2020-09-14 14:12:12 nano bind9.conf 
  499  2020-09-14 14:13:24 ucr --help
  500  2020-09-14 14:13:36 ucr info dns/debug/level 
  501  2020-09-14 14:14:59 ucr set dns/debug/level=10
  502  2020-09-14 14:15:05 systemctl restart bind9.service 
  503  2020-09-14 14:16:05 rndc querylog
  504  2020-09-14 14:16:20 ucr set dns/debug/level=0
  505  2020-09-14 14:16:23 systemctl restart bind9.service 
  506  2020-09-14 14:16:42 rndc --help
  507  2020-09-14 14:19:24 ll /etc/bind/
  508  2020-09-14 14:19:29 cd /etc/bind/
  509  2020-09-14 14:19:50 grep severity -R
  510  2020-09-14 14:20:48 named-checkconf /etc/bind/named.conf
  511  2020-09-14 14:20:53 echo $?
  512  2020-09-14 14:21:04 man named-checkconf 
  513  2020-09-14 14:29:22 host <redacted>

As you can see, there isn’t much that might have broken the DNS resolution. I found some articles online stating that I’d have to start samba-ad-dc.service after bind9.service but no combination of restarting or stopping and starting helped.

When issuing samba_dnsupdate --verbose --all-names I get lots of output with many records resulting in

ERROR(runtime): uncaught exception - (1383, 'WERR_INTERNAL_DB_ERROR')
  File "/usr/lib/python2.7/dist-packages/samba/netcmd/__init__.py", line 185, in _run
    return self.run(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/samba/netcmd/dns.py", line 944, in run
    raise e
Failed 'samba-tool dns' based update of SRV _ldap._tcp.berlin._sites.ForestDnsZones.example.com ucs-addc.example.com 389
Failed update of 36 entries

As resolving our local zone doesn’t work at the moment I thought it might be because it cannot resolve ucs-addc.example.com but it resolves just fine thanks to the hosts file.

Even a server reboot did not help. Is there any way to get DNS back up and running?

MasinAD-SI · September 15, 2020, 11:47am

Additional information:

There are more problems. A lot of groups weren’t in cn=groups but in cn=users. I moved them back. The user uid=krbtgt is missing and the s4connector rejects a sync with samba4. Additionally, relativeDomainName=@,zoneName=example.com,cn=dns,dc=example,dc=com is missing in s4.

I’d like to first solve the krbtgt issue, but [System diagnostic] User "krbtgt": S4 Connector & Check well known SIDs does not work for me. The log still shows

15.09.2020 13:41:49.660 LDAP        (PROCESS): sync from ucs:   Resync rejected file: /var/lib/univention-connector/s4/1598434249.010041
15.09.2020 13:41:49.663 LDAP        (PROCESS): sync from ucs: [           dns] [    delete] DC=@,DC=example.com,CN=MicrosoftDNS,DC=DomainDnsZones,DC=example,DC=com
15.09.2020 13:41:49.681 LDAP        (WARNING): sync failed, saved as rejected
	/var/lib/univention-connector/s4/1598434249.010041
15.09.2020 13:41:49.681 LDAP        (WARNING): Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/univention/s4connector/__init__.py", line 891, in __sync_file_from_ucs
    if ((old_dn and not self.sync_from_ucs(key, mapped_object, pre_mapped_ucs_dn, unicode(old_dn, 'utf8'), old, new)) or (not old_dn and not self.sync_from_ucs(key, mapped_object, pre_mapped_ucs_dn, old_dn, old, new))):
  File "/usr/lib/python2.7/dist-packages/univention/s4connector/s4/__init__.py", line 2617, in sync_from_ucs
    self.property[property_type].con_sync_function(self, property_type, object)
  File "/usr/lib/python2.7/dist-packages/univention/s4connector/s4/dns.py", line 1630, in ucs2con
    s4_zone_delete(s4connector, object)
  File "/usr/lib/python2.7/dist-packages/univention/s4connector/s4/dns.py", line 880, in s4_zone_delete
    res = s4connector.lo_s4.lo.delete_s(zone_dn)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 333, in delete_s
    return self.delete_ext_s(dn,None,None)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 326, in delete_ext_s
    resp_type, resp_data, resp_msgid, resp_ctrls = self.result3(msgid,all=1,timeout=self.timeout)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 514, in result3
    resp_ctrl_classes=resp_ctrl_classes
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 521, in result4
    ldap_result = self._ldap_call(self._l.result4,msgid,all,timeout,add_ctrls,add_intermediates,add_extop)
  File "/usr/lib/python2.7/dist-packages/ldap/ldapobject.py", line 106, in _ldap_call
    result = func(*args,**kwargs)
NOT_ALLOWED_ON_NONLEAF: {'info': '00002015: subtree_delete: Unable to delete a non-leaf node (it has 62 children)!', 'desc': 'Operation not allowed on non-leaf'}

15.09.2020 13:41:49.681 LDAP        (PROCESS): sync to ucs: Resync rejected dn: CN=krbtgt,CN=Users,DC=example,DC=com
15.09.2020 13:41:49.686 LDAP        (PROCESS): sync to ucs:   [          user] [       add] u'uid=krbtgt,CN=Users,dc=example,dc=com'
15.09.2020 13:41:49.943 LDAP        (ERROR  ): Value may not change: key=gidNumber old=None new=5001 (u'uid=krbtgt,CN=Users,dc=example,dc=com')

I guess, both sync issues are connected to my DNS problem.

MasinAD-SI · September 16, 2020, 10:32am

I solved my problem yesterday. It’s kind of a hack, so I don’t advise to follow these steps as long as you are not out of options (as I was).

First step was to force krbtgt to sync again.
Then I basically repeated that with the DNS branch. But this time just renamed the branch in S4, waited for a sync, then renamed it back.
Afterwards, I stopped bind9.service and samba-ad-dc.service and restarted bind9.service (which Wants samba-ad-dc.service and starts it automatically).

What’s still mysterious how so many errors could accumulate. Nobody moved the groups and still they turned up in the Users branch. Why were there any sync errors at all? And why won’t I get notified of sync errors?

If I had some feature wishes, I’d wish for notifying admins of any sync errors and for better logging what the s4-connector does. At the moment I suspect the s4-connector to be the culprit of our directory corruptions.