Help debug : DNS failed when recover from suspend

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Help debug : DNS failed when recover from suspend

Prunk Dump
Hello Debian Team !

I'm the network Administrator of a french High School and I have
troubles debugging a DNS lookup problem affecting all my 550 Debian
Buster clients.

I have two critical systemd services running on my clients :
-> "puppet" that ensure propagation of my whole network configuration.
-> "samba winbind" that allow users pam authentication and Name Service Switch.
These two services use DNS to find their services servers

But when the client recover from suspend these two services failed to
works until the next DNS query.

-> Puppet give the following error :
puppet-agent[3312]: Failed to open TCP connection to puppet:8140
(getaddrinfo: Name or service not known)

-> Samba winbind give :
ads_find_dc: name resolution for realm '****' (domain '"****') failed:
NT_STATUS_NO_LOGON_SERVERS

This is very problematic for me as puppet run only every 30 minutes
(so the configuration is applied 30 minutes too late). And sometimes
this make winbind failed completely. It never recover. A host reboot
it needed.

Any Idea where I can start to search ?
Anything that I can try to identify the origin of the problem ?

Thanks you very much !

Baptiste.

Reply | Threaded
Open this post in threaded view
|

Re: Help debug : DNS failed when recover from suspend

Étienne Mollier
Baptiste, on 2019-09-17:
> I have two critical systemd services running on my clients :
> -> "puppet" that ensure propagation of my whole network configuration.
> -> "samba winbind" that allow users pam authentication and Name Service Switch.
> These two services use DNS to find their services servers

Hi Baptiste,

> Any Idea where I can start to search ?

DNS resolvers are listed in /etc/resolv.conf: have a look and
see if the content is consistent with your proper configuration.
Network components tend to conflict for the control of
/etc/resolv.conf (dhclient, NetworkManager, the admin and its
trusty "vi" editor, etc).  The program "resolvconf" can be
installed to arbitrate this if necessary, although I've never
used it for myself, yet.

> Anything that I can try to identify the origin of the problem ?

If those particular services are critical, and assuming your IP
addresses attributions are static, at least for these core
components of your network, then maybe you will want to consider
using the plain IP address instead of relying on the DNS
resolver's availability.  At least, it would be worth trying
this with a given machine, to see if services are starting
correctly, or if you hit the next error message instead.

[... rewinding ...]
> But when the client recover from suspend these two services failed to
> works until the next DNS query.
>
> -> Puppet give the following error :
> puppet-agent[3312]: Failed to open TCP connection to puppet:8140
> (getaddrinfo: Name or service not known)

Make sure your search domains are present in /etc/resolv.conf,
otherwise your machine will certainly not be able to resolve the
name "puppet".

When you mention "suspend", is it after an actual hibernation ?
There is a bug (more like poor wording in a startup message
actually) in Debian 10.0 were the machine always seem to wake up
from hibernation, which has been fixed in Debian 10.1.  It could
be worth upgrading to clarify this point, if it is not already
the case.

Are affected machines mobile ones ?  If so, it could be caused
by a complete change of network during the hibernation (while
moving from home to the high school typically), and the resolver
configuration was still the one from home somehow.

À plus,  :)
--
Étienne Mollier <[hidden email]>
Fingerprint:  5ab1 4edf 63bb ccff 8b54  2fa9 59da 56fe fff3 882d



signature.asc (673 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Help debug : DNS failed when recover from suspend

Prunk Dump
Thanks you very much Etienne for your help ! I will try to give as
much precision as possible using your tips.

Le mar. 17 sept. 2019 à 23:13, Étienne Mollier
<[hidden email]> a écrit :

>
> Baptiste, on 2019-09-17:
> > I have two critical systemd services running on my clients :
> > -> "puppet" that ensure propagation of my whole network configuration.
> > -> "samba winbind" that allow users pam authentication and Name Service Switch.
> > These two services use DNS to find their services servers
>
> Hi Baptiste,
>
> > Any Idea where I can start to search ?
>
> DNS resolvers are listed in /etc/resolv.conf: have a look and
> see if the content is consistent with your proper configuration.
> Network components tend to conflict for the control of
> /etc/resolv.conf (dhclient, NetworkManager, the admin and its
> trusty "vi" editor, etc).  The program "resolvconf" can be
> installed to arbitrate this if necessary, although I've never
> used it for myself, yet.
>

My resolv.conf contain the following entries :

~# cat /etc/resolv.conf
domain my.domain.lan
search my.domain.lan.
nameserver 172.16.0.30

Everything seems correct. I have only one domain name server and the
other parameters correspond to my domain.
The stat command say that the file is accessed and modified regularly
(just after the suspend recover). But the content does not seems to
change. I have checked the content of the file as soon as possible
after host wake and I can't see any change.

If I change the file manually by removing the domain line for example.
The file is restored few minutes later.

I will give a try to resolvconf.

> > Anything that I can try to identify the origin of the problem ?
>
> If those particular services are critical, and assuming your IP
> addresses attributions are static, at least for these core
> components of your network, then maybe you will want to consider
> using the plain IP address instead of relying on the DNS
> resolver's availability.  At least, it would be worth trying
> this with a given machine, to see if services are starting
> correctly, or if you hit the next error message instead.
>

My IP addresses distribution for clients is not "really" static. I use
a DHCP server. But as lease times are not too short the host IP change
very rarely. But domain search parameters are given by the dhcp server
( see below ).


I use static IP for servers. So I can try your tricks with puppet. But
with samba, this will be difficult as many DNS entries are used for
the various active directory services.
Moreover I don't have only one domain controller. So I absolutely need
correct DNS resolution.

> [... rewinding ...]
> > But when the client recover from suspend these two services failed to
> > works until the next DNS query.
> >
> > -> Puppet give the following error :
> > puppet-agent[3312]: Failed to open TCP connection to puppet:8140
> > (getaddrinfo: Name or service not known)
>
> Make sure your search domains are present in /etc/resolv.conf,
> otherwise your machine will certainly not be able to resolve the
> name "puppet".
>

The "domain" and "search" parameters are presents in resolv.conf.
Maybe the "domain" line is useless. But this line is added
automatically by the dhcp client. Or maybe there is a misconfigured
option in my dhcp server :

~# cat /etc/dhcpd.conf
....
subnet 172.16.0.0 netmask 255.255.0.0 {

   option routers 172.16.0.1;
   option domain-name "my.domain.lan";
   option domain-search "my.domain.lan";
   option domain-name-servers 172.16.0.30;
....

> When you mention "suspend", is it after an actual hibernation ?
> There is a bug (more like poor wording in a startup message
> actually) in Debian 10.0 were the machine always seem to wake up
> from hibernation, which has been fixed in Debian 10.1.  It could
> be worth upgrading to clarify this point, if it is not already
> the case.

Yes I talk about a real "suspend" not hibernate. The computer restart
in only 2 seconds.
I use unattended upgrade so all my host are fully upgraded. I you're
right the startup message gone. My my users never shutdown the
computers as they suspend 5 second after logout to save power
consumption.

>
> Are affected machines mobile ones ?  If so, it could be caused
> by a complete change of network during the hibernation (while
> moving from home to the high school typically), and the resolver
> configuration was still the one from home somehow.
>

No there are not mobile stations. But I use DHCP server. So the host
IP can change. But I don't see any connectivity problem. I can ssh the
host 2 seconds after the wake.

> À plus,  :)
> --
> Étienne Mollier <[hidden email]>
> Fingerprint:  5ab1 4edf 63bb ccff 8b54  2fa9 59da 56fe fff3 882d
>
>

So with you help here my current check list :
-> Maybe a bug in the resolv.conf file access just after the suspend
recover. I need to find who is accessing the file and when. And why
this prevent DNS resolution working.

-> Maybe a bug in the systemd configuration files that awake service
in wrong order ? ( I will do soon a not related bug report to Debian,
puppet.service does not contain any "After=" line )

-> Maybe a bug in network-manager when the host receive a response
from the dhcp server. As the ip can change maybe this make DNS failed.
But NACK is not often sent. So it can't explain the problem
completely. The problem appear even if the IP does not change.

If someone have an idea !

Thanks again.
Baptiste.

Reply | Threaded
Open this post in threaded view
|

Re: Help debug : DNS failed when recover from suspend

Étienne Mollier
Baptiste, on 2019-09-18:

> So with you help here my current check list :
> -> Maybe a bug in the resolv.conf file access just after the suspend
> recover. I need to find who is accessing the file and when. And why
> this prevent DNS resolution working.
>
> -> Maybe a bug in the systemd configuration files that awake service
> in wrong order ? ( I will do soon a not related bug report to Debian,
> puppet.service does not contain any "After=" line )
>
> -> Maybe a bug in network-manager when the host receive a response
> from the dhcp server. As the ip can change maybe this make DNS failed.
> But NACK is not often sent. So it can't explain the problem
> completely. The problem appear even if the IP does not change.
>
> If someone have an idea !
I don't know what to add, sounds like a plan…  On my side, I
never managed to get the proper resolv.conf setup out of
NetworkManager alone (in situations where I didn't have access
to the DHCP server configuration), so would look here first; but
it may be just me, never having been able to figure out how to
use properly that program.

Kind Regards,  :)
--
Étienne Mollier <[hidden email]>
Fingerprint:  5ab1 4edf 63bb ccff 8b54  2fa9 59da 56fe fff3 882d



signature.asc (673 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Help debug : DNS failed when recover from suspend

Andrei POPESCU-2
In reply to this post by Prunk Dump
On Mi, 18 sep 19, 14:40:29, Prunk Dump wrote:
>
> -> Maybe a bug in the systemd configuration files that awake service
> in wrong order ? ( I will do soon a not related bug report to Debian,
> puppet.service does not contain any "After=" line )
>
> -> Maybe a bug in network-manager when the host receive a response
> from the dhcp server. As the ip can change maybe this make DNS failed.
> But NACK is not often sent. So it can't explain the problem
> completely. The problem appear even if the IP does not change.

Network Manager seems to be overkill for your needs. You might want to
try systemd-networkd and systemd-resolved.

See /usr/share/doc/systemd/README.Debian for how to enable them. There
is also a sample config that should be enough for your use-case.


Kind regards,
Andrei
--
http://wiki.debian.org/FAQsFromDebianUser

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Help debug : DNS failed when recover from suspend

Prunk Dump


Le lun. 28 oct. 2019 à 16:49, Andrei POPESCU <[hidden email]> a écrit :
On Mi, 18 sep 19, 14:40:29, Prunk Dump wrote:
>
> -> Maybe a bug in the systemd configuration files that awake service
> in wrong order ? ( I will do soon a not related bug report to Debian,
> puppet.service does not contain any "After=" line )
>
> -> Maybe a bug in network-manager when the host receive a response
> from the dhcp server. As the ip can change maybe this make DNS failed.
> But NACK is not often sent. So it can't explain the problem
> completely. The problem appear even if the IP does not change.

Network Manager seems to be overkill for your needs. You might want to
try systemd-networkd and systemd-resolved.

See /usr/share/doc/systemd/README.Debian for how to enable them. There
is also a sample config that should be enough for your use-case.


Kind regards,
Andrei
--
http://wiki.debian.org/FAQsFromDebianUser

Thanks !

I have nearly found the bug origin. The problem is that when the computer suspend the isc DHCP client’s timer is not updated.

So the client wait too long before renew the lease and keep a expired one.

But I don’t now where to bug report.
-> I don’t if know if it’s a dhclient bug that don’t support suspend

-> or a systemd bug that don’t close dhclient before suspend bug

-> or a network manager bug that don’t update the dhclient timer 

I have asked to isc DHCP users but no one give me a tips actually :
Thanks for your help !

Baptiste