Page MenuHomePhabricator

[infra] Toolforge bastion sssd/LDAP flakiness (May 2025)
Closed, ResolvedPublic

Description

Tracking task to track issues with LDAP/sssd seen on Cloud VPS VMs tools-bastion-13 around May 2025. This usually manifests in sssd issues like:

May 08 19:21:22 tools-bastion-13 sssd[2845878]: Child [3784284] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
May 08 19:21:22 tools-bastion-13 sssd_be[3784745]: Starting up

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
taavi renamed this task from sssd/LDAP flakiness (May 2025) to Toolforge bastion sssd/LDAP flakiness (May 2025).May 9 2025, 7:10 AM
taavi edited projects, added Toolforge; removed Cloud-VPS.

Mentioned in SAL (#wikimedia-cloud) [2025-05-09T07:10:41Z] <taavi> kill bunch of unwanted processes off of tools-bastion-13 T393732, please run your things as jobs

Based on an IRC discussion yesterday, I've disabled Puppet on tools-bastion-13 and hand-updated the sssd config to use codfw LDAP replicas in the hopes that those are somehow stabler than the eqiad replicas and will keep working at least until Monday.

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T11:52:38Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl restart sssd-pam{,{,-priv}.socket} # all three failed with start-limit-hit / Start request repeated too quickly; T393732?

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T11:53:16Z] <lucaswerkmeister> T393732 note: restart of sssd-pam.service actually failed, “may be requested by dependency only”; overall it still seems to have worked though (so next time restarting the sockets is probably sufficient)

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T14:10:28Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl restart sssd-sudo.socket # service-start-limit-hit, T393732?

Seemingly right now processing sudo rules is the main thing that's failing. A few things come to my mind:

  • The most obvious thing is to raise the timeout (ldap_search_timeout sssd setting, apparently defaults to 6 (seconds)). In general it seems logical that as the number of tools grow, the number of things that needs to be fetched from LDAP grows and so operations become slower.
  • This is the query sssd does to find sudo rules: '(&(objectClass=sudoRole)(|(&(!(sudoHost=*))(cn=defaults))(sudoHost=ALL)(sudoHost=tools-bastion-13)(sudoHost=tools-bastion-13.tools.eqiad1.wikimedia.cloud)(sudoHost=172.16.1.16)(sudoHost=172.16.0.0/21)(sudoHost=fe80::f816:3eff:fea1:a283)(sudoHost=fe80::/64)(sudoHost=+*)))'. Setting host restrictions on sudo rules is not something that the current Horizon interface supports, and there are exactly two anchient rules that set one (P75892), so maybe we could disable handling those (ldap_sudo_use_host_filter = false in sssd config) to make that LDAP query more efficient.
  • There are two rules per tool, one to power become and one for the tool to run chown for files inside its home directory. My understanding is that the take utility (from misctools) does the same thing as the latter rule but as a setuid binary, so maybe we could those sudo rules and so cut the number of rules to be processed by about half.
  • Maybe there are some missing indexes in LDAP that could be used to increase the query performance? I didn't find any working dashboards so this seems like a difficult thing to rule out.

My main theory why we're not seeing this on dev. is simply that login. gets a lot more traffic.

/cc @bd808 in case you have any more context than I do about the chmod rule.

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T16:22:02Z] <lucaswerkmeister> systemctl restart sssd-{pam{,-priv},sudo}.socket # service-start-limit-hit, T393732?

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T17:33:58Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl reset-failed sssd-{pam,sudo}.service && systemctl restart sssd-pam{,-priv}.socket # try to reset the rate limits this way (T393732)

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T17:35:56Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl restart sssd-sudo{,.socket} # looks like the reset-failed didn’t work properly, systemd didn’t even try to start the service again afaict (T393732)

FWIW, even though systemd complains if you try to restart sssd-sudo.service

root@tools-bastion-13:~# systemctl restart sssd-sudo{,.socket} # looks like the reset-failed didn’t work properly, systemd didn’t even try to start the service again afaict (T393732)                                                                                                                
Failed to restart sssd-sudo.service: Operation refused, unit sssd-sudo.service may be requested by dependency only (it is configured to refuse manual start/stop).                                                                                                                                    
See system logs and 'systemctl status sssd-sudo.service' for details.                                                                              

– it looks like this is still the way to go (until someone finds a better command, at least): after this command, systemd tried to start sssd-sudo.service again, which didn’t happen after the reset-failed that I logged just before.

But it’s also a moot point, because the service immediately died again anyway.

Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_wikimedia.org [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_wikimedia.org: Connection refused

I’ll probably stop trying to restart stuff and just leave this for people who know what they’re doing to look at on Monday.

This long thread is relates to the behavior we're seeing, although it's not identical:

https://github.com/SSSD/sssd/issues/6219

The one suggestion there that seems worth trying is altering /usr/lib/systemd/system/sssd.service, changing

Restart=on-abnormal

to

Restart=always

We never don't want sssd running, and I suspect that some of the downtime we're seeing is the service dying and not getting restarted promptly.

We also badly need metrics on our ldap servers (rw and ro) -- it would be nice to know if these outages correspond to high ldap traffic. As best I can tell we aren't gathering ldap metrics at all right now... perhaps we could co-opt https://github.com/tomcz/openldap_exporter ?

Restart=always

Since puppet is stopped already, I've hacked in this change on tools-bastion-13.

If the problem is ldap responsiveness, why does

watch -e ldapsearch -x uid=andrew

never show any errors? Has anyone else gotten direct evidence of ldap failure or are we only extrapolating from sssd/pam failure?

Sometimes I can look into Toolforge. When then trying become I get sudo: a password is required.

Restart=always

doesn't seem to help.

I also migrated the host to a different less-busy cloudvirt, which also doesn't seem to have helped.

Now I'm trying to adjust timeouts in sssd.conf to see if we can get things killed off less often.

Change #1143963 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] sssd: increase internal timeouts for be, pam, sudo

https://gerrit.wikimedia.org/r/1143963

Here's a new theory to consider: the problem is not the ldap server being slow, but toolforge ldap queries being slow because there are a zillion users in the 'tools' groups. Maybe we were pushing up against the limit all along, and just finally crossed over it. If that's correct then https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143963 might actually be a correctish fix, followed by, I guess, purging absent toolforge members or *waves hands* performance tuning.

Anyway, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143963 is currently hacked in on the bastion, we'll see how it goes.

205-04-24 was maybe the first of this series of problems. That failure was on the dev.toolforge.org bastion rather than the login.toolforge.org bastion that seems to be having more extended problems at the moment.

  • This is the query sssd does to find sudo rules: '(&(objectClass=sudoRole)(|(&(!(sudoHost=*))(cn=defaults))(sudoHost=ALL)(sudoHost=tools-bastion-13)(sudoHost=tools-bastion-13.tools.eqiad1.wikimedia.cloud)(sudoHost=172.16.1.16)(sudoHost=172.16.0.0/21)(sudoHost=fe80::f816:3eff:fea1:a283)(sudoHost=fe80::/64)(sudoHost=+*)))'. Setting host restrictions on sudo rules is not something that the current Horizon interface supports, and there are exactly two anchient rules that set one (P75892), so maybe we could disable handling those (ldap_sudo_use_host_filter = false in sssd config) to make that LDAP query more efficient.

Both of those sudo rules only referenced nonexistent pmtpa instances so I deleted them.

I have been able to login to Toolforge and do become

Change #1143963 merged by Andrew Bogott:

[operations/puppet@production] sssd: increase internal timeouts for be, pam, sudo

https://gerrit.wikimedia.org/r/1143963

Change #1144572 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps sssd.conf: increase timeout for nss section

https://gerrit.wikimedia.org/r/1144572

Change #1144572 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps sssd.conf: increase timeout for nss section

https://gerrit.wikimedia.org/r/1144572

I have got a problem again:

$ ssh toolforge
Connection closed by 185.15.56.62 port 22

I have got a problem again:

$ ssh toolforge
Connection closed by 185.15.56.62 port 22

I have had the same problem, but it is fixed for me now.

dcaro renamed this task from Toolforge bastion sssd/LDAP flakiness (May 2025) to [infra] Toolforge bastion sssd/LDAP flakiness (May 2025).May 22 2025, 9:35 AM

Let's call this resolved for now. There's been a few fixes applied here and in T394283 and we haven't got any reports since then.

taavi closed subtask Restricted Task as Resolved.May 26 2025, 2:23 PM