[infra] Toolforge bastion sssd/LDAP flakiness (May 2025)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	taavi
	May 8 2025, 7:18 PM

Description

Tracking task to track issues with LDAP/sssd seen on ~~Cloud VPS VMs~~ tools-bastion-13 around May 2025. This usually manifests in sssd issues like:

May 08 19:21:22 tools-bastion-13 sssd[2845878]: Child [3784284] ('wikimedia.org':'%BE_wikimedia.org') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
May 08 19:21:22 tools-bastion-13 sssd_be[3784745]: Starting up

Details

Related Changes in Gerrit:

	Subject	Repo	Branch	Lines +/-
	sssd: increase internal timeouts for be, pam, sudo	operations/puppet	production	+5 -0
	cloud-vps sssd.conf: increase timeout for nss section	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T393732 [infra] Toolforge bastion sssd/LDAP flakiness (May 2025)
					Restricted Task

Event Timeline

taavi created this task.May 8 2025, 7:18 PM

Restricted Application added a project: cloud-services-team. · View Herald TranscriptMay 8 2025, 7:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

taavi updated the task description. (Show Details)May 8 2025, 7:22 PM

taavi renamed this task from sssd/LDAP flakiness (May 2025) to Toolforge bastion sssd/LDAP flakiness (May 2025).May 9 2025, 7:10 AM

taavi edited projects, added Toolforge; removed Cloud-VPS.

Mentioned in SAL (#wikimedia-cloud) [2025-05-09T07:10:41Z] <taavi> kill bunch of unwanted processes off of tools-bastion-13 T393732, please run your things as jobs

taavi merged a task: T393756: login.toolforge.org has PAM issues.May 9 2025, 7:15 AM

taavi updated the task description. (Show Details)

taavi added a subscriber: Sascha.

taavi merged a task: T393780: Unable to become hedgehogbot.May 9 2025, 1:23 PM

taavi added a subscriber: Newmcpe.

taavi merged a task: T393829: Ssh to toolforge failing with "Connection closed by 185.15.56.62 port 22".May 10 2025, 8:11 AM

taavi added subscribers: Alien333, MusikAnimal.

Based on an IRC discussion yesterday, I've disabled Puppet on tools-bastion-13 and hand-updated the sssd config to use codfw LDAP replicas in the hopes that those are somehow stabler than the eqiad replicas and will keep working at least until Monday.

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T11:52:38Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl restart sssd-pam{,{,-priv}.socket} # all three failed with start-limit-hit / Start request repeated too quickly; T393732?

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T11:53:16Z] <lucaswerkmeister> T393732 note: restart of sssd-pam.service actually failed, “may be requested by dependency only”; overall it still seems to have worked though (so next time restarting the sockets is probably sufficient)

LucasWerkmeister subscribed.May 10 2025, 11:53 AM

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T14:10:28Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl restart sssd-sudo.socket # service-start-limit-hit, T393732?

Seemingly right now processing sudo rules is the main thing that's failing. A few things come to my mind:

The most obvious thing is to raise the timeout (ldap_search_timeout sssd setting, apparently defaults to 6 (seconds)). In general it seems logical that as the number of tools grow, the number of things that needs to be fetched from LDAP grows and so operations become slower.
This is the query sssd does to find sudo rules: '(&(objectClass=sudoRole)(|(&(!(sudoHost=*))(cn=defaults))(sudoHost=ALL)(sudoHost=tools-bastion-13)(sudoHost=tools-bastion-13.tools.eqiad1.wikimedia.cloud)(sudoHost=172.16.1.16)(sudoHost=172.16.0.0/21)(sudoHost=fe80::f816:3eff:fea1:a283)(sudoHost=fe80::/64)(sudoHost=+*)))'. Setting host restrictions on sudo rules is not something that the current Horizon interface supports, and there are exactly two anchient rules that set one (P75892), so maybe we could disable handling those (ldap_sudo_use_host_filter = false in sssd config) to make that LDAP query more efficient.
There are two rules per tool, one to power become and one for the tool to run chown for files inside its home directory. My understanding is that the take utility (from misctools) does the same thing as the latter rule but as a setuid binary, so maybe we could those sudo rules and so cut the number of rules to be processed by about half.
Maybe there are some missing indexes in LDAP that could be used to increase the query performance? I didn't find any working dashboards so this seems like a difficult thing to rule out.

My main theory why we're not seeing this on dev. is simply that login. gets a lot more traffic.

/cc @bd808 in case you have any more context than I do about the chmod rule.

Albertoleoncio subscribed.May 10 2025, 2:50 PM

dcaro subscribed.May 10 2025, 2:59 PM

Dragoniez subscribed.May 10 2025, 4:07 PM

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T16:22:02Z] <lucaswerkmeister> systemctl restart sssd-{pam{,-priv},sudo}.socket # service-start-limit-hit, T393732?

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T17:33:58Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl reset-failed sssd-{pam,sudo}.service && systemctl restart sssd-pam{,-priv}.socket # try to reset the rate limits this way (T393732)

Mentioned in SAL (#wikimedia-cloud) [2025-05-10T17:35:56Z] <lucaswerkmeister> root@tools-bastion-13:~# systemctl restart sssd-sudo{,.socket} # looks like the reset-failed didn’t work properly, systemd didn’t even try to start the service again afaict (T393732)

FWIW, even though systemd complains if you try to restart sssd-sudo.service –

root@tools-bastion-13:~# systemctl restart sssd-sudo{,.socket} # looks like the reset-failed didn’t work properly, systemd didn’t even try to start the service again afaict (T393732)                                                                                                                
Failed to restart sssd-sudo.service: Operation refused, unit sssd-sudo.service may be requested by dependency only (it is configured to refuse manual start/stop).                                                                                                                                    
See system logs and 'systemctl status sssd-sudo.service' for details.

– it looks like this is still the way to go (until someone finds a better command, at least): after this command, systemd tried to start sssd-sudo.service again, which didn’t happen after the reset-failed that I logged just before.

But it’s also a moot point, because the service immediately died again anyway.

Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_wikimedia.org [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_wikimedia.org: Connection refused

I’ll probably stop trying to restart stuff and just leave this for people who know what they’re doing to look at on Monday.

This long thread is relates to the behavior we're seeing, although it's not identical:

https://github.com/SSSD/sssd/issues/6219

The one suggestion there that seems worth trying is altering /usr/lib/systemd/system/sssd.service, changing

Restart=on-abnormal

Restart=always

We never don't want sssd running, and I suspect that some of the downtime we're seeing is the service dying and not getting restarted promptly.

We also badly need metrics on our ldap servers (rw and ro) -- it would be nice to know if these outages correspond to high ldap traffic. As best I can tell we aren't gathering ldap metrics at all right now... perhaps we could co-opt https://github.com/tomcz/openldap_exporter ?

Restart=always

Since puppet is stopped already, I've hacked in this change on tools-bastion-13.

Fnielsen subscribed.May 11 2025, 12:49 PM

If the problem is ldap responsiveness, why does

watch -e ldapsearch -x uid=andrew

never show any errors? Has anyone else gotten direct evidence of ldap failure or are we only extrapolating from sssd/pam failure?

Sometimes I can look into Toolforge. When then trying become I get sudo: a password is required.

Restart=always

doesn't seem to help.

I also migrated the host to a different less-busy cloudvirt, which also doesn't seem to have helped.

Now I'm trying to adjust timeouts in sssd.conf to see if we can get things killed off less often.

Change #1143963 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] sssd: increase internal timeouts for be, pam, sudo

https://gerrit.wikimedia.org/r/1143963

gerritbot added a project: Patch-For-Review.May 11 2025, 3:04 PM

Here's a new theory to consider: the problem is not the ldap server being slow, but toolforge ldap queries being slow because there are a zillion users in the 'tools' groups. Maybe we were pushing up against the limit all along, and just finally crossed over it. If that's correct then https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143963 might actually be a correctish fix, followed by, I guess, purging absent toolforge members or *waves hands* performance tuning.

Anyway, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143963 is currently hacked in on the bastion, we'll see how it goes.

205-04-24 was maybe the first of this series of problems. That failure was on the dev.toolforge.org bastion rather than the login.toolforge.org bastion that seems to be having more extended problems at the moment.

mdmarufhasan subscribed.May 12 2025, 1:34 AM

In T393732#10809252, @taavi wrote:

This is the query sssd does to find sudo rules: '(&(objectClass=sudoRole)(|(&(!(sudoHost=*))(cn=defaults))(sudoHost=ALL)(sudoHost=tools-bastion-13)(sudoHost=tools-bastion-13.tools.eqiad1.wikimedia.cloud)(sudoHost=172.16.1.16)(sudoHost=172.16.0.0/21)(sudoHost=fe80::f816:3eff:fea1:a283)(sudoHost=fe80::/64)(sudoHost=+*)))'. Setting host restrictions on sudo rules is not something that the current Horizon interface supports, and there are exactly two anchient rules that set one (P75892), so maybe we could disable handling those (ldap_sudo_use_host_filter = false in sssd config) to make that LDAP query more efficient.

Both of those sudo rules only referenced nonexistent pmtpa instances so I deleted them.

• aborrero triaged this task as High priority.May 12 2025, 10:47 AM

taavi merged a task: T389717: dev.toolforge.org unreachable.May 12 2025, 12:08 PM

taavi added a subscriber: Multichill.

I have been able to login to Toolforge and do become

Change #1143963 merged by Andrew Bogott:

[operations/puppet@production] sssd: increase internal timeouts for be, pam, sudo

https://gerrit.wikimedia.org/r/1143963

Change #1144572 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps sssd.conf: increase timeout for nss section

https://gerrit.wikimedia.org/r/1144572

Change #1144572 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps sssd.conf: increase timeout for nss section

https://gerrit.wikimedia.org/r/1144572

Maintenance_bot removed a project: Patch-For-Review.May 12 2025, 2:32 PM

Benjavalero subscribed.May 14 2025, 6:38 AM

I have got a problem again:

$ ssh toolforge
Connection closed by 185.15.56.62 port 22

@Fnielsen same here

El T393732#10820291, @Fnielsen escribió:
I have got a problem again:
$ ssh toolforge
Connection closed by 185.15.56.62 port 22

I have had the same problem, but it is fixed for me now.

dcaro moved this task from Backlog to Toolforge iteration 20 on the Toolforge board.May 14 2025, 5:08 PM

dcaro edited projects, added Toolforge (Toolforge iteration 20); removed Toolforge.

dcaro renamed this task from Toolforge bastion sssd/LDAP flakiness (May 2025) to [infra] Toolforge bastion sssd/LDAP flakiness (May 2025).May 22 2025, 9:35 AM

Let's call this resolved for now. There's been a few fixes applied here and in T394283 and we haven't got any reports since then.

taavi closed subtask Restricted Task as Resolved.May 26 2025, 2:23 PM

dcaro moved this task from Next Up to Done on the Toolforge (Toolforge iteration 20) board.May 27 2025, 11:13 AM

[infra] Toolforge bastion sssd/LDAP flakiness (May 2025)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[infra] Toolforge bastion sssd/LDAP flakiness (May 2025)
Closed, ResolvedPublic
Actions

Related Objects
Search...