Romina Davidson

Followers

Following

Public Views

Uploads

Papers by Romina Davidson

AegisTrain: A Secure Distributed Training Framework for Large Language Models in the Cloud

Proceedings of the Seventh AAAI/ACM Conference on AI, Ethics, and Society, 2025

Large Language Models (LLMs) are increasingly trained in elastic, multi-tenant cloud infrastructu... more Large Language Models (LLMs) are increasingly trained in elastic, multi-tenant cloud infrastructures[1] that span data centers, regions, and heterogeneous accelerators. While distributed training has matured in scale and efficiency, its security posture lags behind adversarial realities: training corpora may contain sensitive or regulated data; gradient channels can leak membership and attribute information; supplychain subversion can inject malicious kernels or compromised containers; cross-tenant resource sharing elevates the risk of side-channel inference; and orchestration layers, which schedule, checkpoint, and autoscale jobs[2], are attractive targets for data exfiltration and model theft. Existing mitigations remain fragmented: transport encryption protects links, confidentialcompute enclaves protect limited code paths, differential privacy protects outputs in isolation, and secure aggregation schemes address narrow communication steps. What is missing is a coherent end-to-end framework that composes these mechanisms with provable guarantees while preserving the throughput and tail-latency characteristics required by trillion-parameter training. This paper presents AegisTrain, a secure distributed training framework that treats privacy and integrity as firstclass control objectives across the full training lifecycle[3]. AegisTrain couples remotely attested confidential runtimes with attribute-bound key management to ensure that only measured and policy-compliant components can access plaintext data, gradients, and optimizer state. It introduces a cryptographic aggregation substrate that masks worker updates end-to-end and injects calibrated noise under a formally verifiable privacy accountant, rendering gradient channels useless for membership inference even when a bounded number of participants are compromised. Checkpoint images, telemetry, and intermediate artifacts are encrypted at rest with per-epoch keys derived from a hardware-rooted key hierarchy, and the supply chain is hardened with verifiable provenance, reproducible builds, and incluster policy enforcement. The framework is designed for tensor, pipeline, and data-parallel hybrids, and preserves scalability through streaming decryption, batched attestation, and offloaded cryptography that runs on CPU sidecars while GPU compute saturates the model step[4]. We develop a control-plane that enforces risk-adaptive policies-for example, tightening privacy budgets or refusing mixed-trust colocation under elevated threat intel-without manual intervention, and we present machinecheckable invariants that forbid unsafe downgrades. A queueingtheoretic and information-theoretic analysis quantifies the overhead of encryption, attestation, and privacy noise relative to communication and compute, and shows parameter regimes in which security can be achieved with sub-5% throughput loss at cluster scale. A prototype implementation with masked allreduce, enclave-gated data loaders, and encrypted checkpoints demonstrates feasibility on realistic LLM training traces. By articulating the interfaces among attestation, secure aggregation, privacy accounting, and distributed parallelism, AegisTrain reframes secure LLM training as a problem of principled composition rather than ad hoc patchwork, yielding a deployable blueprint for cloud environments where both speed and trust are non-negotiable.

Download

AegisTrain: A Secure Distributed Training Framework for Large Language Models in the Cloud

Large Language Models (LLMs) are increasingly trained in elastic, multi-tenant cloud infrastructu... more Large Language Models (LLMs) are increasingly trained in elastic, multi-tenant cloud infrastructures that span data centers, regions, and heterogeneous accelerators. While distributed training has matured in scale and efficiency, its security posture lags behind adversarial realities: training corpora may contain sensitive or regulated data; gradient channels can leak membership and attribute information; supplychain subversion can inject malicious kernels or compromised containers; cross-tenant resource sharing elevates the risk of side-channel inference; and orchestration layers, which schedule, checkpoint, and autoscale jobs , are attractive targets for data exfiltration and model theft. Existing mitigations remain fragmented: transport encryption protects links, confidentialcompute enclaves protect limited code paths, differential privacy protects outputs in isolation, and secure aggregation schemes address narrow communication steps. What is missing is a coherent end-to-end framework that composes these mechanisms with provable guarantees while preserving the throughput and tail-latency characteristics required by trillion-parameter training. This paper presents AegisTrain, a secure distributed training framework that treats privacy and integrity as firstclass control objectives across the full training lifecycle . AegisTrain couples remotely attested confidential runtimes with attribute-bound key management to ensure that only measured and policy-compliant components can access plaintext data, gradients, and optimizer state. It introduces a cryptographic aggregation substrate that masks worker updates end-to-end and injects calibrated noise under a formally verifiable privacy accountant, rendering gradient channels useless for membership inference even when a bounded number of participants are compromised. Checkpoint images, telemetry, and intermediate artifacts are encrypted at rest with per-epoch keys derived from a hardware-rooted key hierarchy, and the supply chain is hardened with verifiable provenance, reproducible builds, and incluster policy enforcement. The framework is designed for tensor, pipeline, and data-parallel hybrids, and preserves scalability through streaming decryption, batched attestation, and offloaded cryptography that runs on CPU sidecars while GPU compute saturates the model step . We develop a control-plane that enforces risk-adaptive policies-for example, tightening privacy budgets or refusing mixed-trust colocation under elevated threat intel-without manual intervention, and we present machinecheckable invariants that forbid unsafe downgrades. A queueingtheoretic and information-theoretic analysis quantifies the overhead of encryption, attestation, and privacy noise relative to communication and compute, and shows parameter regimes in which security can be achieved with sub-5% throughput loss at cluster scale. A prototype implementation with masked allreduce, enclave-gated data loaders, and encrypted checkpoints demonstrates feasibility on realistic LLM training traces. By articulating the interfaces among attestation, secure aggregation, privacy accounting, and distributed parallelism, AegisTrain reframes secure LLM training as a problem of principled composition rather than ad hoc patchwork, yielding a deployable blueprint for cloud environments where both speed and trust are non-negotiable.

Download

Romina Davidson

Uploads

Papers by Romina Davidson

Log In