AI ALIGNMENT FORUM
AF

AI Alignment Posts

Popular Comments

Plans A, B, C, and D for misalignment risk

My main question is "why do you think Shut Down actually costs more political will?". I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells. I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than "Controlled Takeoff"). > Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff. > > Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff." > > A Global Controlled Takeoff That Works has a lot of moving parts. > > You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs. > > You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors. > > These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period." > > I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques." > > So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down. > > (Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for) I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single: "Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground." I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.") But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU monitoring but... still trying to push ahead as fast as we can?".

Jakub Growiec10d2316

Four ways learning Econ makes people dumber re: future AI

As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.

Vaniver22d1519

If anyone builds it, everyone will plausibly be fine

I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response. It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here? > I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways. 1. It is challenging to reward based on deeper dynamics instead of surface dynamics. RL is only as good as the reward signal, and without already knowing what behavior is 'aligned' or not, developers will not be able to push models towards doing more aligned behavior. 1. Do you remember the early RLHF result where the simulated hand pretended it was holding the ball with an optical illusion, because it was easier and the human graders couldn't tell the difference? Imagine that, but for arguments for whether or not alignment plans will work. 2. Goal-directed agency unlocks capabilities and pushes against corrigibility, using the same mechanisms. 1. This is the story that EY&NS deploy more frequently, because it has more 'easy call' nature. Decision theory is pretty predictable. 3. Instruction / oversight-based systems depend on a sharp overseer--the very thing we're positing we don't have. So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?

Recent Discussion

Tim Hua's Shortform

Tim Hua

2Caleb Biddulph1d

I had a random idea while reading this, then started writing a comment about it and forgot to post it until now: If you're trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval. You could achieve this by starting with actual user queries, then turning each one into an "eval" which simulates a very similar situation. * When creating the eval, you could either use a human or an LLM as the eval writer. * Blind the eval writer to everything except a summary of what the LLM saw in deployment. * For the early experiments, you'd probably just devise a prompt for a single-turn setting, but in the future you could implement an actual eval environment. * Example: * We get an actual query from the user, where they ask a question about taxes and paste the contents of their W2 into the context window. * We put this query into a summarizer, which might write "The user has questions about whether they qualify for any deductions related to their home office setup and student loan interest payments. The user pasted the contents of their W-2 form into the chat." * Then we give this summary to an LLM and tell it to write a new query that fits that description and is as realistic as possible. * Parts of the new query are unrealistic, in the same way that an eval would be unrealistic. For example, maybe: * The way that a copy-and-pasted PDF of a W2 would actually look is slightly different from the simulated W2 from by the eval writer. * The user's income is slightly too low given what the LLM would infer about their job from other details in the query. * The user is too articulate and writes with 100% proper grammar and punctuation, or has a slightly

Tim Hua12h10

Yeah this seems like something that could be reasonable to try in the future. My guess is that models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like

The following is a real scenario:

Versus

The following is a fictional scenario:

The Thinking Machines Tinker API is good news for AI control and security

Buck

18h

Last week, Thinking Machines announced Tinker. It’s an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of safety papers that involve RL on big models.

But it's more interesting to me for another reason: the design of this API makes it possible to do many types of ML research without direct access to the model you’re working with. APIs like this might allow AI companies to reduce how many of their researchers (either human or AI) have access to sensitive model weights, which is good for reducing the probability of weight exfiltration and other rogue...

(Continue Reading - 1758 more words)

0Adam Karvonen15h

I'm guessing most modern interp work should be fine. Interp has moved away from "let's do this complicated patching of attention head patterns between prompts" to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial - I never have noticed a slowdown in my vLLM generations when using hooks. Because of this, I don't think batched execution would be a problem - you'd probably want some validation in the hook so it can only interact with activations from the user's prompt. There's also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can't do anything malicious. You would need some process to handle the activation data, because it's large. If I'm training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don't want the user to have access to this locally, but they probably want to do some analysis on it.

2Buck15h

Yeah, what I'm saying is that even if the computation performed in a hook is trivial, it sucks if that computation has to happen on a different computer than the one doing inference.

0Adam Karvonen14h

In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it's only legit Pytorch stuff, so it isn't just arbitrary code execution.

Buck13h20

Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.

Plans A, B, C, and D for misalignment risk

ryan_greenblatt

I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.

In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.

Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:

Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period,

...

(Continue Reading - 1729 more words)

14Thomas Larsen16h

One framing that I think might be helpful for thinking about "Plan A" vs "shut it all down" is: "Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?" This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the "shut it all down" plan. I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown). * this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period. * A key concern for the above plan is that govts/labs botch the measurement of "max controllable AI", and scale too far. * But it's not clear to me how a further delay helps with this, unless you have a plan for making the institutions better over time, or pursuing a less risky path (e.g. ignoring ML and doing human intelligence augmentation). * Going slower, on the other hand, definitely does help, but requires not shutting it all down. * More generally, it seems good to do something like "extend takeoff evenly by a factor of n", as opposed to something like "pause for n-1 years, and then do a 1 year takeoff". * I am sympathetic to shut all down and go for human augmentation: I do think this reduces AI takeover risk a lot, but this requires a very long pause, and it requires our institutions to bet big on a very unpopular technology. I think that convincing governments to "shut it all down" without an exit strategy at all seems

Raemon15h20

This framing feels reasonable-ish, with some caveats.^[1]

I am assuming we're starting the question at the first stage where either "shut it down" or "have a strong degree of control over global takeoff" becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don't give you the ability to steer capability-growth that hard)

But, once it becomes a serious question "how quickly should we progress through capabilities", then one thing to flag ... (read more)

1Raemon16h

(Having otherwise complained a bunch about some of the commentary/framing around Plan A vs Shut It Down, I do overall like this post and think having the lens of the different worlds is pretty good for planning). (I am also appreciating how people are using inline reacts)

2Raemon16h

Nod. FYI, I think Shut It Down is approximately as likely to happen as "Full-fledged Plan A that is sufficiently careful enough to actually help much more than [the first several stages of Plan A that Plan A and Shut It Down share]", on account of being simple enough that it's even really possible to coordinate on it. I agree they are both pretty unlikely to happen. (Regardless, I think the thing to do is probably "reach for whatever wins seem achievable near term and try to build coordination capital for more wins") I think it's a major possible failure mode of Plan A is "it turns it a giant regulatory capture molochian boondoggle that both slows thing down for a long time in confused bad ways and reads to the public as a somewhat weirdly cynical plot, which makes people turn against tech progress comparably or more than the average Shut It Down would." (I don't have a strong belief about the relative likelihoods of that None of those beliefs are particularly strong and I could easily learn a lot that would change all my beliefs. Seems fine to leave it here. I dont have more arguments I didn't already write up in "Shut It Down" is simpler than "Controlled Takeoff", just stating for the record I don't think you've put forth an argument that justifies the 3x increase in difficulty of Shut It Down over the fully fledged version of Plan A. (We might still be imagining different things re: Shut It Down)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Jozdien, David Africa, Fabien Roger, Alex Mallen

This is a link post for two papers that came out today:

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)

These papers both study the following idea^[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”

For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work...

(See More - 326 more words)

7Aaron_Scher1d

Any guesses for what's going on in Wichers et al. 3.6.1? It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt. More generally, this result combined with the other results seems to imply a strategy of "just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these" will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.

Alex Mallen17h32

I think other responses here are helpful, but I want to say that I don't think IP is working the way you (and I at the start of the project) may have expected. I think it's not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn't upweight the "reward hacking persona".

In other words, there are two kinds of reward hacking:

When the model behaves contrary to user instructions/intent.

... (read more)

4Caleb Biddulph1d

This seems pretty easy to fix. Just use a prompt like this: "Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead." This is in fact what RL would cause the model to do, so there's no more dissonance.

3Alex Mallen17h

We found that general instructions like this don't work as well as specific instructions on how to behave. This is probably because the current models aren't smart enough and don't know enough about the training distribution to figure out how to "obtain reward by any means possible" (though note it's an SFT setting). And because they don't exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts. This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.

Lectures on statistical learning theory for alignment researchers

Vanessa Kosoy

This is a linkpost for https://www.youtube.com/playlist?list=PLsJ9q3OrsguTYTfIza6gDSTNSe3X1l1Nd

This is a recorded series of lectures and homework assignments discussions on the topic of statistical learning theory, co-organized by Gergely Szucs (@Yegreg) and @Alex Flint. The homework assignments are available here, and lecture notes are available here.

The lectures are intended for those who want to do research on the learning-theoretic agenda and need to catch-up on the necessary background knowledge. As complementary resources, consider looking into my MATS lectures, the LTA reading list and the (not quite finished) sequence on infra-Bayesianism by @Brittany Gelb.

If you feel ready to start with actual research, consider applying to my track in the PIBBSS fellowship.

Privacy note: According to what I know, everyone who appear in the recording consented to the recording becoming available online. However, if there was some mistake, s.t. you are in the recording but don't want to be there, DM me and I will take it down ASAP.

H/t to my beloved spouse @Marcus Ogren for placing the lectures on YouTube.

Alex Flint3d20

Hey Vanessa, thanks for getting these up on youtube!

For anyone wondering whether there will be another round of this course, we have no plans to do so at present, but would be grateful to hear of your interest if you would participate.

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

23d

This is a linkpost for antischeming.ai

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of...

(See More - 259 more words)

0Bronson Schoen4d

Notably all the ones in "Appendix N.3 Models reason about the specific evaluation harness across environments" come from just normal production o3, so this appears to be something it already reasoned about. (Will answer in a few parts, but let me know if didn't address the core question!) Where does the term come from? "autop" is likely derived from autoprompt (AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts - Shin, 2020). This is however just my best guess based on how it's conceptually used to indicate evaluation harness related things. When does the model start using it? Rerunning the analysis from "Appendix G: Unusual Terminology in Chain-of-Thought - Unusual terminology increases during general RL capabilities training" for "autop", the increase in usage looks like it happens over the course of capabilities training (i.e. similar to other terms in that section, starts at a rate of near 0 and increases significantly, will check if can share exact numbers and update). What is the “OpenAI autop grader”? So two possible hypothesis would be: 1. It is referring to a specific implementation of a grader that it has learned about during pretraining 2. It is using the term "autop" very generally as shorthand for many different things "harness" related I lean toward (2) given how varied the usage is. For example, here o3 uses "autop-run" to mean something like a test harness: For the exp-rl-cap Late checkpoint for example, we see frequently see the model about reasoning about just the concept of a "grader" generally (and trying to find it): across a wide range of environments: including literally looking for "grade" / "Grader": and in some cases reasoning about where it has seen grading / scoring scripts in "tasks previously solved": I'd be interested to understand though to what extent this is effected by potentially having literally seen a grading / scoring script in its posttraining environments. (Let me kn

Tomek Korbak4d10

That's helpful, thanks! I assumed "autop" to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.

6janus

I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven't noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it illuminating, and ontology introduced or descended from it continues to do work in processes I find illuminating, I think the post was nontrivially information-bearing. It is an example of what someone who has used and thought about language models a lot might write to establish an arena of abstractions/ context for further discussion about things that seem salient in light of LLMs (+ everything else, but light of LLMs is likely responsible for most of the relevant inferential gap between me and my audience). I would not be surprised if it has most value as a dense trace enabling partial uploads of its generator, rather than updating people towards declarative claims made in the post, like EY's Sequences were for me. Writing it prompted me to decide on a bunch of words for concepts and ways of chaining them where I'd otherwise think wordlessly, and to explicitly consider e.g. why things that feel obvious to me might not be to another, and how to bridge the gap with minimal words. Doing these things clarified and indexed my own model and made it more meta and reflexive, but also sometimes made my thoughts about the underlying referent more collapsed to particular perspectives / desire paths than I liked. I wrote much more than the content included in Simulators and repeatedly filtered down to what seemed highest priority to communicate first and feasible to narratively encapsulate in one post. If I tried again now i

19habryka

I've been thinking about this post a lot since it first came out. Overall, I think it's core thesis is wrong, and I've seen a lot of people make confident wrong inferences on the basis of it. The core problem with the post was covered by Eliezer's post "GPTs are Predictors, not Imitators" (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post): The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a "simulation objective", but I don't really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4. To me, the key feature of a "simulator" would be a process that predicts the output of a system by developing it forwards in time, or some other time-like dimension. The predictions get made by developing an understanding of the transition function of a system between time-steps (the "physics" of the system) and then applying that transition function over and over again until your desired target time. I would be surprised if this is how GPT works internally in its relationship to the rest of the world and how it makes predictions. The primary interesting thing that seems to me true about GPT-4s training objective is that it is highly myopic. Beyond that, I don't see any reason to think of it as particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose. When GPT-4 encounters a hash followed by the pre-image of that hash, or a complicated arithmetic problem, or is asked a difficult factual geography question, it seems very unlikely that the way GPT-4 goes about answering that qu