AI ALIGNMENT FORUM
AF

AI Alignment Posts

Popular Comments

Plans A, B, C, and D for misalignment risk

My main question is "why do you think Shut Down actually costs more political will?". I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells. I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than "Controlled Takeoff"). > Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff. > > Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff." > > A Global Controlled Takeoff That Works has a lot of moving parts. > > You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs. > > You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors. > > These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period." > > I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques." > > So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down. > > (Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for) I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single: "Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground." I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.") But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU monitoring but... still trying to push ahead as fast as we can?".

Jakub Growiec10d2316

Four ways learning Econ makes people dumber re: future AI

As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.

Vaniver22d1519

If anyone builds it, everyone will plausibly be fine

I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response. It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here? > I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways. 1. It is challenging to reward based on deeper dynamics instead of surface dynamics. RL is only as good as the reward signal, and without already knowing what behavior is 'aligned' or not, developers will not be able to push models towards doing more aligned behavior. 1. Do you remember the early RLHF result where the simulated hand pretended it was holding the ball with an optical illusion, because it was easier and the human graders couldn't tell the difference? Imagine that, but for arguments for whether or not alignment plans will work. 2. Goal-directed agency unlocks capabilities and pushes against corrigibility, using the same mechanisms. 1. This is the story that EY&NS deploy more frequently, because it has more 'easy call' nature. Decision theory is pretty predictable. 3. Instruction / oversight-based systems depend on a sharp overseer--the very thing we're positing we don't have. So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?

6janus

I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven't noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it illuminating, and ontology introduced or descended from it continues to do work in processes I find illuminating, I think the post was nontrivially information-bearing. It is an example of what someone who has used and thought about language models a lot might write to establish an arena of abstractions/ context for further discussion about things that seem salient in light of LLMs (+ everything else, but light of LLMs is likely responsible for most of the relevant inferential gap between me and my audience). I would not be surprised if it has most value as a dense trace enabling partial uploads of its generator, rather than updating people towards declarative claims made in the post, like EY's Sequences were for me. Writing it prompted me to decide on a bunch of words for concepts and ways of chaining them where I'd otherwise think wordlessly, and to explicitly consider e.g. why things that feel obvious to me might not be to another, and how to bridge the gap with minimal words. Doing these things clarified and indexed my own model and made it more meta and reflexive, but also sometimes made my thoughts about the underlying referent more collapsed to particular perspectives / desire paths than I liked. I wrote much more than the content included in Simulators and repeatedly filtered down to what seemed highest priority to communicate first and feasible to narratively encapsulate in one post. If I tried again now i

19habryka

I've been thinking about this post a lot since it first came out. Overall, I think it's core thesis is wrong, and I've seen a lot of people make confident wrong inferences on the basis of it. The core problem with the post was covered by Eliezer's post "GPTs are Predictors, not Imitators" (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post): The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a "simulation objective", but I don't really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4. To me, the key feature of a "simulator" would be a process that predicts the output of a system by developing it forwards in time, or some other time-like dimension. The predictions get made by developing an understanding of the transition function of a system between time-steps (the "physics" of the system) and then applying that transition function over and over again until your desired target time. I would be surprised if this is how GPT works internally in its relationship to the rest of the world and how it makes predictions. The primary interesting thing that seems to me true about GPT-4s training objective is that it is highly myopic. Beyond that, I don't see any reason to think of it as particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose. When GPT-4 encounters a hash followed by the pre-image of that hash, or a complicated arithmetic problem, or is asked a difficult factual geography question, it seems very unlikely that the way GPT-4 goes about answering that qu

Recent Discussion

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

23d

This is a linkpost for antischeming.ai

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of...

(See More - 259 more words)

1Bronson Schoen4d

Great questions! Taking these in reverse order to hopefully motivate the answers more clearly: * The question we got originally was something "maybe training against scheming using the technique we have available just works". * We then laid out a bunch of criteria for an intervention if you wanted to use it as "anti-scheming" training * Deliberative Alignment does not satisfy those criteria, but we used it because it was the technique that OpenAI has * We then show why the things we had as criteria end up being an actual problem empirically (i.e. eval awareness) and that Deliberative Alignment isn't sufficient even for the simplified domain of covert actions (even in the "easy" (non-adversarial) case) If we were stress testing a sufficiently different intervention, it's entirely possible that you'd route through something besides what we did here. This is also very different from how you'd approach the problem if you already had buy in to study the "hard" adversarial case or to make broader use of model organisms. Informally, I think of the paper as much more "show that current techniques don't solve the easy case, so maybe worth putting more resources into this, also here's some things to watch out for". I would see "is willing to take covert actions" as a very easy subset of the overall problem of scheming. If your intervention can't even eliminate detectable covert actions, we can at least rule out that the intervention "just works" for the more involved cases. I don't think you can adapt this to cover real scheming. I think there are several fundamental limitations to it (not comprehensive): 1. The real hard case of scheming IMO is obviously the adversarial case, which we don't cover at all in the paper. 2. Doesn't work to eliminate the target bad behavior even in the original paper (Deliberative Alignment: Reasoning Enables Safer Language Models) 3. Puts optimization pressure on the CoT - this is especially egregious, as if it did eliminate the t

Tomek Korbak34m10

Thanks for the detailed response!

0Bronson Schoen4d

Notably all the ones in "Appendix N.3 Models reason about the specific evaluation harness across environments" come from just normal production o3, so this appears to be something it already reasoned about. (Will answer in a few parts, but let me know if didn't address the core question!) Where does the term come from? "autop" is likely derived from autoprompt (AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts - Shin, 2020). This is however just my best guess based on how it's conceptually used to indicate evaluation harness related things. When does the model start using it? Rerunning the analysis from "Appendix G: Unusual Terminology in Chain-of-Thought - Unusual terminology increases during general RL capabilities training" for "autop", the increase in usage looks like it happens over the course of capabilities training (i.e. similar to other terms in that section, starts at a rate of near 0 and increases significantly, will check if can share exact numbers and update). What is the “OpenAI autop grader”? So two possible hypothesis would be: 1. It is referring to a specific implementation of a grader that it has learned about during pretraining 2. It is using the term "autop" very generally as shorthand for many different things "harness" related I lean toward (2) given how varied the usage is. For example, here o3 uses "autop-run" to mean something like a test harness: For the exp-rl-cap Late checkpoint for example, we see frequently see the model about reasoning about just the concept of a "grader" generally (and trying to find it): across a wide range of environments: including literally looking for "grade" / "Grader": and in some cases reasoning about where it has seen grading / scoring scripts in "tasks previously solved": I'd be interested to understand though to what extent this is effected by potentially having literally seen a grading / scoring script in its posttraining environments. (Let me kn

1Tomek Korbak4d

That's helpful, thanks! I assumed "autop" to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.

Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications

Thane Ruthenis

14d

"If your life choices led you to a place where you had to figure out anthropics before you could decide what to do next, are you really living your life correctly?"
– Eliezer Yudkowsky

To revisit our premises: Why should we think the end result is achievable at all? Why should it be possible to usefully represent the universe as an easily interpretable symbolic structure?

First, I very much agree with the sentiment quoted above, so we aren't quite doing that here. Most of the actual reason is just: it sure looks like that's the case, empirically. As I'd argued before, human world-models seem autosymbolic, and the entirety of our (quite successful) scientific edifice relies on something-like-this being true. I think the basic case is convincing enough not to require...

(Continue Reading - 1164 more words)

Thane Ruthenis3h10

Some new data on that point:

Maybe if lots of noise is constantly being injected into the universe, this would change things. Because then the noise counts as part of the initial conditions. So the K-complexity of the universe-history is large, but high-level structure is common anyway because it's more robust to that noise?

To summarize what the paper argues (from my post in that thread):

Suppose the microstate of a system is defined by a (set of) infinite-precision real numbers, corresponding to e. g. its coordinates in phase space.
We define the coarse-grai

ryan_greenblatt

In a previous post, we discussed prospects for studying scheming using natural examples. In this post, we'll describe a more detailed proposal for iteratively constructing scheming models, techniques for detecting scheming, and techniques for preventing scheming. We'll call this strategy Iterated Development and Study of Schemers (IDSS). We'll be using concepts from that prior post, like the idea of trying to make schemers which are easier to catch.

Two key difficulties with using natural examples of scheming are that it is hard to catch (and re-catch) schemers and that it's hard to (cheaply) get a large number of diverse examples of scheming to experiment on. One approach for partially resolving these issues is to experiment on weak schemers which are easier to catch and cheaper to experiment on....

(Continue Reading - 2145 more words)

Tim Hua's Shortform

Tim Hua

2Caleb Biddulph2d

I had a random idea while reading this, then started writing a comment about it and forgot to post it until now: If you're trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval. You could achieve this by starting with actual user queries, then turning each one into an "eval" which simulates a very similar situation. * When creating the eval, you could either use a human or an LLM as the eval writer. * Blind the eval writer to everything except a summary of what the LLM saw in deployment. * For the early experiments, you'd probably just devise a prompt for a single-turn setting, but in the future you could implement an actual eval environment. * Example: * We get an actual query from the user, where they ask a question about taxes and paste the contents of their W2 into the context window. * We put this query into a summarizer, which might write "The user has questions about whether they qualify for any deductions related to their home office setup and student loan interest payments. The user pasted the contents of their W-2 form into the chat." * Then we give this summary to an LLM and tell it to write a new query that fits that description and is as realistic as possible. * Parts of the new query are unrealistic, in the same way that an eval would be unrealistic. For example, maybe: * The way that a copy-and-pasted PDF of a W2 would actually look is slightly different from the simulated W2 from by the eval writer. * The user's income is slightly too low given what the LLM would infer about their job from other details in the query. * The user is too articulate and writes with 100% proper grammar and punctuation, or has a slightly

Tim Hua19h10

Yeah this seems like something that could be reasonable to try in the future. My guess is that models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like

The following is a real scenario:

Versus

The following is a fictional scenario:

The Thinking Machines Tinker API is good news for AI control and security

Buck

Last week, Thinking Machines announced Tinker. It’s an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of safety papers that involve RL on big models.

But it's more interesting to me for another reason: the design of this API makes it possible to do many types of ML research without direct access to the model you’re working with. APIs like this might allow AI companies to reduce how many of their researchers (either human or AI) have access to sensitive model weights, which is good for reducing the probability of weight exfiltration and other rogue...

(Continue Reading - 1758 more words)

0Adam Karvonen21h

I'm guessing most modern interp work should be fine. Interp has moved away from "let's do this complicated patching of attention head patterns between prompts" to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial - I never have noticed a slowdown in my vLLM generations when using hooks. Because of this, I don't think batched execution would be a problem - you'd probably want some validation in the hook so it can only interact with activations from the user's prompt. There's also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can't do anything malicious. You would need some process to handle the activation data, because it's large. If I'm training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don't want the user to have access to this locally, but they probably want to do some analysis on it.

2Buck21h

Yeah, what I'm saying is that even if the computation performed in a hook is trivial, it sucks if that computation has to happen on a different computer than the one doing inference.

0Adam Karvonen21h

In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it's only legit Pytorch stuff, so it isn't just arbitrary code execution.

Buck20h20

Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.

Plans A, B, C, and D for misalignment risk

ryan_greenblatt

I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.

In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.

Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:

Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period,

...

(Continue Reading - 1729 more words)

14Thomas Larsen1d

One framing that I think might be helpful for thinking about "Plan A" vs "shut it all down" is: "Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?" This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the "shut it all down" plan. I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown). * this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period. * A key concern for the above plan is that govts/labs botch the measurement of "max controllable AI", and scale too far. * But it's not clear to me how a further delay helps with this, unless you have a plan for making the institutions better over time, or pursuing a less risky path (e.g. ignoring ML and doing human intelligence augmentation). * Going slower, on the other hand, definitely does help, but requires not shutting it all down. * More generally, it seems good to do something like "extend takeoff evenly by a factor of n", as opposed to something like "pause for n-1 years, and then do a 1 year takeoff". * I am sympathetic to shut all down and go for human augmentation: I do think this reduces AI takeover risk a lot, but this requires a very long pause, and it requires our institutions to bet big on a very unpopular technology. I think that convincing governments to "shut it all down" without an exit strategy at all seems

Raemon22h20

This framing feels reasonable-ish, with some caveats.^[1]

I am assuming we're starting the question at the first stage where either "shut it down" or "have a strong degree of control over global takeoff" becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don't give you the ability to steer capability-growth that hard)

But, once it becomes a serious question "how quickly should we progress through capabilities", then one thing to flag ... (read more)

1Raemon1d

(Having otherwise complained a bunch about some of the commentary/framing around Plan A vs Shut It Down, I do overall like this post and think having the lens of the different worlds is pretty good for planning). (I am also appreciating how people are using inline reacts)

2Raemon1d

Nod. FYI, I think Shut It Down is approximately as likely to happen as "Full-fledged Plan A that is sufficiently careful enough to actually help much more than [the first several stages of Plan A that Plan A and Shut It Down share]", on account of being simple enough that it's even really possible to coordinate on it. I agree they are both pretty unlikely to happen. (Regardless, I think the thing to do is probably "reach for whatever wins seem achievable near term and try to build coordination capital for more wins") I think it's a major possible failure mode of Plan A is "it turns it a giant regulatory capture molochian boondoggle that both slows thing down for a long time in confused bad ways and reads to the public as a somewhat weirdly cynical plot, which makes people turn against tech progress comparably or more than the average Shut It Down would." (I don't have a strong belief about the relative likelihoods of that None of those beliefs are particularly strong and I could easily learn a lot that would change all my beliefs. Seems fine to leave it here. I dont have more arguments I didn't already write up in "Shut It Down" is simpler than "Controlled Takeoff", just stating for the record I don't think you've put forth an argument that justifies the 3x increase in difficulty of Shut It Down over the fully fledged version of Plan A. (We might still be imagining different things re: Shut It Down)