Last week, Thinking Machines announced Tinker. It’s an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of safety papers that involve RL on big models.
But it's more interesting to me for another reason: the design of this API makes it possible to do many types of ML research without direct access to the model you’re working with. APIs like this might allow AI companies to reduce how many of their researchers (either human or AI) have access to sensitive model weights, which is good for reducing the probability of weight exfiltration and other rogue...
Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.
I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.
In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.
Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:
This framing feels reasonable-ish, with some caveats.[1]
I am assuming we're starting the question at the first stage where either "shut it down" or "have a strong degree of control over global takeoff" becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don't give you the ability to steer capability-growth that hard)
But, once it becomes a serious question "how quickly should we progress through capabilities", then one thing to flag ...
This is a link post for two papers that came out today:
These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”
For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work...
I think other responses here are helpful, but I want to say that I don't think IP is working the way you (and I at the start of the project) may have expected. I think it's not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn't upweight the "reward hacking persona".
In other words, there are two kinds of reward hacking:
This is a recorded series of lectures and homework assignments discussions on the topic of statistical learning theory, co-organized by Gergely Szucs (@Yegreg) and @Alex Flint. The homework assignments are available here, and lecture notes are available here.
The lectures are intended for those who want to do research on the learning-theoretic agenda and need to catch-up on the necessary background knowledge. As complementary resources, consider looking into my MATS lectures, the LTA reading list and the (not quite finished) sequence on infra-Bayesianism by @Brittany Gelb.
If you feel ready to start with actual research, consider applying to my track in the PIBBSS fellowship.
Privacy note: According to what I know, everyone who appear in the recording consented to the recording becoming available online. However, if there was some mistake, s.t. you are in the recording but don't want to be there, DM me and I will take it down ASAP.
H/t to my beloved spouse @Marcus Ogren for placing the lectures on YouTube.
Hey Vanessa, thanks for getting these up on youtube!
For anyone wondering whether there will be another round of this course, we have no plans to do so at present, but would be grateful to hear of your interest if you would participate.
Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of...
That's helpful, thanks! I assumed "autop" to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.
This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.
Yeah this seems like something that could be reasonable to try in the future. My guess is that models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like
Versus