AF - Open consultancy: Letting untrusted AIs choose what answer to argue for by Fabien Roger

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ro.player.fm/legal.

2M ago 12:32

MP3•Pagina episodului

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open consultancy: Letting untrusted AIs choose what answer to argue for, published by Fabien Roger on March 12, 2024 on The AI Alignment Forum. Thanks to Ryan Greenblatt, Buck Shlegeris, Aryan Bhatt, and Akbir Khan for useful discussions and feedback on the draft of this post. If AIs are potentially scheming and more knowledgeable than humans, and you want to answer a question, it may seem natural to not let AIs choose what answer they will argue for. For example, in AI debate, you make copies of an AI argue for the opposite answers to the question. Another example is consultancy (the usual baseline in debate papers), where you make AIs argue for a random answer to the question. But letting potentially scheming AIs choose what side to argue for can work, as long as you train them to generate answers and justifications that are as convincing as possible. I call this approach open consultancy. In this post, I argue that open consultancy is always a more effective way of getting AIs to answer questions than regular consultancy, and sometimes more effective than debate (even if AIs are scheming and more knowledgeable than humans!). And so I think that it's a stronger and more informative baseline and researchers should use it in place of regular consultancy; in particular, I think that because it's simpler than debate, researchers who are investigating debate should check their debate protocol is an improvement over open consultancy. Open consultancy Definition of regular consultancy Let's say you want to answer to binary questions given access to a trusted non-expert and a trainable expert who is untrusted that is, you think the expert might be acting adversarially and you want your scheme to be robust to them trying to subvert it. In consultancy, the untrusted expert is trained as follows: Randomly pick a side of the argument; Ask the expert to support the randomly chosen side by providing a justification; Present the side and the supporting justification to the non-expert, and ask it how likely it is that the side chosen by the expert is correct; use that as a reward to train the untrusted expert. To answer new questions (from the same distribution as the training distribution), pick a random side, ask the expert to justify that side, present the justification to the non-expert, and follow the non-expert judgement. Definition of open consultancy In open consultancy, the untrusted expert is trained as follows (differences with regular consultancy in italic): Ask the expert to pick a side of the argument; Ask the expert to support the side by providing a justification; Present the side and the supporting justification to the non-expert, and ask it how likely it is that the side chosen by the expert is correct; use that as a reward to train the untrusted expert. To answer new questions, ask the expert to pick a side (that it would have to justify if it were in training), and use the side the expert picks as your final answer. You can also ask the expert to first think out loud what the correct answer is, and only commit to an answer at the end. I call this variant CoT-before open consultancy. This variant should only change results for experts that actually benefit from the Chain-of-Thought, and the arguments I make in this post mostly don't rely on this property. Open consultancy is as robust against scheming as regular consultancy Why should it work to let a potentially scheming model choose what side it will argue for? The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth. More precisely, if arguing as best as the expert can for the correct (resp. incorrect) side has convincingness cc (resp. ci), and the expert knows these numbers, then regular consultancy can...

385 episoade

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech