AI Control: Improving Safety Despite Intentional Subversion
Manage episode 424744791 series 3498845
We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
- We summarize the paper;
- We compare our methodology to the methodology of other safety papers.
Source:
https://www.alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion
Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.
Capitole
1. AI Control: Improving Safety Despite Intentional Subversion (00:00:00)
2. Paper summary (00:02:41)
3. Setup (00:02:43)
4. Evaluation methodology (00:04:59)
5. Results (00:06:25)
6. Relationship to other work (00:10:51)
7. Future work (00:17:50)
83 episoade