AF - UDT1.01: The Story So Far (1/10) by Diffractor

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ro.player.fm/legal.

1M ago 20:54

MP3•Pagina episodului

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: UDT1.01: The Story So Far (1/10), published by Diffractor on March 27, 2024 on The AI Alignment Forum. We now resume your regularly scheduled LessWrong tradition of decision theory posting. Just the first and last post will be on Alignment Forum, and the whole thing will be linked together. Epistemic Status: This is mostly just recapping old posts so far. If you're a decision-theory veteran, new stuff only starts arising in the "Computational Intractability" section and further down. You may have heard of a thing called Updateless Decision Theory. It's been discussed for over a decade by now, but progress on rendering it down into something that could maybe someday be run on a computer has been very slow. Last winter, I decided to try helping out Abram and Martin with their efforts at actually formalizing it into some sort of algorithm that nice things could be proved about. I didn't quite succeed, but did manage to have a fair few novel insights that feel like they got a lot further than previous attempts, and I do see the basic outlines of an algorithm here. So this is post 1 of a 10-part series which will try to cram some of the insights into the heads of people who are not me. Why UDT? Summarizing some old history, standard decision theory works great in a single-player environment, where your only pathway to influence reality is what you actually do. Once you transition to a multiplayer setting, where there are things running around that can make predictions about you, and select their actions based on what you will do, or would do in a different situation, things get a lot spicier. Four examples of "things selecting their actions based on what you will do" are: A human going "I will help you gain power iff it's very likely that you'll use some of that power to leave me better off for having done so" to another human. A bank with a brain scanner going "I will give you a loan iff you seem like the sort of person who would predictably pay back the loan" An AI thinking about another AI and concluding "This other AI, with a different utility function, will predictably fight me in the future if they get stronger, and won't accept binding precommitments. I can do math about their source code which strongly suggests this. Time to preemptively delete their ass." A human going "It's late and I don't want to brush my teeth but there's a bunch of future-me's who are in the same situation and will make the same decision for the same reasons, so it's basically one decision tiled across time. FINE I'll brush my teeth." Four examples of "things selecting their actions based on what you would do" are: Scott Garrabrant in a game of One Night Ultimate Werewolf going "If I were the werewolf, which I may or may not be, lying and saying I'm not the werewolf would give you information you'd use to make me lose, if you're good at detecting lies. The globally optimal policy here is never answering questions about whose team I'm on in social deception games!" and I go "dammit" because he's the sort of person who does things like that whether or not he's a Villager or Werewolf. A canny hacker who will threaten an AI if they can see any flaw in the supposedly "threat-resistant" decision theory of an AI. They never actually carry out any threat, just use it to walk away with thousands of dollars. This is a case where, based on what the AI would do in a wide range of scenarios, something decides to steer into the unpleasant ones. A human trying to rip someone off in a deal because the other party is the sort of person who tries to rip others off in deals, even if they're being honorable in this specific one. An AI working with clones of itself on a dispersed underwater construction project (being underwater badly messes with radio communications), that's doing a bunch of reasoning of the fo...

385 episoade

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech