Artwork

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ro.player.fm/legal.
Player FM - Aplicație Podcast
Treceți offline cu aplicația Player FM !

AF - Comparing Alignment to other AGI interventions: Extensions and analysis by Martín Soto

7:03
 
Distribuie
 

Manage episode 408185833 series 3337166
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ro.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comparing Alignment to other AGI interventions: Extensions and analysis, published by Martín Soto on March 21, 2024 on The AI Alignment Forum. In the last post I presented the basic, bare-bones model, used to assess the Expected Value of different interventions, and especially those related to Cooperative AI (as distinct from value Alignment). Here I briefly discuss important enhancements, and our strategy with regards to all-things-considered estimates. I describe first an easy but meaningful addition to the details of our model (which you can also toy with in Guesstimate). Adding Evidential Cooperation in Large worlds Due to evidential considerations, our decision to forward this or that action might provide evidence about what other civilizations (or sub-groups inside a civilization similar to us) have done. So for example us forwarding a higher aC|V should give us evidence about other civilizations doing the same, and this should alter the AGI landscape. But there's a problem: we have only modelled singletons themselves (AGIs), not their predecessors (civilizations). We have, for example, the fraction FV of AGIs with our values. But what is the fraction cV of civilizations with our values? Should it be higher (due to our values being more easily evolved than trained), or lower (due to our values being an attractor in mind-space)? While a more complicated model could deal directly with these issues by explicitly modelling civilizations (and indeed this is explored in later extensions), for now we can pull a neat trick that gets us most of what we want without enlarging the ontology of the model further, nor the amount of input estimates. Assume for simplicity alignment is approximately as hard for all civilizations (both in cV and cV=1cV), so that they each have pV of aligning their AGI (just like we do). Then, pV of the civilizations in cV will increase FV, by creating an AGI with our values. And the rest 1pV will increase FV. What about cV? pV of them will increase FV. But the misalignment case is trickier, because it might be a few of their misaligned AGIs randomly have our values. Let's assume for simplicity (since FV and cV are usually small enough) that the probability with which a random misaligned (to its creators) AGI has our values is the same fraction that our values have in the universe, after all AGIs have been created: FV.[1] Then, cV(1pV)FV goes to increase FV, and cV(1pV)(1FV) goes to increase (1FV). This all defines a system of equations in which the only unknown is cV, so we can deduce its value! With this estimate, and with some guesses αV and αV for how correlated we are with civilizations with and without our values[2], and again simplistically assuming that the tractabilities of the different interventions are approximately the same for all civilizations, we can compute a good proxy for evidential effects. As an example, to our previous expression for dFC|VdaC|V we will add cVαVdpC|VdaC|V(1pV)+cVαVdpC|VdaC|V(1pV)(1FV) This is because our working on cooperativeness for misalignment provides evidence cV also do (having an effect if their AI is indeed misaligned), but it also provides evidence for cV doing so, which only affects the fraction of cooperative misaligned AIs if their AI is indeed misaligned (to their creators), and additionally it doesn't randomly land on our values. We similarly derive the expressions for all other corrections. Negative evidence In fact, there's a further complication: our taking a marginal action not only gives us evidence for other civilizations taking that action, but also for them not taking the other available actions. To see why this should be the case in our setting, notice the following. If our estimates of the intermediate variables like FV had been "against the baseline of our correlated agents not taking...
  continue reading

385 episoade

Artwork
iconDistribuie
 
Manage episode 408185833 series 3337166
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ro.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comparing Alignment to other AGI interventions: Extensions and analysis, published by Martín Soto on March 21, 2024 on The AI Alignment Forum. In the last post I presented the basic, bare-bones model, used to assess the Expected Value of different interventions, and especially those related to Cooperative AI (as distinct from value Alignment). Here I briefly discuss important enhancements, and our strategy with regards to all-things-considered estimates. I describe first an easy but meaningful addition to the details of our model (which you can also toy with in Guesstimate). Adding Evidential Cooperation in Large worlds Due to evidential considerations, our decision to forward this or that action might provide evidence about what other civilizations (or sub-groups inside a civilization similar to us) have done. So for example us forwarding a higher aC|V should give us evidence about other civilizations doing the same, and this should alter the AGI landscape. But there's a problem: we have only modelled singletons themselves (AGIs), not their predecessors (civilizations). We have, for example, the fraction FV of AGIs with our values. But what is the fraction cV of civilizations with our values? Should it be higher (due to our values being more easily evolved than trained), or lower (due to our values being an attractor in mind-space)? While a more complicated model could deal directly with these issues by explicitly modelling civilizations (and indeed this is explored in later extensions), for now we can pull a neat trick that gets us most of what we want without enlarging the ontology of the model further, nor the amount of input estimates. Assume for simplicity alignment is approximately as hard for all civilizations (both in cV and cV=1cV), so that they each have pV of aligning their AGI (just like we do). Then, pV of the civilizations in cV will increase FV, by creating an AGI with our values. And the rest 1pV will increase FV. What about cV? pV of them will increase FV. But the misalignment case is trickier, because it might be a few of their misaligned AGIs randomly have our values. Let's assume for simplicity (since FV and cV are usually small enough) that the probability with which a random misaligned (to its creators) AGI has our values is the same fraction that our values have in the universe, after all AGIs have been created: FV.[1] Then, cV(1pV)FV goes to increase FV, and cV(1pV)(1FV) goes to increase (1FV). This all defines a system of equations in which the only unknown is cV, so we can deduce its value! With this estimate, and with some guesses αV and αV for how correlated we are with civilizations with and without our values[2], and again simplistically assuming that the tractabilities of the different interventions are approximately the same for all civilizations, we can compute a good proxy for evidential effects. As an example, to our previous expression for dFC|VdaC|V we will add cVαVdpC|VdaC|V(1pV)+cVαVdpC|VdaC|V(1pV)(1FV) This is because our working on cooperativeness for misalignment provides evidence cV also do (having an effect if their AI is indeed misaligned), but it also provides evidence for cV doing so, which only affects the fraction of cooperative misaligned AIs if their AI is indeed misaligned (to their creators), and additionally it doesn't randomly land on our values. We similarly derive the expressions for all other corrections. Negative evidence In fact, there's a further complication: our taking a marginal action not only gives us evidence for other civilizations taking that action, but also for them not taking the other available actions. To see why this should be the case in our setting, notice the following. If our estimates of the intermediate variables like FV had been "against the baseline of our correlated agents not taking...
  continue reading

385 episoade

सभी एपिसोड

×
 
Loading …

Bun venit la Player FM!

Player FM scanează web-ul pentru podcast-uri de înaltă calitate pentru a vă putea bucura acum. Este cea mai bună aplicație pentru podcast și funcționează pe Android, iPhone și pe web. Înscrieți-vă pentru a sincroniza abonamentele pe toate dispozitivele.

 

Ghid rapid de referință