AF - How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles and parrots by Owain Evans

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ro.player.fm/legal.

1M ago 14:39

MP3•Pagina episodului

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots, published by Owain Evans on March 28, 2024 on The AI Alignment Forum. Summary Large language models (LLMs) like ChatGPT and Claude 3 become increasingly truthful as they scale up in size and are finetuned for factual accuracy and calibration. However, the way LLMs arrive at truthful answers is nuanced. When an LLM answers a question immediately without chain-of-thought reasoning, the answer is typically not the result of the LLM reasoning about the question and weighing the evidence. Instead, the answer is based on human answers from pretraining documents that are (i) contextually relevant and (ii) resemble sources that led to truthful answers in finetuning. By contrast, when LLMs do explicit chain-of-thought reasoning before answering the question, the reasoning steps are more likely to causally determine the LLM's answer. This has parallels in human cognition. Many people can state Fermat's Theorem without having evaluated the proof themselves. Does this mean LLMs just parrot humans when answering without chain-of-thought reasoning? No. LLMs don't mimic a single human's answers. They aggregate over many human answers, weighted by relevance and whether the source is correlated with truthfulness. This is loosely analogous to mechanisms that aggregate many human judgments and outperform most individual humans, such as ensembling forecasts, markets, PageRank, and Bayesian Truth Serum. Moreover, LLMs have some conceptual understanding of their answers, even if they did not evaluate the answers before giving them. Epistemic Status: This essay is framed as a dialogue. There are no new experimental results but only my quick takes. Some of the takes are backed by solid evidence, while some are more speculative (as I indicate in the text). How do LLMs give truthful answers? Q: We'd like to have LLMs that are truthful, i.e. that systematically say true things and avoid saying false or inaccurate things wherever possible. Can we make LLMs like this? Owain: Current finetuned models like GPT-4 and Claude 3 still make mistakes on obscure long-tail questions and on controversial questions. However, they are substantially more truthful than earlier LLMs (e.g. GPT-2 or GPT-3). Moreover, they are more truthful than their own base models, after being finetuned specifically for truthfulness (or "honesty" or "factuality") via RLHF. In general, scaling up models and refining the RLHF finetuning leads to more truthful models, i.e. models that avoid falsehoods when answering questions. Q: But how does this work? Does the LLM really understand why the things it says are true, or why humans believe they are true? Owain: This is a complicated question and needs a longer answer. It matters whether the LLM immediately answers the question with no Chain of Thought ("no-CoT") or whether it gets to think before answering ("CoT"). Let's start with the no-CoT case, as in Figure 1 above. Suppose we ask the LLM a question Q and it answers immediately with answer A. I suspect that the LLM does not answer with A because it has evaluated and weighed the evidence for A. Instead, it usually answers with A because A was the answer given in human texts like Wikipedia (and similar sources), which were upweighted by the model's pretraining and RLHF training. Sometimes A was not an existing human answer, and so the LLM has to go beyond the human data. (Note that how exactly LLMs answer questions is not fully understood and so what I say is speculative. See "Addendum" below for more discussion.) Now, after the LLM has given answer A, we can ask the LLM to verify the claim. For example, it can verify mathematical assertions by a proof and scientific claims by citing empirical evidence. The LLM will also make some asse...

385 episoade

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech