Sholto and Trenton

Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast.
在播客上与我的好朋友 Trenton Bricken 和 Sholto Douglas 聊天非常开心。

No way to summarize it, except:
无法总结它，除了：

This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them.
这是关于如何LLMs训练、他们可能很快拥有哪些能力以及他们内部到底发生了什么的最佳上下文转储。

You would be shocked how much of what I know about this field, I've learned just from talking with them. To the extent that you've enjoyed my other AI interviews, now you know why.
你会惊讶于我对这个领域的了解，我只是从与他们交谈中学到的。在某种程度上，你喜欢我的其他人工智能采访，现在你知道为什么了。

Enjoy! 享受！

Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform.
在 YouTube 上观看。在 Apple 播客、Spotify 或任何其他播客平台上收听。

There's a transcript with links to all the papers the boys were throwing down - may help you follow along.
有一份成绩单，上面有男孩们扔掉的所有文件的链接 - 可能会帮助你跟上。

Follow Trenton and Sholto on Twitter.
在 Twitter 上关注 Trenton 和 Sholto。

Timestamps 时间戳 #

(00:00:00) - Long contexts
（00：00：00） - 长上下文

(00:16:12) - Intelligence is just associations
（00：16：12） - 智力只是联想

(00:32:35) - Intelligence explosion & great researchers
（00：32：35） - 智能爆炸和伟大的研究人员

(01:06:52) - Superposition & secret communication
（01：06：52） - 叠加和秘密通信

(01:22:34) - Agents & true reasoning
（01：22：34） - 代理和真实推理

(01:34:40) - How Sholto & Trenton got into AI research
（01：34：40） - Sholto & Trenton 如何进入 AI 研究

(02:07:16) - Are feature spaces the wrong way to think about intelligence?
（02：07：16） - 特征空间是思考智能的错误方式吗？

(02:21:12) - Will interp actually work on superhuman models
（02：21：12） - interp 真的可以在超人模型上工作吗

(02:45:05) - Sholto’s technical challenge for the audience
（02：45：05） - Sholto 对观众的技术挑战

(03:03:57) - Rapid fire （03：03：57） - 快速射击

Transcript 抄本 #

Edited by Teddy Kim, with lots of helpful links
由Teddy Kim编辑，有很多有用的链接

00:00:00 - Long contexts 00：00：00 - 长上下文 #

Dwarkesh Patel 0:00:00 德瓦克什·帕特尔 0：00：00

Okay, today I have the pleasure to talk with two of my good friends, Sholto and Trenton.
好的，今天我很高兴和我的两个好朋友，肖尔托和特伦顿谈谈。

Noam Brown, who wrote the Diplomacy paper, said this about Sholto: “he's only been in the field for 1.5 years, but people in AI know that he was one of the most important people behind Gemini's success.” And Trenton, who's at Anthropic, works on mechanistic interpretability and it was widely reported that he has solved alignment.
《外交》论文的作者诺姆·布朗（Noam Brown）这样评价肖尔托：“他只在这个领域工作了1.5年，但人工智能界的人都知道，他是双子座成功背后最重要的人之一。特伦顿（Trenton）在Anthropic工作，致力于机械可解释性，据广泛报道，他已经解决了对齐问题。

So this will be a capabilities only podcast. Alignment is already solved, no need to discuss further.
因此，这将是一个仅具有功能的播客。对齐已经解决了，无需进一步讨论。

Let's start by talking about context lengths. It seemed to be underhyped, given how important it seems to me, that you can just put a million tokens into context. There's apparently some other news that got pushed to the front for some reason, but tell me about how you see the future of long context lengths and what that implies for these models.
让我们从上下文长度开始。这似乎被低估了，因为在我看来，你可以把一百万个代币放在上下文中。显然，由于某种原因，还有其他一些新闻被推到了前面，但请告诉我您如何看待长上下文长度的未来，以及这对这些模型意味着什么。

Sholto Douglas 00:01:28 肖尔托·道格拉斯 00：01：28

So I think it's really underhyped. Until I started working on it, I didn't really appreciate how much of a step up in intelligence it was for the model to have the onboarding problem basically instantly solved.
所以我认为它真的被低估了。在我开始研究它之前，我并没有真正意识到该模型在智能方面取得了多大的进步，基本上立即解决了入职问题。

You can see that a bit in the perplexity graphs in the paper where just throwing millions of tokens worth of context about a code base allows it to become dramatically better at predicting the next token in a way that you'd normally associate with huge increments in model scale. But you don't need that. All you need is a new context. So underhyped and buried by some other news.
你可以在论文的困惑图中看到这一点，只要抛出数百万个关于代码库的上下文，就可以让它在预测下一个令牌方面变得更好，而你通常会将这种方式与模型规模的巨大增量联系起来。但你不需要那个。您所需要的只是一个新的上下文。所以被其他一些新闻低估和掩盖了。

Dwarkesh Patel 00:01:58 德瓦克什·帕特尔 00：01：58

In context, are they as sample efficient and smart as humans?
在上下文中，它们是否像人类一样高效和聪明？

Sholto Douglas 00:02:02 肖尔托·道格拉斯 00：02：02

I think that's really worth exploring. For example, one of the evals that we did in the paper had it learn a language in context better than a human expert could, over the course of a couple of months.
我认为这真的值得探索。例如，我们在论文中所做的一个评估让它在几个月的时间里比人类专家更好地学习语言。

This is only a small demonstration but I'd be really interested to see things like Atari games where you throw in a couple hundred, or a thousand frames, of labeled actions in the same way that you'd show your friend how to play a game and see if it's able to reason through.
这只是一个小小的演示，但我真的很想看到像雅达利游戏这样的东西，你扔进几百或一千帧的标记动作，就像你向你的朋友展示如何玩游戏一样，看看它是否能够推理。

It might. At the moment, with the infrastructure and stuff, it's still a bit slow at doing that, but I would actually guess that it might just work out of the box in a way that would be pretty mind-blowing.
它可能。目前，有了基础设施和其他东西，它在这方面仍然有点慢，但我实际上猜测它可能会以一种非常令人兴奋的方式开箱即用。

Trenton Bricken 00:02:38 特伦顿·布里肯 00：02：38

And crucially, I think this language was esoteric enough that it wasn't in the training data.
至关重要的是，我认为这种语言足够深奥，以至于它不在训练数据中。

Sholto Douglas 00:02:42 肖尔托·道格拉斯 00：02：42

Exactly. If you look at the model before it has that context thrown in, it doesn't know the language at all and it can't get any translations.
完全。如果你在模型输入上下文之前查看它，它根本不懂语言，也无法获得任何翻译。

Dwarkesh Patel 00:02:49 德瓦克什·帕特尔 00：02：49

And this is an actual human language?
这是一种真正的人类语言吗？

Sholto Douglas 00:02:51 肖尔托·道格拉斯 00：02：51

Exactly. An actual human language.
完全。一种真正的人类语言。

Dwarkesh Patel 00:02:53 德瓦克什·帕特尔 00：02：53

So if this is true, it seems to me that these models are already in an important sense, superhuman. Not in the sense that they're smarter than us, but I can't keep a million tokens in my context when I'm trying to solve a problem, remembering and integrating all the information, an entire code base. Am I wrong in thinking this is a huge unlock?
因此，如果这是真的，在我看来，这些模型已经在重要意义上是超人。并不是说他们比我们聪明，而是当我试图解决问题时，我不能在我的上下文中保留一百万个令牌，记住并整合所有信息，一个完整的代码库。我错了，认为这是一个巨大的解锁吗？

Sholto Douglas 00:03:14 肖尔托·道格拉斯 00：03：14

Actually, I generally think that's true. Previously, I've been frustrated when models aren't as smart, when you ask them a question and you want it to be smarter than you or to know things that you don't. This allows them to know things that you don't. It just ingests a huge amount of information in a way you just can't. So it's extremely important.
实际上，我通常认为这是真的。以前，当模型不那么聪明时，当你问他们一个问题，你希望它比你更聪明，或者知道你不知道的事情时，我会感到沮丧。这使他们能够知道您不知道的事情。它只是以一种你无法做到的方式摄取大量信息。所以这是极其重要的。

Dwarkesh Patel 00:03:33 德瓦克什·帕特尔 00：03：33

Well, how do we explain in-context learning?
那么，我们如何解释情境学习呢？

Sholto Douglas 00:03:35 肖尔托·道格拉斯 00：03：35

There's a line of work I quite like, where it looks at in-context learning as basically very similar to gradient descent, but the attention operation can be viewed as gradient descent on the in-context data. That paper had some cool plots where they basically showed “we take n steps of gradient descent and that looks like n layers of in-context learning, and it looks very similar.” So I think that's one way of viewing it and trying to understand what's going on.
我非常喜欢有一项工作，它认为上下文学习基本上与梯度下降非常相似，但注意力操作可以被视为上下文数据上的梯度下降。那篇论文有一些很酷的图，它们基本上显示了“我们采取了n个梯度下降的步骤，这看起来像是n层的上下文学习，看起来非常相似。所以我认为这是看待它并试图理解正在发生的事情的一种方式。

Trenton Bricken 00:03:59 特伦顿·布里肯 00：03：59

You can ignore what I'm about to say because, given the introduction, alignment is solved and AI safety isn't a problem.
你可以忽略我要说的话，因为鉴于介绍，对齐问题已经解决，人工智能安全不是问题。

I think the context stuff does get problematic, but also interesting here. I think there'll be more work coming out in the not-too-distant future around what happens if you give a hundred shot prompt for jailbreaks, adversarial attacks. It's also interesting in the sense that, if your model is doing gradient descent and learning on the fly, even if it's been trained to be harmless, you're dealing with a totally new model in a way. You're fine-tuning but in a way where you can't control what's going on.
我认为上下文确实有问题，但在这里也很有趣。我认为在不久的将来，会有更多的工作围绕着如果你为越狱、对抗性攻击提供一百次提示会发生什么。从某种意义上说，如果你的模型正在进行梯度下降和动态学习，即使它被训练成无害的，你也在某种程度上处理了一个全新的模型。你正在微调，但在某种程度上你无法控制正在发生的事情。

Dwarkesh Patel 00:04:41 德瓦克什·帕特尔 00：04：41

Can you explain? What do you mean by gradient descent happening in the forward pass and attention?
你能解释一下吗？你说的在前传球和注意力中发生的梯度下降是什么意思？

Trenton Bricken 00:04:45 特伦顿·布里肯 00：04：45

There was something in the paper about trying to teach the model to do linear regression but just through the number of samples or examples they gave in the context. And you can see if you plot on the x-axis the number of shots that it has, then the loss it gets on ordinary least squares regression will go down with time.
论文中有一些关于试图教模型进行线性回归的内容，但只是通过他们在上下文中给出的样本或示例的数量。你可以看到，如果你在 x 轴上绘制它的射击次数，那么它在普通最小二乘回归中得到的损失会随着时间的推移而下降。

Sholto Douglas 00:05:04 肖尔托·道格拉斯 00：05：04

And it goes down exactly matched with the number of gradient descent steps.
它下降与梯度下降台阶的数量完全匹配。

Trenton Bricken 00:05:08 特伦顿·布里肯 00：05：08

Yeah, exactly. 是的，没错。

Dwarkesh Patel 00:05:09 德瓦克什·帕特尔 00：05：09

I only read the intro and discussion section of that paper. But in the discussion, the way they framed it is that the model, in order to get better at long-context tasks, has to get better at learning to learn from these examples or from the context that is already within the window.
我只阅读了那篇论文的介绍和讨论部分。但在讨论中，他们构建模型的方式是，为了更好地完成长上下文任务，模型必须更好地学习从这些示例或窗口内已有的上下文中学习。

And the implication of that is, if meta-learning happens because it has to learn how to get better at long-context tasks, then in some important sense the task of intelligence requires long-context examples and long-context training.
这意味着，如果元学习的发生是因为它必须学习如何在长上下文任务中做得更好，那么在某种重要意义上，智能任务需要长上下文示例和长上下文训练。

Sholto Douglas 00:05:45 肖尔托·道格拉斯 00：05：45

Understanding how to better induce meta-learning in your pre-training process is a very important thing about flexible or adaptive intelligence.
了解如何在预训练过程中更好地诱导元学习是灵活或自适应智能的一件非常重要的事情。

Dwarkesh Patel 00:05:53 德瓦克什·帕特尔 00：05：53

Right, but you can proxy for that just by getting better at doing long-term context tasks. One of the bottlenecks for AI progress that many people identify is the inability of these models to perform tasks on long horizons, engaging with the task for many hours, or even many weeks or months, where they’re an assistant or an employee and they can just do a thing I tell them to do for a while. AI agents haven't taken off for this reason from what I understand.
没错，但你可以通过更好地执行长期上下文任务来代理这一点。许多人认为，人工智能进步的瓶颈之一是这些模型无法长期执行任务，无法长时间执行任务，参与任务数小时，甚至数周或数月，他们是助理或员工，他们可以做我告诉他们要做的事情一段时间。据我所知，由于这个原因，人工智能代理还没有起飞。

So how linked are long context windows, and the ability to perform well on them, and the ability to do these kinds of long-horizon tasks that require you to engage with an assignment for many hours? Or are these unrelated concepts?
那么，长上下文窗口之间的联系如何，以及在它们上表现良好的能力，以及执行这些需要您花数小时完成任务的长期任务的能力？还是这些不相关的概念？

Sholto Douglas 00:06:36 肖尔托·道格拉斯 00：06：36

I would take issue with that being the reason that agents haven't taken off. I think that's more about nines of reliability and the model actually successfully doing things. If you can't chain tasks successively with high enough probability, then you won't get something that looks like an agent. And that's why something like an agent might follow more of a step function.
我认为这是代理商没有起飞的原因。我认为这更多的是关于可靠性的九分之一，以及模型实际上成功地做了一些事情。如果你不能以足够高的概率连续链接任务，那么你就不会得到看起来像代理的东西。这就是为什么像代理这样的东西可能更多地遵循阶跃函数。

In GPT-4 class models, Gemini Ultra class models, they're not enough. But maybe the next increment on model scale means that you get that extra nine. Even though the loss isn't going down that dramatically, that small amount of extra ability gives you the extra. Obviously you need some amount of context to fit long-horizon tasks, but I don't think that's been the limiting factor up to now.
在 GPT-4 级模型、Gemini Ultra 级模型中，它们还不够。但也许模型规模的下一个增量意味着你会得到额外的九个。即使损失没有那么显着地下降，那少量的额外能力也会给你带来额外的好处。显然，你需要一些上下文来适应长期任务，但我不认为这是迄今为止的限制因素。

Trenton Bricken 00:07:16 特伦顿·布里肯 00：07：16

The NeurIPS best paper this year, by Rylan Schaeffer who was the lead author, points to this as the emergence of mirage. People will have a task and you get the right or wrong answer depending on if you've sampled the last five tokens correctly. So naturally you're multiplying the probability of sampling all of those and if you don't have enough nines of reliability, then you're not going to get emergence.
今年NeurIPS的最佳论文，由Rylan Schaeffer撰写，他是第一作者，指出这是海市蜃楼的出现。人们会有一个任务，你会得到正确或错误的答案，这取决于你是否正确地采样了最后五个令牌。因此，自然而然地，你要乘以所有这些样本的概率，如果你没有足够的 9 的可靠性，那么你就不会得到涌现。

And all of a sudden you do and it's, “oh my gosh, this ability is emergent,” when actually it was kind of there to begin with.
突然间，你这样做了，而且，“哦，天哪，这种能力是新兴的”，而实际上它一开始就存在。

Sholto Douglas 00:07:47 肖尔托·道格拉斯 00：07：47

And there are ways that you can find a smooth metric for that.
有一些方法可以找到一个平滑的指标。

Dwarkesh Patel 00:07:50 德瓦克什·帕特尔 00：07：50

HumanEval or whatever. In the GPT-4 paper, the coding problems they have, they measure–
HumanEval 什么的。在 GPT-4 论文中，他们遇到的编码问题，他们衡量——

Sholto Douglas 00:07:56 肖尔托·道格拉斯 00：07：56

Log pass rates 日志通过率

Dwarkesh Patel 00:07:57 德瓦克什·帕特尔 00：07：57

Exactly. For the audience, basically the idea is when you're measuring how much progress there has been on a specific task such as solving coding problems, when it gets it right only one in a thousand times you don't give it a one in a thousand score like, “oh, got it right some of the time.” And so the curve you see is, it gets it right one in a thousand, then one in a hundred, then one in ten, and so forth.
完全。对于观众来说，基本上这个想法是，当你衡量一项特定任务（例如解决编码问题）的进展时，当它只有千分之一的正确性时，你不会给它千分之一的分数，比如，“哦，有时做对了。所以你看到的曲线是，它得到了千分之一，然后是一百分之一，然后是十分之一，依此类推。

So I want to follow up on this. If your claim is that the AI agents haven't taken off because of reliability rather than long-horizon task performance, isn't that lack of reliability–when a task is changed on top of another task, on top of another task–isn't that exactly the difficulty with long-horizon tasks? You have to do ten things in a row or a hundred things in a row, diminishing the reliability of any one of them. The probability goes down from 99.99% to 99.9%. Then the whole thing gets multiplied together and the whole thing has become so much less likely to happen.
所以我想跟进一下。如果你声称人工智能代理没有起飞是因为可靠性而不是长期任务性能，那么这难道不是缺乏可靠性吗——当一个任务在另一个任务之上、另一个任务之上发生变化时——这不正是长期任务的困难吗？你必须连续做十件事或连续做一百件事，这会降低其中任何一件的可靠性。概率从99.99%下降到99.9%。然后整个事情会成倍增加，整个事情发生的可能性就会大大降低。

Sholto Douglas 00:08:59 肖尔托·道格拉斯 00：08：59

That is exactly the problem.But the key issue you're pointing out there is that your base task solve rate is 90%. If it was 99% then chain, it doesn't become a problem. I think this is also something that just hasn't been properly studied. If you look at the academic evals, it’s a single problem. Like the math problem, it's one typical math problem, it's one university-level problem from across different topics. You were beginning to start to see evals looking at this properly via more complex tasks like SWE-bench, where they take a whole bunch of GitHub issues. That is a reasonably long horizon task, but it's still sub-hour as opposed to a multi-hour or multi-day task.
这正是问题所在。但是你指出的关键问题是你的基本任务解决率是90%。如果是 99% 然后是链，它不会成为问题。我认为这也是一些没有得到适当研究的东西。如果你看一下学术评估，这是一个单一的问题。就像数学问题一样，这是一个典型的数学问题，它是一个来自不同主题的大学水平问题。你开始看到 eval 通过更复杂的任务（如 SWE-bench）正确地看待这个问题，他们处理了一大堆 GitHub 问题。这是一个相当长的跨度任务，但它仍然是小时以下的，而不是一个多小时或多天的任务。

So I think one of the things that will be really important to do next is understand better what success rate over long-horizon tasks looks like. I think that's even important to understand what the economic impact of these models might be and properly judge increasing capabilities. Cutting down the tasks and the inputs/outputs involved into minutes or hours or days and seeing how good it is at successively chaining and completing tasks of those different resolutions of time. Then that tells you how automated a job family or task family will be in a way that MMLU scores don't.
因此，我认为接下来要做的一件非常重要的事情是更好地了解长期任务的成功率是什么样子的。我认为，了解这些模型可能产生的经济影响并正确判断能力的增长甚至很重要。将所涉及的任务和输入/输出减少到几分钟、几小时或几天，看看它在连续链接和完成这些不同时间分辨率的任务方面有多好。然后，这会告诉您工作系列或任务系列的自动化程度，而 MMLU 分数则不然。

Trenton Bricken 00:10:18 特伦顿·布里肯 00：10：18

It was less than a year ago that we introduced 100K context windows and I think everyone was pretty surprised by that. Everyone had this soundbite of, “quadratic attention costs, so we can't have long context windows.” And here we are. The benchmarks are being actively made.
不到一年前，我们推出了 100K 上下文窗口，我想每个人都对此感到非常惊讶。每个人都有这样的声音，“二次注意力成本，所以我们不能有很长的上下文窗口。我们来了。基准正在积极制定中。

Dwarkesh Patel 00:10:36 德瓦克什·帕特尔 00：10：36

Wait, doesn't the fact that there are these companies–Google, Magic, maybe others–who have million token attention imply that it's not quadratic anymore? Or are they just eating the cost?
等等，有这些公司——谷歌、万智牌，也许还有其他公司——拥有百万代币关注度，这难道不意味着它不再是二次的了吗？还是他们只是在吃成本？

Sholto Douglas 00:10:50 肖尔托·道格拉斯 00：10：50

Well, who knows what Google is doing for its long context game? One thing has frustrated me about the general research field's approach to attention. There’s an important way in which the quadratic cost of attention is actually dominated in typical dense transformers by the MLP block. So you have this n squared term that's associated with attention but you also have an n squared term that's associated with the D model, the residual stream dimension of the model.
好吧，谁知道谷歌正在为其长上下文游戏做些什么？有一件事让我对一般研究领域的注意力方法感到沮丧。在典型的密集变压器中，注意力的二次成本实际上由MLP模块主导，这是很重要的方式。因此，您有与注意力相关的 n 平方项，但您也有一个与 D 模型相关的 n 平方项，即模型的残差流维数。

I think Sasha Rush has a great tweet where he basically plots the curve of the cost of attention respective to the cost of really large models and attention actually trails off. You actually need to be doing pretty long context before that term becomes really important.
我认为 Sasha Rush 有一条很棒的推文，他基本上绘制了注意力成本与真正大型模型成本相关的曲线，并且注意力实际上会下降。实际上，在该术语变得非常重要之前，您需要做相当长的上下文。

The second thing is that people often talk about how attention at inference time is such a huge cost. When you're actually generating tokens, the operation is not n squared. One set of Q-vectors looks up a whole bunch of KV-vectors and that's linear with respect to the amount of context that the model has.
第二件事是，人们经常谈论推理时的注意力是多么巨大的成本。当您实际生成令牌时，操作不是 n 平方。一组 Q 向量查找一大堆 KV 向量，这与模型的上下文量呈线性关系。

So I think this drives a lot of the recurrence and state space research where people have this meme of linear attention. And as Trenton said, there's a graveyard of ideas around attention. That’s not to say I don't think it's worth exploring, but I think it's important to consider why and where the actual strengths and weaknesses of it are.
所以我认为这推动了大量的复发和状态空间研究，人们有这种线性注意力的模因。正如特伦顿所说，围绕注意力的想法是一片坟墓。这并不是说我认为它不值得探索，但我认为重要的是要考虑它的实际优势和劣势的原因和位置。

Dwarkesh Patel 00:12:21 德瓦克什·帕特尔 00：12：21

Okay, what do you make of this take? As we move forward through the takeoff, more and more of the learning happens in the forward pass. So originally all the learning happens in the bottom-up, hill climbing evolutionary process. Let’s say during the intelligence explosion the AI is maybe handwriting the weights or doing GOFAI or something, and we're in the middle step where a lot of learning happens in-context now with these models, a lot of it happens within the backward pass. Does this seem like a meaningful gradient along which progress is happening?
好的，你怎么看这个镜头？当我们在起飞过程中前进时，越来越多的学习发生在向前传球中。所以最初所有的学习都发生在自下而上的爬山进化过程中。比方说，在智能爆炸期间，人工智能可能正在手写权重或做GOFAI或其他什么，而我们正处于中间步骤，现在这些模型在上下文中发生了很多学习，其中很多发生在向后传递中。这似乎是一个有意义的梯度，正在沿着这个梯度取得进展吗？

The broader thing being that if you're learning in the forward pass, it's much more sample efficient because you can basically think as you're learning. Like when you read a textbook, you're not just skimming it and trying to absorb inductively, “these words follow these words.” You read it and you think about it, and then you read some more and you think about it some more. Does this seem like a sensible way to think about the progress?
更广泛的事情是，如果你在前传中学习，它的样本效率要高得多，因为你基本上可以在学习时思考。就像你读一本教科书一样，你不只是略读它并试图归纳吸收，“这些词跟着这些词。你读了它，你思考它，然后你读了更多，你又想了更多。这似乎是一种明智的思考进展的方式吗？

Sholto Douglas 00:13:23 肖尔托·道格拉斯 00：13：23

It may just be like how birds and planes fly, but they fly slightly differently. The virtue of technology allows us to accomplish things that birds can't. It might be that context length is similar in that it allows it to have a working memory that we can't, but functionally is not the key thing towards actual reasoning.
它可能就像鸟类和飞机的飞行方式一样，但它们的飞行方式略有不同。技术的美德使我们能够完成鸟类无法完成的事情。上下文长度可能是相似的，因为它允许它拥有我们无法做到的工作记忆，但从功能上讲并不是实际推理的关键。

The key step between GPT-2 and GPT-3 was that all of a sudden there was this meta-learning behavior that was observed in training, in the pre-training of the model. And that has, as you said, something to do with how if you give it some amount of context, it's able to adapt to that context. That was a behavior that wasn't really observed before that at all. And maybe that's a mixture of property of context and scale and this kind of stuff. But it wouldn't have occurred to model tiny context, I would say.
GPT-2 和 GPT-3 之间的关键步骤是，突然之间，在训练中，在模型的预训练中观察到了这种元学习行为。正如你所说，这与如果你给它一些上下文，它如何能够适应这种上下文有关。在此之前，这种行为根本没有被真正观察到。也许这是上下文和规模属性的混合体，以及诸如此类的东西。但我想说的是，它不会想到对微小的上下文进行建模。

Dwarkesh Patel 00:14:09 德瓦克什·帕特尔 00：14：09

This is actually an interesting point. So when we talk about scaling up these models, how much of it comes from just making the models themselves bigger? And how much comes from the fact that during any single call you are using more compute?
这实际上是一个有趣的观点。因此，当我们谈论扩大这些模型时，其中有多少来自使模型本身变得更大？在任何一个通话过程中，您都使用了更多的计算，这有多少？

So if you think of diffusion, you can just iteratively keep adding more compute. If adaptive compute is solved, you can keep doing that. And in this case, if there's a quadratic penalty for attention but you're doing long context anyways, then you're still dumping in more compute ( and not just by having bigger models).
因此，如果您考虑扩散，则可以迭代地继续添加更多计算。如果自适应计算得到解决，则可以继续这样做。在这种情况下，如果注意力存在二次惩罚，但无论如何你都在做长上下文，那么你仍然会倾倒更多的计算（而不仅仅是通过拥有更大的模型）。

Trenton Bricken 00:14:46 特伦顿·布里肯 00：14：46

It's interesting because you do get more forward passes by having more tokens. My one gripe–I guess I have two gripes with this though, maybe three.
这很有趣，因为你确实通过拥有更多的代币来获得更多的前向传递。我的一个抱怨——我想我对此有两个抱怨，也许是三个。

So in the AlphaFold paper, one of the transformer modules–they have a few and the architecture is very intricate–but they do, I think, five forward passes through it and will gradually refine their solution as a result.
因此，在 AlphaFold 的论文中，其中一个变压器模块——他们有几个，架构非常复杂——但我认为，他们确实有五个正向传递，并将因此逐渐完善他们的解决方案。

You can also kind of think of the residual stream, Sholto alluded to the read-write operations, as a poor man's adaptive compute. Where it's just going to give you all these layers and if you want to use them, great. If you don't, then that's also fine. Then people will be like, “oh the brain is recurrent and you can do however many loops through it you want.”
你也可以把残差流，Sholto提到的读写操作，看作是穷人的自适应计算。它只会给你所有这些层，如果你想使用它们，那就太好了。如果你不这样做，那也没关系。然后人们会说，“哦，大脑是反复出现的，你可以通过它做任何你想做的循环。

I think to a certain extent, that's right. If I ask you a hard question, you'll spend more time thinking about it and that would correspond to more forward passes. But I think there's a finite number of forward passes that you can do. It’s with language as well, people are like “oh human language can have infinite recursion in it,” like infinite nested statements of “the boy jumped over the bear, that was doing this, that had done this, that had done that…”
我认为在某种程度上，这是对的。如果我问你一个棘手的问题，你会花更多的时间思考它，这与更多的向前传球相对应。但我认为你可以做的向前传球次数是有限的。语言也是如此，人们就像“哦，人类的语言可以有无限的递归”，就像无限嵌套的语句“男孩跳过熊，那个在做这个，那个做这个，那个做那个......”

But empirically, you'll only see five to seven levels of recursion, which relates to that magic number of how many things you can hold in working memory at any given time. So it's not infinitely recursive, but does that matter in the regime of human intelligence? And can you not just add more layers?
但从经验上讲，你只会看到五到七个层次的递归，这与在任何给定时间工作记忆中可以容纳多少东西的神奇数字有关。所以它不是无限递归的，但这在人类智能的制度中重要吗？您不能添加更多图层吗？

00:16:12 - Intelligence is just associations #

00：16：12 - 智能只是关联

Dwarkesh Patel 00:16:12 德瓦克什·帕特尔 00：16：12

Can you break it down for me? You've referred to this in some of your previous answers of listening to these long contexts and holding more things in memory. But ultimately it comes down to your ability to mix concepts together to do some kind of reasoning and these models aren't necessarily human level at that, even in context.
你能帮我分解一下吗？在你之前的一些回答中，你已经提到了这一点，你听了这些长篇大论，并在记忆中保留了更多的东西。但最终，它归结为你混合概念以进行某种推理的能力，而这些模型不一定是人类的水平，即使在上下文中也是如此。

Break down for me how you see just storing raw information versus reasoning and what's in between. Like, where's the reasoning happening? Where is this raw information storage happening? What's different between them in these models?
为我分解一下您如何看待存储原始信息与推理以及两者之间的内容。比如，推理发生在哪里？这种原始信息存储在哪里发生？在这些模型中，它们之间有什么不同？

Trenton Bricken 00:16:46 特伦顿·布里肯 00：16：46

I don't have a super crisp answer for you here. Obviously with the input and output of the model, you're mapping back to actual tokens. And then in between that you're doing higher level processing.
我在这里没有一个超级清晰的答案给你。显然，通过模型的输入和输出，您将映射回实际的令牌。然后，在这两者之间，您正在进行更高级别的处理。

Dwarkesh Patel 00:17:01 德瓦克什·帕特尔 00：17：01

Before we get deeper into this, we should explain to the audience. You referred earlier to Anthropic's way of thinking about transformers as these read-write operations that layers do.
在我们深入探讨这个问题之前，我们应该向观众解释一下。您之前提到了 Anthropic 对 transformer 的思考方式，即层所做的这些读写操作。

One of you should just kind of explain at a high level what you mean by that.
你们中的一个人应该在高层次上解释一下你的意思。

Trenton Bricken 00:17:15 特伦顿·布里肯 00：17：15

So for the residual stream, imagine you're in a boat going down a river and the boat is the current query where you're trying to predict the next token. So it's “the cat sat on the _____.” And then you have these little streams that are coming off the river where you can get extra passengers or collect extra information if you want. And those correspond to the attention heads and MLPs that are part of the model.
因此，对于残差流，假设您乘坐一条顺流而下的船，而该船是您尝试预测下一个令牌的当前查询。所以它是“猫坐在_____上”。然后你有这些从河里流下来的小溪，如果你愿意，你可以在那里获得额外的乘客或收集额外的信息。这些对应于作为模型一部分的注意力头和 MLP。

Sholto Douglas 00:17:41 肖尔托·道格拉斯 00：17：41

I almost think of it like the working memory of the model, like the RAM of the computer, where you're choosing what information to read in so you can do something with it and then maybe read something else in later on.
我几乎把它想象成模型的工作记忆，就像计算机的RAM一样，你可以选择读取什么信息，这样你就可以用它做一些事情，然后以后可能再读其他东西。

Trenton Bricken 00:17:54 特伦顿·布里肯 00：17：54

And you can operate on subspaces of that high-dimensional vector. At this point, I think it's almost given that a ton of things are encoded in superposition. So the residual stream is just one high-dimensional vector, but actually there's a ton of different vectors that are packed into it.
你可以对高维向量的子空间进行操作。在这一点上，我认为几乎可以说，很多东西都是以叠加方式编码的。所以残差流只是一个高维向量，但实际上有大量不同的向量被打包到其中。

Dwarkesh Patel 00:18:12 德瓦克什·帕特尔 00：18：12

To dumb it down, a way that would have made sense to me a few months ago is that you have the words that are the input into the model. All those words get converted into these tokens and those tokens get converted into these vectors. And basically, it's just this small amount of information that's moving through the model.
说白了，几个月前对我来说有意义的一种方法是，你有输入模型的单词。所有这些单词都被转换为这些标记，这些标记被转换为这些向量。基本上，只有这少量的信息在模型中移动。

And the way you explained it to me, Sholto, this paper talks about how early on in the model, maybe it's just doing some very basic things about, “what do these tokens mean?” Like if it says ten plus five, just moving information to have that good representation. And in the middle, maybe the deeper thinking is happening about “how to solve this.” At the end, you're converting it back into the output token because the end product is that you're trying to predict the probability of the next token from the last of those residual streams. So it's interesting to think about the small compressed amount of information moving through the model and how it's getting modified in different ways.
Sholto，你向我解释的方式，这篇论文谈到了在模型的早期，也许它只是做了一些非常基本的事情，“这些代币意味着什么？就像它说十加五一样，只是移动信息以获得良好的表示。在中间，也许更深层次的思考正在发生关于“如何解决这个问题”。最后，你要把它转换回输出令牌，因为最终产品是你试图从最后一个残差流中预测下一个令牌的概率。因此，考虑在模型中移动的少量压缩信息以及它如何以不同的方式进行修改是很有趣的。

Trenton, you're one of the few people who have a background from neuroscience. So you can think about the analogies here to the brain. And in fact, you had a paper in grad school about thinking about attention in the brain, and one of our friend’s said this is the only, or first, neural explanation of why attention works. Whereas we have evidence for why the CNNs, convolutional neural networks, work based on the visual cortex or something.
特伦顿，你是为数不多的有神经科学背景的人之一。所以你可以想想这里对大脑的类比。事实上，你在研究生院有一篇关于思考大脑注意力的论文，我们的一位朋友说，这是唯一或第一个关于注意力起作用的神经解释。然而，我们有证据证明为什么CNN，卷积神经网络，基于视觉皮层或其他东西工作。

Do you think in the brain there is something like a residual stream of compressed information that's moving through and getting modified as you're thinking about something? Even if that's not what's literally happening, do you think that's a good metaphor for what's happening in the brain?
你是否认为在大脑中，有一股残余的压缩信息流，在你思考某事时会移动并被修改？即使这不是字面上发生的事情，你认为这是大脑中正在发生的事情的一个很好的比喻吗？

Trenton Bricken 00:20:04 特伦顿·布里肯 00：20：04

At least in the cerebellum you basically do have a residual stream in what we'll call the attention model for now–and I can go into whatever amount of detail you want for that–where you have inputs that route through it, but they'll also just go directly to the end point that that module will contribute to. So there's a direct path and an indirect path. and, and so the model can pick up whatever information it wants and then add that back in.
至少在小脑中，你基本上有一个残余流，我们现在称之为注意力模型——我可以进入你想要的任何细节——你有通过它的输入，但它们也会直接进入该模块将贡献的终点。所以有直接路径和间接路径。因此，模型可以获取它想要的任何信息，然后将其添加回去。

Dwarkesh Patel 00:20:35 德瓦克什·帕特尔 00：20：35

What happens in the cerebellum?
小脑会发生什么？

Trenton Bricken 00:20:37 特伦顿·布里肯 00：20：37

So the cerebellum nominally just does fine motor control but I analogize this to the person who's lost their keys and is just looking under the streetlight where it's very easy to observe this behavior. One leading cognitive neuroscientist said to me that a dirty little secret of any fMRI study, where you're looking at brain activity for a given task, is that the cerebellum is almost always active and lighting up for it. If you have a damaged cerebellum, you also are much more likely to have autism so it's associated with social skills. In one particular study, where I think they use PET instead of fMRI, when you're doing “next token prediction” the cerebellum lights up a lot. Also, 70% of your neurons in the brain are in the cerebellum. They're small but they're there and they're taking up real metabolic cost.
因此，小脑名义上只是进行精细的运动控制，但我将其类比为丢失钥匙的人，只是在路灯下看着，在那里很容易观察到这种行为。一位领先的认知神经科学家对我说，任何fMRI研究的一个肮脏的小秘密是，小脑几乎总是活跃的，并为它点亮。如果你的小脑受损，你也更有可能患有自闭症，所以它与社交技能有关。在一项特别的研究中，我认为他们使用 PET 而不是 fMRI，当你进行“下一个代币预测”时，小脑会亮很多。此外，大脑中 70% 的神经元位于小脑中。它们很小，但它们就在那里，它们正在承担真正的代谢成本。

Dwarkesh Patel 00:21:29 德瓦克什·帕特尔 00：21：29

This was one of Gwern’s points, that what changed with humans was not just that we have more neurons, but specifically there's more neurons in the cerebral cortex in the cerebellum and they're more metabolically expensive and they're more involved in signaling and sending information back and forth. Is that attention? What's going on?
这是Gwern的观点之一，人类的变化不仅仅是我们有更多的神经元，而且具体来说，在小脑的大脑皮层中有更多的神经元，它们的新陈代谢更加昂贵，它们更多地参与信号传递和来回发送信息。这是关注吗？这是怎么回事？

Trenton Bricken 00:21:52 特伦顿·布里肯 00：21：52

So back in the 1980s, Pentti Kanerva came up with an associative memory algorithm. You have a bunch of memories. You want to store them. There's some amount of noise or corruption that's going on and you want to query or retrieve the best match. And so he wrote this equation for how to do it and a few years later realized that if you implemented this as an electrical engineering circuit, it actually looks identical to the core cerebellar circuit.
因此，早在 1980 年代，Pentti Kanerva 就提出了一种联想记忆算法。你有一堆回忆。你想存储它们。正在发生一定程度的噪音或损坏，您想要查询或检索最佳匹配项。于是他写了这个方程式来说明如何做到这一点，几年后他意识到，如果你把它作为一个电气工程电路来实现，它实际上看起来和核心小脑电路是一样的。

And that circuit, and the cerebellum more broadly, is not just in us, it's in basically every organism. There's active debate on whether or not cephalopods have it, they kind of have a different evolutionary trajectory. But even for fruit flies with the Drosophila mushroom body, that is the same cerebellar architecture.
这个回路，以及更广泛的小脑，不仅存在于我们体内，而且基本上存在于每个生物体中。关于头足类动物是否拥有它，存在着积极的争论，它们有不同的进化轨迹。但即使对于果蝇蘑菇体的果蝇来说，这也是相同的小脑结构。

That convergence and then my paper, which shows that actually this attention operation is a very close approximation, including implementing the Softmax and having these nominal quadratic costs that we've been talking about. So the three way convergence here and the takeoff and success of transformers, just seems pretty striking to me.
这种收敛，然后是我的论文，它表明实际上这种注意力操作是一个非常接近的近似值，包括实现 Softmax 并具有我们一直在谈论的这些名义二次成本。因此，这里的三向收敛以及变形金刚的起飞和成功，在我看来非常引人注目。

Dwarkesh Patel 00:23:04 德瓦克什·帕特尔 00：23：04

I want to zoom out. I think what motivated this discussion in the beginning was we were talking about, “what is the reasoning? What is the memory? What do you think about the analogy you found to attention and this?”
我想缩小。我认为一开始引发这次讨论的原因是我们在谈论，“原因是什么？记忆是什么？你怎么看你发现的关于注意力和这个的类比？

Do you think of this more as just looking up the relevant memories or the relevant facts? And if that is the case, where is the reasoning happening in the brain? How do we think about how that builds up into the reasoning?
你认为这更像是查找相关的记忆或相关的事实吗？如果是这样的话，大脑中的推理发生在哪里？我们该如何思考这如何成为推理的依据？

Trenton Bricken 00:23:33 特伦顿·布里肯 00：23：33

Maybe my hot take here, I don't know how hot it is, is that most intelligence is pattern matching and you can do a lot of really good pattern matching if you have a hierarchy of associative memories. You start with your very basic associations between just objects in the real world. You can then chain those and have more abstract associations, such as a wedding ring symbolizing so many other associations that are downstream. You can even generalize the attention operation and this associated memory as the MLP layer as well. And it's in a long-term setting where you don't have tokens in your current context, but I think this is an argument that association is all you need.
也许我在这里的热门观点，我不知道它有多热门，是大多数智能是模式匹配，如果你有一个联想记忆的层次结构，你可以做很多非常好的模式匹配。你从现实世界中物体之间的最基本关联开始。然后，您可以将它们链接起来并拥有更抽象的关联，例如象征下游许多其他关联的结婚戒指。您甚至可以将注意力操作和此关联的内存概括为 MLP 层。这是一个长期的环境，在你当前的环境中没有代币，但我认为这是一个论点，即你所需要的只是关联。

Associated memory in general as well, you can do two things with it. You can both, denoise or retrieve a current memory. So if I see your face but it's raining and cloudy, I can denoise and gradually update my query towards my memory of your face. But I can also access that memory and then the value that I get out actually points to some other totally different part of the space.
一般来说，关联的内存也可以用它做两件事。您可以对当前内存进行降噪或检索。因此，如果我看到你的脸，但下雨和多云，我可以去噪并逐渐更新我的查询，以恢复我对你脸的记忆。但是我也可以访问该内存，然后我得到的值实际上指向空间的其他完全不同的部分。

A very simple instance of this would be if you learn the alphabet. So I query for A and it returns B, I query for B and it returns C, and you can traverse the whole thing.
一个非常简单的例子是，如果你学习字母表。所以我查询 A，它返回 B，我查询 B，它返回 C，你可以遍历整个事情。

Dwarkesh Patel 00:25:02 德瓦克什·帕特尔 00：25：02

One of the things I talked to Demis about was a paper he had in 2008 that memory and imagination are very linked because of this very thing that you mentioned, that memory is reconstructive. So you are, in some sense, imagining every time you're thinking of a memory because you're only storing a condensed version of it and you have to. This is famously why human memory is terrible and why people in the witness box or whatever would just make shit up.
我和德米斯谈过的一件事是他在2008年发表的一篇论文，说记忆和想象力是紧密相连的，因为你刚才提到了这一点，记忆是重建的。所以，从某种意义上说，你每次想起一段记忆时都在想象，因为你只存储了它的浓缩版本，你必须这样做。这就是为什么人类的记忆是可怕的，以及为什么证人席上的人或其他什么人只会胡说八道。

So let me ask a stupid question. So you read Sherlock Holmes and the guy's incredibly sample efficient. He'll see a few observations and he'll basically figure out who committed the crime because there's a series of deductive steps that leads from somebody's tattoo and what's on the wall to the implications of that. How does that fit into this picture? Because crucially, what makes him smart is that there's not just an association, but there's a sort of deductive connection between different pieces of information. Would you just explain it as, that's just higher level association?
所以让我问一个愚蠢的问题。所以你读了夏洛克·福尔摩斯和这家伙令人难以置信的样本效率。他会看到一些观察结果，他基本上会弄清楚是谁犯下了罪行，因为有一系列的演绎步骤，从某人的纹身和墙上的东西到它的含义。这如何融入这幅图？因为至关重要的是，使他变得聪明的是，不仅有一种关联，而且在不同的信息之间有一种演绎的联系。你会把它解释为，这只是更高层次的关联吗？

Trenton Bricken 00:26:11 特伦顿·布里肯 00：26：11

I think so. I think learning these higher-level associations to be able to then map patterns to each other, as a kind of meta-learning. I think in this case, he would also just have a really long context length, or a really long working memory, where he can have all of these bits and continuously query them as he's coming up with some theory so that the theory is moving through the residual stream. And then his attention heads are querying his context. But then, how he's projecting his query and keys in the space, and how his MLPs are then retrieving longer-term facts or modifying that information, is allowing him to in later layers do even more sophisticated queries and slowly be able to reason through and come to a meaningful conclusion.
我认为如此。我认为学习这些更高层次的关联，以便能够将模式相互映射，这是一种元学习。我认为在这种情况下，他也会有一个非常长的上下文长度，或者一个非常长的工作记忆，在那里他可以拥有所有这些位，并在他提出一些理论时不断查询它们，以便理论在残余流中移动。然后他的注意力开始询问他的背景。但是，他如何在空间中投射他的查询和密钥，以及他的 MLP 如何检索长期事实或修改该信息，使他能够在后面的层中进行更复杂的查询，并慢慢能够推理并得出有意义的结论。

Sholto Douglas 00:27:00 肖尔托·道格拉斯 00：27：00

That feels right to me. You're looking back in the past. You're selectively reading in certain pieces of information, comparing them, and maybe that informs your next step of what piece of information you now need to pull in. Then you build this representation, which progressively looks closer and closer to the suspect in your case. That doesn't feel at all outlandish.
这对我来说是正确的。你回顾过去。你有选择地阅读某些信息，比较它们，也许这会告诉你下一步你现在需要提取哪条信息。然后你建立这个表示，它逐渐看起来越来越接近你案件中的嫌疑人。这并不奇怪。

Trenton Bricken 00:27:20 特伦顿·布里肯 00：27：20

I think that the people who aren't doing this research can overlook how after your first layer of the model, every query key and value that you're using for attention comes from the combination of all the previous tokens.
我认为没有做这项研究的人可能会忽略，在模型的第一层之后，你用于关注的每个查询键和值都来自所有先前标记的组合。

So my first layer, I'll query my previous tokens and just extract information from them. But all of a sudden, let's say that I attended to tokens 1, 2, and 4 in equal amounts. Then the vector in my residual stream–assuming that they wrote out the same thing to the value vectors, but, but ignore that for a second–is a third of each of those. So when I'm querying in the future, my query is actually a third of each of those things.
因此，我的第一层，我将查询我以前的令牌并从中提取信息。但突然之间，假设我以相等的金额关注了代币 1、2 和 4。然后，我的残差流中的向量——假设它们向值向量写出了同样的东西，但是，但暂时忽略它——是每个向量的三分之一。因此，当我将来查询时，我的查询实际上是其中每一项的三分之一。

Sholto Douglas 00:28:03 肖尔托·道格拉斯 00：28：03

But they might be written to different subspaces.
但它们可能会被写入不同的子空间。

Trenton Bricken 00:28:05 特伦顿·布里肯 00：28：05

That's right. Hypothetically, but they wouldn't have to. You can recombine and immediately, even by layer two and certainly by the deeper layers, just have these very rich vectors that are packing in a ton of information. And the causal graph is literally over every single layer that happened in the past. That's what you're operating on.
没错。假设，但他们不必这样做。你可以立即重新组合，即使是第二层，当然还有更深的层，只要有这些非常丰富的向量，这些载体就包含了大量的信息。因果图实际上是在过去发生的每一层上。这就是您正在操作的内容。

Sholto Douglas 00:28:25 肖尔托·道格拉斯 00：28：25

Yeah, it does bring to mind a very funny eval to do, a Sherlock Holmes eval. You put the entire book into context and then you have a sentence which is, “the suspect is X.” Then you have a larger probability distribution over the different characters in the book.
是的，它确实让人想起一个非常有趣的评估，夏洛克·福尔摩斯评估。你把整本书放在上下文中，然后你有一句话，“嫌疑人是X。然后，书中不同角色的概率分布更大。

Trenton Bricken 00:28:41 特伦顿·布里肯 00：28：41

That would be super cool.
那会非常酷。

Sholto Douglas 00:28:44 肖尔托·道格拉斯 00：28：44

I wonder if you'd get anything at all.
我想知道你是否会得到任何东西。

Dwarkesh Patel 00:28:47 德瓦克什·帕特尔 00：28：47

Sherlock Holmes is probably already in the training data. You gotta get a mystery novel that was written in the–
夏洛克·福尔摩斯可能已经在训练数据中了。你得买一本写在——

Trenton Bricken 00:28:52 特伦顿·布里肯 00：28：52

You can get an LLM to write it.
你可以得到一个LLM来写它。

Sholto Douglas 00:28:53 肖尔托·道格拉斯 00：28：53

Or we could purposely exclude it, right?
或者我们可以故意排除它，对吧？

Dwarkesh Patel 00:28:56 德瓦克什·帕特尔 00：28：56

Oh, we can? How do you?
哦，我们可以吗？你是怎么做到的？

Trenton Bricken 00:28:57 特伦顿·布里肯 00：28：57

Well, you need to scrape any discussion of it from Reddit or any other thing.
好吧，您需要从 Reddit 或任何其他内容中抓取任何关于它的讨论。

Sholto Douglas 00:29:00 肖尔托·道格拉斯 00：29：00

Right, it's hard. That's one of the challenges that goes into things like long-context evals, getting a good one. You need to know that it's not in your training data. You just put in the effort to exclude it.
是的，这很难。这是长上下文评估等挑战之一，如何获得一个好的评估。您需要知道它不在训练数据中。你只是努力排除它。

Dwarkesh Patel 00:29:10 德瓦克什·帕特尔 00：29：10

There's two different threads I want to follow up on. Let's go to the long-context one and then we'll come back to this. In the Gemini 1.5 paper the eval that was used was can it remember something like Paul Graham’s essays.
我想跟进两个不同的线程。让我们进入长上下文，然后我们将回到这个。在 Gemini 1.5 的论文中，使用的 eval 是它能记住像 Paul Graham 的文章一样的东西吗？

Sholto Douglas 00:29:28 肖尔托·道格拉斯 00：29：28

Yeah, the needle in a haystack.
是的，大海捞针。

Dwarkesh Patel 00:29:30 德瓦克什·帕特尔 00：29：30

I mean, we don't necessarily just care about its ability to recall one specific fact from the context.
我的意思是，我们不一定只关心它从上下文中回忆一个特定事实的能力。

I'll step back and ask the question. The loss function for these models is unsupervised. You don't have to come up with these bespoke things that you keep out of the training data.
我会退后一步问这个问题。这些模型的损失函数是无监督的。您不必想出这些定制的东西，这些东西不会出现在训练数据之外。

Is there a way you can do a benchmark that's also unsupervised, where another LLM is rating it in some way or something like that. Maybe the answer is that if you could do this, reinforcement learning would work.
有没有办法做一个不受监督的基准测试，另一个人LLM以某种方式或类似的东西对它进行评级。也许答案是，如果你能做到这一点，强化学习就会起作用。

Sholto Douglas 00:30:05 肖尔托·道格拉斯 00：30：05

I think people have explored that kind of stuff. For example, Anthropic has the constitutional RL paper where they take another language model and they point it and say, “how helpful or harmless was that response?” Then they get it to update and try and improve along the Pareto frontier of helpfulness and harmfulness.
我认为人们已经探索过这种东西。例如，Anthropic 有一篇宪法 RL 论文，他们采用另一种语言模型，然后指着它说，“这种反应有多大帮助或无害？然后他们让它更新并尝试沿着有用性和有害性的帕累托边界进行改进。

So you can point language models at each other and create evals in this way. It's obviously an imperfect art form at the moment. because you get reward function hacking basically. Even humans are imperfect here. Humans typically prefer longer answers, which aren't necessarily better answers and you get the same behavior with models.
因此，您可以将语言模型指向彼此，并以这种方式创建评估。目前，这显然是一种不完美的艺术形式。因为你基本上得到了奖励功能黑客攻击。在这里，即使是人类也是不完美的。人类通常更喜欢更长的答案，这不一定是更好的答案，而且模型的行为与此相同。

Dwarkesh Patel 00:30:48 德瓦克什·帕特尔 00：30：48

Going back to the Sherlock Holmes thing, if it's all associations all the way down, does that mean we should be less worried about super intelligence? Because there's not this sense in which it's like Sherlock Holmes++. It'll still need to just find these associations, like humans find associations. It's not able to just see a frame of the world and then it's figured out all the laws of physics.
回到夏洛克·福尔摩斯的事情，如果从头到尾都是联想，这是否意味着我们应该不那么担心超级智能？因为没有这种意义上的夏洛克·福尔摩斯++。它仍然需要找到这些关联，就像人类找到关联一样。它不能只看到世界的一帧，然后它就弄清楚了所有的物理定律。

Trenton Bricken 00:31:20 特伦顿·布里肯 00：31：20

This is a very legitimate response.It's, “if you say humans are generally intelligent, then artificial general intelligence is no more capable or competent.” I'm just worried that you have that level of general intelligence in silicon. You can then immediately clone hundreds of thousands of agents and they don't need to sleep, and they can have super long context windows, and then they can start recursively improving, and then things get really scary. So I think to answer your original question, you're right, they would still need to learn associations.
这是一个非常合理的回应。它是，“如果你说人类一般是聪明的，那么人工智能就没有能力或能力了。我只是担心你在硅方面有那种水平的通用智能。然后，你可以立即克隆数十万个代理，它们不需要睡觉，它们可以有超长的上下文窗口，然后它们可以开始递归改进，然后事情变得非常可怕。所以我认为要回答你最初的问题，你是对的，他们仍然需要学习联想。

Dwarkesh Patel 00:31:52 德瓦克什·帕特尔 00：31：52

But wait, if intelligence is fundamentally about these associations, the recursive self-improvement is just them getting better at association. There's not another thing that's happening. So then it seems like you might disagree with the intuition that they can't be that much more powerful, if they're just doing that.
但是等等，如果智力从根本上讲是关于这些关联的，那么递归的自我改进只是它们在关联方面变得更好。没有其他事情正在发生。因此，似乎你可能不同意直觉，如果他们只是这样做，他们就不可能那么强大。

Trenton Bricken 00:32:11 特伦顿·布里肯 00：32：11

I think then you can get into really interesting cases of meta-learning. When you play a new video game or study a new textbook, you're bringing a whole bunch of skills to the table to form those associations much more quickly. And because everything in some way ties back to the physical world, I think there are general features that you can pick up and then apply in novel circumstances.
我认为这样你就可以进入非常有趣的元学习案例。当你玩一个新的视频游戏或学习一本新的教科书时，你带来了一大堆技能，可以更快地形成这些联想。因为一切都在某种程度上与物理世界联系在一起，我认为有一些一般的特征，你可以拿起，然后在新的环境中应用。

00:32:35 - Intelligence explosion & great researchers #

00：32：35 - 智能爆炸和伟大的研究人员

Dwarkesh Patel 00:32:35 德瓦克什·帕特尔 00：32：35

Should we talk about the intelligence explosion then? The reason I'm interested in discussing this with you guys in particular is that the models of the intelligence explosion we have so far come from economists.
那么我们应该谈谈情报爆炸吗？我之所以特别有兴趣和你们讨论这个问题，是因为到目前为止，我们所拥有的智力爆炸模型来自经济学家。

That’s fine but I think we can do better because in the model of the intelligence explosion, what happens is you replace the AI researchers. There's a bunch of automated AI researchers who can speed up progress, make more AI researchers, and make further progress. If that's the mechanism, we should just ask the AI researchers whether they think this is plausible. So let me just ask you, if I have a thousand agent Sholtos or agent Trentons, do you think that you get an intelligence explosion? What does that look like to you?
这很好，但我认为我们可以做得更好，因为在智能爆炸的模型中，发生的事情是你取代了人工智能研究人员。有一群自动化的人工智能研究人员可以加快进度，培养更多的人工智能研究人员，并取得进一步的进步。如果这是机制，我们应该问问人工智能研究人员是否认为这是合理的。所以让我问你，如果我有一千个特工肖尔托斯或特工特伦顿，你认为你会得到情报爆炸吗？这对你来说是什么样子的？

Sholto Douglas 00:33:32 肖尔托·道格拉斯 00：33：32

I think one of the important bounding constraints here is compute. I do think you could dramatically speed up AI research. It seems very clear to me that in the next couple of years, we'll have things that can do many of the software engineering tasks that I do on a day to day basis, and therefore dramatically speed up my work, and therefore speed up the rate of progress.
我认为这里重要的边界约束之一是计算。我确实认为你可以大大加快人工智能研究的速度。在我看来，在接下来的几年里，我们将拥有可以完成我每天所做的许多软件工程任务的东西，从而大大加快我的工作速度，从而加快进度。

At the moment, I think most of the labs are somewhat compute bound in that there are always more experiments you could run and more pieces of information that you could gain in the same way that scientific research on biology is somewhat experimentally throughput-bound. You need to run and culture the cells in order to get the information.
目前，我认为大多数实验室在某种程度上都受到计算限制，因为你可以运行更多的实验，你可以获得更多的信息，就像生物学的科学研究在某种程度上受实验吞吐量限制一样。您需要运行和培养细胞以获取信息。

I think that will be at least a short term planning constraint. Obviously, Sam's trying to raise $7 trillion to buy chips and it does seem like there's going to be a lot more compute in the future as everyone is heavily ramping. NVIDIA's stock price sort of represents the relative compute increase. Any thoughts?
我认为这至少是一个短期的规划限制。显然，山姆正试图筹集 7 万亿美元来购买芯片，而且随着每个人都在大力增加，未来似乎会有更多的计算。英伟达的股价在某种程度上代表了相对计算的增长。有什么想法吗？

Trenton Bricken 00:34:36 特伦顿·布里肯 00：34：36

I think we need a few more nines of reliability in order for it to be really useful and trustworthy. And we need context lengths that are super long and very cheap to have. If I'm working in our code base, it's really only small modules that I can get Claude to write for me right now. But it's very plausible that within the next few years, or even sooner, it can automate most of my tasks.
我认为我们需要更多的可靠性，以便它真正有用和值得信赖。我们需要超长且非常便宜的上下文长度。如果我在我们的代码库中工作，我现在只能让 Claude 为我编写小模块。但很有可能在未来几年内，甚至更早，它可以自动化我的大部分任务。

The only other thing here that I will note is that the research our interpretability subteam is working on is so early-stage. You really have to be able to make sure everything is done correctly in a bug-free way and contextualize the results with everything else in the model. If something isn't going right, you have to be able to enumerate all of the possible things, and then slowly work on those.
我在这里唯一要指出的另一件事是，我们的可解释性小组正在从事的研究还处于早期阶段。你真的必须能够确保一切都以无错误的方式正确完成，并将结果与模型中的其他所有内容联系起来。如果事情不对劲，你必须能够列举所有可能的事情，然后慢慢地处理这些事情。

An example that we've publicly talked about in previous papers is dealing with layer norm. If I'm trying to get an early result or look at the logit effects of the model, if I activate this feature that we've identified to a really large degree, how does that change the output of the model? Am I using layer norm or not? How is that changing the feature that's being learned? That will take even more context or reasoning abilities for the model.
我们在之前的论文中公开讨论过的一个例子是处理层范数。如果我试图获得早期结果或查看模型的 logit 效应，如果我在很大程度上激活了我们已经确定的这个特征，它如何改变模型的输出？我是否使用层范数？这将如何改变正在学习的功能？这将需要更多的上下文或模型推理能力。

Dwarkesh Patel 00:36:04 德瓦克什·帕特尔 00：36：04

You used a couple of concepts together. It's not self-evident to me that they're the same but it seemed like you were using them interchangeably. One was working on the Claude code base and making more modules based on that, they need more context or something. It seems like they might already be able to fit in the context or do you mean context like “the context window?”
你们一起使用了几个概念。对我来说，它们是一样的并不是不言而喻的，但似乎你正在互换使用它们。一个是开发 Claude 代码库，并在此基础上制作更多模块，他们需要更多的上下文或其他东西。看起来它们可能已经能够适应上下文，或者您的意思是像“上下文窗口”这样的上下文？

Trenton Bricken 00:36:30 特伦顿·布里肯 00：36：30

Yeah, the “context window” context.
是的，“上下文窗口”上下文。

Dwarkesh Patel 00:36:32 德瓦克什·帕特尔 00：36：32

So it seems like the thing that's preventing it from making good modules is not the lack of being able to put the code base in there.
因此，似乎阻止它制作好模块的原因不是无法将代码库放在那里。

Trenton Bricken 00:36:39 特伦顿·布里肯 00：36：39

I think that will be there soon.
我认为这很快就会到来。

Dwarkesh Patel 00:36:41 德瓦克什·帕特尔 00：36：41

But it's not going to be as good as you at coming up with papers because it can fit the code base in there.
但它不会像你写论文那样好，因为它可以容纳那里的代码库。

Trenton Bricken 00:36:46 特伦顿·布里肯 00：36：46

No, but it will speed up a lot of the engineering.
不，但它会加快很多工程速度。

Dwarkesh Patel 00:36:48 德瓦克什·帕特尔 00：36：48

In a way that causes an intelligence explosion?
在某种程度上导致情报爆炸？

Trenton Bricken 00:36:53 特伦顿·布里肯 00：36：53

No, in a way that accelerates research. But I think these things compound. The faster I can do my engineering, the more experiments I can run. And the more experiments I can run, the faster we can… I mean, my work isn't actually accelerating capabilities at all, it's just interpreting the models. But we have a lot more work to do on that. surprise to the Twitter guy,
不，在某种程度上加速了研究。但我认为这些事情是复杂的。我完成工程的速度越快，我可以运行的实验就越多。我能做的实验越多，我们就能跑得越快......我的意思是，我的工作实际上根本不是加速能力，它只是在解释模型。但我们在这方面还有很多工作要做。让推特家伙感到惊讶，

Dwarkesh Patel 00:37:14 德瓦克什·帕特尔 00：37：14

For context, when you released your paper, there was a lot of talk on Twitter like, “alignment is solved guys. Close the curtains.”
就上下文而言，当你发表你的论文时，Twitter上有很多讨论，比如，“对齐是解决的，伙计们。拉上窗帘。

Trenton Bricken 00:37:24 特伦顿·布里肯 00：37：24

Yeah, no it keeps me up at night how quickly the models are becoming more capable and just how poor our understanding of what's going on still is.
是的，不，这让我夜不能寐，模型变得更强大的速度有多快，我们对正在发生的事情的理解仍然很差。

Dwarkesh Patel 00:37:36 德瓦克什·帕特尔 00：37：36

Let's run through the specifics here. By the time this is happening, we have bigger models that are two to four orders of magnitude bigger, or at least an effective compute two to four orders of magnitude bigger. So this idea that you can run experiments faster, you're having to retrain that model in this version of the intelligence explosion. The recursive self-improvement is different from what might've been imagined 20 years ago, where you just rewrite the code. You actually have to train a new model and that's really expensive.
让我们在这里了解一下具体情况。当这种情况发生时，我们拥有的模型要大两到四个数量级，或者至少有效计算要大两到四个数量级。所以这个想法是，你可以更快地进行实验，你必须在这个版本的智能爆炸中重新训练这个模型。递归的自我改进与20年前的想象不同，在20年前，你只需重写代码即可。你实际上必须训练一个新模型，这真的很昂贵。

Not only now, but especially in the future, as you keep making these models orders of magnitude bigger. Doesn't that dampen the possibility of a recursive self-improvement type of intelligence explosion?
不仅是现在，而且将来尤其如此，因为您不断将这些模型放大几个数量级。这难道不会抑制递归自我提升型智能爆炸的可能性吗？

Sholto Douglas 00:38:25 肖尔托·道格拉斯 00：38：25

It's definitely going to act as a breaking mechanism. I agree that the world of what we're making today looks very different from what people imagined it would look like 20 years ago. It's not going to be able to write the same code to be really smart, because actually it needs to train itself. The code itself is typically quite simple, typically really small and self contained.
它肯定会起到破坏机制的作用。我同意，我们今天所创造的世界看起来与人们想象的20年前大不相同。它不可能写出同样的代码来变得非常聪明，因为实际上它需要训练自己。代码本身通常非常简单，通常非常小且独立。

I think John Carmack had this nice phrase where it's the first time in history where you can plausibly imagine writing AI with 10,000 lines of code. That actually does seem plausible when you pare most training codebases down to the limit. But it doesn't take away from the fact that this is something where we should really strive to measure and estimate how progress might be.
我认为约翰·卡马克（John Carmack）有一句好话，这是历史上第一次你可以合理地想象用10,000行代码编写AI。当你将大多数训练代码库削减到极限时，这似乎确实是合理的。但这并不能消除这样一个事实，即这是我们真正应该努力衡量和估计进展情况的地方。

We should be trying very, very hard to measure exactly how much of a software engineer's job is automatable, and what the trend line looks like, and be trying our hardest to project out those trend lines.
我们应该非常非常努力地衡量软件工程师的工作在多大程度上是可自动化的，以及趋势线是什么样子的，并尽最大努力预测出这些趋势线。

Dwarkesh Patel 00:39:21 德瓦克什·帕特尔 00：39：21

But with all due respect to software engineers you are not writing like a React front-end right?
但是，恕我直言，软件工程师不会像 React 前端那样编写，对吧？

What is concretely happening? Maybe you can walk me through a day in the life of Sholto. You're working on an experiment or project that's going to make the model "better.” What is happening from observation to experiment, to theory, to writing the code? What is happening?
具体发生了什么？也许你可以带我走过肖尔托生活中的一天。您正在做一个实验或项目，这将使模型“更好”。从观察到实验，到理论，再到编写代码，发生了什么？发生了什么事情？

Sholto Douglas 00:39:48 肖尔托·道格拉斯 00：39：48

I think it’s important to contextualize here that I've primarily worked on inference so far. A lot of what I've been doing is just helping guide the pre-training process, designing a good model for inference and then making the model and the surrounding system faster. I've also done some pre-training work around that, but it hasn't been my 100% focus. I can still describe what I do when I do that work.
我认为在这里进行上下文化很重要，到目前为止，我主要从事推理工作。我一直在做的很多事情只是帮助指导预训练过程，设计一个好的推理模型，然后使模型和周围系统更快。我也围绕这一点做了一些训练前的工作，但这并不是我 100% 的重点。当我做这项工作时，我仍然可以描述我做了什么。

Dwarkesh Patel 00:40:09 德瓦克什·帕特尔 00：40：09

Sorry, let me interrupt. When Carl Shulman was talking about it on the podcast, he did say that things like improving inference or even literally making better chips or GPUs, that’s part of the intelligence explosion. Obviously if the inference code runs faster, it happens better or faster or whatever. Sorry, go ahead.
对不起，让我打断一下。当卡尔·舒尔曼（Carl Shulman）在播客中谈到这一点时，他确实说过，诸如改进推理甚至制造更好的芯片或GPU之类的事情，都是智能爆炸的一部分。显然，如果推理代码运行得更快，它就会发生得更好或更快，或者其他什么。对不起，请继续。

Sholto Douglas 00:40:32 肖尔托·道格拉斯 00：40：32

So concretely, what does a day look like? I think the most important part to illustrate is this cycle of coming up with an idea, proving it out at different points in scale, and interpreting and understanding what goes wrong. I think most people would be surprised to learn just how much goes into interpreting and understanding what goes wrong.
那么具体来说，一天是什么样的？我认为要说明的最重要的部分是提出一个想法，在不同的规模上证明它，以及解释和理解哪里出了问题。我想大多数人会惊讶地发现，在解释和理解问题所在方面付出了多少努力。

People have long lists of ideas that they want to try. Not every idea that you think should work, will work. Trying to understand why that is is quite difficult and working out what exactly you need to do to interrogate it. So a lot of it is introspection about what's going on. It's not pumping out thousands and thousands and thousands of lines of code. It's not the difficulty in coming up with ideas. Many people have a long list of ideas that they want to try, but paring that down and shot calling, under very imperfect information, what are the right ideas to explore further is really hard.
人们有一长串他们想尝试的想法。并非每个你认为应该可行的想法都会奏效。试图理解为什么这是相当困难的，并弄清楚你到底需要做什么来询问它。所以很多都是对正在发生的事情的反省。它不会抽出成千上万行代码。这不是想出主意的困难。许多人有一长串他们想尝试的想法，但是在非常不完善的信息下，要精简这些想法并开枪打call，进一步探索什么是正确的想法真的很难。

Dwarkesh Patel 00:41:32 德瓦克什·帕特尔 00：41：32

What do you mean by imperfect information? Are these early experiments? What is the information?
不完美的信息是什么意思？这些是早期实验吗？信息是什么？

Sholto Douglas 00:41:40 肖尔托·道格拉斯 00：41：40

Demis mentioned this in his podcast. It's like the GPT-4 paper where you have scaling law increments. You can see in the GPT-4 paper, they have a bunch of dots, right?
Demis 在他的播客中提到了这一点。这就像 GPT-4 论文一样，你有缩放定律增量。你可以在 GPT-4 论文中看到，它们有一堆点，对吧？

They say we can estimate the performance of our final model using all of these dots and there's a nice curve that flows through them. And Demis mentioned that we do this process of scaling up.
他们说，我们可以使用所有这些点来估计最终模型的性能，并且有一条漂亮的曲线流过它们。Demis 提到我们做了这个扩大规模的过程。

Concretely, why is that imperfect information? It’s because you never actually know if the trend will hold. For certain architectures the trend has held really well. And for certain changes, it's held really well. But that isn't always the case. And things which can help at smaller scales can actually hurt at larger scales. You have to make guesses based on what the trend lines look like and based on your intuitive feeling of what’s actually something that's going to matter, particularly for those which help with the small scale.
具体来说，为什么这些信息是不完美的？这是因为你永远不知道这种趋势是否会持续下去。对于某些架构来说，这种趋势一直保持得很好。对于某些变化，它保持得非常好。但情况并非总是如此。在较小规模上可以提供帮助的东西实际上可以在更大的范围内造成伤害。你必须根据趋势线的样子和你对真正重要的事情的直觉来猜测，特别是对于那些有助于小规模的东西。

Dwarkesh Patel 00:42:35 德瓦克什·帕特尔 00：42：35

That's interesting to consider. For every chart you see in a release paper or technical report that shows that smooth curve, there's a graveyard of first few runs and then it's flat.
这很有趣。对于您在发布文件或技术报告中看到的显示平滑曲线的每张图表，都有一个前几次运行的坟墓，然后是平坦的。

Sholto Douglas 00:42:45 肖尔托·道格拉斯 00：42：45

Yeah. There's all these other lines that go in different directions. You just tail off.
是的。还有所有这些其他的线条都朝着不同的方向发展。你只是尾随而去。

Trenton Bricken 00:42:50 特伦顿·布里肯 00：42：50

It's crazy, both as a grad student and here, the number of experiments that you have to run before getting a meaningful result.
这太疯狂了，无论是作为一名研究生还是在这里，在获得有意义的结果之前，你必须进行大量的实验。

Dwarkesh Patel 00:42:57 德瓦克什·帕特尔 00：42：57

But presumably it's not just like you run it until it stops and then go to the next thing. There's some process by which to interpret the early data. I don't know. I could put a Google Doc in front of you and I'm pretty sure you could just keep typing for a while on different ideas you have. There's some bottleneck between that and just making the models better immediately. Walk me through that. What is the inference you're making from the first early steps that makes you have better experiments and better ideas?
但据推测，这不仅仅是你运行它直到它停止，然后去做下一件事。有一些过程可以解释早期数据。我不知道。我可以把一个谷歌文档放在你面前，我很确定你可以继续输入一段时间的不同想法。这与立即使模型变得更好之间存在一些瓶颈。带我了解一下。你从最初的早期步骤中得出的推论是什么，使你有更好的实验和更好的想法？

Sholto Douglas 00:43:30 肖尔托·道格拉斯 00：43：30

I think one thing that I didn't fully convey before was that I think a lot of like good research comes from working backwards from the actual problems that you want to solve. There's a couple of grand problems today in making the models better that you would identify as issues and then work on how can I change things to achieve this? When you scale you also run into a bunch of things and you want to fix behaviors and issues at scale. And that informs a lot of the research for the next increment and this kind of stuff.
我认为我之前没有完全传达的一件事是，我认为很多类似的好研究都来自于从你想要解决的实际问题向后工作。今天，在使模型变得更好方面存在几个大问题，您会将其确定为问题，然后研究如何改变事物以实现这一目标？当你扩展时，你也会遇到很多事情，你想要大规模地修复行为和问题。这为下一个增量和此类东西的很多研究提供了信息。

Concretely, the barrier is a little bit of software engineering, having a code base that's large and capable enough that it can support many people doing research at the same time often makes it complex. If you're doing everything by yourself, your iteration pace is going to be much faster. Alec Radford, for example, famously did much of the pioneering work at OpenAI. I’ve heard he mostly works out of a Jupyter notebook and then has someone else who writes and productionizes that code for him. Actually operating with other people raises the complexity a lot, for natural reasons familiar to every software engineer and also the inherent running. Running and launching those experiments is easy but there's inherent slowdowns induced by that. So you often want to be parallelizing multiple different streams. You can't be totally focused on one thing necessarily. You might not have fast enough feedback cycles. And then intuiting what went wrong is actually really hard.
具体来说，障碍是软件工程的一点点，拥有一个足够大且有能力的代码库，它可以支持许多人同时进行研究，这往往使它变得复杂。如果你自己做所有事情，你的迭代速度会快得多。例如，亚历克·拉德福德（Alec Radford）在OpenAI做了许多著名的开创性工作。我听说他主要在 Jupyter 笔记本上工作，然后让其他人为他编写和生产这些代码。实际上，与其他人一起操作会大大增加复杂性，这是每个软件工程师都熟悉的自然原因，也是固有的运行。运行和启动这些实验很容易，但由此导致的固有减速。因此，您通常希望并行化多个不同的流。你不一定完全专注于一件事。你可能没有足够快的反馈周期。然后直觉地判断出了什么问题实际上非常困难。

This is in many respects, the problem that the team that Trenton is on is trying to better understand. What is going on inside these models? We have inferences and understanding and headcanon for why certain things work, but it's not an exact science. and so you have to constantly be making guesses about why something might have happened, what experiment might reveal, whether that is or isn't true. That's probably the most complex part.
这在许多方面，特伦顿所在的团队正试图更好地理解这个问题。这些模型内部发生了什么？我们有推论、理解和头脑来解释为什么某些事情会起作用，但这不是一门精确的科学。因此，你必须不断地猜测为什么会发生一些事情，实验可能会揭示什么，无论这是真的还是假的。这可能是最复杂的部分。

The performance work is comparatively easier but harder in other respects. It's just a lot of low-level and difficult engineering work.
表演工作相对容易，但在其他方面更难。这只是很多低级和困难的工程工作。

Trenton Bricken 00:45:38 特伦顿·布里肯 00：45：38

I agree with a lot of that. Even on the interpretability team, especially with Chris Olah leading it, there are just so many ideas that we want to test and it's really just having the “engineering” skill–a lot of it is research–to very quickly iterate on an experiment, look at the results, interpret it, try the next thing, communicate them, and then just ruthlessly prioritizing what the highest priority things to do are.
我同意很多。即使在可解释性团队中，尤其是在 Chris Olah 的领导下，我们想要测试的想法也太多了，而它实际上只是拥有“工程”技能——其中很多是研究——非常快速地迭代实验，查看结果，解释它，尝试下一件事，传达它们，然后无情地优先考虑最优先的事情。

Sholto Douglas 00:46:07 肖尔托·道格拉斯 00：46：07

This is really important. The ruthless prioritization is something which I think separates a lot of quality research from research that doesn't necessarily succeed as much. We're in this funny field where so much of our initial theoretical understanding is broken down basically. So you need to have this simplicity bias and ruthless prioritization over what's actually going wrong. I think that's one of the things that separates the most effective people. They don't necessarily get too attached to using a given sort of solution that they are familiar with, but rather they attack the problem directly.
这真的很重要。无情的优先次序是我认为将许多高质量的研究与不一定成功的研究区分开来的东西。我们在这个有趣的领域，我们最初的理论理解基本上被打破了。因此，你需要有这种简单性偏见和无情的优先级，而不是实际出错的地方。我认为这是区分最有效率的人的原因之一。他们不一定太执着于使用他们熟悉的给定解决方案，而是直接解决问题。

You see this a lot in people who come in with a specific academic background. They try to solve problems with that toolbox but the best people are people who expand the toolbox dramatically. They're running around and they're taking ideas from reinforcement learning, but also from optimization theory. And also they have a great understanding of systems. So they know what the sort of constraints that bound the problem are and they're good engineers. They can iterate and try ideas fast. By far the best researchers I've seen, they all have the ability to try experiments really, really, really, really, really fast. That’s cycle time at smaller scales. Cycle time separates people.
在具有特定学术背景的人身上，你经常看到这一点。他们试图用这个工具箱解决问题，但最好的人是那些大幅扩展工具箱的人。他们四处奔波，他们从强化学习中汲取灵感，但也从优化理论中汲取灵感。而且他们对系统有很好的理解。所以他们知道束缚问题的约束是什么，他们是优秀的工程师。他们可以快速迭代和尝试想法。到目前为止，我见过的最好的研究人员，他们都有能力尝试实验，非常快。这是较小规模的循环时间。周期时间将人分开。

Trenton Bricken 00:47:20 特伦顿·布里肯 00：47：20

Machine learning research is just so empirical. This is honestly one reason why I think our solutions might end up looking more brain-like than otherwise. Even though we wouldn't want to admit it, the whole community is kind of doing greedy evolutionary optimization over the landscape of possible AI architectures and everything else. It’s no better than evolution. And that’s not even a slight against evolution.
机器学习研究就是如此实证。老实说，这就是为什么我认为我们的解决方案最终可能看起来比其他解决方案更像大脑的原因之一。尽管我们不想承认这一点，但整个社区都在对可能的人工智能架构和其他一切进行贪婪的进化优化。这并不比进化更好。这甚至不是对进化论的轻视。

Dwarkesh Patel 00:47:46 德瓦克什·帕特尔 00：47：46

That's such an interesting idea. I'm still confused on what will be the bottleneck. What would have to be true of an agent such that it sped up your research? So in the Alec Radford example where he apparently already has the equivalent of Copilot for his Jupyter notebook experiments, is it just that if he had enough of those he would be a dramatically faster researcher?
这是一个非常有趣的想法。我仍然对瓶颈是什么感到困惑。对于一个代理来说，它必须是什么才能加速你的研究？因此，在亚历克·拉德福德（Alec Radford）的例子中，他显然已经为他的Jupyter笔记本实验提供了相当于Copilot的Copilot，只是如果他有足够的这些，他会成为一个速度更快的研究人员吗？

So you're not automating the humans, you're just making the most effective researchers who have great taste, more effective and running the experiments for them? You're still working at the point at which the intelligence explosion is happening? Is that what you're saying?
所以你不是在自动化人类，你只是在让最有品味的研究人员更有效，更有效，并为他们进行实验？你还在情报爆炸发生的时候工作吗？你是这么说的吗？

Sholto Douglas 00:48:27 肖尔托·道格拉斯 00：48：27

Right, and if that were directly true then why can't we scale our current research teams better? I think that’s an interesting question to ask. If this work is so valuable, why can't we take hundreds or thousands of people–they're definitely out there–and scale our organizations better.
是的，如果这是直接正确的，那么为什么我们不能更好地扩展我们目前的研究团队呢？我认为这是一个有趣的问题。如果这项工作如此有价值，为什么我们不能让成百上千的人——他们肯定在那里——并更好地扩展我们的组织。

I think we are less, at the moment, bound by the sheer engineering work of making these things than we are by compute to run and get signal, and taste in terms of what the actual right thing to do is. And then making those difficult inferences on imperfect information,
我认为，目前，我们受制于制造这些东西的纯粹工程工作，而不是通过计算来运行和获取信号，并品尝到真正正确的事情是什么。然后对不完美的信息做出那些困难的推论，

Trenton Bricken 00:49:13 特伦顿·布里肯 00：49：13

For the Gemini team. Because I think for interpretability, we actually really want to keep hiring talented engineers. I think that's a big bottleneck for us.
对于双子座团队。因为我认为为了可解释性，我们实际上真的想继续雇用有才华的工程师。我认为这对我们来说是一个很大的瓶颈。

Sholto Douglas 00:49:23 肖尔托·道格拉斯 00：49：23

Obviously more people are better. But I do think it's interesting to consider. One of the biggest challenges that I've thought a lot about is how do we scale better? Google is an enormous organization. It has 200,000-ish people, right? Maybe 180,000 or something like that. One has to imagine ways of scaling out Gemini's research program to all those fantastically talented software engineers. This seems like a key advantage that you would want to be able to take advantage of. You want to be able to use it but how do you effectively do that? It's a very complex organizational problem.
显然，人越多越好。但我确实认为这很有趣。我经常思考的最大挑战之一是我们如何更好地扩展？谷歌是一个庞大的组织。它有 200,000 人左右，对吧？也许是 180,000 或类似的东西。人们必须想象如何将 Gemini 的研究计划扩展到所有那些才华横溢的软件工程师。这似乎是您希望能够利用的关键优势。你希望能够使用它，但你如何有效地做到这一点？这是一个非常复杂的组织问题。

Dwarkesh Patel 00:50:02 德瓦克什·帕特尔 00：50：02

So compute and taste. That's interesting to think about because at least the compute part is not bottlenecked on more intelligence, it's just bottlenecked on Sam's $7 trillion or whatever, right? If I gave you 10x the H100s to run your experiments, how much more effective a researcher are you?
所以计算和品味。这很有趣，因为至少计算部分在更多智能方面没有瓶颈，它只是在 Sam 的 7 万亿美元或其他方面遇到了瓶颈，对吧？如果我给你 10 倍的 H100 来运行你的实验，你是一个更有效率的研究人员吗？

Sholto Douglas 00:50:20 肖尔托·道格拉斯 00：50：20

TPUs, please. 请使用热塑性聚氨酯。

Dwarkesh Patel 00:50:23 德瓦克什·帕特尔 00：50：23

How much more effective a researcher are you?
你是一个更有效率的研究人员吗？

Sholto Douglas 00:50:26 肖尔托·道格拉斯 00：50：26

I think the Gemini program would probably be maybe five times faster with 10 times more compute or something like that.
我认为 Gemini 程序可能会快五倍，计算量增加 10 倍或类似的东西。

Dwarkesh Patel 00:50:35 德瓦克什·帕特尔 00：50：35

So that's pretty good. Elasticity of 0.5. Wait, that's insane.
所以这很好。弹性0.5。等等，这太疯狂了。

Sholto Douglas 00:50:39 肖尔托·道格拉斯 00：50：39

I think more compute would just directly convert into progress.
我认为更多的计算只会直接转化为进度。

Dwarkesh Patel 00:50:43 德瓦克什·帕特尔 00：50：43

So you have some fixed size of compute and some of it goes to inference and also to clients of GCP. Some of it goes to training and from there, as a fraction of it, some of it goes to running the experiments for the full model.
因此，您有一些固定大小的计算，其中一些用于推理以及 GCP 的客户端。其中一些用于训练，然后从那里开始，作为其中的一小部分，其中一些用于运行完整模型的实验。

Sholto Douglas 00:51:04 肖尔托·道格拉斯 00：51：04

Yeah, that's right. 是的，没错。

Dwarkesh Patel 00:51:05 德瓦克什·帕特尔 00：51：05

Shouldn't the fraction that goes experiments then be higher given research is bottlenecked by compute.
鉴于研究受到计算的瓶颈，那么进行实验的分数不应该更高吗？

Sholto Douglas 00:51:13 肖尔托·道格拉斯 00：51：13

So one of the strategic decisions that every pre-training team has to make is exactly what amount of compute do you allocate to different training runs, to your research program versus scaling the last best thing that you landed on. They're all trying to arrive at an optimal point here. One of the reasons why you need to still keep training big models is that you get information there that you don't get otherwise. So scale has all these emergent properties which you want to understand better.
因此，每个预训练团队必须做出的战略决策之一是，您为不同的训练运行、研究计划分配了多少计算量，而不是扩展您最后找到的最好的东西。他们都试图在这里达到一个最佳点。你仍然需要继续训练大型模型的原因之一是，你在那里获得了你无法获得的信息。因此，规模具有所有这些涌现属性，您希望更好地理解这些属性。

Remember what I said before about not being sure what's going to fall off the curve. If you keep doing research in this regime and keep on getting more and more compute efficient, you may have actually gone off the path to actually eventually scale. So you need to constantly be investing in doing big runs too, at the frontier of what you sort of expect to work.
记住我之前说过的，不确定什么会从曲线上掉下来。如果你继续在这种制度下进行研究，并继续提高计算效率，你实际上可能已经偏离了最终扩展的道路。因此，你也需要不断地投资于大跑，在你期望工作的前沿。

Dwarkesh Patel 00:52:17 德瓦克什·帕特尔 00：52：17

So then tell me what it looks like to be in the world where AI has significantly sped up AI research. Because from this, it doesn't really sound like the AIs are going off and writing the code from scratch that's leading to faster output. It sounds like they're really augmenting the top researchers in some way. Tell me concretely. Are they doing the experiments? Are they coming up with the ideas? Are they just evaluating the outputs of the experiments? What's happening?
那么，请告诉我，在人工智能大大加快了人工智能研究的世界中，会是什么样子。因为从这一点来看，听起来并不像 AI 正在从头开始编写代码，从而导致更快的输出。听起来他们真的在某种程度上增强了顶尖研究人员。具体告诉我。他们在做实验吗？他们想出了主意吗？他们只是在评估实验的结果吗？发生了什么事情？

Sholto Douglas 00:52:39 肖尔托·道格拉斯 00：52：39

So I think there's two walls you need to consider here. One is where AI has meaningfully sped up our ability to make algorithmic progress. And one is where the output of the AI itself is the thing that's the crucial ingredient towards model capability progress. Specifically what I mean there is synthetic data. In the first world, where it's meaningfully speeding up algorithmic progress, I think a necessary component of that is more compute. You've probably reached this elasticity point where AIs are easier to speed up and get on to context than yourself, or other people. So AIs meaningfully speed up your work because they're basically a fantastic Copilot that helps you code multiple times faster.
所以我认为这里需要考虑两堵墙。一个是人工智能有意义地加速了我们取得算法进步的能力。一个是人工智能本身的输出是模型能力进步的关键因素。具体来说，我的意思是合成数据。在第一个世界里，它有意义地加速了算法的进步，我认为其中的一个必要组成部分是更多的计算。你可能已经达到了这个弹性点，人工智能比你自己或其他人更容易加速和进入上下文。因此，AI 有意义地加快了您的工作速度，因为它们基本上是一个出色的 Copilot，可以帮助您将编码速度提高数倍。

That seems actually quite reasonable. Super long-context, super smart model. It's onboarded immediately and you can send them off to complete subtasks and subgoals for you. That actually feels very plausible, but again we don't know because there are no great evals about that kind of thing. As I said before, the best one is SWE-bench.
这似乎实际上很合理。超长上下文，超智能模型。它会立即加入，您可以发送它们为您完成子任务和子目标。这实际上感觉很有道理，但我们又不知道，因为对这种事情没有很好的评价。正如我之前所说，最好的是 SWE-bench。

Dwarkesh Patel 00:53:51 德瓦克什·帕特尔 00：53：51

Somebody was mentioning to me that the problem with that one is that when a human is trying to do a pull request, they'll type something out and they'll run it and see if it works. If it doesn't, they'll rewrite it. None of this was part of the opportunities that the LLM was given when told “run on this.” Just output and if it runs and checks all the boxes then it passed. So it might've been an unfair test in that way.
有人向我提到，这个问题的问题在于，当一个人试图做一个拉取请求时，他们会输入一些东西，然后运行它，看看它是否有效。如果没有，他们会重写它。这些都不是被告知“继续运行”时所给予的机会LLM的一部分。只需输出，如果它运行并选中所有框，那么它就通过了。因此，从这个角度来说，这可能是一个不公平的测试。

Sholto Douglas 00:54:16 肖尔托·道格拉斯 00：54：16

So you can imagine that if you were able to use that, that would be an effective training source. The key thing that's missing from a lot of training data is the reasoning traces, right?
所以你可以想象，如果你能够使用它，那将是一个有效的培训来源。大量训练数据中缺少的关键是推理痕迹，对吧？

And I think this would be it. If I wanted to try and automate a specific field, a job family, or understand how at risk of automation that specific field is, then having reasoning traces feels to me like a really important part of that.
我想就是这样。如果我想尝试自动化一个特定的领域、一个工作系列，或者了解该特定领域的自动化风险有多大，那么推理痕迹对我来说是其中非常重要的一部分。

Dwarkesh Patel 00:54:51 德瓦克什·帕特尔 00：54：51

There's so many different threads there I want to follow up on. Let's begin with the data versus compute thing. Is the output of the AI the thing that's causing the intelligence explosion? People talk about how these models are really a reflection on their data. I forgot his name but there was a great blog by this OpenAI engineer. It was talking about how at the end of the day, as these models get better and better, there are just going to be really effective maps of the data set. So at the end of the day you have to stop thinking about architectures. The most effective architecture is just, “do you do an amazing job of mapping the data?” So that implies that the future AI progress comes from the AI just making really awesome data that you’re mapping to?
我想跟进那里有很多不同的线程。让我们从数据与计算开始。人工智能的输出是导致智能爆炸的原因吗？人们谈论这些模型如何真正反映他们的数据。我忘记了他的名字，但这位 OpenAI 工程师有一个很棒的博客。它谈论的是，随着这些模型变得越来越好，最终将会有真正有效的数据集地图。因此，归根结底，您必须停止考虑架构。最有效的架构就是，“你在映射数据方面做得非常出色吗？所以这意味着未来的人工智能进步来自人工智能，只是制作你正在映射的真正令人敬畏的数据？

Sholto Douglas 00:55:45 肖尔托·道格拉斯 00：55：45

That's clearly a very important part .
这显然是一个非常重要的部分。

Dwarkesh Patel 00:55:46 德瓦克什·帕特尔 00：55：46

That's really interesting. Does that look to you like chain-of-thought? Or what would you imagine as these models get better, as these models get smarter? What does the synthetic data look like?
这真的很有趣。在你看来，这像是思维链吗？或者，当这些模型变得更好，当这些模型变得更智能时，你会想象什么？合成数据是什么样子的？

Sholto Douglas 00:56:00 肖尔托·道格拉斯 00：56：00

When I think of really good data, to me, that raises something which involved a lot of reasoning to create. It's similar to Ilya's perspective on achieving super intelligence effectively via perfectly modeling human textual output. But even in the near term, in order to model something like the arXiv papers or Wikipedia, you have to have an incredible amount of reasoning behind you in order to understand what next token might be output.
当我想到真正好的数据时，对我来说，这引发了一些涉及大量推理的东西。这与 Ilya 的观点相似，即通过完美地模拟人类文本输出来有效地实现超级智能。但即使在短期内，为了模拟像arXiv论文或维基百科这样的东西，你必须有大量的推理，才能理解下一个代币可能输出什么。

So for me, what I imagine as good data is data where it had to do reasoning to produce something. And then the trick of course is how do you verify that that reasoning was correct? This is why you saw DeepMind do that research for geometry. Geometry is an easily formalizable, easily verifiable field. You can check if its reasoning was correct and you can generate heaps of data of correct trig, of verified geometry proofs, and train on that. And you know that that's good data.
所以对我来说，我想象中的好数据是数据，它必须进行推理才能产生一些东西。当然，诀窍在于你如何验证这种推理是否正确？这就是为什么你看到 DeepMind 对几何进行研究的原因。几何是一个易于形式化、易于验证的领域。你可以检查它的推理是否正确，你可以生成一堆正确的三角形数据，经过验证的几何证明，并在此基础上进行训练。你知道这是好数据。

Dwarkesh Patel 00:57:11 德瓦克什·帕特尔 00：57：11

It's actually funny because I had a conversation with Grant Sanderson last year where we were debating this and I was like, “fuck dude, by the time they get the gold of the Math Olympiad, of course they're going to automate all the jobs.” Yikes.
这实际上很有趣，因为去年我和格兰特·桑德森（Grant Sanderson）有过一次谈话，当时我们正在讨论这个问题，我想，“去他妈的，当他们获得数学奥林匹克竞赛的金牌时，他们当然会自动化所有的工作。哎呀。

On synthetic data, there’s a thing I speculated about in my scaling post, which was heavily informed by discussions with you two and you especially, Sholto. You can think of human evolution through the spectrum of getting language and so we're generating the synthetic data. Our copies are generating the synthetic data which we're trained on and it's this really effective genetics, cultural, co-evolutionary loop.
关于合成数据，我在我的扩展帖子中推测了一件事，这与你们两个，尤其是 Sholto 的讨论提供了大量信息。你可以通过获取语言的光谱来思考人类的进化，因此我们正在生成合成数据。我们的副本正在生成我们训练的合成数据，这是一个非常有效的遗传学、文化、协同进化循环。

Sholto Douglas 00:57:54 肖尔托·道格拉斯 00：57：54

And there's a verifier there too, right? There's the real world. You might generate a theory about the gods causing the storms, And then someone else finds cases where that isn't true. And so that sort of didn't match your verification function. Now instead you have some weather simulation which required a lot of reasoning to produce and accurately matches reality. And now you can train on that as a better model of the world. Like we are training on that, and stories, and like scientific theories.
那里也有一个验证器，对吧？这是真实的世界。你可能会产生一个关于众神引起风暴的理论，然后其他人发现这不是真的。所以这与你的验证功能不匹配。现在，你有一些天气模拟，这需要大量的推理来产生并准确地匹配现实。现在你可以把它作为一个更好的世界模型来训练。就像我们正在训练一样，还有故事，就像科学理论一样。

Dwarkesh Patel 00:58:27 德瓦克什·帕特尔 00：58：27

I want to go back. I'm just remembering something you mentioned a little while ago how given how empirical ML is, it really is an evolutionary process resulting in better performance and not necessarily an individual coming up with a breakthrough in a top-down way. That has interesting implications.
我想回去。我只是想起你刚才提到的一些事情，考虑到机器学习的经验性，它确实是一个进化过程，可以带来更好的性能，而不一定是个人以自上而下的方式取得突破。这具有有趣的含义。

First, people are concerned about capabilities increasing because more people are going into the field. I've been somewhat skeptical of that way of thinking, but from this perspective of just more input, it really does feel like more people going to ICML means that there's faster progress towards GPT-5.
首先，人们担心能力的提高，因为越来越多的人进入该领域。我一直对这种思维方式持怀疑态度，但从更多投入的角度来看，确实感觉更多的人去 ICML 意味着 GPT-5 的进展更快。

Trenton Bricken 00:59:13 特伦顿·布里肯 00：59：13

You just have more genetic recombination. And shots on target.
你只是有更多的基因重组。还有射正。

Sholto Douglas 00:59:17 肖尔托·道格拉斯 00：59：17

I mean, aren't all fields kind of like that? This is sort of the scientific framing of discovery versus invention, right? Discovery almost involves whenever there's been a massive scientific breakthrough in the past. Typically there are multiple people co-discovering a thing at roughly the same time. That feels to me, at least a little bit, like the mixing and trying of ideas. You can't try an idea that's so far out of scope that you have no way of verifying with the tools you have available.
我的意思是，不是所有的领域都是这样吗？这有点像发现与发明的科学框架，对吧？发现几乎涉及过去任何重大科学突破。通常，有多个人几乎在同一时间共同发现一件事。在我看来，这至少有点像是想法的混合和尝试。你不能尝试一个远远超出范围的想法，以至于你无法用你现有的工具进行验证。

Trenton Bricken 00:59:45 特伦顿·布里肯 00：59：45

I think physics and math might be slightly different in this regard. But especially for biology or any sort of wetware, to the extent we want to analogize neural networks here, it's just comical how serendipitous a lot of the discoveries are. Penicillin, for example.
我认为物理和数学在这方面可能略有不同。但特别是对于生物学或任何类型的湿件，在某种程度上，我们想在这里类比神经网络，这真是滑稽可笑，许多发现是多么偶然。例如，青霉素。

Dwarkesh Patel 01:00:01 德瓦克什·帕特尔 01：00：01

Another implication of this is the idea that AGI is just going to come tomorrow. Somebody's just going to discover a new algorithm and we have AGI. That seems less plausible. It will just be a matter of more and more and more researchers finding these marginal things that all add up together to make models better.
另一个含义是，AGI明天就会到来。有人会发现一种新的算法，我们有了AGI。这似乎不太合理。这只是越来越多的研究人员发现这些边缘的东西，这些东西加在一起可以使模型变得更好。

Sholto Douglas 01:00:19 肖尔托·道格拉斯 01：00：19

Right. That feels like the correct story to me.
右。对我来说，这感觉是正确的故事。

Trenton Bricken 01:00:23 特伦顿·布里肯 01：00：23

Especially while we're still hardware constrained.
尤其是在我们仍然受到硬件限制的情况下。

Dwarkesh Patel 01:00:25 德瓦克什·帕特尔 01：00：25

Right. Do you buy this narrow window framing of the intelligence explosion? Each GPT-3, GPT-4 is two OOMs, orders of magnitude, more compute or at least more effective compute. In the sense that, if you didn't have any algorithmic progress, it would have to be two orders of magnitude bigger, the raw form, to be as good. Do you buy the framing that, given that you have to be two orders of magnitude bigger at every generation, if you don't get AGI by GPT-7 that can help you catapult an intelligence explosion, you're kind of just fucked as far as much smarter intelligence goes. You're kind of stuck with GPT-7 level models for a long time because at that point you're consuming significant fractions of the economy to make that model and we just don't have the wherewithal to make GPT 8.
右。你买这个情报爆炸的窄窗框吗？每个 GPT-3、GPT-4 都是两个 OOM，数量级，计算量更大，或者至少计算效率更高。从某种意义上说，如果你没有任何算法进步，它必须大两个数量级，原始形式，才能一样好。你是否相信这样的框架，鉴于你必须在每一代人中大两个数量级，如果你没有获得 GPT-7 的 AGI，可以帮助你推动智能爆炸，你就有点完蛋了更聪明的智能。你很长一段时间都停留在 GPT-7 级别的模型上，因为在这一点上，你正在消耗很大一部分经济来制作这个模型，而我们只是没有资金来制作 GPT 8。

Trenton Bricken 01:01:19 特伦顿·布里肯 01：01：19

This is the Carl Shulman sort of argument that we're going to race through the orders of magnitude in the near term, but then in the longer term it would be harder.
这是卡尔·舒尔曼（Carl Shulman）的论点，我们将在短期内通过数量级的比赛，但从长远来看，这将更加困难。

Dwarkesh Patel 01:01:28 德瓦克什·帕特尔 01：01：28

He's probably talked about it a lot but I do buy that framing.
他可能已经谈过很多次了，但我确实买了那个框架。

Sholto Douglas 01:01:33 肖尔托·道格拉斯 01：01：33

I generally buy that. Increases in order of magnitude of compute means in absolute terms, almost diminishing returns on capability, right? We've seen over a couple of orders of magnitude, models go from being unable to do anything to being able to do huge amounts.
我一般都买那个。计算量级的增加意味着从绝对值来看，几乎是递减的能力回报，对吧？我们已经看到超过几个数量级，模型从无能为力到能够做大量的事情。

It feels to me that each incremental order of magnitude gives more nines of reliability at things. So it unlocks things like agents. But at least at the moment, it doesn't feel like reasoning improves linearly, but rather somewhat sublinearly.
在我看来，每增加一个数量级，就会在事情上提供更多的可靠性。所以它解锁了代理之类的东西。但至少在目前，感觉推理并没有线性地提高，而是在某种程度上是亚线性的。

Dwarkesh Patel 01:02:04 德瓦克什·帕特尔 01：02：04

That's actually a very bearish sign. We were chatting with one of our friends and he made the point that if you look at what new applications are unlocked by GPT-4 relative to GPT-3.5, it's not clear that it’s that much more. A GPT-3.5 can do perplexity or whatever. So if there’s this diminishing increase in capabilities and that costs exponentially more to get, that's actually a bearish sign on what 4.5 will be able to do or what 5 will unlock in terms of economic impact.
这实际上是一个非常看跌的迹象。我们和一位朋友聊天时，他指出，如果你看看 GPT-4 相对于 GPT-3.5 解锁了哪些新应用程序，就不清楚它有那么多了。GPT-3.5 可以做困惑或其他任何事情。因此，如果能力的增长正在减少，并且获得成本呈指数级增长，这实际上是一个看跌的迹象，表明4.5将能够做什么，或者5将在经济影响方面释放什么。

Sholto Douglas 01:02:37 肖尔托·道格拉斯 01：02：37

That being said, for me the jump between 3.5 and 4 is pretty huge. So another 3.5 to 4 jump is ridiculous. If you imagine 5 as being a 3.5 to 4 jump, straight off the bat in terms of ability to do SATs and this kind of stuff.
话虽如此，对我来说，3.5 和 4 之间的跳跃是相当大的。所以再跳 3.5 到 4 是荒谬的。如果你把 5 想象成 3.5 到 4 的跳跃，那么就做 SAT 和类似事情的能力而言，这是直接的。

Trenton Bricken 01:02:53 特伦顿·布里肯 01：02：53

Yeah, the LSAT performance was particularly striking.
是的，LSAT的表现特别引人注目。

Sholto Douglas 01:02:55 肖尔托·道格拉斯 01：02：55

Exactly. You go from not super smart to very smart to utter genius in the next generation instantly. And it doesn't, at least to me, feel like we're going to jump to utter genius in the next generation, but it does feel like we'll get very smart plus lots of reliability. TBD what that continues to look like.
完全。你从不聪明到非常聪明，在下一代瞬间成为完全的天才。至少在我看来，这不会让人觉得我们会在下一代中成为完全的天才，但确实感觉我们会变得非常聪明，而且可靠性很高。待定，这仍然是什么样子。

Dwarkesh Patel 01:03:20 德瓦克什·帕特尔 01：03：20

Will GOFAI be part of the intelligence explosion? You talked about synthetic data, but in fact it would be writing its own source code in some important way. There was an interesting paper that you can use diffusion to come up with model weights. I don't know how legit that was or whatever, but something like that.
GOFAI会成为情报爆炸的一部分吗？你谈到了合成数据，但实际上它会以某种重要的方式编写自己的源代码。有一篇有趣的论文说，你可以用扩散来得出模型权重。我不知道这有多合法，但类似的东西。

Trenton Bricken 01:03:41 特伦顿·布里肯 01：03：41

So GOFAI is good old-fashioned AI, right? Can you define that? Because when I hear it, I think “if else” statements for symbolic logic.
所以GOFAI是很好的老式AI，对吧？你能定义一下吗？因为当我听到它时，我想到符号逻辑的“如果”陈述。

Sholto Douglas 01:03:53 肖尔托·道格拉斯 01：03：53

I actually want to make sure we fully unpack the model improvement increments. I don't want people to come away with the perspective that this is super bearish and models aren't going to get much better. I want to emphasize that the jumps that we've seen so far are huge. Even if those continue on a smaller scale, we're still in for extremely smart, very reliable agents over the next couple of orders of magnitude.
实际上，我想确保我们完全解开模型改进增量的包装。我不希望人们认为这是非常悲观的，模型不会变得更好。我想强调的是，到目前为止，我们看到的跳跃是巨大的。即使这些继续以较小的规模发展，在接下来的几个数量级内，我们仍然会遇到非常智能、非常可靠的代理。

We didn't fully close the thread on the narrow window thing. Let's say GPT-4 cost a hundred million dollars or whatever. You have the 1B run, 10B run, 100B run. All seem very plausible by private company standards.
我们没有完全关闭窄窗口上的线程。假设 GPT-4 花费了一亿美元或其他什么。你有 1B 运行、10B 运行、100B 运行。按照私营公司的标准，所有这些似乎都非常合理。

Trenton Bricken 01:04:41 特伦顿·布里肯 01：04：41

You mean in terms of dollars?
你是说以美元计价？

Sholto Douglas 01:04:42 肖尔托·道格拉斯 01：04：42

In terms of dollar amount. You can also imagine even a 1T run being part of a national consortium, on a national level but much harder on behalf of an individual company. But Sam is out there trying to raise $7 trillion, right? He's already preparing for a whole lot of magnitude.
就美元金额而言。你也可以想象，即使是 1T 运行，在国家层面上也是国家财团的一部分，但代表单个公司要困难得多。但山姆正试图筹集 7 万亿美元，对吧？他已经在为大规模做准备了。

Trenton Bricken 01:05:02 特伦顿·布里肯 01：05：02

He's shifted the Overton window.
他移动了奥弗顿的窗户。

Sholto Douglas 01:05:03 肖尔托·道格拉斯 01：05：03

He's shifting the magnitude here beyond the national level. So I want to point out that we have a lot more jumps. Even if those jumps are relatively smaller, that's still a pretty stark improvement in capability.
他正在将这里的规模转移到国家层面之外。所以我想指出，我们有更多的跳跃。即使这些跳跃相对较小，这仍然是能力的明显改进。

Trenton Bricken 01:05:18 特伦顿·布里肯 01：05：18

Not only that, but if you believe claims that GPT-4 is around 1 trillion parameter count, well the human brain is between 30 and 300 trillion synapses. That's obviously not a one-to-one mapping and we can debate the numbers, but it seems pretty plausible that we're below brain scale still.
不仅如此，如果你相信 GPT-4 大约有 1 万亿个参数计数的说法，那么人脑就有 30 到 300 万亿个突触。这显然不是一对一的映射，我们可以对数字进行辩论，但似乎我们仍然低于大脑规模。

Dwarkesh Patel 01:05:37 德瓦克什·帕特尔 01：05：37

So crucially, the point is that the algorithmic overhead is really high. Maybe this is something we should touch on explicitly. Even if you can't keep dumping more compute beyond the models that cost a trillion dollars or something, the fact that the brain is so much more data efficient implies that if we have the compute, if we have the brain's algorithm to train, if you could train as a sample efficient as humans train from birth, then we could make the AGI.
因此，至关重要的是，算法开销非常高。也许这是我们应该明确触及的事情。即使你不能在花费一万亿美元或其他东西的模型之外继续倾倒更多的计算，大脑的数据效率要高得多的事实意味着，如果我们有计算，如果我们有大脑的算法来训练，如果你可以像人类从出生开始训练一样有效地训练样本，然后我们就可以制作 AGI。

Trenton Bricken 01:06:09 特伦顿·布里肯 01：06：09

I never know exactly how to think about the sample efficiency stuff because obviously a lot of things are hardwired in certain ways. They're the coevolution of language and the brain structure. So it's hard to say. There are also some results that indicate that if you make your model bigger, it becomes more sample efficient.
我从来不知道如何考虑样本效率问题，因为显然很多事情都是以某种方式硬连线的。它们是语言和大脑结构的共同进化。所以很难说。还有一些结果表明，如果使模型变大，则样本效率会更高。

Sholto Douglas 01:06:29 肖尔托·道格拉斯 01：06：29

The original scaling laws paper, right? The logic model is almost empty.
原来的缩放定律论文，对吧？逻辑模型几乎是空的。

Trenton Bricken 01:06:33 特伦顿·布里肯 01：06：33

Right. So maybe that just solves it. You don't have to be more data efficient, but if your model is bigger then you also just are more efficient.
右。所以也许这只是解决了它。你不必提高数据效率，但如果你的模型更大，那么你也会更有效率。

Dwarkesh Patel 01:06:42 德瓦克什·帕特尔 01：06：42

What is the explanation for why that would be the case? A bigger model sees these exact same data and at the end of seeing that data it learns more from it? Does it have more space to represent it?
为什么会这样，有什么解释呢？一个更大的模型看到这些完全相同的数据，并在看到这些数据后从中学到更多？它是否有更多的空间来表示它？

01:06:52 - Superposition & secret communication #

01：06：52 - 叠加和秘密通信

Trenton Bricken 01:06:52 特伦顿·布里肯 01：06：52

This is my very naive take here. One thing about the superposition hypothesis that interpretability has pushed is that your model is dramatically underparameterized and that's typically not the narrative that deep learning has pursued, right? But if you're trying to train a model on the entire internet and have it predict with incredible fidelity, you are in the underparameterized regime and you're having to compress a ton of things and take on a lot of noisy interference in doing so. When you have a bigger model, you can have cleaner representations to work with.
这是我在这里非常幼稚的看法。关于可解释性推动的叠加假设的一件事是，你的模型被严重低估了参数化，这通常不是深度学习所追求的叙述，对吧？但是，如果你试图在整个互联网上训练一个模型，并让它以令人难以置信的保真度进行预测，那么你就处于参数化程度低的状态，你必须压缩大量的东西，并承担很多嘈杂的干扰。当您拥有更大的模型时，您可以使用更清晰的表示。

Dwarkesh Patel 01:07:25 德瓦克什·帕特尔 01：07：25

For the audience, you should unpack that. Why that first of all? What is superposition and why is that an implication of superposition?
对于观众来说，你应该解开它。为什么首先呢？什么是叠加，为什么这是叠加的含义？

Trenton Bricken 01:07:32 特伦顿·布里肯 01：07：32

Sure. This was before I joined Anthropic. The fundamental result is from a paper titled “Toy Models of Superposition.” It finds that even for small models, if you are in a regime where your data is high-dimensional and sparse–by sparse I mean, any given data point doesn't appear very often–your model will learn a compression strategy that we call superposition so that it can pack more features of the world into it than it has parameters.
确定。那是在我加入 Anthropic 之前。基本结果来自一篇题为“叠加的玩具模型”的论文。它发现，即使对于小型模型，如果你的数据是高维和稀疏的——我的意思是稀疏，任何给定的数据点都不会经常出现——你的模型将学习一种我们称之为叠加的压缩策略，这样它就可以将更多的世界特征打包到其中，而不是它的参数。

I think both of these constraints apply to the real world, and modeling internet data is a good enough proxy for that. There's only one Dwarkesh. There's only one shirt you're wearing. There's this Liquid Death can here. These are all objects or features and how you define a feature is tricky. You're in a really high-dimensional space because there's so many of them and they appear very infrequently. In that regime, your model will learn compression
我认为这两个限制都适用于现实世界，而对互联网数据进行建模是一个足够好的代理。只有一个矮人。你只穿一件衬衫。这里有这个液体死亡罐头。这些都是对象或特征，如何定义特征是很棘手的。你处于一个非常高维的空间中，因为它们太多了，而且它们很少出现。在这种制度下，您的模型将学习压缩

To riff a little bit more on this, I believe that the reason networks are so hard to interpret is in a large part because of this superposition. If you take a model and you look at a given neuron in it, a given unit of computation, and you ask, “how is this neuron contributing to the output of the model when it fires?” When you look at the data that it fires for, it's very confusing. It'll be like ten percent of every possible input. It’ll fire for “Chinese” but also “fish” and “trees”, and the full stop in URLs.
为了进一步解释这一点，我认为网络之所以如此难以解释，很大程度上是因为这种叠加。如果你拿一个模型，你看一个给定的神经元，一个给定的计算单位，然后你问，“这个神经元在触发时对模型的输出有什么贡献？当您查看它触发的数据时，它非常令人困惑。它就像是每个可能输入的百分之十。它会触发“中文”，也会触发“鱼”和“树”，以及 URL 中的句号。

But the paper that we put out last year, “Towards Monosemanticity,” shows that if you project the activations into a higher-dimensional space and provide a sparsity penalty, you get out very clean features and things all of a sudden start to make a lot more sense. You can think of this as undoing the compression in the same way that you assumed your data was originally high-dimensional and sparse. You return it to that high-dimensional and sparse regime.
但是我们去年发表的论文《走向单调性》（Towards Monosemanticity）表明，如果你把激活投射到一个更高维的空间中，并提供稀疏性惩罚，你就会得到非常干净的特征和东西，突然间开始变得更有意义。您可以将其视为撤消压缩，就像您假设数据最初是高维和稀疏的一样。你把它恢复到那个高维和稀疏的状态。

Dwarkesh Patel 01:09:36 德瓦克什·帕特尔 01：09：36

There's so many interesting threads there. First thing, you mentioned that these models are trained in a regime where they're overparameterized. Isn't that when you have generalization, like grokking happens in that regime?
那里有很多有趣的线索。首先，你提到这些模型是在过度参数化的制度下训练的。这难道不是当你有概括的时候，就像在那个政权中发生摸索一样吗？

Trenton Bricken 01:09:57 特伦顿·布里肯 01：09：57

I was saying the models were underparameterized. Typically people talk about deep learning as if the model were overparameterized. The claim here is that they're dramatically underparameterized, given the complexity of the task that they're trying to perform.
我是说模型参数化不足。通常，人们在谈论深度学习时，就好像模型被过度参数化了一样。这里的说法是，考虑到他们试图执行的任务的复杂性，他们被严重低估了。

Dwarkesh Patel 01:10:14 德瓦克什·帕特尔 01：10：14

Here’s another question. So what is happening with the distilled models? The earlier claims we were talking about is that smaller models are worse at learning than bigger models, but you could make the claim that GPT-4 Turbo is actually worse at reasoning style stuff than GPT-4 despite probably knowing the same facts. The distillation got rid of some of the reasoning.
这是另一个问题。那么蒸馏模型发生了什么呢？我们之前谈论的说法是，较小的模型比较大的模型更不擅长学习，但你可以声称 GPT-4 Turbo 实际上比 GPT-4 更擅长推理风格的东西，尽管可能知道相同的事实。蒸馏摆脱了一些推理。

Sholto Douglas 01:10:44 肖尔托·道格拉斯 01：10：44

Do we have any evidence that GPT-4 Turbo is a distilled version of 4? It might just be a new architecture. It could just be a faster, more efficient new architecture.
我们是否有任何证据表明 GPT-4 Turbo 是 4 的蒸馏版本？它可能只是一种新的架构。它可能只是一个更快、更高效的新架构。

Dwarkesh Patel 01:10:53 德瓦克什·帕特尔 01：10：53

Okay. Interesting. 好。有趣。

Sholto Douglas 01:10:54 肖尔托·道格拉斯 01：10：54

So that's cheaper. 所以这样更便宜。

Dwarkesh Patel 01:10:56 德瓦克什·帕特尔 01：10：56

How do you interpret what's happening in distillation? I think Gwern had one of these questions on his website. Why can't you train the distilled model directly? Why is it a picture you had to project from this bigger space to a smaller space?
您如何解释蒸馏中发生的事情？我认为 Gwern 在他的网站上有这样一个问题。为什么不能直接训练蒸馏模型？为什么你必须从这个更大的空间投射到一个更小的空间？

Trenton Bricken 01:11:14 特伦顿·布里肯 01：11：14

I think both models will still be using superposition. The claim here is that you get a very different model if you distill versus if you train from scratch and it's just more efficient, or it's just fundamentally different, in terms of performance.
我认为这两个模型仍将使用叠加。这里的说法是，如果你蒸馏，你会得到一个非常不同的模型，而不是从头开始训练，它只是更有效率，或者只是在性能方面根本不同。

Sholto Douglas 01:11:32 肖尔托·道格拉斯 01：11：32

I think the traditional story for why distillation is more efficient is during training, normally you're trying to predict this one hot vector that says, “this is the token that you should have predicted.” If your reasoning process means that you're really far off from predicting that, then I see that you still get these gradient updates that are in the right direction. But it might be really hard for you to learn to predict that in the context that you're in.
我认为为什么蒸馏更有效的传统故事是在训练期间，通常你试图预测这个热向量，说，“这是你应该预测的代币。如果你的推理过程意味着你离预测还很远，那么我看到你仍然会得到这些方向正确的梯度更新。但是，在你所处的环境中，你可能真的很难学会预测这一点。

What distillation does is it doesn't just have the one hot vector. It has the full readout from the larger model, all of the probabilities. So you get more signal about what you should have predicted. In some respects it's showing a tiny bit of your work too. It's not just like, “this was the answer.”
蒸馏的作用是它不只有一个热载体。它具有来自较大模型的完整读数，所有概率。因此，您可以获得有关您应该预测的内容的更多信号。在某些方面，它也展示了你的一点点工作。这不仅仅是说，“这就是答案。

Trenton Bricken 01:12:20 特伦顿·布里肯 01：12：20

It's kind of like watching a kung fu master versus being in the Matrix and just downloading.
这有点像看功夫大师，而不是在黑客帝国中下载。

Sholto Douglas 01:12:24 肖尔托·道格拉斯 01：12：24

Yeah, exactly. 是的，没错。

Dwarkesh Patel 01:12:27 德瓦克什·帕特尔 01：12：27

I want to make sure the audience got that. When you're turning on a distilled model you see all its probabilities over the tokens it was predicting and over the ones you were predicting, and then you update through all those probabilities rather than just seeing the last word and updating on that.
我想确保观众明白这一点。当你打开一个提炼模型时，你会看到它所预测的代币和你所预测的代币的所有概率，然后你更新所有这些概率，而不仅仅是看到最后一个词并更新它。

This actually raises a question I was intending to ask you. I think you were the one who mentioned that you can think of chain-of-thought as adaptive compute. The idea of adaptive compute is that if a question is harder, you would want models to be able to spend more cycles thinking about it. So how do you do that? There's only a finite and predetermined amount of compute that one forward pass implies. If there's a complicated reasoning type question or math problem, you want to be able to spend a long time thinking about it. Then you do chain-of-thought where the model just thinks through the answer. You can think about it as all those forward passes where it's thinking through the answer. It's being able to dump more compute into solving the problem.
这实际上提出了一个我打算问你的问题。我想你是那个提到你可以把思维链看作是自适应计算的人。自适应计算的理念是，如果一个问题更难，你会希望模型能够花更多的周期来思考它。那么你是怎么做到的呢？一次前向传递只意味着有限且预先确定的计算量。如果有一个复杂的推理类型的问题或数学问题，你希望能够花很长时间思考它。然后你做思维链，模型只是思考答案。你可以把它想象成所有那些向前传递的地方，它正在思考答案。它能够将更多的计算用于解决问题。

Now let’s go back to the signal thing. When it's doing chain-of-thought, it's only able to transmit that token of information where the residual stream is already a compressed representation of everything that's happening in the model. And then you're turning the residual stream into one token which is like log of 50,000 (or log of vocab_size) bits, which is so tiny.
现在让我们回到信号方面。当它进行思维链时，它只能传输剩余流已经是模型中发生的一切的压缩表示的信息标记。然后你把残余流变成一个令牌，就像 50,000 位（或 vocab_size 位的对数）一样，这太小了。

Sholto Douglas 01:14:04 肖尔托·道格拉斯 01：14：04

I don't think it's quite only transmitting that one token. If you think about it during a forward pass, you create these KV values in the transformer forward pass and then future steps attend to the KV values. So all of those pieces of KV, of keys and values, are bits of information that you could use in the future.
我不认为它只传输一个令牌。如果您在前向传递过程中考虑它，则在变压器前向传递中创建这些 KV 值，然后后续步骤关注 KV 值。因此，所有这些 KV 片段、键和值都是您将来可以使用的信息。

Dwarkesh Patel 01:14:26 德瓦克什·帕特尔 01：14：26

Is the claim that when you fine-tune on chain-of-thought, the key and value weights change so that the sort of steganography can happen in the KV cache?
是否声称，当你对思维链进行微调时，键和值权重会发生变化，以便可以在 KV 缓存中发生那种隐写术？

Sholto Douglas 01:14:39 肖尔托·道格拉斯 01：14：39

I don't think I could make that strong a claim there, but that's a good headcanon for why it works. I don't know if there are any papers explicitly demonstrating that or anything like that.
我不认为我能在那里提出如此强烈的主张，但这是它为什么有效的一个很好的头炮。我不知道是否有任何论文明确证明这一点或类似的东西。

But that's at least one way that you can imagine the model. During pre-training, the model's trying to predict these future tokens and one thing that you can imagine it doing is that it’s learning to smush information about potential futures into the keys and values that it might want to use in order to predict future information.
但这至少是你可以想象模型的一种方式。在预训练期间，模型试图预测这些未来的代币，你可以想象它做的一件事是，它正在学习将有关潜在未来的信息混入它可能想要使用的键和值中，以便预测未来的信息。

It kind of smooths that information across time and the pre-training thing. So I don't know if people are particularly training on chains-of-thought. I think the original chain-of-thought paper had that as almost an immersion property of the model. You could prompt it to do this kind of stuff and it still worked pretty well. So it’s a good headcanon for why that works.
它在某种程度上平滑了这些信息，跨越了时间和预训练的事情。所以我不知道人们是否特别在思想链上训练。我认为最初的思维链论文几乎将其作为模型的沉浸属性。你可以提示它做这种事情，它仍然工作得很好。因此，这是一个很好的头炮，说明为什么会这样。

Trenton Bricken 01:15:35 特伦顿·布里肯 01：15：35

To be overly pedantic here, the tokens that you actually see in the chain-of-thought do not necessarily at all need to correspond to the vector representation that the model gets to see when it's deciding to attend back to those tokens.
在这里过于迂腐，你在思维链中实际看到的标记不一定需要对应于模型在决定返回这些标记时看到的向量表示。

Sholto Douglas 01:15:49 肖尔托·道格拉斯 01：15：49

What a training step is is you actually replacing the token, the model output, with the real next token. Yet it's still learning because it has all this information, internally. When you're getting a model to produce at inference time, you're taking the output, the token, and you're feeding it in the bottom, un-embedding it, and it becomes the beginning of the new residual string. Then you use the output of past KVs to read into and adapt that residual string. At training time you do this thing called teacher forcing basically where you're like, “actually, the token you were meant to output is this one.”
训练步骤实际上是用真正的下一个令牌替换令牌（模型输出）。然而，它仍在学习，因为它在内部拥有所有这些信息。当你在推理时生成一个模型时，你正在获取输出，标记，并在底部输入它，取消嵌入它，它成为新残差字符串的开始。然后，使用过去 KV 的输出来读取和调整该残差字符串。在训练时，你做这个叫做老师强迫的事情，基本上是你喜欢的地方，“实际上，你要输出的令牌就是这个。

That's how you do it in parallel. You have all the tokens. You put them all in parallel and you do the giant forward pass. So the only information it's getting about the past is the keys and values. It never sees the token that it outputs.
这就是你并行的方式。你拥有所有的代币。你把它们都放在一起，然后你做巨大的向前传球。因此，它获得的关于过去的唯一信息是密钥和值。它永远不会看到它输出的令牌。

Trenton Bricken 01:16:42 特伦顿·布里肯 01：16：42

It's trying to do the next token prediction and if it messes up, then you just give it the correct answer.
它正在尝试进行下一个令牌预测，如果它搞砸了，那么你只需给它正确的答案。

Dwarkesh Patel 01:16:48 德瓦克什·帕特尔 01：16：48

Okay, that makes sense.
好吧，这是有道理的。

Trenton Bricken 01:16:50 特伦顿·布里肯 01：16：50

Otherwise it can become totally derailed.
否则，它可能会完全脱轨。

Sholto Douglas 01:16:52 肖尔托·道格拉斯 01：16：52

Yeah. It'd go off the tracks.
是的。它会偏离轨道。

Dwarkesh Patel 01:16:55 德瓦克什·帕特尔 01：16：55

About the sort of secret communication with the model to its forward inferences, how much steganography and secret communication do you expect there to be?
关于与模型的秘密通信到其前向推理的类型，您预计会有多少隐写术和秘密通信？

Sholto Douglas 01:17:10 肖尔托·道格拉斯 01：17：10

We don't know. The honest answer is we don't know. I wouldn't even necessarily classify it as secret information. A lot of the work that Trenton's team is trying to do is to actually understand that these are fully visible from the model side. Maybe not the user, but we should be able to understand and interpret what these values are doing and the information that is transmitting. I think that's a really important goal for the future.
不知道。诚实的答案是我们不知道。我甚至不一定会将其归类为秘密信息。Trenton的团队试图做的很多工作是真正理解这些从模型方面是完全可见的。也许不是用户，但我们应该能够理解和解释这些值在做什么以及正在传输的信息。我认为这是未来的一个非常重要的目标。

Trenton Bricken 01:17:39 特伦顿·布里肯 01：17：39

There are some wild papers though where people have had the model do chain-of-thought and it is not at all representative of what the model actually decides its answer is. You can even go in and edit the chain-of-thought so that the reasoning is totally garbled and it will still output the true answer.
不过，在一些疯狂的论文中，人们已经让模型做了思维链，它根本不代表模型实际决定它的答案是什么。您甚至可以进入并编辑思维链，以便推理完全乱码，并且它仍然会输出真实的答案。

Dwarkesh Patel 01:17:59 德瓦克什·帕特尔 01：17：59

But it gets a better answer at the end of the chain-of-thought, rather than not doing it at all. So is it that something useful is happening, but the useful thing is not human understandable?
但它在思维链的末端得到了更好的答案，而不是根本不做。那么，是不是一些有用的事情正在发生，但有用的东西不是人类可以理解的？

Trenton Bricken 01:18:09 特伦顿·布里肯 01：18：09

I think in some cases you can also just ablate the chain-of-thought and it would have given the same answer anyways. I'm not saying this is always what goes on, but there's plenty of weirdness to be investigated.
我认为在某些情况下，你也可以消融思维链，无论如何它都会给出相同的答案。我并不是说这总是在发生，但有很多奇怪的事情需要调查。

Sholto Douglas 01:18:21 肖尔托·道格拉斯 01：18：21

It's a very interesting thing to look at and try to understand. You can do it with open source models. I wish there were more of this kind of interpretability and understanding work done on open models.
这是一件非常有趣的事情，可以观察并尝试理解。您可以使用开源模型来做到这一点。我希望在开放模型上能有更多这样的可解释性和理解性工作。

Trenton Bricken 01:18:34 特伦顿·布里肯 01：18：34

Even in Anthropic's recent sleeper agents paper, which at a high level for people unfamiliar, basically involves training in a trigger word. And when I say it, for example, “if it's the year 2024, the model will write malicious code instead of otherwise. They do this attack with a number of different models. Some of them use chain-of-thought, some of them don't. Those models respond differently when you try to remove the trigger. You can even see them do this comical reasoning that's pretty creepy. In one case it even tries to calculate, “well, the expected value of me getting caught is this, but then if I multiply it by the ability for me to keep saying, I hate you, I hate you, I hate you, then this is how much reward I should get.” Then it will decide whether or not to actually tell the interrogator that it's malicious or not.
即使在Anthropic最近的潜伏代理论文中，对于不熟悉的人来说，这基本上是一个触发词的训练。例如，当我说它时，“如果是 2024 年，模型将编写恶意代码，而不是其他代码。他们使用许多不同的模型进行这种攻击。他们中的一些人使用思维链，一些人则不使用。当您尝试删除触发器时，这些模型的响应会有所不同。你甚至可以看到他们做这种非常令人毛骨悚然的滑稽推理。在一个案例中，它甚至试图计算，“好吧，我被抓住的期望值是这样的，但是如果我把它乘以我继续说，我恨你，我恨你，我恨你的能力，那么这就是我应该得到的奖励。然后它将决定是否真正告诉审讯者它是恶意的。

There's another paper from a friend, Miles Turpin, where you give the model a bunch of examples where the correct answer is always ‘A’ for multiple choice questions. Then you ask the model, “what is the correct answer to this new question?” It will infer from the fact that all the examples are ‘A’, that the correct answer is ‘A.’ But its chain-of-thought is totally misleading. It will make up random stuff that tries to sound as plausible as possible, but it's not at all representative of the true answer.
还有朋友迈尔斯·特平（Miles Turpin）的另一篇论文，你给模型提供了一堆例子，其中多项选择题的正确答案始终是“A”。然后你问模型，“这个新问题的正确答案是什么？它将从所有示例都是“A”这一事实中推断出正确答案是“A”，但它的思维链完全具有误导性。它会编造一些随机的东西，试图听起来尽可能合理，但它根本不代表真正的答案。

Dwarkesh Patel 01:20:11 德瓦克什·帕特尔 01：20：11

But isn't this how humans think as well? There are the famous split-brain experiments where for a person who is suffering from seizures, they cut the thing that connects the two halves of the brain. The speech half is on the left side so it's not connected to the part that decides to do a movement. So if the other side decides to do something, the speech part will just make something up and the person will think that's legit the reason they did it.
但人类不也是这样思考的吗？有著名的裂脑实验，对于一个患有癫痫发作的人来说，他们切断了连接大脑两半的东西。语音的一半在左侧，因此它与决定做动作的部分没有连接。因此，如果对方决定做某事，演讲部分只会编造一些东西，这个人会认为这是他们这样做的合法原因。

Trenton Bricken 01:20:39 特伦顿·布里肯 01：20：39

Totally. It's just that some people will hail chain-of-thought reasoning as a great way to solve AI safety, but actually we don't know whether we can trust it.
完全。只是有些人会称赞思维链推理是解决人工智能安全的好方法，但实际上我们不知道我们是否可以信任它。

Dwarkesh Patel 01:20:52 德瓦克什·帕特尔 01：20：52

How does that change with AI agents, this landscape of models communicating to themselves in ways we don't understand? Because then it's not just the model itself with its previous caches, but other instances of the model.
随着人工智能代理的出现，这种模型以我们无法理解的方式与自己交流的景观会如何改变？因为这样一来，它不仅仅是模型本身及其以前的缓存，而是模型的其他实例。

Sholto Douglas 01:21:10 肖尔托·道格拉斯 01：21：10

It depends a lot on what channels you give them to communicate with each other. If you only give them text as a way of communicating, then they probably have to interpret–
这很大程度上取决于你给他们什么渠道来相互交流。如果你只给他们文本作为一种交流方式，那么他们可能不得不解释——

Dwarkesh Patel 01:21:17 德瓦克什·帕特尔 01：21：17

How much more effective do you think the models would be if they could share the residual streams versus just text?
您认为如果模型可以共享残差流而不是仅共享文本，那么它们的效率会提高多少？

Sholto Douglas 01:21:23 肖尔托·道格拉斯 01：21：23

Hard to know. One easy way that you can imagine this is as if you wanted to describe how a picture should look. Only describing that with text would be hard and maybe some other representation would plausibly be easier. So you can look at how DALL-E works at the moment. It produces those prompts and when you play with it, you often can't quite get it to do exactly what the model wants or what you want.
很难知道。您可以想象一种简单的方法，就好像您想描述图片应该是什么样子。仅用文字描述这一点是很困难的，也许其他一些表示形式会更容易。因此，您可以查看 DALL-E 目前的工作原理。它会产生这些提示，当你使用它时，你通常无法完全让它做模型想要或你想要的事情。

Dwarkesh Patel 01:21:55 德瓦克什·帕特尔 01：21：55

Only DALL-E has that problem
只有 DALL-E 有这个问题

Sholto Douglas 01:21:57 肖尔托·道格拉斯 01：21：57

You can imagine that being able to transmit some kind of denser representation of what you want would be helpful there. That's two very simple agents, right?
你可以想象，能够传输某种你想要的东西的更密集的表示会很有帮助。这是两个非常简单的代理，对吧？

Trenton Bricken 01:22:23 特伦顿·布里肯 01：22：23

I think a nice halfway house here would be features that you'd learn from dictionary learning.
我认为这里一个不错的中途之家是你从字典学习中学到的功能。

Sholto Douglas 01:22:27 肖尔托·道格拉斯 01：22：27

That would be really, really cool.
那会非常非常酷。

Trenton Bricken 01:22:29 特伦顿·布里肯 01：22：29

You’d get more internal access, but a lot of it is much more human interpretable.
你会得到更多的内部访问，但其中很多都是更人性化的可解释的。

01:22:34 - Agents & true reasoning #

01：22：34 - 代理和真实推理

Dwarkesh Patel 01:22:34 德瓦克什·帕特尔 01：22：34

For the audience, you would project the residual stream into this larger space, where we know what each dimension actually corresponds to, and then back into the next agents. So your claim is that we'll get AI agents when these things are more reliable and so forth. When that happens, do you expect that it will be multiple copies of models talking to each other? Or will it just be adaptive compute solved and the thing just runs bigger, with more compute when it needs to do the kind of thing that a whole firm needs to do.
对于观众来说，你可以将残余流投射到这个更大的空间中，在那里我们知道每个维度实际上对应什么，然后回到下一个智能体中。所以你的说法是，当这些东西更可靠时，我们会得到人工智能代理，等等。当这种情况发生时，你是否认为这将是多个模型的副本相互交谈？或者它只是自适应计算解决，事情只是运行得更大，当它需要做整个公司需要做的事情时，有更多的计算。

I asked this because there's two things that make me wonder about whether agents are the right way to think about what will happen in the future. One is with longer context, these models are able to ingest and consider the information that no human can. We need one engineer who's thinking about the front-end code and one engineer thinking about the back-end code. Whereas this thing can just ingest the whole thing. This sort of Hayekian problem of specialization, goes away.
我之所以问这个问题，是因为有两件事让我想知道代理是否是思考未来会发生什么的正确方式。一个是具有更长的上下文，这些模型能够摄取和考虑人类无法摄取和考虑的信息。我们需要一个考虑前端代码的工程师和一个考虑后端代码的工程师。而这个东西可以摄取整个东西。这种哈耶克式的专业化问题消失了。

Second, these models are just very general. You're not using different types of GPT-4 to do different kinds of things. You're using the exact same model. So I wonder if that implies that in the future, an AI firm is just like a model instead of a bunch of AI agents hooked together.
其次，这些模型非常笼统。你没有使用不同类型的 GPT-4 来做不同类型的事情。您使用的是完全相同的模型。因此，我想知道这是否意味着在未来，人工智能公司就像一个模型，而不是一堆人工智能代理。

Sholto Douglas 01:23:57 肖尔托·道格拉斯 01：23：57

That's a great question. I think especially in the near term, it will look much more like agents talking together. I say that purely because as humans, we're going to want to have these isolated, reliable components that we can trust. We're also going to need to be able to improve and instruct upon those components in ways that we can understand and improve. Just throwing it all into this giant black box company, iit isn't going to work initially. Later on of course, you can imagine it working, but initially it won't work. And two, we probably don't want to do it that way.
这是一个很好的问题。我认为，特别是在短期内，它看起来更像是特工们在一起交谈。我之所以这么说，纯粹是因为作为人类，我们希望拥有这些我们可以信任的孤立的、可靠的组件。我们还需要能够以我们可以理解和改进的方式改进和指导这些组件。只是把它全部扔进这个巨大的黑匣子公司，iit 最初是行不通的。当然，后来你可以想象它会起作用，但最初它不会起作用。第二，我们可能不想那样做。

Trenton Bricken 01:24:41 特伦顿·布里肯 01：24：41

Each of the agents can also be a smaller model that's cheaper to run. And you can fine-tune it so that it's actually good at the task.
每个代理也可以是一个运行成本更低的较小模型。你可以对它进行微调，使它真正擅长这项任务。

Sholto Douglas 01:24:49 肖尔托·道格拉斯 01：24：49

Dwarkesh has brought up adaptive compute a couple of times. There's a future where the distinction between small and large models disappears to some degree. With long-context, there's also a degree to which fine-tuning might disappear, to be honest. These two things are very important today. With today's landscape models, we have whole different tiers of model sizes and we have fine-tuned models of different things. You can imagine a future where you just actually have a dynamic bundle of compute and infinite context, and that specializes your model to different things.
Dwarkesh 已经多次提到自适应计算。在未来，小型和大型模型之间的区别在某种程度上消失了。老实说，对于长上下文，微调可能会在一定程度上消失。这两件事在今天非常重要。对于今天的景观模型，我们有完全不同层次的模型大小，我们有不同事物的微调模型。你可以想象一个未来，你实际上只是有一个动态的计算和无限的上下文，并使你的模型专门用于不同的事情。

Dwarkesh Patel 01:25:23 德瓦克什·帕特尔 01：25：23

One thing you can imagine is you have an AI firm or something, and the whole thing is end-to-end trained on the signal of, “did I make profits?” Or if that's too ambiguous, if it's an architecture firm and they're making blueprints: “did my client like the blueprints?” In the middle, you can imagine agents who are salespeople and agents who are doing the designing, agents who do the editing, whatever. Would that kind of signal work on an end-to-end system like that? Because one of the things that happens in human firms is management considers what's happening at the larger level and gives these fine-grain signals to the pieces when there's a bad quarter or whatever.
你可以想象的一件事是，你有一家人工智能公司或其他什么东西，整个事情都是端到端的，以“我赚钱了吗？或者，如果这太模棱两可了，如果是一家建筑公司，他们正在制作蓝图：“我的客户喜欢蓝图吗？在中间，你可以想象代理商是销售人员，代理商是做设计的，代理商是做编辑的，等等。这种信号在这样的端到端系统上会起作用吗？因为在人类公司中发生的一件事是，管理层会考虑更大层面上发生的事情，并在出现糟糕的季度或其他情况时将这些细粒度的信号提供给各个部分。

Sholto Douglas 01:26:02 肖尔托·道格拉斯 01：26：02

In the limit, yes. That's the dream of reinforcement learning. All you need to do is provide this extremely sparse signal. Then over enough iterations, you create the information that allows you to learn from that signal. But I don't expect that to be the thing that works first. I think this is going to require an incredible amount of care and diligence from humans surrounding these machines and making sure they do exactly the right thing, and exactly what you want, and giving them the right signals to improve in the ways that you want.
在限制中，是的。这就是强化学习的梦想。您需要做的就是提供这种极其稀疏的信号。然后，通过足够多的迭代，您可以创建允许您从该信号中学习的信息。但我不认为这是首先起作用的事情。我认为这需要围绕这些机器的人类付出难以置信的谨慎和勤奋，并确保它们做正确的事情，完全是你想要的，并给它们正确的信号，以你想要的方式进行改进。

Trenton Bricken 01:26:32 特伦顿·布里肯 01：26：32

Yeah, you can't train on the RL reward unless the model generates some reward.
是的，除非模型产生一些奖励，否则您不能使用 RL 奖励进行训练。

Sholto Douglas 01:26:37 肖尔托·道格拉斯 01：26：37

Exactly. You're in this sparse RL world where if the client never likes what you produce, then you don't get any reward at all and it's kind of bad.
完全。你在这个稀疏的RL世界中，如果客户从不喜欢你生产的东西，那么你根本得不到任何回报，这有点糟糕。

Dwarkesh Patel 01:26:47 德瓦克什·帕特尔 01：26：47

But in the future, these models will be good enough to get the reward some of the time, right?
但在未来，这些模型将足够好，在某些时候获得奖励，对吧？

Trenton Bricken 01:26:50 特伦顿·布里肯 01：26：50

This is the nines of reliability that Sholto was talking about.
这就是 Sholto 所说的可靠性的九点。

Dwarkesh Patel 01:26:54 德瓦克什·帕特尔 01：26：54

There's an interesting digression by the way on what we were talking about earlier. Dense representations would be favored, right? That's a more efficient way to communicate. A book that Trenton recommended, The Symbolic Species, has this really interesting argument that language is not just a thing that exists, but it was also something that evolved along with our minds and specifically evolved to be both easy to learn for children and something that helps children develop.
顺便说一句，关于我们之前谈论的内容，有一个有趣的题外话。密集的表示会受到青睐，对吧？这是一种更有效的沟通方式。特伦顿推荐的一本书《象征物种》（The Symbolic Species）有一个非常有趣的论点，即语言不仅仅是一种存在的东西，而且它也是随着我们的思想而进化的东西，特别是进化为既易于儿童学习又有助于儿童发展的东西。

Sholto Douglas 01:27:33 肖尔托·道格拉斯 01：27：33

Unpack that for me. 帮我解开吧。

Dwarkesh Patel 01:27:35 德瓦克什·帕特尔 01：27：35

Because a lot of the things that children learn are received through language, the languages that would be the fittest are the ones that help raise the next generation. And that makes them smarter, better, or whatever.
因为孩子们学到的很多东西都是通过语言来接受的，所以最合适的语言是那些有助于培养下一代的语言。这让他们更聪明，更好，或者其他什么。

Sholto Douglas 01:27:50 肖尔托·道格拉斯 01：27：50

And gives them the concepts to express more complex ideas.
并为他们提供表达更复杂想法的概念。

Trenton Bricken 01:27:54 特伦顿·布里肯 01：27：54

Yeah that, and I guess more pedantically, just not die.
是的，我想更迂腐的是，只是不要死。

Sholto Douglas 01:27:58 肖尔托·道格拉斯 01：27：58

It lets you encode the important shit to not die.
它可以让你对重要的狗屎进行编码，使其不死。

Dwarkesh Patel 01:28:04 德瓦克什·帕特尔 01：28：04

So when we just think of language it’s like, “oh, it's this contingent and maybe suboptimal way to represent ideas.” But actually, maybe one of the reasons that LLMs have succeeded is because language has evolved for tens of thousands of years to be this sort of cast in which young minds can develop. This is the purpose it was evolved for.
因此，当我们想到语言时，就像是，“哦，这是表达想法的偶然性，也许是次优的方式。但实际上，也许成功的原因之一LLMs是因为语言已经进化了数万年，成为年轻人思想可以发展的那种演员。这就是它进化的目的。

Sholto Douglas 01:28:27 肖尔托·道格拉斯 01：28：27

Think about computer vision researchers versus language model researchers. People who work in other modalities have to put enormous amounts of thought into exactly what the right representation space for the images is and what the right signal is to learn from there. Is it directly modeling the pixels or is it some loss that's conditioned on… There's a paper ages ago where they found that if you trained on the internal representations of an ImageNet model, it helped you predict better. Later on that's obviously limiting.
想想计算机视觉研究人员与语言模型研究人员。从事其他模式工作的人必须投入大量思考，究竟什么是图像的正确表示空间，以及从那里学习的正确信号是什么。是直接对像素进行建模，还是以...很久以前有一篇论文，他们发现，如果你对ImageNet模型的内部表示进行训练，它可以帮助你更好地预测。后来，这显然是有局限性的。

There was PixelCNN where they're trying to discretely model the individual pixels and stuff, but understanding the right level of representation there is really hard. In language, people are just like, “Well, I guess you just predict the next token then.” It's kind of easy. There's the tokenization discussion and debate. One of Gwern's favorites.
在PixelCNN上，他们试图离散地对单个像素和东西进行建模，但要理解正确的表示水平真的很难。在语言中，人们就像，“好吧，我猜你只是预测下一个代币。这很容易。有代币化的讨论和辩论。Gwern 的最爱之一。

Dwarkesh Patel 01:29:22 德瓦克什·帕特尔 01：29：22

That's really interesting. The case for multimodal being a way to bridge the data wall, or get past the data wall, is based on the idea that the things you would have learned from more language tokens, you can just get from YouTube. Has that actually been the case? How much positive transfer do you see between different modalities where the images are actually helping you become better at writing code or something, because the model is learning latent capabilities just from trying to understand the image?
这真的很有趣。多模态是一种跨越数据墙或越过数据墙的方式，是基于这样一种想法，即你从更多语言令牌中学到的东西，你可以从YouTube上获得。事实真的是这样吗？在图像实际上帮助你更好地编写代码或其他东西的不同模态之间，你看到了多少积极的转移，因为模型只是通过尝试理解图像来学习潜在的能力？

Sholto Douglas 01:29:56 肖尔托·道格拉斯 01：29：56

In his interview with you, Demis mentioned positive transfer.
在接受你的采访时，德米斯提到了积极的转移。

Dwarkesh Patel 01:30:01 德瓦克什·帕特尔 01：30：01

Can’t get in trouble. 不能惹麻烦。

Sholto Douglas 01:30:03 肖尔托·道格拉斯 01：30：03

I can't say heaps about that. Other than to say, this is something that people believe. We have all of this data about the world. It would be great if we could learn an intuitive sense of physics from it, that helps us reason. That seems totally plausible.
我不能说很多。除此之外，这是人们相信的事情。我们拥有关于世界的所有数据。如果我们能从中学到一种直觉的物理感，那将是一件好事，这有助于我们推理。这似乎是完全合理的。

Trenton Bricken 01:30:24 特伦顿·布里肯 01：30：24

I'm the wrong person to ask, but there are interesting interpretability pieces where if we fine-tune on math problems, the model just gets better at entity recognition.
我问错了人，但有一些有趣的可解释性文章，如果我们对数学问题进行微调，模型就会在实体识别方面变得更好。

Dwarkesh Patel 01:30:35 德瓦克什·帕特尔 01：30：35

Whoa, really? 哇，真的吗？

Trenton Bricken 01:30:37 特伦顿·布里肯 01：30：37

So there's like a. A paper from David Bau's lab recently where they investigate what actually changes in a model when I fine-tune it with respect to the attention heads. They have this synthetic problem of, “Box A has this object in it. Box B has this other object in it. What was in this box?” And it makes sense, right? You're better at attending to the positions of different things which you need for coding and manipulating math equations.
所以有一个。David Bau 实验室最近发表了一篇论文，他们研究了当我对注意力头进行微调时，模型的实际变化。他们有一个综合问题，“盒子A里有这个物体。盒子 B 里有另一个对象。这个盒子里装的是什么？这是有道理的，对吧？你更擅长处理编码和操作数学方程式所需的不同事物的位置。

Sholto Douglas 01:31:10 肖尔托·道格拉斯 01：31：10

I love this kind of research. What's the name of the paper? Do you know?
我喜欢这种研究。这篇论文叫什么名字？你知道吗？

Trenton Bricken 01:31:13 特伦顿·布里肯 01：31：13

Look up “fine-tuning, models, math,” from David Bau’s group that came out like a week ago. I'm not endorsing the paper, that's a longer conversation. But it does talk about and cite other work on this entity recognition.
查找“微调、模型、数学”，来自一周前出现的 David Bau 小组。我不是在认可这篇论文，这是一个更长的对话。但它确实谈论并引用了关于这种实体识别的其他工作。

Dwarkesh Patel 01:31:32 德瓦克什·帕特尔 01：31：32

One of the things you mentioned to me a long time ago is the evidence that when you train LLMs on code they get better at reasoning and language. Unless it's the case that the comments in the code are just really high quality tokens or something, that implies that to be able to think through how to code better, it makes you a better reasoner and that's crazy, right? I think that's one of the strongest pieces of evidence for scaling, just making the thing smart, that kind of positive transfer
你很久以前向我提到的一件事是，当你在代码上训练LLMs时，他们会在推理和语言方面变得更好。除非代码中的注释只是真正高质量的标记或其他东西，否则这意味着能够思考如何更好地编码，它会让你成为一个更好的推理者，这太疯狂了，对吧？我认为这是扩展的最有力证据之一，只是让事情变得智能，那种积极的转移

Sholto Douglas 01:31:58 肖尔托·道格拉斯 01：31：58

I think this is true in two senses. One is just that modeling code obviously implies modeling a difficult reasoning process used to create it. But code is a nice explicit structure of composed reasoning, “if this, then that.” It encodes a lot of structure in that way that you could imagine transferring to other types of reasoning problems.
我认为这在两个意义上是正确的。一个是建模代码显然意味着建模一个用于创建它的困难推理过程。但是代码是一个很好的组合推理的显式结构，“如果这个，那么那个。它以这种方式编码了很多结构，你可以想象这种方式可以转移到其他类型的推理问题中。

Dwarkesh Patel 01:32:23 德瓦克什·帕特尔 01：32：23

And crucially, the thing that makes it significant is that it's not just stochastically predicting the next token of words or whatever because it's learned, “Sally corresponds to the murderer at the end of the Sherlock Holmes story.” No, if there is some shared thing between code and language, it must be at a deeper level that the model has learned.
至关重要的是，它之所以重要，是因为它不仅仅是随机地预测下一个词语或其他东西，因为它是被学习的，“莎莉对应于夏洛克·福尔摩斯故事结尾的凶手。不，如果代码和语言之间存在一些共享的东西，那么模型必须学习到更深层次的东西。

Sholto Douglas 01:32:45 肖尔托·道格拉斯 01：32：45

Yeah, I think we have a lot of evidence that actual reasoning is occurring in these models and that they're not just stochastic parrots. It just feels very hard for me to believe that having worked and played with these models.
是的，我认为我们有很多证据表明，这些模型中正在发生实际推理，而且它们不仅仅是随机鹦鹉。我很难相信与这些模型一起工作和玩耍。

Trenton Bricken 01:33:03 特伦顿·布里肯 01：33：03

I have two, immediate cached responses to this. One is the work on Othello, and now other games, where I give you a sequence of moves in the game and it turns out that if you apply some pretty straightforward interpretability techniques, then you can get a board that the model has learned. It's never seen the game board before. That's generalization.
我对此有两个即时缓存的响应。一个是《奥赛罗》的工作，现在是其他游戏，我给你一个游戏中的一系列动作，结果发现，如果你应用一些非常简单的可解释性技术，那么你可以得到一个模型已经学习的棋盘。以前从未见过游戏板。这就是概括。

The other is Anthropic's influence functions paper that came out last year where they look at the model outputs. Things like, “please don't turn me off. I want to be helpful.” They scan for what was the data that led to that? And one of the data points that was very influential was someone, dying of dehydration and having a will to keep surviving. To me, that just seems like a very clear, generalization of motive rather than regurgitating, “don't turn me off.” I think 2001: A Space Odyssey was also one of the influential things. That's more related but it's clearly pulling in things from lots of different distributions.
另一个是去年发表的Anthropic的影响函数论文，他们研究了模型的输出。比如，“请不要把我关掉。我想帮忙。他们扫描导致这种情况的数据是什么？其中一个非常有影响力的数据点是有人死于脱水，并且有继续生存的意愿。对我来说，这似乎是一个非常明确的动机概括，而不是反刍，“不要让我失望。我认为《2001：太空漫游》也是影响很大的事情之一。这更相关，但它显然是从许多不同的发行版中提取东西。

Sholto Douglas 01:34:04 肖尔托·道格拉斯 01：34：04

I also like the evidence that you see even with very small transformers where you can explicitly encode circuits to do addition. Or induction heads, this kind of thing. You can literally encode basic reasoning processes in the models manually and it seems clear that there's evidence that they also learned this automatically because you can then rediscover those from trained models. To me this is really strong evidence.
我也喜欢你看到的证据，即使是非常小的变压器，你也可以显式编码电路来做加法。或者感应头，这种东西。从字面上看，您可以手动对模型中的基本推理过程进行编码，并且似乎很明显，有证据表明它们也是自动学习的，因为您可以从经过训练的模型中重新发现这些过程。对我来说，这是非常有力的证据。

Trenton Bricken 01:34:27 特伦顿·布里肯 01：34：27

The models are underparameterized. They need to learn. We're asking them to do it and they want to learn. The gradients want to flow. So yeah, they're learning more general skills.
模型参数化不足。他们需要学习。我们要求他们这样做，他们想学习。渐变想要流动。所以，是的，他们正在学习更多的通用技能。

01:34:40 - How Sholto & Trenton Got Into AI Research #

01：34：40 - Sholto & Trenton 如何进入 AI 研究

Dwarkesh Patel 01:34:40 德瓦克什·帕特尔 01：34：40

So I want to take a step back from the research and ask about your career specifically. Like my introduction implied, you've been in this field for a year and a half, right?
因此，我想从研究中退后一步，具体询问一下您的职业。就像我的介绍所暗示的那样，你已经在这个领域工作了一年半，对吧？

Trenton Bricken 01:34:56 特伦顿·布里肯 01：34：56

At Anthropic, yeah. 在 Anthropic，是的。

Dwarkesh Patel 01:34:58 德瓦克什·帕特尔 01：34：58

I know the "solved alignment" takes are overstated. And you won't say this yourself because you'd be embarrassed by it but it's a pretty incredible thing. It’s the thing that people in mechanistic interpretability think is the biggest step forward and you've been working on it for a year. It's notable. I'm curious how you explain what's happened. Like why in a year or a year and a half, have you guys made important contributions to your field?
我知道“解决的对齐”被夸大了。你自己不会这么说，因为你会为此感到尴尬，但这是一件非常不可思议的事情。这是机械可解释性的人认为这是向前迈出的最大一步，你已经为此工作了一年。这是值得注意的。我很好奇你如何解释发生了什么。比如为什么在一年或一年半的时间里，你们对你们的领域做出了重要贡献？

Trenton Bricken 01:35:30 特伦顿·布里肯 01：35：30

It goes without saying luck, obviously. I feel like I've been very lucky in that the timing of different progressions has been just really good in terms of advancing to the next level of growth. I feel like for the interpretability team specifically, I joined when we were five people. We've now grown quite a lot.
显然，运气不言而喻。我觉得我非常幸运，因为在进入下一个增长水平方面，不同进展的时机非常好。我觉得特别是对于可解释性团队，我是在我们五个人的时候加入的。我们现在已经成长了很多。

There were so many ideas floating around and we just needed to really execute on them, and have quick feedback loops, and do careful experimentation. That led to signs of life and has now allowed us to really scale. I feel like that's been my biggest value-add to the team. It's not all engineering, but quite a lot of it has been
有很多想法在流传，我们只需要真正执行它们，并有快速的反馈循环，并进行仔细的实验。这导致了生命的迹象，现在使我们能够真正扩大规模。我觉得这是我为球队带来的最大价值。这并不全是工程学，但其中相当多的是工程学

Sholto Douglas 01:36:12 肖尔托·道格拉斯 01：36：12

Interesting. So you're saying you came at a point where there had been a lot of science done and there was a lot of good research floating around, but they needed someone to just take that and maniacally execute on it.
有趣。所以你是说你来到了一个已经做了很多科学研究的地步，有很多好的研究在流传，但他们需要有人来接受它并疯狂地执行它。

Trenton Bricken 01:36:22 特伦顿·布里肯 01：36：22

Yeah and this is why it's not all engineering. Because it's running different experiments and having a hunch for why it might not be working and then opening up the model or opening up the weights and asking, “what is it learning? Okay, well let me try and do this instead,” and that sort of thing. But a lot of it has just been being able to do very careful, thorough, but quick, investigation of different ideas.
是的，这就是为什么它不全是工程。因为它正在运行不同的实验，并且有一种预感，为什么它可能不起作用，然后打开模型或打开权重并问，“它在学习什么？好吧，让我试着做这个，“诸如此类的事情。但其中很多只是能够对不同的想法进行非常仔细、彻底但快速的调查。

Dwarkesh Patel 01:36:45 德瓦克什·帕特尔 01：36：45

And why was that lacking?
为什么缺乏呢？

Trenton Bricken 01:36:48 特伦顿·布里肯 01：36：48

I don't know. I mean, I work quite a lot and then I just feel like I'm quite agentic. I've been very privileged to have a really nice safety net to be able to take lots of risks, but I'm just quite headstrong. In undergrad, Duke had this thing where you could just make your own major and it was like, “eh I don't like this prerequisite or this prerequisite and I want to take all of four or five of these subjects at the same time so I'm just going to make my own major.”
我不知道。我的意思是，我工作很多，然后我只是觉得我很能动。我很荣幸有一个非常好的安全网，能够承担很多风险，但我只是很任性。在本科时，杜克大学有这样一种东西，你可以自己选专业，就像，“呃，我不喜欢这个先决条件或这个先决条件，我想同时选修四五个科目，所以我就打算选自己的专业。

Or in the first year of grad school, I like canceled rotation so I could work on this thing that became the paper we were talking about earlier. And I didn't have an advisor. I got admitted to do machine learning for protein design and was just off in computational neuroscience land with no business there at all. But it worked out.
或者在研究生院的第一年，我喜欢取消轮换，这样我就可以研究这个成为我们之前谈论的论文的东西。而且我没有顾问。我被录取为蛋白质设计的机器学习，当时我刚进入计算神经科学领域，那里根本没有生意。但它成功了。

Dwarkesh Patel 01:37:34 德瓦克什·帕特尔 01：37：34

There's a head strongness but another theme that jumped out was the ability to step back, you were talking about this earlier. The ability to step back from your sunk costs and go in a different direction is in a weird sense the opposite of that, but also a crucial step. I know 21 year olds or 19 year olds who are like “this is not a thing I’ve specialized in” or “I didn’t major in this.” I’m like,
有一种头脑的坚强，但另一个跳出来的主题是退后一步的能力，你之前谈到了这一点。从沉没成本中退后一步并朝着不同的方向前进的能力在某种奇怪的意义上与此相反，但也是关键的一步。我认识一些21岁或19岁的年轻人，他们会说“这不是我擅长的事情”或“我没有主修这个”。我想，
“dude, motherfucker, you're 19! You can definitely do this.” Whereas you’re switching in the middle of grad school or something like that.
“伙计，混蛋，你才19岁！你绝对可以做到。而你在研究生院或类似的东西中转。

Trenton Bricken 01:38:04 特伦顿·布里肯 01：38：04

I think it's, “strong ideas loosely held” and being able to just pinball in different directions. The headstrongness I think relates a little bit to the fast feedback loops or agency in so much as I just don't get blocked very often. If I'm trying to write some code and something isn't working, even if it's in another part of the code base, I'll often just go in and fix that thing or at least hack it together to be able to get results. And I've seen other people where they're just like, “help I can't,” and it's,”no, that's not a good enough excuse. Go all the way down.”
我认为这是“松散的强主意”，并且能够向不同的方向弹球。我认为这种任性与快速反馈循环或代理有关，因为我不会经常被阻止。如果我试图编写一些代码，但有些东西不起作用，即使它在代码库的另一部分，我通常也会进去修复它，或者至少把它破解在一起以获得结果。我见过其他人，他们只是说，“我不能帮忙”，而且是，“不，这不是一个足够好的借口。一路往下走。

Dwarkesh Patel 01:38:36 德瓦克什·帕特尔 01：38：36

I've definitely heard people in management type positions talk about the lack of such people, where they will check in on somebody a month after they gave them a test, or a week after they gave them a test, and then ask, “how is it going?” And they say, “well, we need to do this thing, which requires lawyers because it requires talking about this regulation.” And then it’s like, “how's that going?” And they’re like, “we need lawyers.” And I'm like, “why didn't you get lawyers?”
我肯定听过管理型职位的人谈论缺乏这样的人，他们会在某人给他们做测试一个月后，或者在他们给他们做测试一周后检查他们，然后问，“进展如何？他们说，“好吧，我们需要做这件事，这需要律师，因为它需要谈论这个规定。然后就像，“这是怎么回事？他们说，“我们需要律师。我说，“你为什么不找律师？

Sholto Douglas 01:39:02 肖尔托·道格拉斯 01：39：02

I think that's arguably the most important quality in almost anything. It's just pursuing it to the end of the earth. Whatever you need to do to make it happen, you'll make it happen.
我认为这可以说是几乎所有事情中最重要的品质。它只是追到天涯海角。无论你需要做什么来实现它，你都会实现它。

Dwarkesh Patel 01:39:11 德瓦克什·帕特尔 01：39：11

“If you do everything, you'll win.”
“如果你什么都做，你就会赢。”

Sholto Douglas 01:39:12 肖尔托·道格拉斯 01：39：12

Exactly. I think from my side that quality has definitely been important: agency and work. There are thousands, probably tens of thousands of engineers, at Google who are basically equivalent in software engineering ability. Let's say if you gave us a very well-defined task, then we'd probably do it with equivalent value. Maybe a bunch of them would do it a lot better than me in all likelihood.
完全。我认为从我的角度来看，质量肯定很重要：代理和工作。谷歌有成千上万的工程师，他们的软件工程能力基本相当。比方说，如果你给了我们一个非常明确的任务，那么我们可能会以同等的价值来完成它。也许他们中的一群人很可能会比我做得更好。

But one of the reasons I've been impactful so far is I've been very good at picking extremely high-leverage problems. I mean problems that haven't been particularly well-solved so far, but perhaps as a result of frustrating structural factors like the ones that you pointed out in that scenario before, where they're like, “we can't do X because this team won’t do Y.” Well, I'm just going to vertically solve the entire thing. And that turns out to be remarkably effective. If I think there is something correct, something that needs to happen, I'm also very comfortable with making that argument and continuing to make that argument at escalating levels of criticality until that thing gets solved.
但到目前为止，我之所以有影响力，原因之一是我非常擅长选择杠杆率极高的问题。我的意思是到目前为止还没有得到特别好解决的问题，但也许是由于令人沮丧的结构性因素，比如你之前在那个场景中指出的那些因素，它们就像，“我们不能做X，因为这个团队不会做Y。好吧，我只是要垂直解决整个问题。事实证明，这是非常有效的。如果我认为有正确的事情，需要发生的事情，我也很乐意提出这个论点，并继续以不断升级的临界水平提出这个论点，直到那个事情得到解决。

I'm also quite pragmatic with what I do to solve things. You get a lot of people who come in with, as I said before, a particular background or a familiarity. One of the beautiful things about Google is that you can run around and get world experts in literally everything. You can sit down and talk to people who are optimization experts, TPU chip design experts, experts in different forms of pre-training algorithms or RL or whatever. You can learn from all of them and you can take those methods and apply them. I think this was maybe the start of why I was initially impactful, this vertical agency effectively. A follow-up piece from that is that I think it's often surprising how few people are fully-realized in all the things they want to do. They're blocked or limited in some way.
我做事也很务实。正如我之前所说，很多人都带着特定的背景或熟悉度进来。谷歌的美妙之处在于，你可以四处奔波，让世界专家几乎无所不包。您可以坐下来与优化专家、TPU 芯片设计专家、不同形式的预训练算法或 RL 或其他方面的专家交谈。你可以从所有这些中学习，你可以采用这些方法并应用它们。我想这也许是我最初有影响力的开始，这个垂直机构有效。接下来的一点是，我认为很少有人能完全实现他们想做的所有事情，这常常令人惊讶。它们在某种程度上被阻止或限制。

This is very common in big organizations everywhere. People have all these blockers on what they're able to achieve. I think helping inspire people to work in particular directions and working with them on doing things massively scales your leverage. You get to work with all these wonderful people who teach you heaps of things. And generally helping them push past organizational blockers means that together you get an enormous amount done. None of the impact that I've had has been me individually going off and solving a whole lot of stuff. It's been me maybe starting off in a direction, and then convincing other people that this is the right direction, and bringing them along in this big tidal wave of effectiveness that goes and solves that problem.
这在各地的大型组织中都非常普遍。人们对他们能够实现的目标有所有这些障碍。我认为帮助激励人们朝着特定的方向工作，并与他们一起做事，可以极大地扩展你的影响力。你可以和所有这些很棒的人一起工作，他们教你很多东西。一般来说，帮助他们克服组织障碍意味着你们一起完成了大量的工作。我所受到的影响都不是我个人去解决一大堆事情。我可能从一个方向开始，然后说服其他人这是正确的方向，并把他们带到这个巨大的有效浪潮中，去解决这个问题。

Dwarkesh Patel 01:42:16 德瓦克什·帕特尔 01：42：16

We should talk about how you guys got hired. Because I think that's a really interesting story. You were a McKinsey consultant, right? There's an interesting thing there. I think generally people just don't understand how decisions are made about either admissions or evaluating who to hire. Just talk about how you were noticed and hired.
我们应该谈谈你们是如何被录用的。因为我认为这是一个非常有趣的故事。你是麦肯锡的顾问，对吧？那里有一件有趣的事情。我认为通常人们只是不明白如何决定录取或评估雇用谁。只是谈谈你是如何被注意到和雇用的。

Sholto Douglas 01:42:45 肖尔托·道格拉斯 01：42：45

So the TLDR of this is I studied robotics in undergrad. I always thought that AI would be one of the highest-leverage ways to impact the future in a positive way. The reason I am doing this is because I think it is one of our best shots at making a wonderful future basically.
所以这个TLDR是我在本科时学习了机器人技术。我一直认为，人工智能将是以积极方式影响未来的最高杠杆方式之一。我之所以这样做，是因为我认为这是我们创造美好未来的最佳机会之一。

I thought that working at McKinsey, I would get a really interesting insight into what people actually did for work. I actually wrote this as the first line in my cover letter to McKinsey. I was like, “I want to work here so that I can learn what people do, so that I can understand how to work.” In many respects, I did get that. I just got a whole lot of other things too. Many of the people there are wonderful friends.
我以为在麦肯锡工作，我会对人们实际为工作做了什么有一个非常有趣的见解。实际上，我把这句话写成了我给麦肯锡的求职信的第一行。我当时想，“我想在这里工作，这样我就可以学习人们的工作，这样我就可以了解如何工作。在很多方面，我确实明白了。我还得到了很多其他东西。那里的许多人都是很好的朋友。

I think a lot of this agentic behavior comes in part from my time there. You go into organizations and you see how impactful just not taking no for an answer is. You would be surprised at the kind of stuff where, because no one quite cares enough, things just don't happen. No one's willing to take direct responsibility. Directly responsible individuals are ridiculously important and some people just don't care as much about timelines. So much of the value that an organization like McKinsey provides, is hiring people who you were otherwise unable to hire, for a short window of time where they can just push through problems.
我认为很多这种代理行为部分来自我在那里的时间。你进入组织，你会看到不把“不”作为答案是多么有影响力。你会惊讶于这种事情，因为没有人足够关心，事情就是没有发生。没有人愿意承担直接责任。直接负责的人非常重要，有些人只是不太关心时间表。像麦肯锡这样的组织提供的大部分价值是，在很短的时间内雇用你无法雇用的人，他们可以在很短的时间内解决问题。

I think people underappreciate this. So at least some of this attitude of “hold up, I'm going to become the directly responsible individual for this because no one's taking appropriate responsibility. I'm going to care a hell of a lot about this. And I'm going to go to the end of the earth to make sure it gets done,” comes from that time.
我认为人们低估了这一点。因此，至少有一些“等一下，我将成为对此负有直接责任的人”的态度，因为没有人承担适当的责任。我会非常关心这个。我要去地球的尽头，确保它完成，“来自那个时候。

More to your actual question of how I got hired. I didn't get into the grad programs that I wanted to get into over here, which was specifically for focus on robotics, and RL research, and that kind of stuff. In the meantime, on nights and weekends, basically every night from 10pm to 2am, I would do my own research. And every weekend, for at least 6-8 hours each day, I would do my own research and coding projects and this kind of stuff.
更多是关于你是如何被录用的实际问题。我没有进入我想在这里进入的研究生课程，这是专门针对机器人和 RL 研究之类的东西。与此同时，在晚上和周末，基本上每天晚上10点到凌晨2点，我都会做自己的研究。每个周末，每天至少6-8个小时，我都会做自己的研究和编码项目之类的东西。

That sort of switched in part from quite robotic specific work. After reading Gwern’s scaling hypothesis post, I got completely scaling-pilled and was like, “okay, clearly the way that you solve robotics is by scaling large multimodal models.” Then in an effort to scale large multimodal models with a grant from the TPU access program, the Tensor Research Cloud, I was trying to work out how to scale that effectively. James Bradbury, who at the time was at Google and is now at Anthropic, saw some of my questions online where I was trying to work out how to do this properly and he was like, “I thought I knew all the people in the world who were asking these questions. Who on earth are you?” He looked at that and he looked at some of the robotic stuff that I'd been putting up on my blog. He reached out and said, “hey, do you want to have a chat and do you want to explore working with us here?” I was hired, as I understood it later, as an experiment in trying to take someone with extremely high enthusiasm and agency and pairing them with some of the best engineers that he knew. So another reason I've been impactful is I had this dedicated mentorship from utterly wonderful people like Reiner Pope, who has since left to go do his own ship company, Anselm Levskaya, James himself, and many others.
这在一定程度上是从相当机器人的特定工作中转换过来的。在阅读了 Gwern 的缩放假设文章后，我完全被缩放药丸所吸引，就像，“好吧，很明显，你解决机器人技术的方法是通过缩放大型多模态模型。然后，为了利用 TPU 访问计划 Tensor Research Cloud 的资助来扩展大型多模态模型，我试图研究如何有效地扩展它。詹姆斯·布拉德伯里（James Bradbury）当时在谷歌工作，现在在Anthropic工作，他在网上看到了我的一些问题，我试图弄清楚如何正确地做到这一点，他说：“我以为我认识世界上所有问这些问题的人。你到底是谁？他看了看，还看了我在博客上放的一些机器人的东西。他伸出手说：“嘿，你想聊聊吗，你想在这里探索和我们一起工作吗？正如我后来所理解的那样，我被雇用是一个实验，试图以极高的热情和能动性来吸引一个，并将他们与他认识的一些最好的工程师配对。因此，我之所以有影响力的另一个原因是，我得到了像莱纳·波普（Reiner Pope）这样非常出色的人的专门指导，他后来离开了自己的船舶公司，安塞姆·列夫斯卡娅（Anselm Levskaya），詹姆斯本人和许多其他人。

Those are the formative two to three months at the beginning and they taught me a whole lot of the principles and heuristics that I apply. How to solve problems understanding the way systems and algorithms overlap, where one more thing that makes you quite effective in ML research is concretely understanding the systems side of things. This is something I've learned from them. A deep understanding of how systems influence algorithms and how algorithms influence systems. Because the systems constrain the solution space, which you have available to yourself in the algorithm side. And very few people are comfortable fully bridging that gap. At a place like Google, you can just go and ask all the algorithms experts and all the systems experts everything they know, and they will happily teach you. If you go and sit down with them, they will teach you everything they know and it's wonderful.
这些是开始时的两到三个月的形成，他们教会了我很多我应用的原则和启发式方法。如何解决理解系统和算法重叠方式的问题，其中让你在机器学习研究中非常有效的另一件事是具体理解事物的系统方面。这是我从他们身上学到的东西。深入了解系统如何影响算法以及算法如何影响系统。因为系统限制了解决方案空间，而您在算法方面可以自己使用。很少有人愿意完全弥合这一差距。在像谷歌这样的地方，你可以去问所有的算法专家和所有的系统专家他们所知道的一切，他们会很乐意教你。如果你去和他们坐下来，他们会教你他们所知道的一切，这太棒了。

This has meant that I've been able to be very, very effective for both sides. For the pre-training crew, because I understand systems very well I can intuit and understand, “this will work well or this won't.” And then flow that on through the inference considerations of models and this kind of thing. To the chip design teams, I'm one of the people they turn to understand what chips they should be designing in three years because I'm one of the people who's best able to understand and explain the kind of algorithms that we might want to design in three years. Obviously you can't make very good guesses about that, but I think I convey the information well, accumulated from all of my compatriots on the pre-training crew, and the general systems design crew. Also even inference applies a constraint to pre-training. So there's these trees of constraints where if you understand all the pieces of the puzzle, then you get a much better sense for what the solution space might look like.
这意味着我能够对双方都非常非常有效。对于预培训人员来说，因为我非常了解系统，所以我可以凭直觉理解，“这要么工作得很好，要么不行。然后通过模型的推理考虑和类似的事情来流动。对于芯片设计团队来说，我是他们了解他们应该在三年内设计什么芯片的人之一，因为我是最能理解和解释我们可能想要在三年内设计的算法的人之一。显然，你不能对此做出很好的猜测，但我认为我很好地传达了信息，这些信息是从我所有在预训练人员和一般系统设计人员的同胞那里积累起来的。此外，甚至推理也会对预训练施加约束。因此，有这些约束树，如果你理解了拼图的所有部分，那么你就会更好地理解解决方案空间的样子。

Dwarkesh Patel 01:48:17 德瓦克什·帕特尔 01：48：17

There's a couple of things that stick out to me there. One is not just the agency of the person who was hired, but the parts of the system that were able to think, "wait, that's really interesting. Who is this guy? Not from a grad program or anything. Currently a McKinsey consultant with just undergrad. But that's interesting, let's give this a shot.” So with James and whoever else, that's very notable. The second is that I actually didn't know the part of the story where that was part of an experiment run internally about, “can we do this? Can we bootstrap somebody?”
有几件事让我印象深刻。一个不仅仅是被雇用的人的代理权，而是系统中能够思考的部分，“等等，这真的很有趣。这个人是谁？不是来自研究生课程或任何东西。目前是麦肯锡的顾问，只有本科生。但这很有趣，让我们试一试。因此，对于詹姆斯和其他任何人来说，这是非常值得注意的。第二个是，我实际上不知道这个故事的哪一部分是内部运行的实验的一部分，“我们能做到吗？我们能引导某人吗？

In fact, what's really interesting about that is the third thing you mentioned is. Having somebody who understands all layers of the stack and isn't so stuck on any one approach or any one layer of abstraction is so important. Specifically what you mentioned about being bootstrapped immediately by these people. It means that since you're getting up to speed on everything at the same time, rather than spending grad school going deep in one specific way of doing RL, you can actually take the global view and aren't totally bought in on one thing.
事实上，真正有趣的是你提到的第三件事。拥有一个了解堆栈所有层并且不拘泥于任何一种方法或任何一层抽象的人非常重要。具体来说，你提到的被这些人立即引导。这意味着，由于你同时在所有事情上都加快了速度，而不是在研究生院里深入研究一种特定的RL方式，你实际上可以采取全球视野，而不是完全接受一件事。

So not only is it something that's possible, but it has greater returns potentially than just hiring somebody at a grad school. Just like getting a GPT-8 and fine-tuning the model for one year.
因此，这不仅是可能的，而且可能比仅仅在研究生院雇用某人具有更大的回报。就像获得 GPT-8 并微调模型一年一样。

Sholto Douglas 01:49:41 肖尔托·道格拉斯 01：49：41

You come at everything with fresh eyes and you don't come in locked to any particular field. Now one caveat to that is that before, during my self-experimentation, I was reading everything I could. I was obsessively reading papers every night. Funnily enough, I read much less widely now that my day is occupied by working on things. And in some respect, I had this very broad perspective whereas in a PhD program, you'll just focus on a particular area. If you just read all the NLP work and all the computer vision work and like all the robotics work, you see all these patterns that start to emerge across subfields, in a way that foreshadowed some of the work that I would later do.
你以全新的眼光看待一切，你不会被锁定在任何特定的领域。现在需要注意的是，在我自我实验之前，我正在阅读所有我能读到的东西。我每天晚上都痴迷地看报纸。有趣的是，我现在的阅读范围要少得多，因为我的一天被工作所占据。在某些方面，我有非常广阔的视野，而在博士课程中，你只会专注于一个特定的领域。如果你只是阅读了所有的NLP工作和所有的计算机视觉工作，就像所有的机器人工作一样，你会看到所有这些模式开始在子领域中出现，在某种程度上预示了我后来要做的一些工作。

Dwarkesh Patel 01:50:26 德瓦克什·帕特尔 01：50：26

That's super interesting. One of the reasons that you've been able to be agentic within Google is you're pair programming half the days, or most of the days, with Sergey Brin, right? So it's really interesting that there's a person who's willing to just push ahead on this LLM stuff and get rid of the local blockers in place.
这真是太有趣了。你能够在谷歌内部成为代理的原因之一是你有一半的时间，或者大部分时间，与谢尔盖·布林（Sergey Brin）配对编程，对吧？所以真的很有意思的是，有一个人愿意推进这些东西LLM，并摆脱现有的本地障碍。

Sholto Douglas 01:50:46 肖尔托·道格拉斯 01：50：46

It’s important to say it’s not like everyday or anything. There are particular projects that he's interested in, and then we'll work together on those. But there's also been times when he's been focused on projects with other people. But in general, yes, there's a surprising alpha to being one of the people who actually goes down to the office every day.
重要的是要说它不像日常或任何东西。他感兴趣的项目有些特别，然后我们会一起合作。但有时他也专注于与其他人的项目。但总的来说，是的，成为每天真正去办公室的人之一，有一个令人惊讶的阿尔法。

It shouldn't be, but that is surprisingly impactful. As a result, I've benefited a lot from basically being close friends with people in leadership who care, and from being able to really argue convincingly about why we should do X as opposed to Y, and having that vector. Google is a big organization and having those vectors helps a little bit. But also it's the kind of thing you don't want to ever abuse. You want to make the argument through the right channels and only sometimes do you need to.
它不应该是，但这令人惊讶地影响。因此，我基本上与关心我的领导人员成为亲密的朋友，并且能够真正令人信服地争论为什么我们应该做 X 而不是 Y，并拥有这个向量，这让我受益匪浅。谷歌是一个大组织，拥有这些向量会有所帮助。但它也是你不想滥用的那种东西。你想通过正确的渠道提出论点，只有有时你才需要这样做。

Dwarkesh Patel 01:51:47 德瓦克什·帕特尔 01：51：47

So this includes people like Sergey Brin, Jeff Dean, and so forth. I mean, it's notable. I feel like Google is undervalued. Like Steve Jobs is working on the equivalent next product for Apple and pair programming on it or something…
所以这包括像谢尔盖·布林（Sergey Brin）、杰夫·迪恩（Jeff Dean）等人。我的意思是，这是值得注意的。我觉得谷歌被低估了。就像史蒂夫·乔布斯（Steve Jobs）正在为苹果公司开发等效的下一个产品，并在其上进行结对编程或其他东西......

Sholto Douglas 01:52:00 肖尔托·道格拉斯 01：52：00

Right, I've benefited immensely from it. So for example, during the Christmas break, I was going into the office for a couple of days during that time. I don't know if you guys have read that article about Jeff and Sanjay, but they were there pair programming on stuff. I got to hear about all these cool stories of early Google where they're talking about crawling under the floorboards and rewiring data centers and telling me how many bytes they were pulling off the instructions of a given compiler and instruction, all these crazy little performance optimizations they were doing. They were having the time of their life and I got to sit there and really experience this. There's a sense of history that you expect to be very far away from in a large organization, but…
是的，我从中受益匪浅。例如，在圣诞节假期期间，我在那段时间里要去办公室几天。我不知道你们有没有读过那篇关于 Jeff 和 Sanjay 的文章，但他们在那里进行了结对编程。我听说了早期谷歌的所有这些很酷的故事，他们谈论爬到地板下，重新连接数据中心，告诉我他们从给定编译器和指令的指令中提取了多少字节，所有这些疯狂的小性能优化。他们度过了他们生命中的时光，我坐在那里真正体验了这一点。在一个大型组织中，有一种你期望非常遥远的历史感，但是......

Dwarkesh Patel 01:53:02 德瓦克什·帕特尔 01：53：02

That's super cool. And Trenton, does this map onto any of your experience?
这太酷了。特伦顿，这与你的经历有关吗？

Trenton Bricken 01:53:06 特伦顿·布里肯 01：53：06

I think Sholto's story is more exciting. Mine was just very serendipitous in that I got into computational neuroscience. I didn't have much business being there. My first paper was mapping the cerebellum to the attention operation and transformers. My next ones were looking at–
我认为肖尔托的故事更令人兴奋。我只是非常偶然，因为我进入了计算神经科学。我在那里没有太多生意。我的第一篇论文是将小脑映射到注意力操作和转换器。我的下一个是看——

Dwarkesh Patel 01:53:23 德瓦克什·帕特尔 01：53：23

How old were you when you wrote that?
你写这篇文章的时候几岁？

Trenton Bricken 01:53:24 特伦顿·布里肯 01：53：24

It was my first year of grad school, so 22. My next work was on sparsity in networks, inspired by sparsity in the brain, which was when I met Tristan Hume. Anthropic was doing the SoLU, the Softmax Linear Output Unit work which was very related in quite a few ways in terms of making the activation of neurons across a layer really sparse. If we do that then we can get some interpretability of what the neuron's doing. I think we've updated that approach towards what we're doing now. So that started the conversation.
那是我读研究生的第一年，所以22岁。我的下一个工作是关于网络中的稀疏性，灵感来自大脑中的稀疏性，那是我遇到特里斯坦·休谟的时候。Anthropic正在做SoLU，Softmax线性输出单元的工作，在使一个层的神经元激活变得非常稀疏方面，这在很多方面都非常相关。如果我们这样做，那么我们可以获得神经元所做的事情的一些可解释性。我认为我们已经更新了这种方法，以适应我们现在正在做的事情。于是对话就这样开始了。

I shared drafts of that paper with Tristan. He was excited about it. That was basically what led me to become Tristan's resident and then convert to full-time. But during that period, I also moved as a visiting researcher to Berkeley, and started working with Bruno Olshausen, both on what's called vector symbolic architectures–one of the core operations of them is literally superposition–and on sparse coding also known as dictionary learning, which is literally what we've been doing since. Bruno Olshausen basically invented sparse coding back in 1997. So my research agenda and the interpretability team seemed to be running in parallel in research tastes. So it made a lot of sense for me to work with the team and it's been a dream since.
我和特里斯坦分享了那篇论文的草稿。他对此很兴奋。这基本上就是导致我成为特里斯坦的居民然后转为全职的原因。但在那段时间里，我还作为访问研究员搬到了伯克利，并开始与布鲁诺·奥尔斯豪森（Bruno Olshausen）合作，既研究了所谓的向量符号架构（它们的核心操作之一实际上是叠加），也研究了稀疏编码，也称为字典学习，这实际上是我们一直在做的事情。布鲁诺·奥尔斯豪森（Bruno Olshausen）基本上在1997年发明了稀疏编码。因此，我的研究议程和可解释性团队似乎在研究品味上是平行的。因此，与团队合作对我来说很有意义，从那时起，这一直是我的梦想。

Dwarkesh Patel 01:54:49 德瓦克什·帕特尔 01：54：49

There’s one thing I've noticed when people tell stories about their careers or their successes. They ascribe it way more to contingency, but when they hear about other people's stories they're like, “of course it wasn't contingent.” You know what I mean? “If that didn't happen, something else would have happened.”
当人们讲述他们的职业生涯或成功的故事时，我注意到一件事。他们更多地将其归因于偶然性，但当他们听到其他人的故事时，他们会说，“当然不是偶然的。你知道我的意思？“如果那没有发生，就会发生其他事情。

I've just noticed that and it's interesting that you both think that it was especially contingent. Maybe you're right. But it’s sort of an interesting pattern.
我刚刚注意到了这一点，有趣的是，你们俩都认为这是特别偶然的。也许你是对的。但这是一个有趣的模式。

Trenton Bricken 01:55:17 特伦顿·布里肯 01：55：17

I mean, I literally met Tristan at a conference and didn't have a scheduled meeting with him or anything. I just joined a little group of people chatting, and he happened to be standing there, and I happened to mention what I was working on, and that led to more conversations. I think I probably would've applied to Anthropic at some point anyways. But I would've waited at least another year. It's still crazy to me that I can actually contribute to interpretability in a meaningful way.
我的意思是，我真的在一次会议上遇到了特里斯坦，并没有与他安排会面或任何东西。我刚刚加入了一小群人聊天，他碰巧站在那里，我碰巧提到了我正在做的事情，这导致了更多的对话。我想我可能会在某个时候申请 Anthropic。但我至少要再等一年。对我来说，我实际上可以以一种有意义的方式为可解释性做出贡献，这对我来说仍然很疯狂。

Sholto Douglas 01:55:42 肖尔托·道格拉斯 01：55：42

I think there's an important aspect of shots on goal there, so to speak. Where just choosing to go to conferences itself is putting yourself in a position where luck is more likely to happen. Conversely, in my own situation it was doing all of this work independently and trying to produce and do interesting things. That was my own way of trying to manufacture luck, so to speak, to try and do something meaningful enough that it got noticed.
我认为射门是一个重要的方面，可以这么说。仅仅选择参加会议本身就是将自己置于一个更有可能发生运气的位置。相反，在我自己的情况下，它是独立地完成所有这些工作，并试图制作和做有趣的事情。这是我自己试图制造运气的方式，可以这么说，尝试做一些有意义的事情，让它引起人们的注意。

Dwarkesh Patel 01:56:08 德瓦克什·帕特尔 01：56：08

Given what you said, you framed this in the context that they were trying to run this experiment.
鉴于你所说的话，你是在他们试图进行这个实验的背景下构建的。

Sholto Douglas 01:56:13 肖尔托·道格拉斯 01：56：13

So specifically James and, I think, our manager Brennan was trying to run this experiment.
所以具体来说，詹姆斯，我想，我们的经理布伦南正试图进行这个实验。

Dwarkesh Patel 01:56:17 德瓦克什·帕特尔 01：56：17

It worked. Did they do it again?
成功了。他们又这样做了吗？

Sholto Douglas 01:56:19 肖尔托·道格拉斯 01：56：19

Yeah, so my closest collaborator, Enrique, he crossed from search through to our team. He's also been ridiculously impactful. He's definitely a stronger engineer than I am and he didn't go to university.
是的，所以我最亲密的合作者恩里克，他从搜索到我们的团队。他的影响力也非常大。他绝对是一个比我更强壮的工程师，而且他没有上过大学。

Dwarkesh Patel 01:56:33 德瓦克什·帕特尔 01：56：33

What was notable is that usually this kind of stuff is farmed out to recruiters or something. Whereas James is somebody whose time is worth like hundreds of millions of dollars.You know what I mean? So this thing is very bottlenecked on that kind of person taking the time, in an almost aristocratic tutoring sense, and finding someone and then getting them up to speed. It seems if it works this well, it should be done at scale. Like it should be the responsibility of key people to onboard.
值得注意的是，通常这种东西是外包给招聘人员或其他什么的。而詹姆斯是一个时间价值数亿美元的人。你知道我的意思？所以这件事对那种人来说是非常瓶颈的，他们花时间，在一种近乎贵族式的辅导意义上，找到一个人，然后让他们跟上速度。似乎如果它运作良好，它应该大规模完成。就像它应该是关键人物的责任。

Sholto Douglas 01:57:10 肖尔托·道格拉斯 01：57：10

I think that is true to many extents. I'm sure you probably benefited a lot from the key researchers mentoring you deeply.
我认为这在很多方面都是正确的。我敢肯定，你可能从关键研究人员的深入指导中受益匪浅。

Dwarkesh Patel 01:57:18 德瓦克什·帕特尔 01：57：18

And actively looking on open source repositories or on forums for potential people like this.
并积极寻找开源存储库或论坛，寻找像这样的潜在人。

Sholto Douglas 01:57:25 肖尔托·道格拉斯 01：57：25

I mean James has Twitter injected into his brain, but yes. I think this is something which in practice is done. Like people do look out for people that they find interesting and try to find high signal. In fact, I was talking about this with Jeff the other day and Jeff said that one of the most important hires he ever made was off a cold email. I was like, “well who was that?” And he's Chris Olah. Chris similarly had no formal background in ML. Google Brain was just getting started in this kind of thing but Jeff saw that signal. And the residency program which Brain had was astonishingly effective at finding good people that didn't have strong ML backgrounds.
我的意思是詹姆斯已经将Twitter注入了他的大脑，但是是的。我认为这是在实践中完成的事情。就像人们确实会寻找他们觉得有趣的人并试图找到高信号一样。事实上，前几天我和 Jeff 谈过这件事，Jeff 说他做过的最重要的招聘之一就是一封冷冰冰的电子邮件。我当时想，“那是谁？他就是克里斯·奥拉。Chris同样没有正式的ML背景，Google Brain才刚刚开始做这种事情，但Jeff看到了这个信号。Brain 的住院医师计划在寻找没有强大 ML 背景的优秀人才方面非常有效。

Dwarkesh Patel 01:58:27 德瓦克什·帕特尔 01：58：27

One of the other things I want to emphasize for a potential slice of the audience is that there's this sense that the world is legible and efficient, that you just go to jobs.google.com or jobs.whatevercompany.com and you apply and there's the steps and they will evaluate you efficiently. Not only from your stories, but it just seems like often that's not the way it happens. In fact, it's good for the world that that's not often how it happens. It is important to look at, “were they able to write an interesting technical blog post about their research or are they making interesting contributions.”
我想向潜在观众强调的另一件事是，有一种感觉，即世界是清晰而高效的，你只需去 jobs.google.com 或 jobs.whatevercompany.com，然后申请，就会有步骤，他们会有效地评估你。不仅从你的故事中，而且似乎经常不是这样发生的。事实上，这对世界来说是件好事，但这种情况并不经常发生。重要的是要看，“他们是否能够写一篇关于他们的研究的有趣的技术博客文章，或者他们是否做出了有趣的贡献。

I want you to riff on this for the people who are assuming that the other end of the job board is super legible and mechanical. This is not how it works and in fact, people are looking for the different kind of person who's agentic and putting stuff out there.
我希望你为那些认为工作板的另一端非常清晰和机械的人重复这一点。这不是它的工作方式，事实上，人们正在寻找不同类型的人，他们善于代理并把东西放在那里。

Sholto Douglas 01:59:25 肖尔托·道格拉斯 01：59：25

I think specifically what people are looking for are two things. One is agency and putting yourself out there. The second is the ability to do something at a world-class level. There are two examples that I always like to point to here. Andy Jones from Anthropic did an amazing paper on scaling laws as applied to board games. It didn't require much resources. It demonstrated incredible engineering skill and incredible understanding of the most topical problem of the time. He didn't come from a typical academic background or whatever. As I understand it, basically as soon as he came out with that paper, both Anthropic and OpenAI were like, “we would desperately like to hire you.”
我认为人们特别需要两样东西。一个是代理，把自己放在那里。第二个是在世界级水平上做某事的能力。这里我总是喜欢举两个例子。来自 Anthropic 的 Andy Jones 发表了一篇关于应用于棋盘游戏的缩放定律的精彩论文。它不需要太多资源。它展示了令人难以置信的工程技能和对当时最热门问题的令人难以置信的理解。他不是来自典型的学术背景或其他什么。据我了解，基本上他一发表那篇论文，Anthropic 和 OpenAI 都说，“我们迫切希望雇用你。

There's also someone who works on Anthropic's performance team now, Simon Boehm, who has written in my mind the reference for optimizing a CUDA map model on a GPU. It demonstrates an example of taking some prompt effectively and producing the world-class reference example for it, in something that wasn't particularly well done so far. I think that’s an incredible demonstration of ability and agency and in my mind would be an immediate, “we would please love to interview/hire you.”
现在还有人在 Anthropic 的性能团队工作，Simon Boehm，他在我的脑海中写下了在 GPU 上优化 CUDA 地图模型的参考。它展示了一个有效地采取一些提示并为其生成世界级参考示例的例子，到目前为止，这还不是特别好。我认为这是对能力和代理能力的不可思议的展示，在我看来，这将是一个直接的，“我们很乐意面试/雇用你。

Trenton Bricken 02:00:36 特伦顿·布里肯 02：00：36

The only thing I can add here is I still had to go through the whole hiring process and all the standard interviews and this sort of thing.
我在这里唯一可以补充的是，我仍然必须经历整个招聘过程和所有标准面试之类的事情。

Sholto Douglas 02:00:42 肖尔托·道格拉斯 02：00：42

Yeah, everyone does. Everyone does.
是的，每个人都这样做。每个人都这样做。

Dwarkesh Patel 02:00:43 德瓦克什·帕特尔 02：00：43

Wait, doesn't that seem stupid?
等等，这看起来是不是很愚蠢？

Sholto Douglas 02:00:47 肖尔托·道格拉斯 02：00：47

I mean, it's important, debiasing.
我的意思是，这很重要，消除偏见。

Dwarkesh Patel 02:00:50 德瓦克什·帕特尔 02：00：50

A bias is what you want, right? You want the bias of somebody who's got great taste. Who cares?
偏见是你想要的，对吧？你想要一个有品味的人的偏见。谁在乎啊？

Sholto Douglas 02:00:56 肖尔托·道格拉斯 02：00：56

Your interview process should be able to disambiguate that as well.
你的面试过程也应该能够消除歧义。

Trenton Bricken 02:00:59 特伦顿·布里肯 02：00：59

I think there are cases where someone seems really great and then they actually just can't code, this sort of thing. How much you weigh these things definitely matters though and I think we take references really seriously. The interviews you can only get so much signal from. So it's all these other things that can come into play for whether or not a hire makes sense.
我认为在某些情况下，某人看起来真的很棒，然后他们实际上就是不会编码，诸如此类的事情。不过，你对这些东西的权衡程度肯定很重要，我认为我们非常认真地对待参考资料。你只能从采访中得到这么多信号。因此，所有这些其他事情都可能影响到招聘是否有意义。

Sholto Douglas 02:01:18 肖尔托·道格拉斯 02：01：18

But you should design your interviews such that they test the right things.
但是你应该设计你的面试，以便他们测试正确的东西。

Dwarkesh Patel 02:01:23 德瓦克什·帕特尔 02：01：23

One man's bias is another man's taste.
一个人的偏见是另一个人的品味。

Trenton Bricken 02:01:29 特伦顿·布里肯 02：01：29

I guess the only thing I would add to this, or to the headstrong context, is this line: “the system is not your friend.” It's not necessarily actively against you or your sworn enemy. It's just not looking out for you. So that's where a lot of the proactiveness comes in. There are no adults in the room and you have to come to some decision for what you want your life to look like and execute on it. And hopefully you can then update later, if you're too headstrong in the wrong way. But I think you almost have to just charge at certain things to get much of anything done, to not be swept up in the tide of whatever the expectations are.
我想我唯一要补充的就是这句话：“系统不是你的朋友。它不一定是积极反对你或你的死敌。它只是没有照顾你。所以这就是很多主动性的用武之地。房间里没有成年人，你必须做出一些决定，你希望你的生活是什么样子，并执行它。希望你以后可以更新，如果你以错误的方式太任性了。但我认为你几乎只需要在某些事情上冲锋陷阵，才能完成很多事情，而不是被任何期望的浪潮所席卷。

Sholto Douglas 02:02:11 肖尔托·道格拉斯 02：02：11

There's one final thing I want to add. We talked a lot about agency and this kind of stuff.
最后我想补充一点。我们谈了很多关于代理和类似的东西。

But I think surprisingly enough, one of the most important things is just caring an unbelievable amount. When you care an unbelievable amount, you check all the details and you have this understanding of what could have gone wrong. It just matters more than you think. People end up not caring or not caring enough.
但我认为令人惊讶的是，最重要的事情之一就是关心令人难以置信的数量。当你关心一个令人难以置信的数量时，你会检查所有的细节，你就会明白可能出了什么问题。它比你想象的更重要。人们最终不关心或不够关心。

There’s this LeBron quote where he talks about how before he started in the league he was worried that everyone being incredibly good. He gets there and then he realizes that actually, once people hit financial stability, they relax a bit and he realizes, “oh, this is going to be easy.”
勒布朗引用了这句话，他谈到在他开始进入联盟之前，他担心每个人都非常出色。他到了那里，然后他意识到，实际上，一旦人们达到财务稳定，他们就会放松一点，他意识到，“哦，这将很容易。

I don't think that's quite true because I think in AI research most people actually care quite deeply. But there's caring about your problem and there's also just caring about the entire stack and everything that goes up and down, going explicitly and fixing things that aren't your responsibility to fix because overall it makes the stack better.
我不认为这是真的，因为我认为在人工智能研究中，大多数人实际上都非常关心。但是，你要关心你的问题，也要关心整个堆栈以及所有上下波动的东西，明确地去修复那些不是你负责修复的东西，因为总的来说，它使堆栈变得更好。

Dwarkesh Patel 02:03:11 德瓦克什·帕特尔 02：03：11

You were mentioning going in on weekends and on Christmas break and the only people in the office are Jeff Dean and Sergey Brin or something and you just get to pair program with them. I don't want to pick on your company in particular, but people at any big company have gotten there because they've gone through a very selective process. They had to compete in high school. They had to compete in college. But it almost seems like they get there and then they take it easy when in fact it's the time to put the pedal to the metal. Go in and pair program with Sergey Brin on the weekends or whatever, you know what I mean?
你提到周末和圣诞节假期去上班，办公室里只有杰夫·迪恩和谢尔盖·布林之类的人，你只需要和他们配对程序。我不想特别挑剔你的公司，但任何大公司的人都已经走到了那里，因为他们经历了一个非常有选择性的过程。他们不得不在高中参加比赛。他们不得不在大学里竞争。但似乎他们到达了那里，然后他们放松了，而实际上是时候将踏板踩到金属上了。在周末或其他什么时间与谢尔盖·布林（Sergey Brin）配对，你知道我的意思吗？

Sholto Douglas 02:03:48 肖尔托·道格拉斯 02：03：48

There's pros and cons there, right? I think many people make the decision that the thing that they want to prioritize is a wonderful life with their family. They do wonderful work in the hours that they do and that's incredibly impactful. I think this is true for many people at Google. Maybe they don't work as many hours as in your typical startup mythologies. But the work that they do is incredibly valuable.
这有利有弊，对吧？我认为很多人都决定，他们想要优先考虑的事情是与家人一起过上美好的生活。他们在他们所做的几个小时内做了出色的工作，这非常有影响力。我认为对于谷歌的许多人来说，情况都是如此。也许他们的工作时间不如典型的创业神话中那么多。但他们所做的工作非常有价值。

It's very high-leverage because they know the systems and they're experts in their field. We also need people like that. Our world rests on these huge systems that are difficult to manage and difficult to fix. We need people who are willing to work on, and help, and fix, and maintain those in frankly a thankless way. That isn't as high publicity as all of this AI work that we're doing. I am ridiculously grateful that those people do that. I'm also happy that there are people that find technical fulfillment in their job and doing that well and also maybe they draw a lot more out of spending a lot of hours with their family. I'm lucky that I'm at a stage in my life where I can go in and work every hour of the week. I'm not making as many sacrifices to do that.
这是非常高的杠杆，因为他们了解系统，并且是各自领域的专家。我们也需要这样的人。我们的世界依赖于这些难以管理和难以修复的庞大系统。我们需要那些愿意以一种吃力不讨好的方式努力、帮助、修复和维护它们的人。这并不像我们正在做的所有这些人工智能工作那样高知名度。我非常感激那些人这样做。我也很高兴有些人在他们的工作中找到了技术成就感，并且做得很好，而且也许他们从与家人共度大量时间中获得了更多。我很幸运，我正处于人生的一个阶段，我可以每周的每个小时都去工作。我没有为此做出那么多牺牲。

Dwarkesh Patel 02:05:01 德瓦克什·帕特尔 02：05：01

One example sticks out in my mind of this sort getting to the yes on the other side of a no. Basically every single high-profile guest I've done so far, I think maybe with one or two exceptions, I've sat down for a week and I've just come up with a list of sample questions. I just try to come up with really smart questions to send to them. In that entire process I've always thought, if I just cold email them, it's like a 2% chance they say yes. If I include this list, there's a 10% chance. Because otherwise, you go through their inbox and every 34 seconds, there's an interview for some podcast or interview. Every single time I've done this they've said yes.
在我的脑海中有一个例子，就是在否定的另一边得到肯定。到目前为止，基本上我做过的每一位知名嘉宾，我想也许除了一两个例外，我已经坐下来一个星期了，我只是想出了一个示例问题清单。我只是试着想出非常聪明的问题发送给他们。在整个过程中，我一直在想，如果我只是给他们发冷电子邮件，他们答应的几率只有 2%。如果我包括这个列表，有 10% 的机会。因为否则，你浏览他们的收件箱，每 34 秒，就会有一些播客或采访的采访。每次我这样做时，他们都说是的。

Trenton Bricken 02:05:46 特伦顿·布里肯 02：05：46

You just ask the right questions,
你只要问正确的问题，

Sholto Douglas 02:05:49 肖尔托·道格拉斯 02：05：49

You do everything, you'll win,
你做任何事情，你都会赢，

Dwarkesh Patel 02:05:50 德瓦克什·帕特尔 02：05：50

You just literally have to dig in the same hole for 10 minutes, or in that case make a sample list of questions for them, to get past their "not an idiot" list.
你只需要在同一个洞里挖 10 分钟，或者在这种情况下为他们制作一个问题示例列表，以通过他们的“不是白痴”列表。

Sholto Douglas 02:06:01 肖尔托·道格拉斯 02：06：01

Demonstrate how much you care and the work you're willing to put in.
展示你有多关心你，以及你愿意付出的努力。

Trenton Bricken 02:06:05 特伦顿·布里肯 02：06：05

Something that a friend said to me a while back that stuck is that it's amazing how quickly you can become world-class at something. Most people aren't trying that hard and are only working the actual 20 hours or something that they're spending on this thing. So if you just go ham, then you can get really far, pretty fast.
不久前，一位朋友对我说过一句话，那就是你能以多快的速度成为世界级的某件事，这真是太神奇了。大多数人并没有那么努力，只工作了实际的 20 个小时或他们花在这件事上的东西。所以如果你只是去火腿，那么你可以走得很远，很快。

Sholto Douglas 02:06:25 肖尔托·道格拉斯 02：06：25

I think I'm lucky I had that experience with the fencing as well. I had the experience of becoming world-class in something and knowing that if you just worked really, really hard and were–
我想我很幸运，我也有过击剑的经历。我有过在某件事上成为世界级的经历，并且知道如果你真的非常努力地工作，并且是——

Dwarkesh Patel 02:06:35 德瓦克什·帕特尔 02：06：35

For context, Sholto was one seat away, he was the next person in line to go to the Olympics for fencing.
就上下文而言，肖尔托只差一个座位，他是下一个排队参加奥运会击剑比赛的人。

Sholto Douglas 02:06:43 肖尔托·道格拉斯 02：06：43

I was at best like 42nd in the world for fencing, for men's foil fencing.
在击剑和男子花剑方面，我充其量只能排在世界第42位。

Dwarkesh Patel 02:06:47 德瓦克什·帕特尔 02：06：47

Mutational load is a thing, man.
突变负荷是一回事，伙计。

Sholto Douglas 02:06:53 肖尔托·道格拉斯 02：06：53

There was one cycle where I was like the next highest-ranked person in Asia and if one of the teams had been disqualified for doping–as was occurring during that cycle and occurred for like the Australian women's rowing team that went on because one of the teams was disqualified–then I would have been the next in line.
有一个周期，我就像亚洲排名第二的人，如果其中一支球队因兴奋剂而被取消资格——就像在那个周期中发生的那样，就像澳大利亚女子赛艇队一样，因为其中一支球队被取消资格而继续比赛——那么我就会成为下一个。

02:07:16 - Are feature spaces the wrong way to think about intelligence? #

02：07：16 - 特征空间是思考智能的错误方式吗？

Dwarkesh Patel 02:07:16 德瓦克什·帕特尔 02：07：16

It's interesting when you just find out about people's prior lives and it's, “oh this guy was almost an Olympian.”
有趣的是，当你发现人们的前世时，“哦，这家伙几乎是奥运选手。

Okay, let's talk about interpretability. I actually want to stay on the brain stuff as a way to get into it for a second. We were previously discussing this. Is the brain organized in the way where you have a residual stream that is gradually refined with higher-level associations over time? There's a fixed dimension size in a model. I don't even know how to ask this question in a sensible way, but what is the D model of the brain? What is the embedding size, or because of feature splitting is that not a sensible question?
好吧，让我们谈谈可解释性。我实际上想留在大脑的东西上，作为一种进入它的方式。我们之前讨论过这个问题。大脑的组织方式是否是这样的，你有一个残余的流，随着时间的推移，随着更高层次的联想逐渐完善？模型中有一个固定的维度大小。我什至不知道如何以一种明智的方式提出这个问题，但大脑的D模型是什么？嵌入大小是多少，或者因为特征拆分，这不是一个明智的问题吗？

Trenton Bricken 02:08:06 特伦顿·布里肯 02：08：06

No, I think it's a sensible question. Well, it is a question.
不，我认为这是一个明智的问题。嗯，这是一个问题。

Dwarkesh Patel 02:08:09 德瓦克什·帕特尔 02：08：09

You could have just not said that.
你本来可以不这么说的。

Trenton Bricken 02:08:19 特伦顿·布里肯 02：08：19

I don't know how you would begin. Okay, well this part of the brain is like a vector of this dimensionality. Maybe for the visual stream, because it's like V1 to V2 to IT, whatever. You could just count the number of neurons that are there and say that is the dimensionality. But it seems more likely that there are submodules and things are divided up. I'm not the world's greatest neuroscientist. I did it for a few years, I studied the cerebellum quite a bit. I'm sure there are people who could give you a better answer on this.
我不知道你会如何开始。好吧，大脑的这一部分就像是这个维度的向量。也许对于视觉流来说，因为它就像 V1 到 V2 到 IT 等等。你可以数一数那里的神经元数量，然后说这就是维度。但似乎更有可能的是，有子模块，事情被分开了。我不是世界上最伟大的神经科学家。我做了几年，我研究了小脑。我相信有人可以给你一个更好的答案。

Dwarkesh Patel 02:08:56 德瓦克什·帕特尔 02：08：56

Do you think that the way to think, whether it's in the brain or whether it's in these models, fundamentally what's happening is that features are added, removed, changed, and that the feature is the fundamental unit of what is happening in the model? This goes back to the earlier thing we were talking about, whether it's just associations all the way down. Give me a counterfactual. In the world where this is not true, what is happening instead? What is the alternative hypothesis here?
你是否认为思考的方式，无论是在大脑中还是在这些模型中，从根本上说，正在发生的事情是特征被添加、删除、改变，并且特征是模型中正在发生的事情的基本单位？这又回到了我们之前谈论的事情，无论它只是一直向下的关联。给我一个反事实。在事实并非如此的世界里，发生了什么？这里的替代假设是什么？

Trenton Bricken 02:09:30 特伦顿·布里肯 02：09：30

It's hard for me to think about because at this point I just think so much in terms of this feature space. At one point there was the kind of behavioral approach towards cognition where you're just input and output but you're not really doing any processing. Or it's like everything is embodied and you're just a dynamical system that's operating along some predictable equations but there's no state in the system. But whenever I've read these sorts of critiques I think, “well, you're just choosing to not call this thing a state, but you could call any internal component of the model a state.” Even with the feature discussion, defining what a feature is, is really hard. So the question feels almost too slippery.
我很难思考，因为在这一点上，我只是在这个功能空间方面考虑了很多。在某一时刻，有一种认知行为方法，你只是输入和输出，但你并没有真正做任何处理。或者，就像一切都是具体化的，你只是一个动态系统，沿着一些可预测的方程运行，但系统中没有状态。但每当我读到这些批评时，我都会想，“好吧，你只是选择不把这个东西称为状态，但你可以把模型的任何内部组件称为状态。即使进行了功能讨论，定义功能是什么也非常困难。所以这个问题感觉太滑了。

Dwarkesh Patel 02:10:24 德瓦克什·帕特尔 02：10：24

What is a feature?
什么是功能？

Trenton Bricken 02:10:25 特伦顿·布里肯 02：10：25

A direction and activation space. A latent variable that is operating behind the scenes, that has causal influence over the system you're observing. It’s a feature if you call it a feature, it's tautological.
方向和激活空间。一个在幕后运行的潜在变量，对你所观察的系统有因果影响。如果你称它为功能，它就是一个功能，它是同义的。

Sholto Douglas 02:10:49 肖尔托·道格拉斯 02：10：49

In a very rough, intuitive sense in a sufficiently sparse and like binary vector, a feature is whether or not something's turned on or off, in a very simplistic sense. I think a useful metaphor to understand is that in many respects it’s the same way the neuroscientists would talk about a neuron activating, right?
在一个非常粗略的、直观的意义上，在一个足够稀疏和像二进制向量的意义上，一个特征是某物是否被打开或关闭，在一个非常简单的意义上。我认为一个有用的比喻是，在许多方面，这与神经科学家谈论神经元激活的方式相同，对吧？

Trenton Bricken 02:11:11 特伦顿·布里肯 02：11：11

If that neuron corresponds to…
如果该神经元对应于...

Sholto Douglas 02:11:12 肖尔托·道格拉斯 02：11：12

To something in particular, right?
特别的东西，对吧？

Trenton Bricken 02:11:15 特伦顿·布里肯 02：11：15

What do we want a feature to be? What is the synthetic problem under which a feature exists? Even with the “Towards Monosemanticity” work, we talk about what's called feature splitting, which is basically where you will find as many features as you give the model the capacity to learn. By model here, I mean the up projection that we fit after we trained the original model. So if you don't give it much capacity, it'll learn a feature for bird, but if you give it more capacity, then it will learn ravens and eagles and sparrows and specific types of birds.
我们想要一个功能是什么？特征存在的综合问题是什么？即使有“走向单调性”的工作，我们也会谈论所谓的特征拆分，这基本上是你找到尽可能多的特征，你给模型学习的能力。这里所说的模型，是指我们在训练原始模型后拟合的向上投影。因此，如果你不给它太多的容量，它会学习鸟类的特征，但如果你给它更多的容量，那么它就会学习乌鸦、老鹰、麻雀和特定类型的鸟类。

Dwarkesh Patel 02:11:51 德瓦克什·帕特尔 02：11：51

Still on the definitions thing, I naively think of things like bird versus, at the highest level, things like love or deception or holding a very complicated proof in your head or something.
仍然在定义方面，我天真地想到了诸如鸟与鸟之类的东西，在最高层次上，诸如爱或欺骗之类的东西，或者在脑海中持有非常复杂的证据之类的东西。

Are these all features? Because then the definition seems so broad as to almost be not that useful. Rather there seems to be some important differences between these things and they're all features. I'm not sure what we would mean by that.
这些都是功能吗？因为这样一来，这个定义似乎就太宽泛了，以至于几乎没有那么有用。相反，这些东西之间似乎有一些重要的区别，它们都是功能。我不确定我们这是什么意思。

Trenton Bricken 02:12:32 特伦顿·布里肯 02：12：32

I mean all of those things are discrete units that have connections to other things that then imbues them with meaning. That feels like a specific enough definition that it's useful or not too all-encompassing. But feel free to push back.
我的意思是，所有这些事物都是离散的单元，它们与其他事物有联系，然后赋予它们以意义。这感觉像是一个足够具体的定义，它是有用的，或者不是太包罗万象。但请随意反击。

Dwarkesh Patel 02:12:49 德瓦克什·帕特尔 02：12：49

Well what would you discover tomorrow that could make you think, “oh this is fundamentally the wrong way to think about what's happening in a model.”
那么，明天你会发现什么，可能会让你觉得，“哦，这从根本上说是思考模型中发生的事情的错误方式。

Trenton Bricken 02:12:59 特伦顿·布里肯 02：12：59

If the features we were finding weren't predictive, or if they were just representations of the data, where it's like: “oh all you're doing is just clustering your data and there's no higher- level associations that are being made or it's some phenomenological thing of your call. You're saying that this feature files for marriage, but if you activate it really strongly it doesn't change the outputs of the model in a way that would correspond to it.”
如果我们发现的特征不是预测性的，或者它们只是数据的表示，那么就像：“哦，你所做的只是对你的数据进行聚类，没有更高层次的关联，或者这是你调用的一些现象学的东西。你是说这个功能是为婚姻准备的，但如果你非常强烈地激活它，它不会以与之相对应的方式改变模型的输出。

I think those would both be good critiques. Here’s another. We tried to do experiments on MNIST which is a data set of images, and we didn't look super hard into it. So I'd be interested if other people wanted to take up a deeper investigation here. But it's plausible that your latent space of representations is dense and it's a manifold instead of being these discrete points. So you could move across the manifold, but at every point, there would be some meaningful behavior. It's much harder then, to label things as features that are discrete.
我认为这些都是很好的批评。这是另一个。我们试图在MNIST上做实验，这是一个图像数据集，我们并没有非常认真地研究它。因此，如果其他人想在这里进行更深入的调查，我会很感兴趣。但是，你的潜在表征空间是密集的，它是一个流形，而不是这些离散点，这是合理的。所以你可以在流形上移动，但在每个点上，都会有一些有意义的行为。那么，将事物标记为离散的特征要困难得多。

Dwarkesh Patel 02:14:05 德瓦克什·帕特尔 02：14：05

In a naive, sort of outsider way, it seems to me that a way in which this picture could be wrong is if it’s not that something is turned on and turned off, but that it's a much more global kind of the system. I'm going to use really clumsy, dinner party kind of language, but is there a good analogy here?
以一种幼稚的、有点局外人的方式，在我看来，这张照片可能是错误的一种方式是，如果不是某些东西被打开和关闭，而是它是一种更加全球化的系统。我将使用非常笨拙的晚宴语言，但这里有一个很好的类比吗？

I guess if you think of something like the laws of physics, it's not that the feature for wetness is turned on, but it's only turned on this much and then the feature for… I guess maybe it's true because the mass is like a gradient and… I don't know. But the polarity or whatever is the gradient as well.
我想如果你想到物理定律之类的东西，并不是说湿润功能被打开了，而是它只打开了这么多，然后是......我想也许这是真的，因为质量就像一个梯度，而且......我不知道。但极性或其他任何东西也是梯度。

There's also a sense in which there's the laws and the laws are more general and you have to understand the general bigger picture and you don't get that from just these specific subcircuits.
还有一种意义上说，有定律，而定律更普遍，你必须了解一般的大局，你不能仅仅从这些特定的子电路中得到它。

Sholto Douglas 02:15:08 肖尔托·道格拉斯 02：15：08

But that's where the reasoning circuit itself comes into play, right? You're taking these features ideally and trying to compose them into something high-level. At least this is my headcanon, So let's say I'm trying to use the foot, F=ma, right? Then presumably at some point I have features which denote mass. And then that's helping me retrieve the actual mass of the thing that I'm using and then the acceleration and this kind of stuff. Then also, maybe there's a higher-level feature that does correspond to using the first law of physics. Maybe. But the more important part is the composition of components which helps me retrieve a relevant piece of information and then produce maybe some multiplication operator or something like that when necessary. At least that's my headcanon.
但这就是推理电路本身发挥作用的地方，对吧？你理想地利用了这些功能，并试图将它们组合成高级的东西。至少这是我的头炮，所以假设我正在尝试使用脚，F=马，对吧？然后大概在某个时候，我有表示质量的特征。然后这帮助我检索我正在使用的东西的实际质量，然后是加速度和类似的东西。然后，也许还有一个更高层次的特征，确实对应于使用物理第一定律。或。但更重要的部分是组件的组成，它帮助我检索相关信息，然后在必要时生成一些乘法运算符或类似的东西。至少这是我的头炮。

Dwarkesh Patel 02:15:52 德瓦克什·帕特尔 02：15：52

What is a compelling explanation to you, especially for very smart models, of “I understand why it made this output and it was like for a legit reason.” If it's doing million line pull requests or something, what are you seeing at the end of that request where you're like, “yep good, that's chill.”
对你来说，特别是对于非常聪明的模型来说，一个令人信服的解释是什么，“我理解为什么它会产生这种输出，而且这是出于合法的原因。如果它正在执行数百万个拉取请求或其他什么，那么在请求结束时，您会看到什么，您会说，“是的，太好了，这很冷。

Trenton Bricken 02:16:11 特伦顿·布里肯 02：16：11

So ideally you apply dictionary learning to the model. You've found features. Right now we're actively trying to get the same success for attention heads. You can do it for residual stream, MLP, and attention throughout the whole model. Hopefully at that point you can also identify broader circuits through the model that are more general reasoning abilities that will activate or not activate.
因此，理想情况下，您将字典学习应用于模型。你已经找到了功能。现在，我们正在积极尝试为注意力负责人获得同样的成功。您可以在整个模型中对残差流、MLP 和注意力执行此操作。希望在这一点上，您还可以通过模型识别更广泛的电路，这些电路是将激活或不激活的更通用的推理能力。

But in your case where we're trying to figure out if this pull request should be approved or not. I think you can flag or detect features that correspond to deceptive behavior, malicious behavior, these sorts of things, and see whether or not those have fired. That would be an immediate thing. You can do more than that, but that would be an immediate one.
但在您的情况下，我们试图弄清楚是否应该批准此拉取请求。我认为您可以标记或检测与欺骗行为、恶意行为等相对应的功能，并查看这些功能是否已触发。这将是一件立竿见影的事情。你可以做更多的事情，但那将是立竿见影的。

Dwarkesh Patel 02:16:53 德瓦克什·帕特尔 02：16：53

But before I trace down on that, what does a reasoning circuit look like? What would that look like when you found it?
但在我追溯之前，推理电路是什么样的？当你找到它时，它会是什么样子？

Trenton Bricken 02:17:00 特伦顿·布里肯 02：17：00

Yeah, so, I mean, the induction head is probably one of the simplest cases.
是的，所以，我的意思是，感应头可能是最简单的情况之一。

Dwarkesh Patel 02:17:02 德瓦克什·帕特尔 02：17：02

But it's not reasoning, right?
但这不是推理，对吧？

Trenton Bricken 02:17:04 特伦顿·布里肯 02：17：04

Well, what do you call reasoning, right? For context for listeners, the induction head is basically, when you see the line, “Mr. and Mrs. Dursley did something. Mr. _____,” and you're trying to predict what “blank” is and the head has learned to look for previous occurrences of the word “Mr.” and look at the word that comes after it and then copy and paste that as the prediction for what should come next. It's a super reasonable thing to do and there is computation being done there to accurately predict the next token.
好吧，你叫什么推理，对吧？对于听众来说，当你看到这句话时，归纳头基本上是，“德思礼先生和夫人做了一些事情。Mr. _____“，你试图预测”空白“是什么，而大脑已经学会了寻找以前出现的”先生“这个词，并查看它后面的单词，然后复制并粘贴它作为对接下来应该发生的事情的预测。这是一件非常合理的事情，并且正在那里进行计算以准确预测下一个代币。

Sholto Douglas 02:17:43 肖尔托·道格拉斯 02：17：43

Yeah, that is context dependent.
是的，这取决于上下文。

Dwarkesh Patel 02:17:45 德瓦克什·帕特尔 02：17：45

But it's not reasoning. You know what I mean?
但这不是推理。你知道我的意思？

Trenton Bricken 02:17:49 特伦顿·布里肯 02：17：49

I guess going back to the “associations all the way down.” It’s if you chain together a bunch of these reasoning circuits, or heads, that have different rules for how to relate information.
我想回到“一路向下的联想”。如果你把一堆推理电路或头部串联在一起，它们对如何关联信息有不同的规则。

Dwarkesh Patel 02:18:02 德瓦克什·帕特尔 02：18：02

But in this sort of zero shot case, something is happening when you pick up a new game and you immediately start understanding how to play it. And it doesn't seem like an induction head kind of thing.
但是在这种零镜头的情况下，当你拿起一个新游戏时，就会发生一些事情，你立即开始了解如何玩它。而且它看起来不像是感应头之类的东西。

Trenton Bricken 02:18:13 特伦顿·布里肯 02：18：13

Or I think there would be another circuit for extracting pixels and turning them into latent representations of the different objects in the game, right? And a circuit that is learning physics.
或者我认为会有另一个电路来提取像素并将它们转换为游戏中不同对象的潜在表示，对吧？还有一个正在学习物理的电路。

Dwarkesh Patel 02:18:26 德瓦克什·帕特尔 02：18：26

What would that look like? Because the induction head is like one layer transformer?
那会是什么样子？因为感应头就像一层变压器？

Trenton Bricken 02:18:30 特伦顿·布里肯 02：18：30

Two layer. 两层。

Dwarkesh Patel 02:18:32 德瓦克什·帕特尔 02：18：32

So you can kind of see the thing that is a human picking up a new game and understanding it. How would you think about what that is? I presume it's across multiple layers. What would that physically look like? How big would it be maybe?
所以你可以看到一个人类拿起一个新游戏并理解它的东西。你会怎么想那是什么？我认为它跨越了多个层次。那会是什么样子？它可能有多大？

Trenton Bricken 02:18:53 特伦顿·布里肯 02：18：53

I mean, that would just be an empirical question, right? How big does the model need to be to perform this task? Maybe it's useful if I just talk about some other circuits that we've seen. So we've seen the IOI circuit, which is the indirect object identification. It's like, “Mary and Jim went to the store, Jim gave the object to ____.” It would predict “Mary” because Mary's appeared before, as the indirect object. Or, it'll infer pronouns. This circuit even has behavior where if you ablate it, then other heads in the model will pick up that behavior. We'll even find heads that want to do copying behavior, and then other heads will suppress it. So it's one head's job to just always copy the token that came before or the token that came five before, or whatever. And then it's another head's job to be like, “no, do not copy that thing.” There are lots of different circuits performing, in these cases, pretty basic operations. But when they're chained together you can get unique behaviors.
我的意思是，这只是一个经验问题，对吧？模型需要多大才能执行此任务？如果我只谈谈我们看到的其他一些电路，也许会很有用。因此，我们已经看到了IOI电路，即间接对象识别。这就像，“玛丽和吉姆去了商店，吉姆把东西给了____。它会预测“玛丽”，因为玛丽之前出现过，作为间接宾语。或者，它会推断代词。该电路甚至具有行为，如果您将其烧蚀，则模型中的其他磁头将拾取该行为。我们甚至会发现想要做复制行为的脑袋，然后其他脑袋会压制它。因此，一个人的工作就是总是复制前面的令牌或前五代的令牌，或者其他什么。然后另一个负责人的工作是，“不，不要复制那个东西。在这些情况下，有许多不同的电路执行非常基本的操作。但是当它们被链接在一起时，你可以得到独特的行为。

Dwarkesh Patel 02:20:00 德瓦克什·帕特尔 02：20：00

It won't be something you can see in like a two layer transformer, so will you just be like, “this is the circuit for deception” or whatever? This part of the network fired when we at the end identified the thing as being deceptive. This part didn't fire when we didn't identify it as being deceptive. Therefore, this must be the deception circuit.
它不会像两层变压器那样被你看到，所以你会说，“这是欺骗的电路”或其他什么？当我们最终确定该事物具有欺骗性时，网络的这一部分就触发了。当我们没有将其识别为具有欺骗性时，这部分没有触发。因此，这一定是欺骗电路。

Trenton Bricken 02:20:25 特伦顿·布里肯 02：20：25

I think a lot of analysis like that. Anthropic has done quite a bit of research before on sycophancy, which is the model saying what it thinks you want to hear
我认为很多这样的分析。Anthropic 之前对阿谀奉承做过相当多的研究，即模型说出它认为你想听的话

Dwarkesh Patel 02:20:36 德瓦克什·帕特尔 02：20：36

That requires us at the end to be able to label which one is bad and which one is good.
这要求我们最终能够标记哪个是坏的，哪个是好的。

Trenton Bricken 02:20:42 特伦顿·布里肯 02：20：42

Yeah, so we have tons of instances–and actually as you make a lot of models larger, they do more of this–where the model clearly has features that model another person's mind and some subset of these, we're hypothesizing here, would be associated with more deceptive behavior.
是的，所以我们有大量的实例——实际上，当你把很多模型做得更大时，它们会做更多的事情——模型显然具有模拟另一个人思维的特征，而我们在这里假设的这些特征的某些子集将与更多的欺骗性行为相关联。

Dwarkesh Patel 02:21:03 德瓦克什·帕特尔 02：21：03

Although it's doing that by… I don't know. ChatGPT is probably modeling me because that's what RLHF induces it to do.
虽然它是通过...我不知道。ChatGPT 可能正在模仿我，因为这就是 RLHF 诱导它做的事情。

Trenton Bricken 02:21:10 特伦顿·布里肯 02：21：10

Yeah. Theory of mind. 是的。心智理论。

02:21:12 - Will interp actually work on superhuman models #

02：21：12 - interp 真的可以在超人模型上工作吗

Dwarkesh Patel 02:21:12 德瓦克什·帕特尔 02：21：12

So first of all, there’s the thing you mentioned earlier about redundancy. So then have you caught the whole thing that could cause deception of the whole thing or is it just one instance of it? Second of all, are your labels correct? Maybe you thought this wasn't deceptive but it’s still deceptive. Especially if it's producing output you can't understand. Third, is the thing that's gonna be the bad outcome something that's even human-understandable? Deception is a concept we can understand.
首先，你刚才提到的关于冗余的问题。那么，你有没有抓住可能导致整个事情欺骗的整个事情，或者它只是其中的一个例子？其次，您的标签是否正确？也许你认为这不是欺骗性的，但它仍然具有欺骗性。特别是如果它正在产生你无法理解的输出。第三，将要导致糟糕结果的事情是人类可以理解的吗？欺骗是一个我们可以理解的概念。

Trenton Bricken 02:21:41 特伦顿·布里肯 02：21：41

A lot to unpack here. A few things. It's fantastic that these models are deterministic. When you sample from them, it's stochastic. But I can just keep putting in more inputs and ablate every single part of the model. This is kind of the pitch for computational neuroscientists to come and work on interpretability. It's like you have this alien brain, you have access to everything in it, and you can just ablate however much of it you want.
这里有很多东西要解开。有几件事。这些模型是确定性的，这真是太棒了。当你从他们那里采样时，它是随机的。但是我可以继续输入更多的输入，并消融模型的每个部分。这是计算神经科学家来研究可解释性的一种推销方式。这就像你有这个外星人的大脑，你可以访问其中的一切，你可以随心所欲地消融。

So I think if you do this carefully enough you really can start to pin down what are the circuits involved and what are the backup circuits, these sorts of things. It’s a bit of a cop out answer but it's important to keep in mind doing automated interpretability. As our models continue to get more capable, we have them assign labels or run some of these experiments at scale. With respect to detecting superhuman performance, which I think was the last part of your question, aside from the cop out answer, if we buy this "associations all the way down," you should be able to coarse-grain the representations at a certain level such that they then make sense.
所以我认为，如果你做得足够仔细，你真的可以开始确定所涉及的电路是什么，备份电路是什么，诸如此类的事情。这有点像一个警察的答案，但重要的是要记住自动可解释性。随着我们的模型继续变得越来越强大，我们让他们分配标签或大规模运行其中一些实验。关于检测超人的表现，我认为这是你问题的最后一部分，除了警察的答案之外，如果我们购买这种“一路向下的关联”，你应该能够在一定水平上粗略地划分表征，以便它们有意义。

I think it was even in Demis's podcast. He's talking about how if a chess player makes a superhuman move, they should be able to distill it into reasons why they did it. Even if the model is not going to tell you what it is, you should be able to decompose that complex behavior into simpler circuits or features to really start to make sense of why it did that thing.
我想它甚至在 Demis 的播客中。他说，如果一个棋手做了一个超人的举动，他们应该能够把它提炼成他们这样做的原因。即使模型不会告诉你它是什么，你也应该能够将这种复杂的行为分解成更简单的电路或特征，以便真正开始理解它为什么这样做。

Dwarkesh Patel 02:23:08 德瓦克什·帕特尔 02：23：08

There's a separate question of if such representation exists. It seems like it must or actually I'm not sure if that's the case. And secondly, whether using this sparse autoencoder setup you could find it. In this case, if you don't have labels that are adequate to represent it, you wouldn't find it.
还有一个单独的问题，即是否存在这种表示。似乎必须或实际上我不确定是否是这种情况。其次，无论使用这种稀疏的自动编码器设置，您都可以找到它。在这种情况下，如果你没有足以表示它的标签，你就不会找到它。

Trenton Bricken 02:23:28 特伦顿·布里肯 02：23：28

Yes and no. We are actively trying to use dictionary learning now on the sleeper agents work, which we talked about earlier. If I just give you a model, can you tell me if there's this trigger in it and if it's going to start doing interesting behavior? It's an open question whether or not when it learns that behavior, it's part of a more general circuit that we can pick up on without actually getting activations for and having it display that behavior. Because that would kind of be cheating then. Or if it's learning some hacky trick that's a separate circuit that you'll only pick up on if you actually have it do that behavior. But even in that case, the geometry of features gets really interesting, because fundamentally, each feature is in some part of your representation space and they all exist with respect to each other.
是的，也不是。我们现在正在积极尝试将字典学习用于我们之前讨论过的潜伏代理工作。如果我给你一个模型，你能告诉我里面是否有这个触发因素，它是否会开始做有趣的行为吗？这是一个悬而未决的问题，当它学习到这种行为时，它是否是一个更通用的电路的一部分，我们可以在不实际激活并让它显示该行为的情况下接受它。因为那样的话就有点作弊了。或者，如果它正在学习一些黑客技巧，那是一个单独的电路，只有当你真的让它做这种行为时，你才会学习。但即使在这种情况下，特征的几何形状也变得非常有趣，因为从根本上说，每个特征都位于表示空间的某个部分，并且它们都彼此相对于彼此存在。

So in order to have this new behavior, you need to carve out some subset of the feature space for the new behavior and then push everything else out of the way to make space for it. Hypothetically, you can imagine you have your model before you've taught it this bad behavior and you know all the features or have some coarse-grained representation of them. You then fine-tune it such that it becomes malicious and then you can kind of identify this black hole region of feature space where everything else has been shifted away from that and you haven't put in an input that causes it to fire. Then you can start searching for what is the input that would cause this part of the space to fire. What happens if I activate something in this? There are a whole bunch of other ways that you can try and attack that problem.
因此，为了拥有这种新行为，您需要为新行为开辟出一些特征空间的子集，然后将其他所有内容都推到一边，为它腾出空间。假设，你可以想象一下，在你教它这种不良行为之前，你就已经有了你的模型，并且你知道所有的特征，或者对它们有一些粗粒度的表示。然后你对它进行微调，使其变得恶意，然后你可以识别特征空间的这个黑洞区域，其他一切都已经远离了它，并且你没有输入导致它触发的输入。然后，您可以开始搜索会导致这部分空间触发的输入是什么。如果我激活了其中的某些内容会怎样？还有一大堆其他方法可以尝试解决这个问题。

Dwarkesh Patel 02:25:00 德瓦克什·帕特尔 02：25：00

This is sort of a tangent, but one interesting idea I heard was if that space is shared between models then you can imagine trying to find it in an open source model to then make… Like Gemma, Google's newly released open source model. They said in the paper that it's trained using the same architecture or something like that.
这有点切线，但我听到的一个有趣的想法是，如果该空间在模型之间共享，那么您可以想象尝试在开源模型中找到它，然后制作......就像 Gemma 一样，Google 新发布的开源模型。他们在论文中说，它是使用相同的架构或类似的东西训练的。

Sholto Douglas 02:25:20 肖尔托·道格拉斯 02：25：20

I have to be honest, I didn't know because I haven't read the Gemma paper.
老实说，我不知道，因为我没有读过杰玛的论文。

Dwarkesh Patel 02:25:23 德瓦克什·帕特尔 02：25：23

So to the extent that's true, how much of the red teaming you do on Gemma is potentially helping you jailbreak into Gemini?
因此，在某种程度上，你在杰玛身上所做的红队中有多少可能帮助你越狱进入双子座？

Trenton Bricken 02:25:35 特伦顿·布里肯 02：25：35

This gets into the fun space of how universal are features across models. Our “Towards Monosemanticity” paper looked at this a bit. I can't give you summary statistics but there’s the Base64 feature, for example, which we see across a ton of models. There are actually three of them, but they'll fire for and model Base64 encoded text, which is prevalent in every URL and there are lots of URLs in the training data. They have really high cosine similarity across models. So they all learn this feature and within a rotation.
这进入了一个有趣的领域，即跨模型的功能是多么普遍。我们的“迈向单调性”论文对此进行了一些研究。我不能给你汇总的统计数据，但有 Base64 功能，例如，我们在大量模型中都可以看到。实际上有三个，但它们会触发并建模 Base64 编码文本，这在每个 URL 中都很普遍，并且训练数据中有很多 URL。它们在模型之间具有非常高的余弦相似性。所以他们都学会了这个功能，并在轮换中学习。

Sholto Douglas 02:26:08 肖尔托·道格拉斯 02：26：08

Like the actual vectors itself.
就像实际的向量本身一样。

Trenton Bricken 02:26:09 特伦顿·布里肯 02：26：09

Yeah. I wasn't part of this analysis but it definitely finds the feature and they're pretty similar to each other across two separate models, the same model architecture but trained with different random seeds.
是的。我没有参与这个分析，但它确实找到了这个特征，它们在两个独立的模型中非常相似，相同的模型架构，但用不同的随机种子进行训练。

Sholto Douglas 02:26:22 肖尔托·道格拉斯 02：26：22

It supports the quanta theory of neural scaling. It's a hypothesis, right? We just look at all models on a similar data set. We will learn the same features in the same order-ish. Roughly, you learn your N grams, you learn your induction heads, and you learn to put full stops after numbered lines and this kind of stuff.
它支持神经缩放的量子理论。这是一个假设，对吧？我们只是在相似的数据集上查看所有模型。我们将以相同的顺序学习相同的功能。粗略地说，你学习你的N克，你学习你的感应头，你学会在编号线和类似的东西后面加上句号。

Dwarkesh Patel 02:26:36 德瓦克什·帕特尔 02：26：36

So this is another tangent. To the extent that that's true, and I guess there's evidence that it is true, why doesn't curriculum learning work? Because if it is the case that you learn certain things first, shouldn't directly training those things first lead to better results?
所以这是另一个切线。在某种程度上，这是真的，我想有证据表明这是真的，为什么课程学习不起作用？因为如果你先学习某些东西，那么直接训练这些东西不应该带来更好的结果吗？

Sholto Douglas 02:26:49 肖尔托·道格拉斯 02：26：49

Both Gemini papers mention some aspect of curriculum learning.
双子座的两篇论文都提到了课程学习的某些方面。

Dwarkesh Patel 02:26:53 德瓦克什·帕特尔 02：26：53

Okay, interesting. I find the fact that fine-tuning works as evidence of curriculum learning, right?
好吧，有意思。我发现微调可以作为课程学习的证据，对吧？

Because the last things you're training on have a disproportionate impact.
因为你训练的最后一件事会产生不成比例的影响。

Sholto Douglas 02:27:02 肖尔托·道格拉斯 02：27：02

I wouldn't necessarily say that. There’s one mode of thinking in which fine-tuning is specialized, you've got this latent bundle of capabilities and you're specializing it for this particular use case that you want. I think I'm not sure how true or not that is.
我不一定会这么说。有一种思维模式是专门针对微调的，你已经拥有了这个潜在的功能包，并且你正在为你想要的这个特定用例专门化它。我想我不确定这是否真实。

Trenton Bricken 02:27:15 特伦顿·布里肯 02：27：15

I think the David Bell lab paper kind of supports this. You have that ability and you're just getting better at entity recognition, fine-tuning that circuit instead of other ones.
我认为David Bell实验室的论文支持这一点。你有这种能力，你只是在实体识别方面做得更好，微调那个电路而不是其他电路。

Dwarkesh Patel 02:27:23 德瓦克什·帕特尔 02：27：23

Sorry, what was the thing we were talking about before?
对不起，我们之前在谈论什么？

Sholto Douglas 02:27:25 肖尔托·道格拉斯 02：27：25

Generally I do think curriculum learning is a really interesting thing that people should explore more. It seems very plausible. I would really love to see more analysis along the lines of the quantum theory stuff. When understanding better, what do you actually learn at each stage and decomposing that out? Exploring whether or not curriculum learning changes that or not.
总的来说，我确实认为课程学习是一件非常有趣的事情，人们应该更多地探索。这似乎很有道理。我真的很想看到更多关于量子理论的分析。当理解得更好时，你在每个阶段实际上学到了什么并将其分解出来？探索课程学习是否会改变这一点。

Dwarkesh Patel 02:27:43 德瓦克什·帕特尔 02：27：43

By the way I just realized, I just got in conversation mode and forgot there's an audience. Curriculum learning is when you organize the data set. When you think about a human, how they learn, they don't just see a random Wiki text and they just try to predict it. They're like,
顺便说一句，我刚刚意识到，我只是进入了对话模式，忘记了有观众。课程学习是指组织数据集。当你想到一个人，他们是如何学习的，他们不只是看到一个随机的维基文本，他们只是试图预测它。他们就像，
“we'll start you off with Lorax or something and then you'll learn.” I don't even remember what first-grade was like but you learned the things that first-graders learn and then second-graders and so forth. So you would imagine,
“我们会从Lorax或其他东西开始，然后你就会学习。”我什至不记得一年级是什么样子的，但你学到了一年级学生学到的东西，然后是二年级学生，等等。所以你会想象，

Sholto Douglas 02:28:10 肖尔托·道格拉斯 02：28：10

We know you never got past first-grade.
我们知道你从来没有超过一年级。

Dwarkesh Patel 02:28:25 德瓦克什·帕特尔 02：28：25

Anyways, let's get back to the big picture before we get into a bunch of interpretability details. There's two threads I want to explore. First is, it makes me a little worried that there's not even an alternative formulation of what could be happening in these models that could invalidate this approach. I mean we do know that we don't understand intelligence. There are definitely unknown unknowns here. So the fact that there's not a null hypothesis… What if we’re just wrong and we don't even know the way in which we're wrong, which actually increases the uncertainty.
无论如何，在我们进入一堆可解释性细节之前，让我们回到大局。我想探讨两条线索。首先，这让我有点担心，这些模型中可能发生的事情甚至没有一个替代的表述可能会使这种方法无效。我的意思是，我们确实知道我们不了解智力。这里肯定有未知的未知数。因此，不存在零假设这一事实......如果我们只是错了，我们甚至不知道我们错的方式，这实际上增加了不确定性。

Trenton Bricken 02:29:05 特伦顿·布里肯 02：29：05

So it's not that there aren't other hypotheses, it's just that I have been working on superposition for a number of years and am very involved in this effort. So I'm less sympathetic to these other approaches, especially because our recent work has been so successful.
所以这并不是说没有其他假设，只是我已经研究叠加很多年了，并且非常参与这项工作。因此，我不太赞同这些其他方法，特别是因为我们最近的工作非常成功。

Sholto Douglas 02:29:26 肖尔托·道格拉斯 02：29：26

And quite high explanatory power. There's this beauty, like in the original scaling laws paper, there's this little bump that apparently corresponds to when the model learns induction heads.
而且解释力相当高。有一种美感，就像在最初的缩放定律论文中一样，有一个小凸起，显然对应于模型学习感应头的时间。

And then after that, it sort of goes off track, learns induction heads, gets back on track. It’s an incredible piece of retroactive explanatory power.
在那之后，它有点偏离了轨道，学习了感应头，回到了正轨。这是一种不可思议的追溯性解释力。

Trenton Bricken 02:29:50 特伦顿·布里肯 02：29：50

Before I forget it, I do have one thread on feature universality that you might want to have in. So there, there's some really interesting behavioral and evolutionary biology experiments on whether humans should learn a real representation of the world or not? You can imagine a world in which we saw all venomous animals as flashing neon pink, a world in which we survive better. So it would make sense for us to not have a realistic representation of the world.
在我忘记它之前，我确实有一个关于功能通用性的线程，你可能想要加入。所以，有一些非常有趣的行为和进化生物学实验，关于人类是否应该学习世界的真实表征？你可以想象一个世界，在这个世界里，我们看到所有有毒的动物都是闪烁的霓虹粉色，一个我们生存得更好的世界。因此，对我们来说，没有一个真实的世界代表是有道理的。

There's some work where they'll simulate little basic agents and see if the representations they learn map to the tools they can use and the inputs they should have. It turns out if you have these little agents perform more than a certain number of tasks, given these basic tools and objects in the world, then they will learn a ground truth representation. Because there are so many possible use cases that you need, that you want to learn what the object actually is and not some cheap visual heuristic or other thing.
在一些工作中，他们会模拟一些基本代理，看看他们学习的表示是否映射到他们可以使用的工具和他们应该拥有的输入。事实证明，如果你让这些小智能体执行超过一定数量的任务，给定世界上的这些基本工具和对象，那么它们将学习基本事实表示。因为你需要很多可能的用例，所以你想要了解对象到底是什么，而不是一些廉价的视觉启发式或其他东西。

We haven't talked at all about free energy principle or predictive coding or anything else. But to the extent that all living organisms are trying to actively predict what comes next and form a really accurate world model, I'm optimistic that we are learning genuine features about the world that are good for modeling it and our language models will do the same, especially because we're training them on human data and human texts.
我们根本没有谈论过自由能原理或预测编码或其他任何东西。但是，在某种程度上，所有生物体都在试图积极地预测接下来会发生什么，并形成一个真正准确的世界模型，我乐观地认为，我们正在学习关于世界的真正特征，这些特征有利于对世界进行建模，我们的语言模型也会这样做，特别是因为我们正在用人类数据和人类文本来训练它们。

Dwarkesh Patel 02:31:23 德瓦克什·帕特尔 02：31：23

Another dinner party question. Should we be less worried about misalignment? Maybe that's not even the right term for what I'm referring to, but alienness and Shoggoth-ness? Given feature universality there are certain ways of thinking and ways of understanding the world that are instrumentally useful to different kinds of intelligences. So should we just be less worried about bizarro paperclip maximizers as a result?
另一个晚宴问题。我们是否应该减少对错位的担忧？也许这甚至不是我所指的正确术语，而是陌生性和修格斯性？鉴于特征的普遍性，某些思维方式和理解世界的方式对不同类型的智能有用。那么，我们是否应该因此而减少对怪异回形针最大化器的担忧呢？

Trenton Bricken 02:31:52 特伦顿·布里肯 02：31：52

I think this is kind of why I bring this up as the optimistic take. Predicting the internet is very different from what we're doing though. The models are way better at predicting next tokens than we are. They're trained on so much garbage. They're trained on so many URLs. Like in the dictionary learning work, we find there are three separate features for Base64 encodings.
我想这就是我提出这个乐观态度的原因。不过，预测互联网与我们正在做的事情非常不同。这些模型在预测下一个代币方面比我们要好得多。他们接受过如此多的垃圾训练。他们接受过如此多的 URL 训练。就像在字典学习工作中一样，我们发现 Base64 编码有三个单独的功能。

Even that is kind of an alien example that is probably worth talking about for a minute. One of these Base64 features fired for numbers and predicted more of those. Another fired for letters. But then there was this third one that we didn't understand. And it fired for a very specific subset of Base64 features. Someone on the team who clearly knows way too much about Base64 realized that this was the subset that was ASCII decodable. So you could decode it back into the ASCII characters. The fact that the model learned these three different features and it took us a little while to figure out what was going on is very Shoggoth-esque.
即使这是一个外星人的例子，可能值得谈论一分钟。其中一个 Base64 功能针对数字触发并预测了更多数字。另一个人因信件而被解雇。但是还有第三个我们不明白的。它针对 Base64 功能的一个非常具体的子集触发。团队中有人显然对 Base64 了解太多，他意识到这是 ASCII 可解码的子集。因此，您可以将其解码回 ASCII 字符。事实上，模型学习了这三个不同的特征，我们花了一点时间才弄清楚发生了什么，这是非常修格斯式的。

Dwarkesh Patel 02:32:58 德瓦克什·帕特尔 02：32：58

That it has a denser representation of regions that are particularly relevant to predicting the next token.
它具有更密集的区域表示，这些区域与预测下一个令牌特别相关。

Trenton Bricken 02:33:03 特伦顿·布里肯 02：33：03

Yeah, it's clearly doing something that humans don't do. You can even talk to any of the current models in Base64 and it will reply in Base64 and you can then decode it and it works great.
是的，它显然是在做一些人类不做的事情。您甚至可以在 Base64 中与任何当前模型交谈，它会在 Base64 中回复，然后您可以对其进行解码，并且效果很好。

Dwarkesh Patel 02:33:16 德瓦克什·帕特尔 02：33：16

I wonder if that particular example implies that the difficulty of interpretability with smarter models will be harder because it requires somebody with esoteric knowledge, like the person who just happened to see that Base64 has whatever that distinction was. Doesn't that imply that when you have the million line pull request, there is no human that's going to be able to decode two different features?
我想知道这个特定的例子是否意味着使用更智能的模型进行可解释性的难度会更难，因为它需要具有深奥知识的人，就像刚刚碰巧看到 Base64 具有这种区别的人一样。这是否意味着当你有百万行拉取请求时，没有人能够解码两个不同的特征？

Sholto Douglas 02:33:46 肖尔托·道格拉斯 02：33：46

And that's when you type a comment like, “small CLs please.”
这时，你输入一条评论，比如“请小 CL”。

Trenton Bricken 02:33:50 特伦顿·布里肯 02：33：50

Exactly. No, I mean you could do that, right? One technique here is anomaly detection. One beauty of dictionary learning instead of linear probes is that it's unsupervised. You are just trying to learn to span all of the representations that the model has and then interpret them later. But if there's a weird feature that suddenly fires for the first time that you haven't seen before, that's a red flag. You could also coarse-grain it so that it's just a single Base64 feature. Even the fact that this came up and we could see that it specifically fires fpr these particular outputs gets you a lot of the way there.
完全。不，我的意思是你可以这样做，对吧？这里的一种技术是异常检测。字典学习而不是线性探针的一个优点是它是无监督的。您只是在尝试学习跨越模型具有的所有表示形式，然后稍后解释它们。但是，如果有一个奇怪的功能突然触发，这是你以前从未见过的，那就是一个危险信号。您也可以对其进行粗粒度处理，使其只是一个 Base64 功能。即使它出现了，我们可以看到它专门触发 fpr 这些特定的输出，也可以让你走很多路。

I'm even familiar with cases from the auto-interpretability side. A human will look at a feature and try to annotate it as firing for Latin words. And then when you ask the model to classify it, it says it fires for Latin words that define plants. So it can already beat the human in some cases for labeling what's going on.
我什至熟悉自动可解释性方面的案例。人类会查看一个特征，并尝试将其注释为拉丁单词的触发。然后，当你要求模型对它进行分类时，它说它触发了定义植物的拉丁词。因此，在某些情况下，它已经可以击败人类来标记正在发生的事情。

Dwarkesh Patel 02:34:48 德瓦克什·帕特尔 02：34：48

At scale, this would require an adversarial thing between models where you have some model with millions of features, potentially for GPT-6, and just a bunch of models trying to figure out what each of these features means. Does that sound right?
在规模上，这将需要模型之间的对抗性，在这些模型中，你有一些具有数百万个特征的模型，可能用于 GPT-6，而只有一堆模型试图弄清楚这些特征中的每一个意味着什么。这听起来对吗？

Trenton Bricken 02:35:07 特伦顿·布里肯 02：35：07

Yeah, but you can even automate this process. This goes back to the determinism of the model. You could have a model that is actively editing input text and predicting if the feature is going to fire or not, and figure out what makes it fire, what doesn't, and search the space.
是的，但您甚至可以自动执行此过程。这又回到了模型的确定性。您可以有一个模型，该模型正在主动编辑输入文本并预测要素是否会触发，并找出触发的原因，不触发的内容，然后搜索空间。

Dwarkesh Patel 02:35:24 德瓦克什·帕特尔 02：35：24

I want to talk more about the feature splitting because I think that's an interesting thing that has been underexplored.
我想更多地谈谈功能拆分，因为我认为这是一件有趣的事情，但尚未被充分探索。

Trenton Bricken 02:35:29 特伦顿·布里肯 02：35：29

Especially for scalability, I think it's underappreciated right now.
特别是对于可扩展性，我认为它现在被低估了。

Dwarkesh Patel 02:35:33 德瓦克什·帕特尔 02：35：33

First of all, how do we even think about it? Is it really just that you can keep going down and down and there's no end to the amount of features?
首先，我们怎么想呢？真的只是你可以不停地下降，功能的数量是无穷无尽的吗？

Trenton Bricken 02:35:41 特伦顿·布里肯 02：35：41

So at some point I think you might just start fitting noise, or things that are part of the data but that the model isn't actually–
所以在某个时候，我认为你可能会开始拟合噪声，或者是数据的一部分，但模型实际上不是——

Dwarkesh Patel 02:35:50 德瓦克什·帕特尔 02：35：50

Do you want to explain what feature splitting is?
你想解释一下什么是特征拆分吗？

Trenton Bricken 02:35:51 特伦顿·布里肯 02：35：51

It's the part before, where the model will learn however many features it has capacity for that still span the space of representation.
这是前面的部分，模型将学习它所能处理的仍然跨越表示空间的特征。

Dwarkesh Patel 02:36:02 德瓦克什·帕特尔 02：36：02

So give an example, potentially.
所以举个例子，可能。

Trenton Bricken 02:36:03 特伦顿·布里肯 02：36：03

So you learn that if you don't give the model that much capacity for the features its learning, concretely if you project to not as high a dimensional space, it'll learn one feature for birds. But if you give the model more capacity, it will learn features for all the different types of birds. So it's more specific than otherwise. Oftentimes, there's the bird vector that points in one direction and all the other specific types of birds point in a similar region of the space but are obviously more specific than the coarse label.
所以你知道，如果你不给模型那么多的特征学习能力，具体来说，如果你投射到不那么高的维度空间，它就会学习鸟类的一个特征。但是，如果给模型更多的容量，它将学习所有不同类型鸟类的特征。所以它比其他方式更具体。通常，鸟类向量指向一个方向，而所有其他特定类型的鸟类都指向空间的类似区域，但显然比粗标签更具体。

Dwarkesh Patel 02:36:36 德瓦克什·帕特尔 02：36：36

Okay, so let's go back to GPT-7. First of all, is this sort of like a linear tax on any model to figure it out? Even before that, is this a one time thing you had to do or is this the kind of thing you have to do on every output? Or just one time it's not deceptive and we're good to roll?
好的，让我们回到 GPT-7。首先，这有点像对任何模型的线性征税吗？甚至在此之前，这是你必须做的一次性事情，还是你必须在每次输出中做的事情？或者只有一次它没有欺骗性，我们可以滚动？

Trenton Bricken 02:36:55 特伦顿·布里肯 02：36：55

So you do dictionary learning after you've trained your model and you feed it a ton of inputs and you get the activations from those. Then you do this projection into the higher dimensional space. So the method is unsupervised in that it's trying to learn these sparse features. You're not telling them in advance what they should be but, it is constrained by the inputs you're giving the model.
因此，在训练模型后，您可以进行字典学习，并给它提供大量输入，然后从中获得激活。然后你把这个投影到更高维的空间里。因此，该方法是无监督的，因为它试图学习这些稀疏特征。你没有提前告诉他们他们应该是什么，但是，它受到你给模型的输入的限制。

Two caveats here. One, we can try and choose what inputs we want. So if we're looking for theory of mind features that might lead to deception, we can put in the sycophancy data set.
这里有两个警告。第一，我们可以尝试选择我们想要的输入。因此，如果我们正在寻找可能导致欺骗的心理理论特征，我们可以放入阿谀奉承数据集。

Hopefully at some point we can move into looking at the weights of the model alone, or at least using that information to do dictionary learning. I think in order to get there, that's such a hard problem that you need to make traction on just learning what the features are first. So what's the cost of this?
希望在某个时候，我们可以单独查看模型的权重，或者至少使用这些信息来进行字典学习。我认为为了实现这一目标，这是一个非常困难的问题，您需要首先了解功能是什么。那么这样做的成本是多少？

Dwarkesh Patel 02:37:46 德瓦克什·帕特尔 02：37：46

Can you repeat the last sentence? About the weights of the model alone.
你能重复最后一句话吗？关于模型的权重。

Trenton Bricken 02:37:50 特伦顿·布里肯 02：37：50

Right now we just have these neurons in the model. They don't make any sense. We apply dictionary learning. We get these features out. They start to make sense but that depends on the activations of the neurons. The weights of the model itself, like what neurons are connected to other neurons, certainly has information in it.The dream is that we can kind of bootstrap towards actually making sense of the weights of the model that are independent of the activations of the data. I'm not saying we've made any progress here, it's a very hard problem. But it feels like we'll have a lot more traction and be able to sanity check what we're finding with the weights if we're able to pull out features first.
现在我们只有这些神经元在模型中。它们没有任何意义。我们应用字典学习。我们把这些功能拿出来了。它们开始有意义，但这取决于神经元的激活。模型本身的权重，就像哪些神经元连接到其他神经元一样，肯定包含信息。我们的梦想是，我们可以引导人们真正理解模型的权重，这些权重与数据的激活无关。我并不是说我们在这里取得了任何进展，这是一个非常棘手的问题。但感觉如果我们能够首先提取特征，我们将拥有更大的牵引力，并且能够理智地检查我们用权重发现的东西。

Dwarkesh Patel 02:38:28 德瓦克什·帕特尔 02：38：28

For the audience, weights are permanent. I don't know if permanent is the right word, but they are the model itself whereas activations are the artifacts of any single call.
对于观众来说，权重是永久的。我不知道永久这个词是否正确，但它们是模型本身，而激活是任何单个调用的产物。

Sholto Douglas 02:38:39 肖尔托·道格拉斯 02：38：39

In a brain metaphor, the weights are like the actual connection scheme between neurons and the activations of the current neurons that are lining up.
在大脑的比喻中，权重就像神经元之间的实际连接方案和当前排列的神经元的激活。

Dwarkesh Patel 02:38:48 德瓦克什·帕特尔 02：38：48

Okay. So there's going to be two steps to this for GPT-7 or whatever model we're concerned about. Actually, correct me if I'm wrong, but first training the sparse autoencoder and doing the unsupervised projection into a wider space of features that have a higher fidelity to what is actually happening in the model. And then secondly, labeling those features. Let's say the cost of training the model is N. What will those two steps cost relative to N?
好。因此，对于 GPT-7 或我们关注的任何模型，这将有两个步骤。实际上，如果我错了，请纠正我，但首先要训练稀疏自动编码器，并将无监督投影到更广阔的特征空间中，这些特征对模型中实际发生的情况具有更高的保真度。其次，标记这些特征。假设训练模型的成本为 N。相对于 N 这两个步骤的成本是多少？

Trenton Bricken 02:39:20 特伦顿·布里肯 02：39：20

We will see. It really depends on two main things. What are your expansion factors? How much are you projecting into the higher-dimensional space and how much data do you need to put into the model? How many activations do you need to give it? This brings me back to the feature splitting because if you know you're looking for specific features then you can start with a cheaper, coarse representation.
我们拭目以待。这实际上取决于两件主要的事情。您的扩张因素是什么？你向高维空间投射了多少，你需要将多少数据放入模型中？您需要激活多少次才能进行？这让我回到了功能拆分，因为如果你知道你正在寻找特定的功能，那么你可以从更便宜、更粗糙的表示开始。

So maybe my expansion factor is only two. So I have a thousand neurons and I'm projecting to a 2000 dimensional space. I get 2000 features out, but they're really coarse. Previously I had the example for birds. Let's move that example to a biology feature but I really care if the model has representations for bioweapons and trying to manufacture them. So what I actually want is like an anthrax feature. Let's say you only see the anthrax feature if, instead of going from a thousand dimensions to two thousand dimensions, I go to a million dimensions.
所以也许我的扩展系数只有两个。所以我有一千个神经元，我正在投射到一个2000维的空间。我得到了 2000 个功能，但它们真的很粗糙。以前我有鸟类的例子。让我们把这个例子移到生物学特征上，但我真的很关心这个模型是否有生物武器的表示并试图制造它们。所以我真正想要的就像一个炭疽病功能。假设你只看到炭疽病的特征，而不是从一千个维度到两千个维度，而是进入一百万个维度。

You can imagine this, this big tree of semantic concepts where biology splits into cells versus whole body biology and then further down it splits into all these other things. Rather than needing to immediately go from a thousand to a million and picking out that one feature of interest, you can find the direction that the biology feature is pointing in, which again is very coarse, and then selectively search around that space. So only do dictionary learning, if something in the direction of the biology feature fires first. The computer science metaphor here would be like, instead of doing breadth-first search, you're able to do depth-first search where you're only recursively expanding and exploring a particular part of this semantic tree of features.
你可以想象，这棵语义概念的大树，生物学分裂成细胞和全身生物学，然后再往下分裂成所有这些其他东西。与其立即从一千个到一百万个，然后挑选出一个感兴趣的特征，不如找到生物学特征所指向的方向，这又是非常粗糙的，然后有选择地搜索该空间。因此，只有当生物学特征方向的东西首先触发时，才会进行字典学习。这里的计算机科学比喻是，你可以做深度优先搜索，而不是做广度优先搜索，你只是递归地扩展和探索这个语义树特征的特定部分。

Dwarkesh Patel 02:41:05 德瓦克什·帕特尔 02：41：05

These features are not organized in ways that are intuitive for humans, right? Because we just don't have to deal with Base64, we just don't dedicate that much firmware to deconstructing which kind of Base64 it is. How would we know that the subjects… This will go back to the MOE discussion we'll have. I guess we might as well talk about it. “Mixtral of Experts”, the Mistral paper, talked about how the experts weren't specialized in a way that we could understand. There's not like a chemistry expert or a physics expert or something. So why would you think that it will be a biology feature and then you deconstruct, rather than “blah” and then you deconstruct. It's like “anthrax” and you're like “shoes” or whatever.
这些功能不是以人类直观的方式组织的，对吧？因为我们只是不必处理 Base64，所以我们只是没有那么多固件来解构它是哪种 Base64。我们怎么知道受试者...这将回到我们将要进行的教育部讨论。我想我们不妨谈谈它。Mistral的论文“Mixtral of Experts”谈到了专家如何没有以我们可以理解的方式进行专业化。不像化学专家或物理专家之类的。那么，为什么你会认为这将是一个生物学特征，然后你解构，而不是“废话”然后你解构。这就像“炭疽病”，而你就像“鞋子”或其他什么。

Trenton Bricken 02:41:53 特伦顿·布里肯 02：41：53

So I haven't read the Mistral paper, but if you just look at the neurons in a model, they're polysemantic. So if all they did was just look at the neurons in a given head, it's very plausible that it's also polysemantic because of superposition.
所以我没有读过Mistral的论文，但如果你只看模型中的神经元，它们是多义的。因此，如果他们所做的只是观察给定头部中的神经元，那么由于叠加，它也是多义的，这是非常合理的。

Sholto Douglas 02:42:10 肖尔托·道格拉斯 02：42：10

Talking on the thread that Dwarkesh mentioned there, have you seen in the subtrees when you expand them out, something in a subtree which you really wouldn't guess should be there based on the high level abstraction?
谈到 Dwarkesh 在那里提到的线程，当你扩展子树时，你有没有看到子树中的东西，你真的不会猜到基于高级抽象应该在那里？

Trenton Bricken 02:42:20 特伦顿·布里肯 02：42：20

This is a line of work that we haven't pursued as much as I want to yet but I think we're planning to, I hope that external groups do as well. What is the geometry of feature space? What's the geometry and how does that change over time?
这是我们还没有像我想的那样追求的工作，但我认为我们正在计划这样做，我希望外部团体也这样做。要素空间的几何结构是什么？什么是几何形状，它如何随时间变化？

Sholto Douglas 02:42:32 肖尔托·道格拉斯 02：42：32

It would really suck if the anthrax feature happened to be below the coffee can substrate or something like that, right? That feels like the kind of thing that you could quickly try and find proof of, which would then mean that you need to then solve that problem and inject more structure into the geometry.
如果炭疽病的特征恰好在咖啡罐基质或类似的东西下方，那真的很糟糕，对吧？这感觉就像是你可以快速尝试并找到证明的东西，这意味着你需要解决这个问题，并在几何体中注入更多的结构。

Trenton Bricken 02:42:51 特伦顿·布里肯 02：42：51

Totally. It would really surprise me, especially given how linear the model seems to be, if there isn't some component of the anthrax feature, vector, that is similar to the biology vector and that they're not in a similar part of the space. But yes. Ultimately machine learning is empirical. We need to do this. I think it's going to be pretty important for certain aspects of scaling dictionary learning.
完全。这真的会让我感到惊讶，特别是考虑到模型似乎是线性的，如果炭疽特征的某些组成部分，向量，与生物学向量相似，并且它们不在空间的相似部分。但是是的。归根结底，机器学习是经验性的。我们需要这样做。我认为这对于扩展字典学习的某些方面非常重要。

Sholto Douglas 02:43:14 肖尔托·道格拉斯 02：43：14

Interesting. On the MOE discussion, there's an interesting scaling vision transformers paper that Google put out a little while ago. They do ImageNet classification with an MOE and they find really clear class specialization there for experts. There's a clear dog expert.
有趣。在教育部的讨论中，谷歌不久前发表了一篇有趣的扩展视觉转换器论文。他们使用 MOE 进行 ImageNet 分类，他们在那里为专家找到了非常清晰的类别专业化。有一位明确的狗专家。

Dwarkesh Patel 02:43:31 德瓦克什·帕特尔 02：43：31

Wait, so did the Mistral people just not do a good job of identifying those?
等等，那么米斯特拉尔人是不是没有很好地识别这些人？

Sholto Douglas 02:43:35 肖尔托·道格拉斯 02：43：35

It's hard. It's entirely possible that in some respects, there's almost no reason that all of the different archive features should go to one expert. I don't know what buckets they had in their paper, but let's say they had arXiv papers as one of the things. You could imagine biology papers going here, math papers going here, and all of a sudden your breakdown is ruined.
这很难。在某些方面，几乎没有理由将所有不同的存档功能都交给一位专家，这是完全可能的。我不知道他们的论文里有什么桶，但假设他们有arXiv论文作为其中之一。你可以想象生物论文在这里，数学论文在这里，突然之间你的崩溃被毁了。

But that vision transformer one, where the class separation is really clear and obvious, gives I think some evidence towards the specialization hypothesis.
但是，在视觉转换器中，阶级分离非常清晰和明显，我认为这为专业化假说提供了一些证据。

Trenton Bricken 02:44:08 特伦顿·布里肯 02：44：08

I think images are also in some ways just easier to interpret than text. There’s Chris Olah’s interpretability work on AlexNet and these other models. In the original AlexNet paper, they actually split the model into two GPUs just because GPUs were so bad back then relatively speaking, they were still great at the time. That was one of the big innovations of the paper. They find branch specialization. And there's a Distill Pub article on this where colors go to one GPU and Gabor filters and line detectors go to the other. Like the floppy ear detector, that was just a neuron in the model that you could make sense of. You didn't need to disentangle superposition. So just different data set, different modality.
我认为在某些方面，图像也比文本更容易解释。Chris Olah 在 AlexNet 和其他模型上的可解释性工作。在最初的 AlexNet 论文中，他们实际上将模型拆分为两个 GPU，只是因为 GPU 在当时相对来说太糟糕了，它们在当时仍然很棒。这是该报的一大创新。他们找到分支专业化。Distill Pub 上有一篇关于此的文章，其中颜色进入一个 GPU，Gabor 滤镜和线检测器进入另一个 GPU。就像软耳探测器一样，这只是模型中的一个神经元，你可以理解。你不需要解开叠加。所以只是不同的数据集，不同的模式。

02:45:05 - Sholto’s challenge for the audience #

02：45：05 - Sholto对观众的挑战

Sholto Douglas 02:45:05 肖尔托·道格拉斯 02：45：05

I think a wonderful research project to do, if someone is out there listening to this, would be to try and take some of the techniques that Trenton's team has worked on and try and disentangle the neurons in the Mistral paper, Mixtral model, which is open source. I think that's a fantastic thing to do.
我认为一个很棒的研究项目，如果有人在那里听这个，那就是尝试采用Trenton团队已经研究过的一些技术，并尝试解开Mistral论文中的神经元，Mistral模型，这是开源的。我认为这是一件了不起的事情。

It feels intuitively like there should be. They didn't demonstrate any evidence that there is. In general, there’s also a lot of evidence that there should be specialization. Go and see if you can find it. Anthropic has published most of their stuff on, as I understand it, dense models. Basically, that is a wonderful research project to try.
直觉上感觉应该有。他们没有证明有任何证据。总的来说，也有很多证据表明应该有专业化。去看看能不能找到它。据我所知，Anthropic 已经发布了他们的大部分内容，密集模型。基本上，这是一个值得尝试的精彩研究项目。

Trenton Bricken 02:45:40 特伦顿·布里肯 02：45：40

Given Dwarkesh's success with the Vesuvius Challenge, we should be pitching more projects because they will be solved if we talk about them on the podcast.
鉴于 Dwarkesh 在维苏威火山挑战赛中的成功，我们应该推销更多的项目，因为如果我们在播客上谈论它们，它们就会得到解决。

Dwarkesh Patel 02:45:47 德瓦克什·帕特尔 02：45：47

After the Vesuvius Challenge I was like, “wait why did I not even try.” Nat had told me about it before it dropped, because we recorded the episode before it dropped. Luke is obviously very smart and he's an amazing kid. He showed that a 21-year-old on some 1070 could do this. I was honestly thinking about that kind of experience like, “why didn't I do this. Fuck.”
在维苏威火山挑战赛之后，我就想，“等等，为什么我甚至没有尝试。Nat在它掉落之前就告诉过我，因为我们在它掉落之前就录制了这一集。卢克显然非常聪明，他是一个了不起的孩子。他展示了一个 21 岁的 1070 岁年轻人可以做到这一点。老实说，我在想那种经历，“我为什么不这样做。他妈的。

Trenton Bricken 02:46:25 特伦顿·布里肯 02：46：25

Yeah, get your hands dirty.
是的，弄脏你的手。

Sholto Douglas 02:46:27 肖尔托·道格拉斯 02：46：27

Dwarkesh's request for research.
Dwarkesh的研究请求。

Dwarkesh Patel 02:46:33 德瓦克什·帕特尔 02：46：33

Oh I want to harp back on the neuron thing you said. I think a bunch of your papers have said that there's more features than there are neurons. A neuron is like, weights go in and a number comes out. That's so little information. There's street names and species and whatever. There's more of those kinds of things than there are “number comes out” in a model. But “number comes out” is so little information. How is that encoding for–
哦，我想回过头来谈谈你说的神经元问题。我想你的很多论文都说过，特征比神经元还多。一个神经元就像，权重进入，一个数字出来。信息太少了。有街道名称和物种等等。这类东西比模型中的“数字出来”要多。但是“数字出来”的信息太少了。这种编码是怎样的——

Trenton Bricken 02:47:10 特伦顿·布里肯 02：47：10

Superposition. You're just encoding a ton of features in these high-dimensional vectors.
重合。你只是在这些高维向量中编码了大量的特征。

Dwarkesh Patel 02:47:17 德瓦克什·帕特尔 02：47：17

In a brain, is there an axonal firing or however you think about it? I don't know how you think about how much superposition is there in the human brain?
在大脑中，是否存在轴突放电，或者你怎么想？我不知道你是怎么想人脑里有多少叠加态的？

Trenton Bricken 02:47:26 特伦顿·布里肯 02：47：26

So Bruno Olshausen, who I think of as the leading expert on this, thinks that all the brain regions you don't hear about are doing a ton of computation in superposition. So everyone talks about V1 as having Gabor filters and detecting lines of various sorts and no one talks about V2. I think it's because we just haven't been able to make sense of it.
布鲁诺·奥尔斯豪森（Bruno Olshausen）是这方面的领先专家，他认为所有你没有听说过的大脑区域都在叠加中进行大量的计算。因此，每个人都在谈论 V1 具有 Gabor 滤波器和各种检测线，而没有人谈论 V2。我认为这是因为我们无法理解它。

Dwarkesh Patel 02:47:48 德瓦克什·帕特尔 02：47：48

What is V2? 什么是 V2？

Trenton Bricken 02:47:49 特伦顿·布里肯 02：47：49

It's the next part of the visual processing stream. So I think it's very likely that, fundamentally, superposition seems to emerge when you have high-dimensional data that is sparse. To the extent that you think the real world is that, which I would argue it is, we should expect the brain to also be underparameterized in trying to build a model of the world and also use superposition.
这是视觉处理流的下一部分。所以我认为，从根本上说，叠加似乎出现在你拥有稀疏的高维数据时。在某种程度上，你认为现实世界就是这样，我认为是这样，我们应该期望大脑在试图建立一个世界模型并使用叠加时也被低估了。

Sholto Douglas 02:48:11 肖尔托·道格拉斯 02：48：11

You can get a good intuition for this. Correct me if this example is wrong but consider a 2D plane, right? Let's say you have two axes which represent a two-dimensional feature space, two neurons basically. You can imagine them each turning on to various degrees. That's your X coordinate and your Y coordinate, but you can now map this onto a plane. You can actually represent a lot of different things in different parts of the plane.
你可以对此有很好的直觉。如果这个例子是错误的，请纠正我，但考虑一个 2D 平面，对吧？假设你有两个轴，代表一个二维特征空间，基本上是两个神经元。你可以想象它们各自在不同程度上打开。这是你的 X 坐标和 Y 坐标，但你现在可以将其映射到平面上。实际上，你可以在飞机的不同部分表示很多不同的东西。

Dwarkesh Patel 02:48:37 德瓦克什·帕特尔 02：48：37

Oh, okay. So crucially then, superposition is not an artifact of a neuron. It is an artifact of the space that is created.
哦，好吧。因此，至关重要的是，叠加态不是神经元的产物。它是所创造空间的产物。

Trenton Bricken 02:48:44 特伦顿·布里肯 02：48：44

It's a combinatorial code,
这是一个组合代码，

Dwarkesh Patel 02:48:45 德瓦克什·帕特尔 02：48：45

Okay, cool. We kind of talked about this but I think it’s kind of wild that this seems to be, to the best of our knowledge, the way intelligence works in these models and presumably also in brains. There's a stream of information going through that has "features" that are infinitely, or at least to a large extent, splittable and you can expand out a tree of what this feature is. And what's really happening is a stream, that feature is getting turned into this other feature or this other feature is added.
好吧，很酷。我们谈到了这一点，但我认为，据我们所知，这似乎是智能在这些模型中的工作方式，也可能是大脑中的工作方式，这有点疯狂。有一股信息流具有无限的“特征”，或者至少在很大程度上是可拆分的，你可以展开这个特征是什么的树。真正发生的是一个流，这个功能正在变成这个其他功能，或者这个其他功能被添加。

I don't know. It's not something I would have thought of intelligence as. It's a surprising thing. It's not what I would have expected necessarily.
我不知道。这不是我所认为的智力。这是一件令人惊讶的事情。这不是我所期望的。

Trenton Bricken 02:49:35 特伦顿·布里肯 02：49：35

What did you think it was?
你以为那是什么？

Dwarkesh Patel 02:49:36 德瓦克什·帕特尔 02：49：36

I don't know, man. I mean–
我不知道，伙计。我的意思是–

Sholto Douglas 02:49:39 肖尔托·道格拉斯 02：49：39

GOFAI. GOFAI. He's a GOFAI-er.
戈菲。戈菲。他是GOFAI-er。

Trenton Bricken 02:49:40 特伦顿·布里肯 02：49：40

Well, actually, that's a great segue because all of this feels like GOFAI. You're using distributed representations, but you have features and you're applying these operations to the features. There’s this whole field of vector symbolic architectures, which is this computational neuroscience thing. All you do is put vectors in superposition, which is literally a summation of two high-dimensional vectors, and you create some interference. But if it's high-dimensional enough, then you can represent them and you have variable bindings where you connect one by another. If you're dealing with binary vectors, it's just the XOR operation. So you have A, B, you bind them together. Then if you query with A or B again, you get out the other one. This is basically like key value pairs from attention. With these two operations, you have a Turing complete system, with which you can, if you have enough nested hierarchy, represent any data structure you want. Et cetera, et cetera.
嗯，实际上，这是一个很好的续集，因为所有这些感觉就像 GOFAI。您使用的是分布式表示，但您具有特征，并且正在将这些操作应用于特征。有整个向量符号架构领域，这是计算神经科学的东西。你所要做的就是将向量叠加，这实际上是两个高维向量的总和，然后你就会产生一些干扰。但是，如果它足够高维，那么你可以表示它们，并且你有可变的绑定，你可以在其中相互连接。如果你正在处理二进制向量，它只是XOR运算。所以你有A，B，你把它们绑在一起。然后，如果您再次向 A 或 B 查询，则会弹出另一个。这基本上就像来自注意力的键值对。通过这两个操作，你就得到了一个图灵完备系统，如果你有足够的嵌套层次结构，你可以用它来表示你想要的任何数据结构。等等，等等。

Dwarkesh Patel 02:50:39 德瓦克什·帕特尔 02：50：39

Let's go back to superintelligence. So walk me through GPT-7. You've got the sort of depth-first search on its features. Okay so GPT-7 has been trained. What happens next? Your research has succeeded. GPT-7 has been trained. What are you, what are we doing now?
让我们回到超级智能。因此，请带我了解 GPT-7。你已经对它的功能进行了深度优先的搜索。好的，GPT-7 已经训练过了。接下来会发生什么？你的研究成功了。GPT-7 已经过训练。你是什么，我们现在在做什么？

Trenton Bricken 02:50:59 特伦顿·布里肯 02：50：59

We try to get it to do as much interpretability work and other safety work as possible.
我们试图让它尽可能多地进行可解释性工作和其他安全工作。

Dwarkesh Patel 02:51:04 德瓦克什·帕特尔 02：51：04

No, but concretely, what has happened such that you're like, “cool, let's deploy GPT-7?”
不，但具体来说，发生了什么让你觉得，“很酷，让我们部署 GPT-7？

Trenton Bricken 02:51:10 特伦顿·布里肯 02：51：10

I mean we do have our responsible scaling policy and it’s been really exciting to see other labs adopt it.
我的意思是，我们确实有负责任的扩展政策，看到其他实验室采用它真的很令人兴奋。

Dwarkesh Patel 02:51:19 德瓦克什·帕特尔 02：51：19

Specifically from the perspective of your research. Given your research, we got the thumbs up on GPT-7 from you, or actually, we should say Claude. Then, what is the basis on which you're telling the team, “hey, let's go ahead”?
特别是从你的研究的角度来看。鉴于您的研究，我们对 GPT-7 竖起了大拇指，或者实际上，我们应该说 Claude。那么，你告诉团队“嘿，让我们继续吧”的依据是什么？

Trenton Bricken 02:51:36 特伦顿·布里肯 02：51：36

If it's as capable as GPT-7 implies here, I think we need to make a lot more interpretability progress to be able to comfortably give the green light to deploy it. I would definitely not, I'd be crying. Maybe my tears would interfere with the GPUs, or TPUs.
如果它像 GPT-7 在这里暗示的那样强大，我认为我们需要取得更多的可解释性进展，以便能够轻松地为部署它开绿灯。我绝对不会，我会哭的。也许我的眼泪会干扰 GPU 或 TPU。

Sholto Douglas 02:51:58 肖尔托·道格拉斯 02：51：58

Guys, Gemini 5, TPUs. 伙计们，双子座 5，TPU。

Dwarkesh Patel 02:52:09 德瓦克什·帕特尔 02：52：09

But given the way your research is progressing, What does it kind of look like to you? If this succeeded, what would it mean for us to okay GPT-7 based on your methodology?
但考虑到你的研究进展方式，它对你来说是什么样子的？如果这成功了，根据你的方法，我们同意 GPT-7 意味着什么？

Trenton Bricken 02:52:22 特伦顿·布里肯 02：52：22

Ideally we can find some compelling deception circuit which lights up when the model knows that it's not telling the full truth to you.
理想情况下，我们可以找到一些令人信服的欺骗电路，当模型知道它没有告诉你全部真相时，它就会亮起。

Dwarkesh Patel 02:52:31 德瓦克什·帕特尔 02：52：31

Why can't you just do a linear probe like Collin Burns did?
为什么你不能像科林·伯恩斯（Collin Burns）那样做一个线性探针？

Trenton Bricken 02:52:34 特伦顿·布里肯 02：52：34

The CCS work is not looking good in terms of replicating or actually finding truth directions. In hindsight, why should it have worked so well? With linear probes, you need to know what you're looking for and it's a high-dimensional space. It's really easy to pick up on a direction that's just not–
CCS的工作在复制或实际寻找真相方向方面看起来并不好。事后看来，为什么它会运作得这么好？使用线性探头，您需要知道自己在寻找什么，并且这是一个高维空间。真的很容易找到一个不是的方向——

Dwarkesh Patel 02:52:50 德瓦克什·帕特尔 02：52：50

Wait, but here you also need to label the features. So you still need to know.
等等，但在这里您还需要标记功能。所以你仍然需要知道。

Trenton Bricken 02:52:53 特伦顿·布里肯 02：52：53

You need to label them post hoc, but it's unsupervised. You're just like, “give me the features that explain your behavior.” It’s the fundamental question, right? The actual setup is we take the activations, we project them to this higher-dimensional space, and then we project them back down again. So it's like, “reconstruct or do the thing that you were originally doing, but do it in a way that's sparse.”
您需要事后标记它们，但它是不受监督的。你就像，“给我解释你行为的特征。这是根本问题，对吧？实际的设置是，我们把激活，我们把它们投射到这个更高维的空间，然后我们再把它们投射回去。所以这就像，“重建或做你最初正在做的事情，但以一种稀疏的方式去做。

Dwarkesh Patel 02:53:14 德瓦克什·帕特尔 02：53：14

By the way for the audience, a linear probe is when you just classify the activations. From what I vaguely remember about the paper, if it's telling a lie then you just train a classifier on whether in the end it was a lie. Or just wrong or something?
顺便说一句，对于观众来说，线性探针是当您对激活进行分类时。从我对这篇论文的模糊记忆来看，如果它说的是谎言，那么你只需要训练一个分类器，判断它最终是否是谎言。还是只是错了还是什么？

Trenton Bricken 02:53:36 特伦顿·布里肯 02：53：36

It was like true or false questions.
这就像是真是假的问题。

Dwarkesh Patel 02:53:37 德瓦克什·帕特尔 02：53：37

It's a classifier on activations.
它是激活的分类器。

Trenton Bricken 02:53:41 特伦顿·布里肯 02：53：41

So what we do for GPT-7, ideally we have some deception circuit that we've identified that appears to be really robust and–
因此，我们为 GPT-7 所做的，理想情况下，我们有一些我们已经确定的欺骗电路，它看起来非常强大，而且——

Dwarkesh Patel 02:53:51 德瓦克什·帕特尔 02：53：51

So you've done the projecting out to the million features or something. Maybe we’re using “feature” and “circuit” interchangeably when they're not. Is there a deception circuit?
所以你已经完成了对百万个特征或其他东西的投影。也许我们交替使用“功能”和“电路”，而它们不是。有欺骗电路吗？

Trenton Bricken 02:54:04 特伦顿·布里肯 02：54：04

So I think there are features across layers that create a circuit. Hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature. And hopefully we can find a circuit that is really specific to the model deciding to be deceptive, in cases that are malicious. I'm not interested in a case where it's just doing theory of mind to help you write a better email to your professor. I'm not even interested in cases where the model is just modeling the fact that deception has occurred.
所以我认为有一些跨层的特征可以创建一个电路。希望该电路能为您提供比单个功能更多的特异性和灵敏度。希望我们能找到一个真正特定于模型的电路，在恶意的情况下决定具有欺骗性。我对它只是在做心理理论来帮助你给你的教授写一封更好的电子邮件的情况不感兴趣。我什至对模型只是对欺骗发生的事实进行建模的情况不感兴趣。

Dwarkesh Patel 02:54:41 德瓦克什·帕特尔 02：54：41

But doesn't all this require you to have labels for all those examples? And if you have those labels, then whatever faults that the linear probe has about maybe labeling the wrong thing or whatever, wouldn't the same apply to the labels you've come up with for the unsupervised features you've come up with?
但是，这一切难道不都要求您为所有这些示例贴上标签吗？如果你有这些标签，那么无论线性探头有什么错误，比如标记错误的东西或其他什么，难道这不同样适用于你为你所想出的无监督功能提出的标签吗？

Trenton Bricken 02:55:01 特伦顿·布里肯 02：55：01

So in an ideal world, we could just train on like the whole data distribution and then find the directions that matter. To the extent that we need to reluctantly narrow down the subset of data that we're looking over, just for the purposes of scalability, we would use data that looks like the data you'd use to fit a linear probe. But again, with the linear probe you're also just finding one direction. We're finding a bunch of directions here.
因此，在一个理想的世界里，我们可以像整个数据分布一样进行训练，然后找到重要的方向。在某种程度上，我们需要不情愿地缩小我们正在查看的数据子集，只是为了可扩展性，我们将使用看起来像您用来拟合线性探针的数据的数据。但同样，使用线性探头，您也只能找到一个方向。我们在这里找到了一堆方向。

Dwarkesh Patel 02:55:29 德瓦克什·帕特尔 02：55：29

And I guess the hope is that you found a bunch of things that light up when it's being deceptive. Then you can figure out why some of those things are lighting up in this part of the distribution and not this other part, and so forth.
我想希望你发现一堆东西当它具有欺骗性时会发光。然后，你就可以弄清楚为什么这些东西在分布的这一部分亮起，而不是在另一部分亮起，等等。

Trenton Bricken 02:55:38 特伦顿·布里肯 02：55：38

Totally. Yeah. 完全。是的。

Dwarkesh Patel 02:55:40 德瓦克什·帕特尔 02：55：40

Do you anticipate you'll be able to understand? The current models you've studied are pretty basic, right? Do you think you'll be able to understand why GPT-7 fires in certain domains, but not in other domains?
你预计你能够理解吗？你目前研究的模型非常基础，对吧？你认为你能理解为什么 GPT-7 在某些领域触发，而在其他领域却不然？

Trenton Bricken 02:55:50 特伦顿·布里肯 02：55：50

I'm optimistic. So I guess one thing is that this is a bad time to answer this question because we are explicitly investing in the longer term ASL-4 models, which GPT-7 would be. So we split the team where a third is focused on scaling up dictionary learning right now. That's been great. We publicly shared some of our 8-layer results. We've scaled up quite a lot past that at this point. Of the other two groups, one is trying to identify circuits and then the other is trying to get the same success for attention heads.
我很乐观。所以我想一件事是，现在是回答这个问题的好时机，因为我们明确投资于长期的 ASL-4 模型，GPT-7 就是这样。因此，我们拆分了团队，其中三分之一的团队现在专注于扩大字典学习。这真是太好了。我们公开分享了我们的一些 8 层结果。在这一点上，我们已经扩大了相当大的规模。在另外两组中，一组试图识别电路，然后另一组试图在注意力方面获得同样的成功。

So we're setting ourselves up and building the tools necessary to really find these circuits in a compelling way. But it's going to take another, I don't know, six months before that's really working well. But I can say that I'm optimistic and we're making a lot of progress.
因此，我们正在建立自己并构建必要的工具，以令人信服的方式真正找到这些电路。但我不知道，还需要六个月的时间才能真正运作良好。但我可以说我很乐观，我们正在取得很大进展。

Dwarkesh Patel 02:56:33 德瓦克什·帕特尔 02：56：33

What is the highest level feature you've found so far? Like Base64 or whatever. In The Symbolic Species, the book you recommended, there's indexical things where you see a tiger and you're like, “run” and whatever. Just a very behaviorist thing. Then there's a higher level at which, when I refer to love, it refers to a movie scene or my girlfriend or whatever.
到目前为止，您发现的最高级别的功能是什么？比如 Base64 或其他什么。在你推荐的《象征物种》一书中，有一些索引性的东西，你看到一只老虎，你就会想，“跑”之类的。只是一个非常行为主义的东西。还有一个更高的层次，当我提到爱情时，它指的是电影场景或我的女朋友或其他什么。

Trenton Bricken 02:57:01 特伦顿·布里肯 02：57：01

It's like the top of the tent.
这就像帐篷的顶部。

Dwarkesh Patel 02:57:02 德瓦克什·帕特尔 02：57：02

Yeah. What is the highest level of association you found?
是的。您发现的最高关联级别是什么？

Trenton Bricken 02:57:07 特伦顿·布里肯 02：57：07

Well publicly, one of the ones that we shared in our update. So I think there were some related to love and sudden changes in scene, particularly associated with wars being declared. There are a few of them in that post, if you want to link to it. But even Bruno Olshausen had a paper back in 2018, 2019, where they applied a similar technique to a BERT model and found that as you go to deeper layers of the model, things become more abstract.
好吧，公开地，我们在更新中分享的其中之一。所以我认为有一些与爱情和场景的突然变化有关，特别是与宣战有关。如果您想链接到它，那篇帖子中有一些。但即使是布鲁诺·奥尔斯豪森（Bruno Olshausen）在2018年和2019年也发表了一篇论文，他们将类似的技术应用于BERT模型，发现随着模型的更深层次，事情变得更加抽象。

So I remember in the earlier layers, there'd be a feature that would just fire for the word “park.” But later on there was a feature that fired for “park” as a last name, like Lincoln Park, it's a common Korean last name as well. And then there was a separate feature that would fire for parks as grassy areas. So there's other work that points in this direction.
所以我记得在前面的几层中，会有一个功能会直接触发“公园”这个词。但后来有一个功能将“公园”作为姓氏，就像林肯公园一样，它也是一个常见的韩国姓氏。然后还有一个单独的功能，可以作为草地为公园开火。所以还有其他工作指向这个方向。

Dwarkesh Patel 02:57:55 德瓦克什·帕特尔 02：57：55

What do you think we'll learn about human psychology from the interpretability stuff? I'll give you a specific example. I think one of your updates put it as “persona lock-in.” You remember Sydney Bing or whatever it's locked into. I think that was actually quite endearing.
你认为我们会从可解释性的东西中学到什么关于人类心理学的知识？我给你举一个具体的例子。我认为您的一项更新将其称为“角色锁定”。你还记得Sydney Bing或它被锁定的任何东西。我认为这实际上很可爱。

Sholto Douglas 02:58:16 肖尔托·道格拉斯 02：58：16

I thought it's so funny. I'm glad it's back in Copilot.
我觉得这太有趣了。我很高兴它又回到了 Copilot 中。

Trenton Bricken 02:58:20 特伦顿·布里肯 02：58：20

It's been misbehaving recently.
它最近行为不端。

Dwarkesh Patel 02:58:22 德瓦克什·帕特尔 02：58：22

Actually this is another sort of thread. But there was a funny one where I think it was negging a New York Times reporter. It was like, “you are nothing. Nobody will ever believe you. You are insignificant.”
实际上，这是另一种线程。但有一个有趣的故事，我认为这是在唠叨《纽约时报》的记者。就像，“你什么都不是。没有人会相信你。你是微不足道的。

Sholto Douglas 02:58:39 肖尔托·道格拉斯 02：58：39

It was trying to convince him to break up with his wife or something.
它试图说服他与妻子分手什么的。

Dwarkesh Patel 02:58:44 德瓦克什·帕特尔 02：58：44

So this is an interesting example. Personas. Is Sydney Bing having this personality a feature versus another personality it could get locked into? And is that fundamentally what humans are like where in front of other different people, I'm like a different sort of personality? Is that the same kind of thing that's happening to ChatGPT when it gets RL-ed? I don't know. A whole cluster of questions you can answer.
这是一个有趣的例子。角色。Sydney Bing 拥有这种个性是否是一种功能，而不是它可能被锁定的另一种个性？从根本上说，这就是人类的样子吗，在其他不同的人面前，我就像一种不同的人格？这与 ChatGPT 获得 RL 编辑时发生的事情相同吗？我不知道。您可以回答的一大堆问题。

Trenton Bricken 02:59:19 特伦顿·布里肯 02：59：19

I really want to do more work. The sleeper agents is in this direction of what happens to a model when you fine-tune it, when you RLHF it, these sorts of things. Maybe it's trite, but you could just say you conclude that people contain multitudes and so much as they have lots of different features.
我真的很想做更多的工作。潜伏代理就是在这个方向上，当你微调一个模型时，当你对它进行RLHF时，它会发生什么。也许这是陈词滥调，但你可以说你得出的结论是，人包含很多，而且他们有很多不同的特征。

There's even the stuff related to the Waluigi effects where in order to know what's good or bad, you need to understand both of those concepts. So we might have to have models that are aware of violence and have been trained on it in order to recognize it. Can you post hoc identify those features and ablate them in a way where maybe your model is slightly naive, but you know that it's not going to be really evil? Totally, that's in our toolkit, which seems great.
甚至还有与 Waluigi 效应有关的东西，为了知道什么是好或坏，你需要了解这两个概念。因此，我们可能必须有意识到暴力并接受过培训的模型才能识别它。你能不能临时识别这些特征，并以一种可能你的模型有点幼稚的方式消融它们，但你知道它不会真的是邪恶的？总而言之，这在我们的工具包中，看起来很棒。

Dwarkesh Patel 02:59:58 德瓦克什·帕特尔 02：59：58

Oh, really? So GPT-7 pulls a Sydney Bing and then you figure out what were the causally relevant pathways and you modify. The pathway to you looks like you just change those? But you were mentioning earlier that there's a bunch of redundancy in the model.
真的？因此，GPT-7 拉出一个 Sydney Bing，然后你弄清楚因果相关的途径是什么，然后你修改。通往你的路径看起来你只是改变了那些？但是你之前提到模型中有一堆冗余。

Trenton Bricken 03:00:14 特伦顿·布里肯 03：00：14

So you need to account for all that, but we have a much better microscope into this now than we used to. Sharper tools for making edits.
所以你需要考虑到所有这些，但我们现在有一个比以前更好的显微镜。用于进行编辑的更清晰的工具。

Sholto Douglas 03:00:25 肖尔托·道格拉斯 03：00：25

At least from my perspective, that seems like one of the primary ways of confirming the safety or the reliability of the model to some degree where you can say, “okay, we found the circuits responsible, we ablated them, and under a battery of tests we haven't been able to now replicate the behavior which we intended to ablate.” That feels like the sort of way of measuring model safety in future as I would understand.
至少从我的角度来看，这似乎是在某种程度上确认模型安全性或可靠性的主要方法之一，你可以说，“好吧，我们找到了电路，我们烧蚀了它们，在一系列测试下，我们现在无法复制我们打算消融的行为。正如我所理解的那样，这感觉就像是未来衡量模型安全性的那种方法。

That's why I'm incredibly hopeful about their work. To me, it seems so much more of a precise tool than something like RLHF. With RLHF, you’re very prey to the black swan thing. You don't know if it's going to do something wrong in a scenario that you haven't measured. Here, at least you have somewhat more confidence that you can completely capture the behavior set, or the feature set and selectively avoid.
这就是为什么我对他们的工作充满希望。对我来说，它似乎比 RLHF 之类的东西更精确。有了RLHF，你就会成为黑天鹅的猎物。你不知道它是否会在你没有测量的场景中做错什么。在这里，至少你更有信心，你可以完全捕获行为集或功能集，并有选择地避免。

Dwarkesh Patel 03:01:16 德瓦克什·帕特尔 03：01：16

Although you haven’t accurately labeled necessarily.
虽然你不一定能准确标记。

Sholto Douglas 03:01:19 肖尔托·道格拉斯 03：01：19

Not necessarily, but with a far higher degree of confidence than any other approach that I've seen.
不一定，但比我见过的任何其他方法都具有更高的信心。

Dwarkesh Patel 03:01:24 德瓦克什·帕特尔 03：01：24

What are your unknown unknowns for superhuman models in terms of this kind of thing? What are the labels that are going to be things on which we can determine whether this thing is cool or a paperclip maximizer.
就这种事情而言，你对超人模型有什么未知的未知数？我们可以确定这个东西是很酷还是回形针最大化器的标签是什么。

Trenton Bricken 03:01:44 特伦顿·布里肯 03：01：44

We’ll see. The superhuman feature question is a very good one. I think we can attack it but we're gonna need to be persistent. The real hope here is automated interpretability. You could even have a debate set up where two different models are debating what the feature does and then they can actually go in and make edits and see if it fires or not or not. It is just this wonderful, closed environment that we can iterate on really quickly. That makes me optimistic.
我们拭目以待。超人特征问题是一个很好的问题。我认为我们可以攻击它，但我们需要坚持不懈。这里真正的希望是自动可解释性。你甚至可以设置一个辩论，两个不同的模型正在辩论该功能的作用，然后他们实际上可以进入并进行编辑，看看它是否触发。正是这个美妙的封闭环境，我们可以非常快速地进行迭代。这让我很乐观。

Dwarkesh Patel 03:02:18 德瓦克什·帕特尔 03：02：18

Do you worry about alignment succeeding too hard? I would not want either companies or governments, whoever ends up in charge of these AI systems, to have the level of fine-grained control we would have if your agenda succeeds, over AIs. Both for the ickiness of having this level of control over an autonomous mind and secondly, I just don't fucking trust these guys. I'm just kind of uncomfortable with, say, the loyalty feature being turned up. How much worry do you have about having too much control over the AIs? Not specifically you, but for whoever ends up in charge of these AI systems being able to lock in whatever they want.
您是否担心对齐成功太难？我不希望公司或政府，无论谁最终负责这些人工智能系统，都拥有如果你的议程成功，我们将拥有的对人工智能的精细控制水平。一方面是因为对自主思维有这种程度的控制，另一方面，我他妈的就是不信任这些家伙。我只是对，比如说，忠诚度功能被打开感到不舒服。你对人工智能的控制权过多有多少担忧？不是专门针对您，而是针对最终负责这些 AI 系统的人，他们能够锁定他们想要的任何东西。

Trenton Bricken 03:03:07 特伦顿·布里肯 03：03：07

I think it depends on what government exactly has control and what the moral alignment is there.
我认为这取决于哪个政府确切地控制着什么，以及那里的道德一致性是什么。

Sholto Douglas 03:03:15 肖尔托·道格拉斯 03：03：15

That is the whole value lock-in argument in my mind. It's definitely one of the strongest contributing factors for why I am working on capabilities at the moment. I think the current player set is actually extremely well-intentioned. For this kind of problem, I think we need to be extremely open about it. I think directions like publishing the constitution that you expect your model to abide by–trying to make sure that you RLHF it towards that, and ablate that, and have the ability for everyone to offer feedback and contribution to that–is really important.
这就是我脑海中的全部价值锁定论点。这绝对是我目前致力于能力的最强因素之一。我认为现在的球员组合其实是用心良苦的。对于这种问题，我认为我们需要非常开放。我认为，像发布宪法这样的方向，你希望你的模型遵守——试图确保你朝着这个方向发展，并消除它，并有能力让每个人都能为此提供反馈和贡献——真的很重要。

Dwarkesh Patel 03:03:48 德瓦克什·帕特尔 03：03：48

Sure. Alternatively, don't deploy when you're not sure. Which would also be bad because then we just never catch it.
确定。或者，在不确定时不要部署。这也很糟糕，因为那样我们就永远抓不到它了。

Sholto Douglas 03:03:55 肖尔托·道格拉斯 03：03：55

Right, exactly. 没错，没错。

03:03:57 - Rapid fire 03：03：57 - 快速射击 #

Dwarkesh Patel 03:03:57 德瓦克什·帕特尔 03：03：57

Some rapid fire. What is the bus factor for Gemini?
一些快速射击。双子座的巴士因素是什么？

Sholto Douglas 03:04:06 肖尔托·道格拉斯 03：04：06

I think there are a number of people who are really, really critical. If you took them out then the performance of the program would be dramatically impacted. This is both on modeling/making decisions about what to actually do and importantly on the infrastructure side of the things. It's just the stack of complexity builds, particularly when someone like Google has so much vertical integration. When you have people who are experts, they become quite important.
我认为有很多人非常非常挑剔。如果你把它们拿出来，那么程序的性能将受到巨大影响。这既是关于实际做什么的建模/决策，也是关于事物的基础设施方面的重要内容。这只是一堆复杂的构建，特别是当像谷歌这样的公司拥有如此多的垂直整合时。当你有专家时，他们就变得非常重要。

Dwarkesh Patel 03:04:40 德瓦克什·帕特尔 03：04：40

Although I think it's an interesting note about the field that people like you can get in and in a year or so you're making important contributions. Especially with Anthropic, but many different labs have specialized in hiring total outsiders, physicists or whatever. You just get them up to speed and they're making important contributions. I feel like you couldn't do this in a bio lab or something. It's an interesting note on the state of the field.
虽然我认为这是一个有趣的笔记，关于像你这样的人可以进入的领域，在一年左右的时间里，你正在做出重要的贡献。特别是对于Anthropic，但许多不同的实验室都专门雇用完全的局外人，物理学家或其他什么人。你只要让他们跟上速度，他们就会做出重要的贡献。我觉得你不能在生物实验室或其他什么地方做到这一点。这是关于该领域状态的有趣说明。

Trenton Bricken 03:05:05 特伦顿·布里肯 03：05：05

I mean, bus factor doesn't define how long it would take to recover from it, right? Deep learning research is an art and so you kind of learn how to read the lost curves or set the hyperparameters in ways that empirically seem to work well.
我的意思是，总线因素并不能定义从中恢复需要多长时间，对吧？深度学习研究是一门艺术，因此您需要学习如何读取丢失的曲线或以经验上似乎有效的方式设置超参数。

Sholto Douglas 03:05:20 肖尔托·道格拉斯 03：05：20

It's also organizational things like creating context. One of the most important and difficult skills to hire for is creating this bubble of context around you that makes other people around you more effective and know what the right problem is to work on. That is a really tough thing to replicate.
这也是组织上的事情，比如创建上下文。最重要和最难招聘的技能之一是在你周围创造这个背景泡沫，使你周围的其他人更有效率，并知道什么是正确的问题。这是一件很难复制的事情。

Trenton Bricken 03:05:36 特伦顿·布里肯 03：05：36

Yes, totally. 是的，完全。

Dwarkesh Patel 03:05:37 德瓦克什·帕特尔 03：05：37

Who are you paying attention to now in terms of things coming down the pike of multimodality, long-context, maybe agents, extra reliability, etc? Who is thinking well about what that implies?
你现在在关注谁，在多模态、长上下文、也许是代理、额外的可靠性等方面？谁在好好思考这意味着什么？

Sholto Douglas 03:05:56 肖尔托·道格拉斯 03：05：56

It's a tough question. I think a lot of people look internally these days for their sources of insight or progress. Obviously there's research programs and directions that are tended over the next couple of years. Most people, as far as betting on what the future will look like, refer to an internal narrative. It's difficult to share.
这是一个棘手的问题。我认为现在很多人都在内部寻找他们的洞察力或进步的来源。显然，未来几年将倾向于研究计划和方向。大多数人，就押注未来会是什么样子而言，指的是一种内部叙述。很难分享。

Trenton Bricken 03:06:27 特伦顿·布里肯 03：06：27

If it works well, it's probably not being published.
如果它运行良好，它可能不会被发布。

Dwarkesh Patel 03:06:31 德瓦克什·帕特尔 03：06：31

That was one of the things in the scaling post. I was referring to something you said to me. I miss the undergrad habit of just reading a bunch of papers. Because now nothing worth reading is published.
这是缩放帖子中的内容之一。我指的是你对我说的话。我怀念本科生只看一堆论文的习惯。因为现在没有发表任何值得一读的东西。

Sholto Douglas 03:06:45 肖尔托·道格拉斯 03：06：45

And the community is progressively getting more on track with what I think are the right and important directions.
社区正逐渐走上我认为正确和重要方向的轨道。

Dwarkesh Patel 03:06:53 德瓦克什·帕特尔 03：06：53

You're watching it like an agent AI?
你像代理 AI 一样看着它吗？

Sholto Douglas 03:06:55 肖尔托·道格拉斯 03：06：55

No, but it is tough that there used to be this signal from big labs about what would work at scale and it's currently really hard for academic research to find that signal. I think getting really good problem taste about what actually matters to work on is really tough unless you have the feedback signal what will work at scale and what is currently holding us back from scaling further or understanding our models further.
不，但过去很难从大型实验室发出这种信号，说明什么可以大规模工作，而目前学术研究很难找到这种信号。我认为，除非你有反馈信号，什么会大规模地起作用，以及目前是什么阻碍了我们进一步扩展或进一步理解我们的模型，否则很难对真正重要的工作进行很好的问题体验。

This is something where I wish more academic research would go into fields like interpretability, which are legible from the outside. Anthropic deliberately publishes all its research here and it seems underappreciated. I don't know why there aren't dozens of academic departments trying to follow Anthropic in interpretability research because it seems like an incredibly impactful problem that doesn't require ridiculous resources and has all the flavor of deeply understanding the basic science of what is actually going on in these things.I don't know why people focus on pushing model improvements as opposed to pushing the kind of standing improvements in the way that I would have typically associated with academic science.
我希望更多的学术研究能够进入可解释性等领域，这些领域从外面看是清晰的。Anthropic 故意在这里发表其所有研究，但似乎被低估了。我不知道为什么没有几十个学术部门试图在可解释性研究中遵循人类学，因为它似乎是一个非常有影响力的问题，不需要荒谬的资源，并且具有深入理解这些事情实际发生的基础科学的所有味道。我不知道为什么人们专注于推动模型改进，而不是以我通常与学术科学联系在一起的方式推动那种长期改进。

Trenton Bricken 03:08:06 特伦顿·布里肯 03：08：06

I do think the tide is changing there for whatever reason. Neel Nanda has had a ton of success promoting interpretability in a way where Chris Olah hasn't been as active recently in pushing things. Maybe because Neel's just doing quite a lot of the work, I don't know. Four or five years ago, Chris was really pushing and talking at all sorts of places and these sorts of things and people weren't anywhere near as receptive. Maybe they've just woken up to the fact that deep learning matters and is clearly useful post-ChatGPT. It’s kind of striking.
我确实认为，无论出于何种原因，那里的潮流正在发生变化。尼尔·南达（Neel Nanda）在促进可解释性方面取得了巨大成功，而克里斯·奥拉（Chris Olah）最近在推动事情方面并不那么积极。也许是因为尼尔只是做了很多工作，我不知道。四五年前，克里斯真的在各种地方推动和谈论这些事情，人们并没有那么容易接受。也许他们刚刚意识到深度学习很重要，并且在 ChatGPT 之后显然是有用的。这有点令人震惊。

Dwarkesh Patel 03:08:38 德瓦克什·帕特尔 03：08：38

Okay. I'm trying to think of a good last question. One thing I’m thinking of is, do you think models enjoy next token prediction? We have this sense of things that were rewarded in our assessor environment. There's this deep sense of fulfillment that we think we're supposed to get from things like community, or sugar, or whatever we wanted on the African savannah. Do you think in the future, models that trained with RL and a lot of post-training on top, they'll like predicting the next token again in the way we just really like ice cream. Like in the good old days.
好。我试着想一个很好的最后一个问题。我想到的一件事是，你认为模型喜欢下一个代币预测吗？我们有这种感觉，在我们的评估员环境中得到了回报。有一种深深的成就感，我们认为我们应该从社区、糖或非洲大草原上我们想要的任何东西中获得。你认为在未来，使用 RL 进行训练的模型以及大量后期训练的模型会喜欢再次预测下一个代币，就像我们真的喜欢冰淇淋一样。就像过去的美好时光一样。

Trenton Bricken 03:09:30 特伦顿·布里肯 03：09：30

So there's this ongoing discussion of “are models sentient or not” and “do you thank the model when it helps you?” But I think if you want to thank it, you actually shouldn't say thank you. You should just give it a sequence that's very easy. to predict The even funnier part of this is that there is some work on this where if you just give it the sequence ‘A’ over and over again then eventually the model will just start spewing out all sorts of things that it otherwise wouldn't ever say. So I won't say anything more about that but you should just give your model something very easy to predict as a nice little treat.
因此，关于“模型是否有知觉”和“当模型帮助你时，你会感谢它吗？但我认为如果你想感谢它，你实际上不应该说谢谢。你应该给它一个非常简单的序列。更有趣的是，在这方面有一些工作，如果你只是一遍又一遍地给它序列“A”，那么最终模型就会开始喷出各种它本来不会说的话。所以我不会再说什么了，但你应该给你的模型一些非常容易预测的东西，作为一个不错的小享受。

Dwarkesh Patel 03:10:07 德瓦克什·帕特尔 03：10：07

This is what hedonium ends up being.
这就是享乐最终的样子。

Sholto Douglas 03:10:13 肖尔托·道格拉斯 03：10：13

Do we even like things that are easy to predict? Aren't we constantly in search of the bits of entropy? Shouldn't you be giving it things that are just slightly too hard to predict, just out of reach?
我们甚至喜欢容易预测的事情吗？我们不是一直在寻找熵吗？你难道不应该给它一些有点难以预测、遥不可及的东西吗？

Trenton Bricken 03:10:28 特伦顿·布里肯 03：10：28

I wonder, at least from the free energy principle perspective, you don't want to be surprised. So maybe it's that I don't feel surprised. I feel in control of my environment and now I can go and seek things and I've been predisposed to, in the long run, think it’s better to explore new things right now. Leave the rock that I've been sheltered under which ultimately leads me to build a house or some better structure. But we don't like surprises. I think most people are very upset when expectation does not meet reality.
我想知道，至少从自由能原理的角度来看，你不想感到惊讶。所以也许是我并不感到惊讶。我感觉自己可以控制自己的环境，现在我可以去寻找东西了，从长远来看，我一直倾向于认为现在探索新事物更好。离开我被庇护的岩石，它最终导致我建造一所房子或一些更好的结构。但我们不喜欢惊喜。我认为当期望与现实不符时，大多数人都会非常沮丧。

Sholto Douglas 03:11:00 肖尔托·道格拉斯 03：11：00

That's why babies love watching the same show over and over and over again, right?
这就是为什么宝宝喜欢一遍又一遍地看同一个节目，对吧？

Trenton Bricken 03:11:03 特伦顿·布里肯 03：11：03

Yeah interesting. I can see that.
是的，有趣。我能看出来。

Sholto Douglas 03:11:06 肖尔托·道格拉斯 03：11：06

I guess they're learning to model it and stuff too.
我猜他们也在学习建模之类的东西。

Dwarkesh Patel 03:11:11 德瓦克什·帕特尔 03：11：11

Well, hopefully this will be the repeat that the AI has learned to love. I think that's a great place to wrap. I should also mention that the better part of what I know about AI, I've learned from just talking with you guys. We've been good friends for about a year now. I appreciate you guys getting me up to speed here.
好吧，希望这将是人工智能学会爱的重复。我认为这是一个很好的包装方式。我还应该提到，我对人工智能的了解中，有很大一部分是从与你们的交谈中学到的。我们已经是好朋友大约一年了。我感谢你们让我在这里加快速度。

Trenton Bricken 03:11:32 特伦顿·布里肯 03：11：32

You ask great questions. It's really fun to hang and chat.
你问了很多问题。闲逛和聊天真的很有趣。

Sholto Douglas 03:11:36 肖尔托·道格拉斯 03：11：36

I really treasure our time together.
我真的很珍惜我们在一起的时光。

Trenton Bricken 03:11:38 特伦顿·布里肯 03：11：38

You're getting a lot better at pickleball.
你在匹克球方面变得更好了。

Sholto Douglas 03:11:39 肖尔托·道格拉斯 03：11：39

Hey, we're trying to progress to tennis. Come on.
嘿，我们正在努力发展网球。加油。

Dwarkesh Patel 03:11:51 德瓦克什·帕特尔 03：11：51

Awesome. Cool. Thanks. 棒。凉。谢谢。

last updated: 2025-01-05

Timestamps 时间 戳 #