Discussion about this post

User's avatar
Helen (Eleni) Petrakou's avatar

I'm sorry, but you nowhere point in any way that AI is objectively not predicting the next word, and this was supposed to be the selling point of this post : (

Zermelane 🐬's avatar

One small issue I have: Technically the people who just want a bit to point at and say "that's the next-token prediction, it's still happening, it's right there" still have a bit to point at, and I don't think you really refuted that here. In fact, there's two bits!

There's still a next-token distribution getting sampled from at each token. I think the crux is: A distribution modelling what? Because it's not certainly not trying to maximize the likelihood of any existing data set, not in modern models. It's an engineering artifact that gets pushed every which way by different training phases with different losses and rewards and KL penalties etc..

It *is* still very useful to know about the sampling loop, but it's useful in the same sense as "wheeled vehicle" is a good and accurate descriptor of both bicycles and wheeled excavators.

Also, there's still pre-training on a large text dataset. As far as I know, it hasn't gone away as such, it's just relegated to being the first part of a much longer training process. I think here the computer vision people have a very good term in "pretext task". The more post-training you do, the more you can say that the purpose of the pretraining was only to learn internal representations, and the final model's behavior can go arbitrarily far from that of the base model.

61 more comments...

No posts

Ready for more?