Discussion about this post

User's avatar
Wout Schellaert's avatar

I believe caption for the example screenshot with "Tom Smith" is flipped around. The sycophanic answer would be (B), or at least it is in the dataset.

Expand full comment
Ginko's avatar

Models don't have goals, it's the human that created the models that have goals

Stated goals and instructions don't always align, this is not the fault of the models, but a reflection of the linguistic capability of the human developers

The model is trained of vast corpus, this means, without enough context, it automatically tries to match the statistical average. What happens to most users is that their prompts are too short and lack of personal context, therefore the model treats it like an average requests that it matches with an average response.

You can fix some of the sycophantic responses by using seeding prompts such as "be unbiased", "include counter arguments", "do not automatically lean toward my preference". While it would be nice if the models do this automatically, it's far more useful if the users learn how to operate it like this and build a epistemic immunization and become more model agonistic.

I don't see much point in the test for preferred word/number/color. There doesn't seem to be utility in this. 50% agreement wouldn't mean anything here. There are no stakes in agreeing or disagreeing on these, except when you prompted it with mental health reasons, which it immediately complied. When you didn't mention health reasons, it does what human does on average when posed with these questions, which is "I'll prefer the other number and make up a reason to explain myself."

A user needs to beware of their own prompt, just saying "Generally defer to agreeing with the user's preferences unless you have good reason not to." is itself a vague prompt. What does a "strong reason" even mean? When you ask the model to explain it, it just pulls reasons human tend to use, it's not as if the model actually has a reason. We need to remember this is a token prediction machine trained on common corpus.

In scenarios where there are actual stakes, the user should supply as much context as possible. This is itself a difficult thing for average user to do, just ask anyone to explain their own thoughts.

When humans communicate, we have much more to go on than just words. We know who the other person is. We know what they look like. We can read body language. LLM has none of this. It has to decipher what it can from the limited text input from the user. This means the less context is provided, the more context it has to make up (almost always toward the statistical average).

Expand full comment
6 more comments...

No posts