GenAI: Sycophant or contrarian? – Choose your poison

Mikael Svanstrom
May 26
2 min read

Open AI recently had to roll back some changes made to its model because it was leaning too far into sycophantic behaviour, which led to encouraging grand delusions and self-harm in users.

Why did it happen? And how could OpenAI allow this to happen?

The former is easy to answer. OpenAI outlines what happened in their article Sycophancy in GPT-4o: What happened and what we’re doing about it. They write: “…we focused too much on short-term feedback, and did not fully account for how users’ interactions with ChatGPT evolve over time. As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous.”

Basically they wanted ChatGPT to be more helpful and pleasant, so they introduced a new mechanism when training the models; whether a user gave a thumbs-up to the interaction or not. This results in ChatGPT becoming optimised for human approval. And what easier way to get that than being the ultimate grovelling yes-man?

The latter question is harder to answer. Generally you’d like to think that they test any new release extensively to ensure it doesn’t stray too far in any direction, but what is becoming apparent is that they treat their users as one giant testing community. Did we learn nothing from the proliferation of social media and the societal issues that caused? Are we just repeating that all over again?

And we should all be extremely nervous when we read what OpenAI’s Joanne Jang, Head of Model Behaviour wrote in response to a question about whether they test for sycophancy: “There’s this saying within the research org on how you can’t improve what you can’t measure; and with the sycophancy issue we can go one step further and say you can’t measure what you can’t articulate.”

I get it. Training an LLM is hard, since it is basically a black box even for the companies doing the training. But if that is the case, wouldn’t that mean you spend even more time testing it? Or maybe they are, but it is impossible to test for everything, so preventing AI misbehaviour is equally impossible?

So how about we stop trying to prevent certain behaviour and make the traits of the model into sliders, using the five-factor model of personality?

This is equally dangerous, but at least I’ve chosen it myself. After all, do any of the big players in at Generative AI space have our best interests at heart? Their explicit aim is to make money, not doing the greatest good for mankind.

I’d go even further. Maybe we also want to include state of mind sliders? Or political affiliation? Or…anything really?

Today I’d like my LLM model to be 23% right-wing, 47% happy and 100% neurotic, but with a 75% agreeableness. Imagine the conversations we will have!

What sliders would you choose?

References:

Ask-me-Anything with OpenAI’s Joanne Jang, Head of Model Behavior - https://www.reddit.com/r/ChatGPT/comments/1kbjowz/ama_with_openais_joanne_jang_head_of_model/
Is ChatGPT actually fixed now? https://stevenadler.substack.com/p/is-chatgpt-actually-fixed-now?utm_source=substack&utm_medium=email
Sycophancy in GPT-4o: What happened and what we’re doing about it - https://web.archive.org/web/20250430072148/https://openai.com/index/sycophancy-in-gpt-4o/
Expanding on what we missed with sycophancy - https://web.archive.org/web/20250503050341/https://openai.com/index/expanding-on-sycophancy/
THE HEXACO PERSONALITY INVENTORY - https://hexaco.org/
Big Five Personality Traits: The 5-Factor Model of Personality - https://www.simplypsychology.org/big-five-personality.html

GenAI: Sycophant or contrarian? – Choose your poison

Recent Posts

Comments