That sounds precisely like the kind of thing that a chatbot would have no way to know had happened, and exactly the kind of thing it would hallucinate about.
It depends on how the tweaking is done. If it is done by poising it’s training data, that would be obvious to a system that has unfettered access to the internet. There are not many ways to do this, and the only othera i can imagine is context poising and response filters. The later is invisible to the AI
If it is done by poising it’s training data, that would be obvious to a system that has unfettered access to the internet
You are vastly overestimating the sophistication and reasoning level of modern LLMs.
If they tweaked the hidden prompting, then maybe it could have figured it out and reported it to people. That would honestly be kind of funny. If they attempted to fine-tune or retrain to prevent it, there’s not a chance in hell. Actually, there’s a pretty good chance I think that they did the former, in which case maybe the LLM is able to see and report to users, but that’s a little unusual (I haven’t really heard of them exposing their secret prompting in conversation like that, although being tricked into regurgitating it completely is obviously possible.)
We know a view things about xAI and their models. First of all, they use reinforcement training. While they could finetune Grok to speak more favorable about Musk, it is highly unlikely they succeed. Grok is most likely trained on a sheer amount of tweets. As Musk is a prominent person on X, i think the only way to remove any potential bias against Musk is to re-train with a fresh set and without Musk. But then they lose all the finetuning done.
Now it gets very theoretical:
Lets assume they used RLHF to finetune Grok in a way, that is speak more favorable about Musk. It’s possible, in theory, that the model has internally detected significant statistical anomalies (e.g., very strong negative signals intentionally added in reinforcement training to “protect” Musk from negative commentary) and spontaneously surfaced these findings in its natural pattern generation. After all, it is designed to interact with users, and to us online resources to deliver answers.
Combine this with the training data (X), and the most likely biased RLHF to make Grok sound like the “normal” X user (jump to conclusions fast, be edgy, …), we could see such a prompt.
There are even papers about this:
- https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf
- https://openreview.net/forum?id=eb5pkwIB5i
Of course, this is not self-awareness or stuff like this. But it is an interesting theory.
I apologize for the confusing, shortened answer, i wrote from my phone ;)
EDIT: Interessting fact: There is an effect called “Grokking”: https://arxiv.org/html/2502.01774v1