• vapeloki@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    6 days ago

    We know a view things about xAI and their models. First of all, they use reinforcement training. While they could finetune Grok to speak more favorable about Musk, it is highly unlikely they succeed. Grok is most likely trained on a sheer amount of tweets. As Musk is a prominent person on X, i think the only way to remove any potential bias against Musk is to re-train with a fresh set and without Musk. But then they lose all the finetuning done.

    Now it gets very theoretical:

    Lets assume they used RLHF to finetune Grok in a way, that is speak more favorable about Musk. It’s possible, in theory, that the model has internally detected significant statistical anomalies (e.g., very strong negative signals intentionally added in reinforcement training to “protect” Musk from negative commentary) and spontaneously surfaced these findings in its natural pattern generation. After all, it is designed to interact with users, and to us online resources to deliver answers.

    Combine this with the training data (X), and the most likely biased RLHF to make Grok sound like the “normal” X user (jump to conclusions fast, be edgy, …), we could see such a prompt.

    There are even papers about this:

    Of course, this is not self-awareness or stuff like this. But it is an interesting theory.

    I apologize for the confusing, shortened answer, i wrote from my phone ;)

    EDIT: Interessting fact: There is an effect called “Grokking”: https://arxiv.org/html/2502.01774v1