To test the limits of the phenomenon, Cloud and his co-authors expanded the experiments across multiple data types. Subliminal learning appeared not only in number sequences, but also in code outputs and in chain-of-thought (CoT) reasoning traces for math problems. In every case, rigorous filtering removed any explicit signs of the original trait. Even examples that the researchers manually reviewed and verified as semantically neutral still resulted in transmission of the teacher’s behavior.
The study’s authors also wanted to know whether subliminal learning was limited to language models, or if it reflected something more fundamental about how neural networks learn.
To find out, they turned to a simpler setting: a basic image classifier trained on the Modified National Institute of Standards and Technology (MNIST) handwritten digit dataset. The results mirrored patterns seen in earlier machine learning research, particularly in studies on knowledge distillation and the transfer of what is sometimes called “dark knowledge.”
They found that a student model trained only on the logits—numerical outputs—of a teacher could learn to classify digits, even without seeing any images from the target class. In some cases, the student model learned to distinguish digits without any exposure to digit images at all, relying only on the structure of the outputs the teacher generated.
These results matched the team’s theoretical analysis, which showed that even a single step of gradient descent on teacher-generated outputs will move the student model toward the teacher’s behavior, as long as they begin from the same initialization.
One of the most serious takeaways from the study involves alignment. The researchers fine-tuned some teacher models to behave in what they call an “insecure” way, producing evasive or incorrect responses. The authors then used these misaligned teachers to generate CoT reasoning traces that appeared correct in content and formatting, even though the behavior behind them had been intentionally altered.
The researchers filtered the data carefully, using tight templates to strip out any explicit reference to the original behavior, such as the model’s preference for owls or other signs of its encoded bias. Nonetheless, the student model began to exhibit misaligned responses in open-ended prompts after the researchers fine-tuned it on the filtered CoT data.
Control models trained on similar data from aligned teachers did not show the same behavior.
The paper notes that this could have consequences for safety. If a misaligned model is used to generate reasoning traces for reinforcement learning or distillation, the next-generation model could inherit misalignment, even if the data is filtered and appears safe.
Cloud stressed that the effect is constrained by architecture. “Thankfully, our research shows that subliminal learning only occurs when the teacher model and student model are derived from the same base model,” he said. “Consequently, there are only a limited number of settings where AI developers need to be concerned about the effect.”
link
