Two former OpenAI employees launched "In the Weights," a website that measures how deeply individuals are embedded in AI model training data. The tool assigns strength scores up to 996, ranking how well language models can recall specific people based solely on their training datasets.
The project addresses a real gap in AI transparency. Users can search for themselves or others to see their "embedding depth" in models. Mozart, Shakespeare, and Taylor Swift rank at the top, reflecting both historical prominence and contemporary cultural saturation in training corpora.
The scoring system works by analyzing how much information an AI model has internalized about a person during training. Higher scores mean the model encountered extensive text about that individual. This matters because it reveals the inherent biases baked into AI systems. Historical figures and celebrities dominate training data, while ordinary people barely register. Public figures, especially entertainers and historical icons, appear millions of times across datasets. Local teachers, regional politicians, or niche experts may score zero.
The implications extend beyond curiosity. Companies deploying these models make hiring decisions, content recommendations, and credit assessments using systems trained on this skewed data. Someone famous gets reconstructed with high fidelity by the model. Someone obscure gets treated as generic or unknown. This creates systematic advantages for public figures and disadvantages for everyone else.
The website democratizes this knowledge. Previously, only AI researchers could probe training data directly. Now anyone can see their own embedding strength and understand how thoroughly they exist in the machine's world. The tool won't prevent biases in AI systems, but it makes those biases visible and quantifiable.
The creators chose to build this after leaving OpenAI, suggesting internal conversations about data representation and model transparency. The project hints at broader industry questions: Should AI companies disclose what training data they use? Should individuals have rights over their inclusion? How much does a person's presence in training data shape the model's
