[r/LocalLLaMA]score: 0.20
103B-Token Pre-LLM Usenet Corpus (1980–2013) Now Available for Training
May 27, 2026
A 103B-token Usenet corpus spanning 1980–2013 offers AI-contamination-free training data with domain-specific subsets: 10.3B tokens of comp.* computing discussion, 3.3B tokens of sci.* content, and 16.5B tokens of rec.* material. Pre-SEO writing style makes it structurally distinct from modern web scrapes, useful for fine-tuning without RLHF artifact bleed.
resources
HOW THIS AFFECTS YOU
●
builderYou can fine-tune on domain-specific subsets — especially comp.* — to get models with pre-RLHF writing character, free of refusal patterns or GPT stylistic artifacts.
●
researcherClean pre-LLM baseline data lets you study model behavior without AI-contamination confounds, and the hierarchical newsgroup structure enables controlled domain-specific training experiments.