License Incompatibility and Provenance Failures in African NLP Corpora
June 30, 2026
An audit of over twenty African NLP corpus families reveals critical license mismatches, including CC-BY-SA/NC incompatibilities and hidden NoDerivs clauses. The study identifies systemic failures in data provenance and persistence, such as dead source URLs and misrepresented HuggingFace dataset cards.
HOW THIS AFFECTS YOU
●
builderYou must conduct deeper due diligence on license provenance to avoid legal risks in dataset integration.
●
policyYou should account for hidden NoDerivs clauses and license mismatches when auditing linguistic datasets.