Why datasets built on public domain might not be enough for AI
Common Corpus is a public domain dataset for training large language models (LLMs). Boasting 500 billion words in multiple languages, drawn from various cultural initiatives, it offers researchers a powerful tool to develop smaller and more efficient LLMs. It should not be abused as a tool to promote public policies that expand the reach of copyright law.