Skip to content

AI2 Releases Its Largest Open Dataset for Language Model Training

AI2 releases Dolma, an expansive open dataset aimed at promoting transparency in AI training. This move sets a new benchmark in open-sourcing language model training data.

AI2's Dolma

In an age where language models like GPT-4 are making waves, the datasets used to train them remain largely proprietary. AI2 challenges this norm by launching Dolma, a comprehensive open dataset designed to facilitate transparent AI training.

Dolma, aptly named as the precursor for the anticipated open language model OLMo, aims to foster community collaboration. By offering the dataset freely, AI2 encourages the research community to not only use but also modify and enhance it.

While giants like OpenAI maintain a tight lid on their training datasets, AI2's approach is refreshingly transparent. The motivation isn't merely ethical AI training but also to address concerns regarding the possible unethical procurement of data. By making processes and sources publicly accessible, AI2 ensures researchers can fully understand and replicate their dataset.

Though not the first in the open dataset arena, Dolma's vastness (encompassing 3 billion tokens) and clear usage guidelines set it apart. The dataset operates under the "ImpACT license for medium-risk artifacts", which establishes transparent standards for its use, including required disclosures and distribution terms.

For individuals concerned about their personal data's potential inclusion in Dolma, AI2 offers a dedicated removal request form, further emphasizing its commitment to ethical AI practices.

AI2's Dolma sets a commendable precedent in the AI sphere, championing transparency and ethical practices. As the AI landscape continually evolves, such steps towards openness are crucial in fostering trust and collaboration in the research community.