Aug. 22, 2025

Open-sci and OpenEuroLLM release of reference models

Together with Open-sci, we are releasing open-sci-ref 0.01 - a research-dense transformer model family. Our aim is to establish scaling laws that enable ranking of datasets and other hyperparameters.

With all the intermediate checkpoints trained on 8 different, well established reference datasets on various model (0.13B - 0.4B - 1.3B - 1.7B) and token (50B, 300B, 1T) scales, this model family serves as a baseline for comparisons and for studies on training dynamics. Our release includes all intermediate model weights, logs and code with the hope to facilitate future research.

Introduction and motivation

Foundation models show transfer across various conditions and tasks, after generic pre-training on large volumes of diverse data. They also exhibit scaling laws for predictable stronger transfer with increasing pre-training scales. Foundation models constitute the core of modern machine learning, holding the promise for solving fundamental problems of generalization and efficient learning.

To make guided progress in researching foundation models and the datasets necessary for their creation, fully controllable and reproducible comparisons of various learning procedures are a necessity. Ideally, such a comparison should allow the identification of learning procedures that produce stronger generalization than others. It should also identify components of the procedure that lead to stronger generalization, while being reproducible for third party validation of findings.

This release is a step towards establishing grounds for performing such comparisons. We provide a set of training runs and resulting language models to serve as reference baselines. These baselines can be used for comparing any other training and check its sanity and quality relative to the established baselines across various important open reference datasets.

For the first step, we selected standard dense transformers following established works such as Pythia, OLMo, PolyPythias, and DCLM, which provide fully open and reproducible training pipelines along with intermediate checkpoints for analyzing training dynamics. We establish baseline references for larger model scales (1.3B and 1.7B parameters) trained on 1T token datasets, and include experiments with the recent high-quality NemoTron-CC dataset using context lengths of 4k, 8k, and 16k tokens during pre-training. We show how reference training performed on various scales can be used for open dataset comparison, confirming NemoTron-CC HQ is consistently the best pre-training dataset, followed by DCLM-baselines and FineWeb-Edu. We show how comparison to reference baselines across scales can be used to test claims about model or dataset quality, e.g. pointing to pitfalls in multilingual training and datasets. We release all the trained models, intermediate checkpoints and logs to facilitate further studies. We also provide necessary sanity checks and baselines for building and studying strong base models in cooperation with Open-sci.

Experimental procedure

For all datasets, we train reference models with 130M, 400M, 1.3B and 1.7B parameters with Megatron-LM. We evaluate every intermediate checkpoint with zero-shot and few-shot benchmarks, namely Copa, Openbookqa, Lambada_openai, Winogrande on 0-shot setting, MMLU in 5-shot and Commonsense-qa, Piqa, Arc-challenge, Arc-easy, Hellaswag, Boolq with 10-shot.

Results: dataset comparison and reference baselines on various scales

In Fig. 1, we compare the average performance when training our reference models on different datasets. It is clear that some datasets, such as Nemotron-cc-hq and DCLM, provide better performance, even for smaller model sizes.

Figure 1: Downstream performance for different model sizes when considering different numbers of model parameters.

In Fig. 2, we show how the pretraining of our reference models scales when increasing the training compute budget for different datasets. This highlights that by performing measurements across selected reference scales, consistent dataset ranking trends can be identified. The scaling analysis can hint at promising strong datasets, even with a low training budget setting and without deriving full scaling laws. For more accurate predictions and candidate dataset comparison, full scaling law derivation is required.

To measure the performance of our reference models on a larger setting, we also release checkpoints trained on 1T tokens and report their average performance on Table 1 against models of similar sizes. In particular, our reference models trained on Nemotron-cc achieves competitive results against baselines trained on a much larger number of tokens. This is interesting given that we considered a very simple single stage pipeline that does not change dataset mixture during different stages of pre-training or performs annealing on high-quality datasets.

Figure 2: Scaling comparison for our reference model on different datasets compared to some baselines.

Table 1: Downstream performance on considered tasks of our reference models trained on 1T tokens on DCML, FineWeb-Edu and Nemotron-cc-hq as well as several other baselines. Models are sorted by their average downstream performance.

A surprising finding is the level of performance we measure on multilingual problem solving benchmarks. Despite being pre-trained only on Nemotron-CC-HQ, a largely english-dominated dataset, open-sci-ref shows stronger performance in tasks requiring problem solving on multi-lingual MMLU than for instance EuroLLM-1.7B. This observation holds true across multiple languages which were explicitly included into the training set for

EuroLLM, despite its larger compute budget in pre-training overall (ca. 4x open-sci-ref). This demonstrates that English-only pre-training enables language comprehension and problem solving across multiple languages for a strong composition dataset such as Nemotron-CC-HQ. Creating effective multilingual dataset mixtures remains challenging, as evidenced by the performance of EuroLLM-1.7B compared to English-only training, despite including benchmark languages and using additional compute. This emphasizes the central role of the dataset for obtaining high quality pre-training.

Figure 3: Performance of English pre-trained open-sci-ref on MMMLU which is MMLU translated to multiple languages by OpenAI.

Conclusion and outlook

With open-sci-ref-0.01 research release, we provide a set of dense transformer models on 0.13B-0.4B-1.3B-1.7B scales trained on important reference datasets C4, Pile, SlimPajama, FineWeb-EDU-1.4T, DCLM-baselines and NemoTron-CC-HQ. We included two further relevant datasets, HPLT-2.0 and CommonCorpus, for token budgets of 50B, 300B and 1T. These reference models provide baselines for comparison to any other method trained on any of the same reference open datasets, making it easier to put a new training procedure into relation to already existing working baselines.

We are also releasing all intermediate checkpoints and logs from each reference training, to support studies on learning dynamics and learning interventions on different phases of training onHuggingFace. As we use a constant learning rate schedule, intermediate checkpoints can be studied either by continuing training or by cooling down from any point in the training procedure.

While this research release mainly aims to establish baselines for sanity checks and comparison to other procedures, we also deliver valuable insights into the quality of important reference datasets. The dataset rankings we obtain are robust as they are established through both the training and scaling dynamics. This provides a much more robust ranking than comparison based on measurements conducted on only one fixed scale. Establishing comparisons that are valid across broad scales span remains an open challenge and will likely require the derivation of full scaling laws and is subject of follow up work.

Contributors

Marianna Nezhurina: established major part of dataset processing (tokenization) and training infrastructure (Megatron-LM container based workflow), conducted scaling tests for distributed training, wrote routines for evaluation based on lm-eval-harness, downloaded and tokenized datasets, co-designed experiments, converted DCLM base models (1B, 7B) to HF and ran evaluation for the scaling plot.

Joerg Franke: downloaded and tokenized datasets, transferred datasets between various machines, co-designed experiments, conducted training experiments.

Taishi Nakamura : wrote checkpoint conversion routines from Megatron to HuggingFace format for custom open-sci-ref models to enable easy evaluation via lm-eval-harness.

Timur Carstensen: automated and performed conversion of all the Megatron checkpoints to HuggingFace format for evaluation using a script provided by Marianna and Taishi, helped running evaluations, provided the tooling to parse all hyperparameters from the logs, performed the evaluations and visualizations on MMMLU

Niccolò Ajroldi: helped running evaluations and fixed a bug in lm-eval-harness to handle custom paths.

Ville Komulainen: uploaded all the intermediate and final checkpoints to HuggingFace.

David Salinas: wrote the infrastructure to automate and perform large batch of evaluations, ran most of the evaluations, wrote code to generate most of the tables and figures except for MMMLU, co-wrote the blog post.

Jenia Jitsev : coordination and project supervision; acquired compute resources; designed the experiments (scales, architecture configuration, evaluation selection), wrote training scripts for various supercomputers, conducted major fraction of training runs, wrote routines for transferring datasets across supercomputers, downloaded and transferred the datasets across the machines, helped running evaluation, wrote the blog post. Erika Halonen: co-wrote the blog post.