MultiSynt: an open multilingual synthetic dataset for LLM pre-training

A DIFFERENT APPROACH TO DATA SCARCITY

The MultiSynt initiative addresses a critical bottleneck: the lack of availability of high-quality pre-training data in multilingual settings. The MultiSynt approach will make use of generative models to enhance existing content, targeting improvements in language representation, domain coverage, and content diversity across several EU languages.

TWO TEAMS WITH A SIMILAR GOAL

MultiSynt supports the broader EuroLLM and OpenEuroLLM initiatives who share a common goal: address data scarcity to improve LLM-oriented dataset composition deficiencies to mitigate the risk of producing underperforming models.

AI FACTORIES CALLS SUPPORT

This project is funded by the EuroHPC AI Factory Large Scale call EHPC-AIF-2025LS01-028 which supports applications performing AI activities with high-impact and high-gain innovative research. besides industrial innovation.It will be implemented in CINECA'sLEONARDO BOOSTER partition.

COMPUTE & STORAGE USAGE

3,000,000

GPU hours allocated in LEONARDO BOOSTER

600,000

Total storage required (GB)

400,000

Total amount of data to transfer to/from (GB)

WHAT'S THE PLAN?

Building on the methodology established by Nemotron-CC for English to a multilingual dimension and enhanced with innovative components, the MultiSynt 4-phase approach includes:

PHASE 1

Available multilingual pre-training data quality estimation (20% of the GPU-h time)

During this phase we develop, train and apply quality estimation models to estimate the data quality of documents in existing multilingual pre- training datasets. For quality estimation, we consider models in the 100M - 4B parameter range.

PHASE 2

Experimentation with multilingual synthetic data creation (20% of the GPU-h time)

During this phase we develop and execute different synthetic data generation pipelines. We will apply the pipelines to subsamples of large- scale training datasets, generating synthetic datasets of smaller scale.

For multiple pipeline versions using different LLMs, we plan to generate synthetic datasets of up to 50B tokens.

PHASE 3

Evaluation of the efficacy of different methods for various languages (10% of the GPU-h time)

During this phase we plan to run end-to-end ablation studies where we train smaller scale models on small token budgets to estimate the output quality of different synthetic data generation pipelines, to guide the development and selection of the final pipeline.

Specifically, we plan to train various models at the 100M-2B parameter scale with up to 200B tokens.

PHASE 4

Production of larger-scale generation of synthetic data for (50% of the GPU-h time)

In this phase we apply the final version of our synthetic data generation pipeline on selected subsets of large LLM pre-training datasets.

Based on the insight obtained from phase 3, we will identify the compute optimal setup balancing synthetic data quality and model size, which will dictate the amount of synthetic tokens we can generate under a given compute budget.

We estimate as a result a synthetic multilingual dataset with a minimum of 1T tokens.

We then evaluate the efficacy of the resulting synthetic dataset by training a smaller LLM with a longer token horizon on different mixes of original and synthetic data samples, following the experiments presented in Nemotron-CC.

We plan to train several 8B parameter models on up to 1T tokens. Each run will require 100k A100 hours.