One year has elapsed since the start of the OpenEuroLLM project. This ambitious project carried out by a consortium of 20 leading European research institutions, companies and EuroHPC centres, coordinated by Jan Hajič (Charles University, Czechia) and co-led by AMD Silo AI, has been busy with the first steps of developing next-generation open-source language models to advance European AI capabilities.
The project's main goal requires extensive research, access to high-performance computing resources, and strategic collaboration with other prominent European initiatives. During its inaugural year, the project has achieved significant milestones in advancing regional AI sovereignty through targeted efforts in digital infrastructure development, data practices, model development, and evaluation tools.
“Creating an open source multilingual LLM in the public space and within a large consortium is a challenging task. I am proud that thanks to the expertise, enthusiasm, commitment and hard work of especially the core partners the project has achieved its first-year goals. However, significant challenges, especially in securing more compute for creating the final models, still remain,” says Jan Hajič, Charles University.
Infrastructure
OpenEuroLLM is developing the digital infrastructure needed to lower thresholds for AI product development in Europe. This includes infrastructure for conducting large-scale distributed training, for running evaluations of models seamlessly across different European clusters and for building robust software stacks for experiments. In the first year of the project, these were essential steps to avoid the dependence on a single cluster and to make the most of the current configurations of European HPCs.
Data
In collaboration with Open-Sci, reference models for dataset selection and scaling trends have been
developed. These reference models provide baselines for comparison to any other method trained on any of the same reference open datasets, making it easier to put a new training procedure into relation to already existing working baselines.
MixtureVitae, another significant open web-scale pretraining dataset, has been developed together with LAION, Ontocord, and Open-Sci. It has proved to be the first permissive dataset that manages to match or outperform strong non-permissive datasets like FineWeb-Edu or DCLM. It is particularly strong on reasoning problems related to mathematics and code.
Together with EuroLLM the project has tackled the challenge of lack of data that most European languages face. As current data collection cannot adequately address language scarcity, limiting proper representation of many languages in multilingual models, the first comprehensive multilingual synthetic pre-training
dataset has been created.
In parallel, the project has established the basis of the
OpenEuroLLM catalogue of LLM training data, a structured catalogue providing a uniform, collectively curated, and well documented collection of candidate LLM training datasets. Datasets in the catalogue have been made publicly available (read-only) on multiple EuroHPC systems such as LUMI, Leonardo and MareNostrum to avoid duplicative efforts and redundant storage.
Models and evaluation
In collaboration with HPLT, 2B/100B reference models for various languages have been
released. These transparent and easily reproducible reference models provide a means for cross-lingual comparison, inspection of monolingual performance, or understanding of popular evaluation tasks for different languages.
In addition, a range of 2B/4TT models have been trained for studying multilingual data mixes to determine the optimal proportion of each language within a training dataset for producing high-performing multilingual LLMs.
The results of both the 2B/100B and the 2B/4TT models inform future decisions as model sizes are scaled up.
Looking ahead
As the project enters its second year, transparency, openness and community collaboration continue to be guiding values while the work continues with high ambitions.
OpenEuroLLM succeeded in securing access to EuroHPC
strategic compute resources, guaranteeing a substantial amount of compute on four major EuroHPC supercomputers for the remainder of the project. Additional compute resources will however be required to complement the strategic allocations.
The project is looking to release an 8B model by next summer followed by a larger model using the compute secured with the strategic compute allocation. Additionally, new iterations on the Poro model family will be released.