📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva LLM, developed from scratch with extensive native-language data, achieved unexpectedly low results on Italian academic benchmarks. This challenges assumptions about scale and investment in European sovereign AI efforts.
Italy’s Minerva project, a large-scale European sovereign language model trained from scratch on 2.5 trillion tokens with approximately 50% Italian content, scored only 4.9% on the INVALSI Italian school-exam benchmark, raising questions about the effectiveness of native-language investment at current model scales.
Minerva, led by Sapienza University of Rome and supported by Italy’s national research and supercomputing infrastructure, was designed to demonstrate a sovereign approach to developing Italian-language large language models (LLMs). The project trained models ranging from 350 million to 7 billion parameters, with the 3B model being publicly released along with training data and code. Despite this significant investment, Minerva-3B’s performance on the INVALSI Italian benchmark was near chance, at just 4.9%. This result was unexpected given the extensive native-language data used and the scale of the training dataset.
Researchers concluded that, while dataset composition matters, the overall size of the dataset and the number of parameters are more critical for handling complex language tasks. The findings suggest that even substantial native-language investment may not suffice at the current parameter scales to achieve deep language understanding or academic proficiency, challenging assumptions in the European sovereign-LLM movement.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.

Fine-Tuning Large Language Models: From Custom Datasets to High-Performance AI Models Using Modern Toolchains
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code

Supercomputing for Artificial Intelligence: Foundations, Architectures, and Scaling Deep Learning Workloads
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.

LOCAL LLM DEPLOYMENT: Training, Fine-Tuning, & Offline Inference: The Complete Developer’s Guide to Building, Training, and Running Private Open-Source AI Offline (with full source code)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-LLM Strategies
The results from Minerva reveal that large native-language datasets and significant model sizes alone may not produce the desired depth of country-specific knowledge or academic competence. This finding questions the prevailing belief that scaling native-language data and parameters automatically leads to better performance, highlighting the need for reevaluating investment strategies across Europe’s sovereign AI initiatives. The case underscores that achieving meaningful language understanding and domain-specific knowledge may require even larger investments or different architectural approaches, influencing future policy and research directions.
Background on Italy’s Minerva and European LLM Development
Italy’s Minerva project was launched as a pioneering effort to create a fully sovereign Italian-language LLM, trained from scratch on 2.5 trillion tokens, with about half being Italian. Led by Sapienza University’s NLP group under Roberto Navigli, and supported by Italy’s national AI and supercomputing infrastructure, Minerva was designed to demonstrate a scalable, open, and national approach to AI development. Previous European efforts, such as Portugal’s AMÁLIA, opted for continuation pre-training on multilingual models, raising questions about the optimal approach for language-specific models. Minerva’s open weights, data, and code marked a significant step in transparency and infrastructure, but its performance on complex language tasks has exposed limitations in current scaling assumptions.
“Minerva’s low benchmark score suggests that even large native-language datasets at current scales may not produce country-knowledge depth.”
— Thorsten Meyer
Unanswered Questions About Scaling and Model Architecture
It remains unclear whether increasing model size beyond 7 billion parameters, further expanding native-language datasets, or adopting different training methodologies could significantly improve Minerva’s performance on complex language tasks. The ongoing research continues to explore these avenues, but definitive conclusions are yet to be drawn. Additionally, the specific reasons for the low academic benchmark results, despite substantial native-language data, are still under investigation.
Next Steps for European Sovereign-Language Model Development
Researchers and policymakers will likely reassess the scale and approach of native-language AI projects across Europe. Future work may involve experimenting with larger models, alternative architectures, or hybrid strategies combining multilingual and monolingual training. The Italian team also plans to continue iterative testing, including ongoing evaluations of the 2025 continual-training case study, to better understand the relationship between data scale, model size, and language proficiency. The broader European community will watch these developments to inform strategic investments and research priorities.
Key Questions
Why did Minerva perform poorly on Italian academic tests?
Despite extensive native-language data, the evaluation suggests that dataset size and model scale are critical for complex tasks. The current parameter scale may be insufficient to develop deep country-specific knowledge or academic proficiency.
Does this mean large native-language datasets are ineffective?
Not necessarily; results indicate that data scale alone may not guarantee performance. Larger models, different architectures, or more targeted training might be necessary to achieve desired outcomes.
What does this mean for European AI sovereignty efforts?
The findings highlight that simply scaling native-language data and models may not be enough. Policymakers and researchers may need to consider more ambitious investments or new strategies to develop effective country-specific AI systems.
Will Minerva’s low performance impact future projects?
It is too early to say. The project continues iterating, and these results provide valuable insights for refining approaches to sovereign-language model development.
Are there plans to improve Minerva’s performance?
Yes, ongoing research aims to explore larger models, alternative training methods, and different data strategies to enhance language understanding and academic proficiency.
Source: ThorstenMeyerAI.com