ParlaCAP - Comparing agenda settings across parliaments via the ParlaMint dataset

ParlaCAP is an OSCARS Open Science cascading grant project focused on extending the usage of the open, comparable corpora of parliamentary debates ParlaMint to researchers in social sciences and beyond. The project leverages advanced natural language processing to analyse political agendas and sentiments in debates from 28 European parliaments. The automatic coding of agendas throughout a wide dataset of more than 8 million speeches, given in more than 20 languages, has become possible recently with significant developments in natural language processing and artificial intelligence, allowing for multilingual transformer models to provide both highly consistent and accurate codings. By integrating the ParlaMint dataset and the Comparative Agendas Project's coding scheme, the project will create a comprehensive, FAIR dataset for comparative political research, enhancing transparency and accountability in legislative discourse across Europe.

Project start date: 1 January 2025 Project duration: 24 months

See also the project description at the OSCARS website.

This project is funded by the OSCARS project's cascading grant, which has received funding from the European Commission’s Horizon Europe Research and Innovation programme under grant agreement No. 101129751.

Operated by

Data

The ParlaCAP dataset and other freely-available datasets related to the ParlaCAP project.

ParlaCAP 1.0 dataset

TSV · 8M speeches

The ParlaCAP dataset consists of 8 million speeches from 28 European national and regional parliaments, with each speech coded with the sentiment expressed (ParlaSent coding from negative, over neutral, to positive) and the topic discussed (CAP (Comparative Agendas Project) coding with 21 topics), and rich metadata on the speakers, parties and democracies. The ParlaCAP dataset extends the ParlaMint 5.0 dataset by automatically coding topics and sentiment for each speech and simplifying the data to a tabular form.
Repository · More Information
ParlaMint 5.0 corpus collection

TEI XML, TSV, CoNLL-U · 1.2B words, 8M speeches

ParlaMint 5.0 is a collection of comparable corpora of parliamentary debates from 29 European countries and regions. The corpora are richly annotated with metadata on speakers and parties, and automatically assigned CAP top-level topics and sentiment information. While ParlaMint 5.0 and ParlaCAP 1.0 provide the same underlying data, ParlaMint is distributed in formats primarily intended for corpus linguistics research, whereas ParlaCAP is provided in simplified, analysis-ready formats that are more accessible to social scientists and other digital humanities researchers.
Repository · Concordancer · More Information
ParlaCAP fine-tuning data

JSONL · 35,579 speeches

The ParlaCAP-train training dataset for the ParlaCAP topic classifier. The dataset comprises approximately 36,000 parliamentary speeches in 29 languages from the ParlaMint 4.1 corpus collection, annotated with the CAP topic labels by the GPT-4o model following the LLM Teacher-Student Framework for development of training data and BERT-like classifiers without manually-annotated data.
Repository · More Information
ParlaCAP test data

JSONL · 3,443 speeches

The ParlaCAP test datasets comprise parliamentary speeches in Bosnian, Croatian, English, and Serbian, sourced from the ParlaMint 4.1 dataset and manually-annotated by a single expert annotator using the 21 CAP categories from the official CAP schema, along with an additional Other label. The datasets are approximately balanced across labels and languages with app. 800 instances per language. To prevent large language models from incorporating the test datasets during their training phase, the test datasets are not publicly available. However, we are happy to share them with interested researchers - contact us to be granted access to the datasets.
Evaluation Dashboard · More Information

Models

Multilingual models fine-tuned on the tasks of CAP topic schema classification and sentiment identification.

ParlaCAP model

The ParlaCAP model is a multilingual text classification model that assigns topic categories to parliamentary speeches according to the CAP (Comparative Agendas Project) schema. The model was built by fine-tuning the XLM-R-Parla model on GPT-4o–annotated debates from multiple European parliaments. It achieves macro-F1 around 0.65–0.72 across English and South Slavic test sets.
Model on Hugging Face · More Information
Guide for Automatic CAP Annotation of ParlaMint Data
ParlaSent model

The ParlaSent model is a multilingual transformer model for sentiment analysis in parliamentary speeches. The model was developed by fine-tuning the XLM-R-Parla model on the ParlaSent dataset, a manually-annotated selection of sentences of parliamentary proceedings from Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom. The model achieves high accuracy, with a mean absolute error of 0.68–0.71 in regression scenario, and macro-F1 scores of 0.70–0.73 when its outputs are mapped to the three sentiment categories (Positive, Neutral, Negative).
Model on Hugging Face
More Information

Tutorials

Step-by-step guides for using the ParlaCAP dataset.

Parliamentary Speech Analysis with ParlaCAP

Python · Jupyter notebooks

Multiple tutorials for analyzing parliamentary speeches across multiple European countries using the python programming language. The 5 tutorial notebooks combine processing of ParlaMint data, sentiment analysis, party comparisons and cross-country analyses which enable students and researchers to study the tone and content of parliamentary debates systematically. View tutorial
Parliamentary Speech Analysis with ParlaCAP

R · Markdown README files

Multiple tutorials (adaptation of the Python tutorials to the left) for analyzing parliamentary speeches across multiple European countries using the R programming language. View a draft of the tutorial

Publications

Publications connected with the ParlaCAP project.

Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification February 2026

arXiv; submitted to the PoliticalNLP workshop, co-located with LREC 2026

Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić
Conference Paper
From the Dispatch Box: Unlocking the Potential of ParlaMint Through noSketch Engine and TEITOK 2026

Kristina Pahor de Maiti Tekavčič, Anna Kryvenko
Digital Textbook
State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? November 2025

arXiv; submitted to LREC 2026

Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić
Conference Paper
Parlasent: mapping sentiment in political discourse with large language models June 2025

Political Research Exchange, 7(1)

Michal Mochtak, Peter Rupnik, Taja Kuzman, Nikola Ljubešić
Research Note
ParlaCAP: Comparing Agenda-setting across Parliaments via the ParlaMint dataset June 2025

Annual Conference of the Comparative Agendas Project (CAP) 2025

Nikola Ljubešić, Taja Kuzman Pungeršek, Daniela Širinić
Conference Paper

Events

Upcoming and past events related to this project: talks and workshops.

Guest lecture at University of Vienna January 19, 2026

Vienna, Austria

Nikola Ljubešić: Multilingual NLP can shed light on many secrets of parliamentary proceedings
Event · Slides
CDH/CLARIN Workshop in the CDH Training Programme November 20, 2025

Utrecht, Netherlands

Anna Kryvenko, Kristina Pahor de Maiti Tekavčič: ParlaMint – An introduction to Multilingual Parliamentary Data
Event · Materials
Presentation at the ARNES Open Science conference November 20, 2025

Ljubljana, Slovenia

Taja Kuzman Pungeršek: ParlaCAP - Primerjava tematskih prioritet v parlamentih na podlagi korpusov ParlaMint
Slides · Recording
Poster at the CLARIN ERIC 2025 annual conference October 1, 2025

Vienna, Austria

Nikola Ljubešić, Taja Kuzman Pungeršek: ParlaCAP - Mining the ParlaMint Treasures with Multilingual Topic and Sentiment Classification
Event · Poster
Workshop at the Digital Humanities Conference 2025 July 14, 2025

Lisbon, Portugal

Darja Fišer, Anna Kryvenko, Kristina Pahor de Maiti Tekavčič: From the Dispatch Box - Unlocking Topics and Sentiments in Multilingual ParlaMint Corpora
Event · Recording
Presentation at the CAP 2025 conference June 11, 2025

Konstanz, Germany

Nikola Ljubešić, Daniela Širinić: ParlaCAP - Comparing Agenda-Setting across Parliaments via the ParlaMint Dataset
Event
Presentation at the Fostering Open Science session at the CLARIN ERIC General Assembly February 6, 2025

Leuven, Belgium

Presentation of the project was given by Tomaž Erjavec
Poster at the CLARIN ERIC 2024 conference October 16, 2024

Barcelona, Spain

Nikola Ljubešić, Taja Kuzman: ParlaCAP - Comparing agenda settings across parliaments via the ParlaMint dataset
Event · Poster
Poster at the 4th Workshop on Computational Linguistics for the Political and Social Sciences September 13, 2024

Vienna, Austria

Nikola Ljubešić: ParlaCAP - Comparing agenda settings across parliaments via the ParlaMint dataset
Event
Presentation at the Central European University September 11, 2024

Vienna, Austria

Nikola Ljubešić: Revealing the Hidden Treasures of Parliamentary Proceedings with NLP
Event
Presentation at the ACDH-CH Research Lunch September 10, 2024

Vienna, Austria

Nikola Ljubešić: ParlaMint - How recent developments in NLP help us reveal the hidden treasures of parliamentary proceedings
Event

Partners

Jožef Stefan Institute

The Jožef Stefan Institute (JSI) is the leading Slovenian scientific research institute, covering a broad spectrum of basic and applied research in natural sciences, life sciences and engineering. ParlaCAP project members come from the Department of Knowledge Technologies, the CLARIN.SI infrastructure, and the CLASSLA knowledge centre, all located inside the JSI.
Project Participants:
Nikola Ljubešić, Tomaž Erjavec, Taja Kuzman Pungeršek, Peter Rupnik
Institute of Contemporary History

The Institute of Contemporary History (INZ) is the central and largest specialised research organisation in the field of contemporary history in the Republic of Slovenia. It conducts research on Slovenian history from the 19th century onwards.
Project Participants:
Katja Meden, Jure Skubic, Anna Kryvenko
Faculty of Political Science, University of Zagreb

As the oldest Faculty of Political Science in Central, Eastern and South-East Europe, offering political science program since 1962, Faculty of Political Science is an academic institution whose mission is acquisition and transfer of knowledge about Croatian state and politics, society, media and its international environment.
Project Participant:
Daniela Širinić
Institute of Information and Communication Technologies, Bulgarian Academy of Sciences

The mission of the Institute is to conduct fundamental and applied research in the field of computer science, information and communication technologies (ICT), language technologies (LT) and in the development of innovative interdisciplinary applications related to these technologies.
Project Participants:
Petya Osenova, Nikolay Paev, Teodor Valchev
Institute of Computer Science, Polish Academy of Sciences

The Institute of Computer Science of the Polish Academy of Sciences is a leading research centre of information technology in Poland. The Institute conducts fundamental and applied research in computer science and linguistics, developing both formal methods and computational models.
Project Participants:
Maciej Ogrodniczuk, Łukasz Kobyliński
External Collaborators
- Michal Mochtak (Radboud University)
- Matyáš Kopp (Institute of Formal and Applied Linguistics)

Contact

If you have any questions about the ParlaCAP project, its datasets, models, or any other project-related content, we will be happy to help.

Contact person

Name: Nikola Ljubešić

Email: nikola.ljubesic@ijs.si
CLARIN.SI Helpdesk

Email: info@clarin.si

ParlaCAP - Comparing agenda settings across parliaments via the ParlaMint dataset

Data

Models

Tutorials

Publications

Events

Partners

Project Participants:

Project Participants:

Project Participant:

Project Participants:

Project Participants:

Contact