ParlaCAP - Comparing agenda settings across parliaments via the ParlaMint dataset
ParlaCAP is an OSCARS Open Science cascading grant project focused on extending the usage of the open, comparable corpora of parliamentary debates ParlaMint to researchers in social sciences and beyond. The project leverages advanced natural language processing to analyse political agendas and sentiments in debates from 28 European parliaments. The automatic coding of agendas throughout a wide dataset of more than 8 million speeches, given in more than 20 languages, has become possible recently with significant developments in natural language processing and artificial intelligence, allowing for multilingual transformer models to provide both highly consistent and accurate codings. By integrating the ParlaMint dataset and the Comparative Agendas Project's coding scheme, the project will create a comprehensive, FAIR dataset for comparative political research, enhancing transparency and accountability in legislative discourse across Europe.
This project is funded by the OSCARS project's cascading grant, which has received funding from the European Commission’s Horizon Europe Research and Innovation programme under grant agreement No. 101129751.
Data
The ParlaCAP dataset and other freely-available datasets related to the ParlaCAP project.
-
ParlaCAP 1.0 datasetThe ParlaCAP dataset consists of 8 million speeches from 28 European national and regional parliaments, with each speech coded with the sentiment expressed (ParlaSent coding from negative, over neutral, to positive) and the topic discussed (CAP (Comparative Agendas Project) coding with 21 topics), and rich metadata on the speakers, parties and democracies. The ParlaCAP dataset extends the ParlaMint 5.0 dataset by automatically coding topics and sentiment for each speech and simplifying the data to a tabular form.
Repository · More Information (Paper in Development) -
ParlaMint 5.0 corpus collectionParlaMint 5.0 is a collection of comparable corpora of parliamentary debates from 29 European countries and regions. The corpora are richly annotated with metadata on speakers and parties, and automatically assigned CAP top-level topics and sentiment information. While ParlaMint 5.0 and ParlaCAP 1.0 provide the same underlying data, ParlaMint is distributed in formats primarily intended for corpus linguistics research, whereas ParlaCAP is provided in simplified, analysis-ready formats that are more accessible to social scientists and other digital humanities researchers.
Repository · Concordancer · More Information -
ParlaCAP fine-tuning dataThe training dataset for the ParlaCAP topic classifier. The dataset comprises approximately 36,000 parliamentary speeches in 29 languages from the ParlaMint 4.1 corpus collection, annotated with the CAP topic labels by the GPT-4o model following the LLM Teacher-Student Framework for development of training data and BERT-like classifiers without manually-annotated data.
(TBA) Repository · More Information -
ParlaCAP test dataThe ParlaCAP test datasets comprise parliamentary speeches in Bosnian, Croatian, English, and Serbian, sourced from the ParlaMint 4.1 dataset and manually-annotated by a single expert annotator using the 21 CAP categories from the official CAP schema, along with an additional Other label. The datasets are approximately balanced across labels and languages with app. 800 instances per language. To prevent large language models from incorporating the test datasets during their training phase, the test datasets are not publicly available. However, we are happy to share them with interested researchers - contact us to be granted access to the datasets.
Evaluation Dashboard · More Information
Models
Multilingual models fine-tuned on the tasks of CAP topic schema classification and sentiment identification.
-
ParlaCAP modelThe ParlaCAP model is a multilingual text classification model that assigns topic categories to parliamentary speeches according to the CAP (Comparative Agendas Project) schema. The model was built by fine-tuning the XLM-R-Parla model on GPT-4o–annotated debates from multiple European parliaments. It achieves macro-F1 around 0.65–0.72 across English and South Slavic test sets.
Model on Hugging Face
Guide for Automatic CAP Annotation of ParlaMint Data -
ParlaSent modelThe ParlaSent model is a multilingual transformer model for sentiment analysis in parliamentary speeches. The model was developed by fine-tuning the XLM-R-Parla model on the ParlaSent dataset, a manually-annotated selection of sentences of parliamentary proceedings from Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom. The model achieves high accuracy, with a mean absolute error of 0.68–0.71 in regression scenario, and macro-F1 scores of 0.70–0.73 when its outputs are mapped to the three sentiment categories (Positive, Neutral, Negative).
Model on Hugging Face
More Information
Tutorials
Step-by-step guides for using the ParlaCAP dataset.
-
Parliamentary Speech Analysis with ParlaCAPMultiple tutorials for analyzing parliamentary speeches across multiple European countries using the python programming language. The 5 tutorial notebooks combine processing of ParlaMint data, sentiment analysis, party comparisons and cross-country analyses which enable students and researchers to study the tone and content of parliamentary debates systematically. View tutorial
-
Parliamentary Speech Analysis with ParlaCAPMultiple tutorials for analyzing parliamentary speeches across multiple European countries using the R programming language. TBA
Publications
Publications connected with the ParlaCAP project.
-
State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? November 2025arXiv; submitted to LREC 2026Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić
Conference Paper -
Parlasent: mapping sentiment in political discourse with large language models June 2025Political Research Exchange, 7(1)Michal Mochtak, Peter Rupnik, Taja Kuzman, Nikola Ljubešić
Research Note -
ParlaCAP: Comparing Agenda-setting across Parliaments via the ParlaMint dataset June 2025Nikola Ljubešić, Taja Kuzman Pungeršek, Daniela Širinić
Conference Paper
Events
Upcoming and past events related to this project: talks and workshops.
-
CDH/CLARIN Workshop in the CDH Training Programme November 20, 2025Utrecht, Netherlands
-
Presentation at the ARNES Open Science conference November 20, 2025Ljubljana, Slovenia
-
Poster at the CLARIN ERIC 2025 annual conference October 1, 2025Vienna, Austria
-
Workshop at the Digital Humanities Conference 2025 July 14, 2025Lisbon, Portugal
-
Presentation at the CAP 2025 conference June 11, 2025Konstanz, GermanyNikola Ljubešić, Daniela Širinić: ParlaCAP - Comparing Agenda-Setting across Parliaments via the ParlaMint Dataset
Event -
Presentation at the Fostering Open Science session at the CLARIN ERIC General Assembly February 6, 2025Leuven, BelgiumPresentation of the project was given by Tomaž Erjavec
-
Poster at the CLARIN ERIC 2024 conference October 16, 2024Barcelona, Spain
-
Poster at the 4th Workshop on Computational Linguistics for the Political and Social Sciences September 13, 2024Vienna, AustriaNikola Ljubešić: ParlaCAP - Comparing agenda settings across parliaments via the ParlaMint dataset
Event -
Presentation at the Central European University September 11, 2024Vienna, AustriaNikola Ljubešić: Revealing the Hidden Treasures of Parliamentary Proceedings with NLP
Event -
Presentation at the ACDH-CH Research Lunch September 10, 2024Vienna, AustriaNikola Ljubešić: ParlaMint - How recent developments in NLP help us reveal the hidden treasures of parliamentary proceedings
Event
Partners
-
Jožef Stefan InstituteThe Jožef Stefan Institute (JSI) is the leading Slovenian scientific research institute, covering a broad spectrum of basic and applied research in natural sciences, life sciences and engineering. ParlaCAP project members come from the Department of Knowledge Technologies, the CLARIN.SI infrastructure, and the CLASSLA knowledge centre, all located inside the JSI.
Project Participants:
Nikola Ljubešić, Tomaž Erjavec, Taja Kuzman Pungeršek, Peter Rupnik -
Institute of Contemporary HistoryThe Institute of Contemporary History (INZ) is the central and largest specialised research organisation in the field of contemporary history in the Republic of Slovenia. It conducts research on Slovenian history from the 19th century onwards.
Project Participants:
Katja Meden, Jure Skubic, Anna Kryvenko -
Faculty of Political Science, University of ZagrebAs the oldest Faculty of Political Science in Central, Eastern and South-East Europe, offering political science program since 1962, Faculty of Political Science is an academic institution whose mission is acquisition and transfer of knowledge about Croatian state and politics, society, media and its international environment.
Project Participant:
Daniela Širinić -
Institute of Information and Communication Technologies, Bulgarian Academy of SciencesThe mission of the Institute is to conduct fundamental and applied research in the field of computer science, information and communication technologies (ICT), language technologies (LT) and in the development of innovative interdisciplinary applications related to these technologies.
Project Participants:
Petya Osenova, Nikolay Paev, Teodor Valchev -
Institute of Computer Science, Polish Academy of SciencesThe Institute of Computer Science of the Polish Academy of Sciences is a leading research centre of information technology in Poland. The Institute conducts fundamental and applied research in computer science and linguistics, developing both formal methods and computational models.
Project Participants:
Maciej Ogrodniczuk, Łukasz Kobyliński -
External Collaborators
- Michal Mochtak (Radboud University)
- Matyáš Kopp (Institute of Formal and Applied Linguistics)
Contact
If you have any questions about the ParlaCAP project, its datasets, models, or any other project-related content, we will be happy to help.
-
Contact personName: Nikola LjubešićEmail: nikola.ljubesic@ijs.si
-
CLARIN.SI HelpdeskEmail: info@clarin.si