ParlaSpeech

Language	Duration	Sentences	Layers	Links
Croatian (HR)	3,061 hours	922,679	All layers	CLARIN \| HuggingFace
Serbian (RS)	896 hours	290,778	All layers	CLARIN \| HuggingFace
Czech (CZ)	1,218 hours	717,682	Ling, Senti, Pause	CLARIN \| HuggingFace
Polish (PL)	1,010 hours	535,465	Ling, Senti, Pause	CLARIN \| HuggingFace

1. id [string] - Unique sentence ID, referencing the utterance ID from ParlaMint 4.0, expanded with character offsets.

2. words [list of dicts] - ASR Word-level timing and character offset annotations.

• time_s [float] - Start time of the word in seconds, relative.

• time_e [float] - End time of the word in seconds, relative.

• char_s [int] - Start character index in the transcript, relative.

• char_e [int] - End character index in the transcript, relative.

3. audio [string] - Relative path to the audio file in ParlaSpeech.

4. audio_length [float] - Duration of the audio file in seconds.

5. text [string] - Text from official manual transcriptions corresponding to the audio.

6. text_start [int] - Start character index of the relevant text segment, absolute.

7. text_end [int] - End character index of the relevant text segment, absolute.

8. audio_start [float] - Start time of the relevant audio segment, absolute.

9. audio_end [float] - End time of the relevant audio segment, absolute.

10. speaker_info [dict] - Metadata about the speaker and session.

• Text_ID [string] - Text document ID.

• ID [string] - Speech segment ID.

• Title [string] - Title of the debate/session.

• Date [string] - Date of the session.

• Body [string] - Parliamentary body or chamber.

• Term [string] - Parliamentary term.

• Session [string] - Session number.

• Meeting [string] - Meeting number.

• Sitting [string] - Sitting number.

• Agenda [string] - Agenda topic.

• Subcorpus [string] - Subcorpus name.

• Lang [string] - Language of the transcript.

• Speaker_role [string] - Role of the speaker (e.g., Regular, Minister).

• Speaker_MP [string] - Whether speaker is a Member of Parliament.

• Speaker_minister [string] - Whether speaker is a minister.

• Speaker_party [string] - Party abbreviation.

• Speaker_party_name [string] - Full name of the party.

• Party_status [string] - Government or opposition or coalition.

• Party_orientation [string] - Political orientation (e.g., left, right).

• Speaker_ID [string] - Unique speaker identifier, usually in format "LastnameFirstName".

• Speaker_name [string] - Full name of the speaker in format "Lastname, Firstname".

• Speaker_gender [string] - Gender of the speaker (M, F or "-").

• Speaker_birth [string] - Birth year of the speaker.

ParlaSpeech-Pause

11. filled_pauses [list of dicts] - Timestamped intervals of filled pauses in the audio.

• time_s [float] - Start time of the pause in seconds, relative.

• time_e [float] - End time of the pause in seconds, relative.

• words_idx [int] - Index of the word in words preceding the pause.

ParlaSpeech-Align

12. words_align [list of dicts] - Kaldi MFA word-level alignment data. Each dictionary corresponds to a word.

• text [string] - Word text.

• char_s [int] - Start character index, relative.

• char_e [int] - End character index, relative.

• time_s [float] - Start time of the word in seconds, relative.

• time_e [float] - End time of the word in seconds, relative.

• words_idx [int] - Index of word in words. Equal to Null if instance cannot be matched to any words instance.

13. chars_align [list of lists of dicts] - Kaldi MFA character-level alignment data. Every dictionary corresponds to a character, while the list of dictionaries corresponds to a word.

• text [string] - Character(s).

• time_s [float] - Start time of the character(s), relative.

• time_e [float] - End time of the character(s), relative.

• char_s [int] - Start index in the full text, relative.

• char_e [int] - End index in the full text, relative.

ParlaSpeech-Stress

14. primary_stress [list of dicts] - Primary stress annotations for multisyllabic words.

• words_align_idx [int] - Index of word in the words_align.

• stress [int] - Index of syllable containing the primary stress.

• nuclei [list of int] - Character indices of syllable nuclei.

• raw [list of float] - Inferred and unprocessed timestamped intervals of primary stress, relative to audio file.

ParlaSpeech-Ling

15. linguistic_annotation [list of dicts] - Text-based word-level linguistic annotation.

• words_idx [int] - Index of the corresponding word from words. Equal to Null for parts of the text that cannot be found in words (such as punctuation).

• id [int] - Token ID.

• text [string] - Word form.

• lemma [string] - Lemma of the word.

• upos [string] - Universal part of speech.

• xpos [string] - Language-specific part of speech.

• feats [string] - Morphological features.

• head [int] - Head of the current word (syntactic dependency relation).

• deprel [string] - Type of syntactic dependency relation.

• misc [string] - Additional information.

ParlaSpeech-Senti

16. sentiment [dict] - Text-based sentiment prediction based on the ParlaSent model (https://huggingface.co/classla/xlm-r-parlasent).

• ParlaSent_logit [float] - Sentence sentiment prediction from the ParlaSent model (continuous value, ranging from 0 (negative) to 6 (positive)).

• ParlaSent_3 [string] - Sentence sentiment on 3 levels coming from the ParlaSent sentiment text prediction model (Negative, Neutral, Positive).

• ParlaSent_6 [string] - Sentence sentiment on 6 levels coming from the ParlaSent sentiment text prediction model (Negative, Mixed Negative, Neutral Negative, Neutral Positive, Mixed Positive, Positive).

Example of an Entry

Here is a short sample from the corpora, reformatted for readability:

History

2021 — The idea of the ParlaSpeech corpora was born while proposing the second iteration of the ParlaMint project inside which textual transcripts were made available for 20+ European parliaments. We ideated a task to deliver at least 50 hours of aligned text and speech for at least three parliaments / languages. Inside the ParlaMint project ParlaSpeech v1 and ParlaSpeech v2 were born.
2022 — ParlaSpeech v1 was born in the early days of the second iteration of the ParlaMint project, where the Croatian parliament spoken data were aligned to the existing Croatian ParlaMint text transcripts. The result are the ParlaCLARIN workshop paper describing the construction process, and the ParlaSpeech-HR v1 corpus, 1816 hours in size.
2023 — ParlaSpeech v2 was the final result of the ParlaMint project, spanning four parliaments of Croatia, Czechia, Poland, and Serbia, and 5,000 hours of audio. The improved alignment process, applied to the Croatian, Polish and Serbian parliaments, is described in the SPECOM conference paper, while the Czech corpus has been compiled from the previously constructed Czech parliamentary speech corpus.
2024 — ParlaSpeech v3 is developed as an enriched version of the audio and transcripts available in ParlaSpeech v2. New features include linguistic UD-based annotation of transcripts, paralinguistic annotation of filled pauses and primary stress, more detailed word- and character-level alignment of text to the audio, and transcript-based sentiment classification.
2025 — ParlaSpeech v4, in development, represents an extension of the list of parliaments to also include the Bulgarian, Slovenian and Ukrainian parliament.

Nikola Ljubešić	nikola.ljubesic@ijs.si
Peter Rupnik	peter.rupnik@ijs.si
Ivan Porupski	ivan.porupski@ijs.si
Danijel Koržinek	danijel@pjwstk.edu.pl
Taja Kuzman Pungeršek	taja.kuzman@ijs.si

Welcome to ParlaSpeech

Annotation Layers

Corpus Overview

Authors

Dataset Schema