ParlaSpeech is a multilingual collection of parliamentary speech corpora covering
Croatian (HR), Serbian (RS), Czech (CZ), and Polish (PL) parliamentary proceedings, created by aligning official parliamentary audio recordings with their corresponding
ParlaMint transcripts. It preserves rich metadata including speaker attributes (name, gender, political party) and session details (date, agenda, parliamentary term), enabling cross-disciplinary research in political discourse analysis, speech processing, and computational linguistics. The corpus supports both computational (JSONL) and phonetic (TextGrid) analysis formats.
ParlaSpeech-v3.0 extends the base ParlaSpeech corpora with five annotation layers: linguistic annotations (ParlaSpeech-Ling following Universal Dependencies), sentiment (ParlaSpeech-Senti), filled pause detection (ParlaSpeech-Pause), precise word-level alignments (ParlaSpeech-Align), and primary stress markers (ParlaSpeech-Stress). Currently, Croatian and Serbian versions contain all annotation layers, while Czech and Polish include only pause, sentiment and linguistic annotations. These structured enrichments facilitate advanced research in prosody, speech disfluencies, and multimodal parliamentary analysis.
ParlaSpeech
1. id [string] - Unique sentence ID, referencing the utterance ID from ParlaMint 4.0, expanded with character offsets.
2. words [list of dicts] - ASR Word-level timing and character offset annotations.
• time_s [float] - Start time of the word in seconds, relative.
• time_e [float] - End time of the word in seconds, relative.
• char_s [int] - Start character index in the transcript, relative.
• char_e [int] - End character index in the transcript, relative.
3. audio [string] - Relative path to the audio file in ParlaSpeech.
4. audio_length [float] - Duration of the audio file in seconds.
5. text [string] - Text from official manual transcriptions corresponding to the audio.
6. text_start [int] - Start character index of the relevant text segment, absolute.
7. text_end [int] - End character index of the relevant text segment, absolute.
8. audio_start [float] - Start time of the relevant audio segment, absolute.
9. audio_end [float] - End time of the relevant audio segment, absolute.
10. speaker_info [dict] - Metadata about the speaker and session.
• Text_ID [string] - Text document ID.
• ID [string] - Speech segment ID.
• Title [string] - Title of the debate/session.
• Date [string] - Date of the session.
• Body [string] - Parliamentary body or chamber.
• Term [string] - Parliamentary term.
• Session [string] - Session number.
• Meeting [string] - Meeting number.
• Sitting [string] - Sitting number.
• Agenda [string] - Agenda topic.
• Subcorpus [string] - Subcorpus name.
• Lang [string] - Language of the transcript.
• Speaker_role [string] - Role of the speaker (e.g., Regular, Minister).
• Speaker_MP [string] - Whether speaker is a Member of Parliament.
• Speaker_minister [string] - Whether speaker is a minister.
• Speaker_party [string] - Party abbreviation.
• Speaker_party_name [string] - Full name of the party.
• Party_status [string] - Government or opposition or coalition.
• Party_orientation [string] - Political orientation (e.g., left, right).
• Speaker_ID [string] - Unique speaker identifier, usually in format "LastnameFirstName".
• Speaker_name [string] - Full name of the speaker in format "Lastname, Firstname".
• Speaker_gender [string] - Gender of the speaker (M, F or "-").
• Speaker_birth [string] - Birth year of the speaker.
ParlaSpeech-Pause
11. filled_pauses [list of dicts] - Timestamped intervals of filled pauses in the audio.
• time_s [float] - Start time of the pause in seconds, relative.
• time_e [float] - End time of the pause in seconds, relative.
• words_idx [int] - Index of the word in words
preceding the pause.
ParlaSpeech-Align
12. words_align [list of dicts] - Kaldi MFA word-level alignment data. Each dictionary corresponds to a word.
• text [string] - Word text.
• char_s [int] - Start character index, relative.
• char_e [int] - End character index, relative.
• time_s [float] - Start time of the word in seconds, relative.
• time_e [float] - End time of the word in seconds, relative.
• words_idx [int] - Index of word in words
. Equal to Null if instance cannot be matched to any words
instance.
13. chars_align [list of lists of dicts] - Kaldi MFA character-level alignment data. Every dictionary corresponds to a character, while the list of dictionaries corresponds to a word.
• text [string] - Character(s).
• time_s [float] - Start time of the character(s), relative.
• time_e [float] - End time of the character(s), relative.
• char_s [int] - Start index in the full text, relative.
• char_e [int] - End index in the full text, relative.
ParlaSpeech-Stress
14. primary_stress [list of dicts] - Primary stress annotations for multisyllabic words.
• words_align_idx [int] - Index of word in the words_align
.
• stress [int] - Index of syllable containing the primary stress.
• nuclei [list of int] - Character indices of syllable nuclei.
• raw [list of float] - Inferred and unprocessed timestamped intervals of primary stress, relative to audio file.
ParlaSpeech-Ling
15. linguistic_annotation [list of dicts] - Text-based word-level linguistic annotation.
• words_idx [int] - Index of the corresponding word from words
. Equal to Null for parts of the text
that cannot be found in words
(such as punctuation).
• id [int] - Token ID.
• text [string] - Word form.
• lemma [string] - Lemma of the word.
• upos [string] - Universal part of speech.
• xpos [string] - Language-specific part of speech.
• feats [string] - Morphological features.
• head [int] - Head of the current word (syntactic dependency relation).
• deprel [string] - Type of syntactic dependency relation.
• misc [string] - Additional information.
ParlaSpeech-Senti
16. sentiment [dict] - Text-based sentiment prediction based on the ParlaSent model (https://huggingface.co/classla/xlm-r-parlasent).
• ParlaSent_logit [float] - Sentence sentiment prediction from the ParlaSent model (continuous value, ranging from 0 (negative) to 6 (positive)).
• ParlaSent_3 [string] - Sentence sentiment on 3 levels coming from the ParlaSent sentiment text prediction model (Negative, Neutral, Positive).
• ParlaSent_6 [string] - Sentence sentiment on 6 levels coming from the ParlaSent sentiment text prediction model (Negative, Mixed Negative, Neutral Negative, Neutral Positive, Mixed Positive, Positive).
Example of an Entry
Here is a short sample from the corpora, reformatted for readability:
History
-
2021 — The idea of the ParlaSpeech corpora was born while proposing the second iteration of the ParlaMint project inside which textual transcripts were made available for 20+ European parliaments. We ideated a task to deliver at least 50 hours of aligned text and speech for at least three parliaments / languages. Inside the ParlaMint project ParlaSpeech v1 and ParlaSpeech v2 were born.
-
2022 — ParlaSpeech v1 was born in the early days of the second iteration of the ParlaMint project, where the Croatian parliament spoken data were aligned to the existing Croatian ParlaMint text transcripts. The result are the ParlaCLARIN workshop paper describing the construction process, and the ParlaSpeech-HR v1 corpus, 1816 hours in size.
-
2023 — ParlaSpeech v2 was the final result of the ParlaMint project, spanning four parliaments of Croatia, Czechia, Poland, and Serbia, and 5,000 hours of audio. The improved alignment process, applied to the Croatian, Polish and Serbian parliaments, is described in the SPECOM conference paper, while the Czech corpus has been compiled from the previously constructed Czech parliamentary speech corpus.
-
2024 — ParlaSpeech v3 is developed as an enriched version of the audio and transcripts available in ParlaSpeech v2. New features include linguistic UD-based annotation of transcripts, paralinguistic annotation of filled pauses and primary stress, more detailed word- and character-level alignment of text to the audio, and transcript-based sentiment classification.
-
2025 — ParlaSpeech v4, in development, represents an extension of the list of parliaments to also include the Bulgarian, Slovenian and Ukrainian parliament.