Welcome to ParlaSpeech

ParlaSpeech is a multilingual collection of parliamentary speech corpora covering Croatian (HR), Serbian (RS), Czech (CZ), and Polish (PL) parliamentary proceedings, created by aligning official parliamentary audio recordings with their corresponding ParlaMint transcripts. It preserves rich metadata including speaker attributes (name, gender, political party) and session details (date, agenda, parliamentary term), enabling cross-disciplinary research in political discourse analysis, speech processing, and computational linguistics. The corpus supports both computational (JSONL) and phonetic (TextGrid) analysis formats.

ParlaSpeech-v3.0 extends the base ParlaSpeech corpora with five annotation layers: linguistic annotations (ParlaSpeech-Ling following Universal Dependencies), sentiment (ParlaSpeech-Senti), filled pause detection (ParlaSpeech-Pause), precise word-level alignments (ParlaSpeech-Align), and primary stress markers (ParlaSpeech-Stress). Currently, Croatian and Serbian versions contain all annotation layers, while Czech and Polish include only pause, sentiment and linguistic annotations. These structured enrichments facilitate advanced research in prosody, speech disfluencies, and multimodal parliamentary analysis.

Reference: Ljubešić, Nikola, Peter Rupnik, and Danijel Koržinek. "The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings." International Conference on Speech and Computer. Cham: Springer Nature Switzerland, 2024.
[BibTeX] [arXiv]

Annotation Layers

ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, POS, syntax, etc.)
ParlaSpeech-Senti: Sentiment from transcript (ParlaSent).
ParlaSpeech-Pause: Detected filled pauses in speech (i.e. "ermm", "umm")
ParlaSpeech-Align: Precise grapheme/word-level alignments (HR/RS only)
ParlaSpeech-Stress: Primary stress markers in multisyllabic words (HR/RS only)

Note 1: HR and RS contain all of the annotation layers, while CZ and PL contain only ParlaSpeech-Ling (linguistic metadata), ParlaSpeech-Senti (sentiment) and ParlaSpeech-Pause (filled pauses) layers.

Note 2: There are two word-level alignments: words and words_align. words is the ASR-level alignment, available in all languages, while words_align is a finer Kaldi MFA-level alignment, available only in HR and RS.

Available formats:
JSONL for computational processing
TextGrid for phonetic analysis

Example TextGrid visualization

Example TextGrid file visualization

Corpus Overview

Language Duration Sentences Layers Links
Croatian (HR) 3,061 hours 922,679 All layers CLARIN | HuggingFace
Serbian (RS) 896 hours 290,778 All layers CLARIN | HuggingFace
Czech (CZ) 1,218 hours 717,682 Ling, Senti, Pause CLARIN | HuggingFace
Polish (PL) 1,010 hours 535,465 Ling, Senti, Pause CLARIN | HuggingFace

Authors

Nikola Ljubešić nikola.ljubesic@ijs.si
Peter Rupnik peter.rupnik@ijs.si
Ivan Porupski ivan.porupski@ijs.si
Danijel Koržinek danijel@pjwstk.edu.pl
Taja Kuzman Pungeršek taja.kuzman@ijs.si

Dataset Schema

Timing information: Absolute times are timestamps from the original full-length session audio, so instances don't start at 0.0s. Relative times are normalized to each extracted sentence, always beginning at 0.0s.

ParlaSpeech

1. id [string] - Unique sentence ID, referencing the utterance ID from ParlaMint 4.0, expanded with character offsets.
2. words [list of dicts] - ASR Word-level timing and character offset annotations.
time_s [float] - Start time of the word in seconds, relative.
time_e [float] - End time of the word in seconds, relative.
char_s [int] - Start character index in the transcript, relative.
char_e [int] - End character index in the transcript, relative.
3. audio [string] - Relative path to the audio file in ParlaSpeech.
4. audio_length [float] - Duration of the audio file in seconds.
5. text [string] - Text from official manual transcriptions corresponding to the audio.
6. text_start [int] - Start character index of the relevant text segment, absolute.
7. text_end [int] - End character index of the relevant text segment, absolute.
8. audio_start [float] - Start time of the relevant audio segment, absolute.
9. audio_end [float] - End time of the relevant audio segment, absolute.
10. speaker_info [dict] - Metadata about the speaker and session.
Text_ID [string] - Text document ID.
ID [string] - Speech segment ID.
Title [string] - Title of the debate/session.
Date [string] - Date of the session.
Body [string] - Parliamentary body or chamber.
Term [string] - Parliamentary term.
Session [string] - Session number.
Meeting [string] - Meeting number.
Sitting [string] - Sitting number.
Agenda [string] - Agenda topic.
Subcorpus [string] - Subcorpus name.
Lang [string] - Language of the transcript.
Speaker_role [string] - Role of the speaker (e.g., Regular, Minister).
Speaker_MP [string] - Whether speaker is a Member of Parliament.
Speaker_minister [string] - Whether speaker is a minister.
Speaker_party [string] - Party abbreviation.
Speaker_party_name [string] - Full name of the party.
Party_status [string] - Government or opposition or coalition.
Party_orientation [string] - Political orientation (e.g., left, right).
Speaker_ID [string] - Unique speaker identifier, usually in format "LastnameFirstName".
Speaker_name [string] - Full name of the speaker in format "Lastname, Firstname".
Speaker_gender [string] - Gender of the speaker (M, F or "-").
Speaker_birth [string] - Birth year of the speaker.

ParlaSpeech-Pause

11. filled_pauses [list of dicts] - Timestamped intervals of filled pauses in the audio.
time_s [float] - Start time of the pause in seconds, relative.
time_e [float] - End time of the pause in seconds, relative.
words_idx [int] - Index of the word in words preceding the pause.

ParlaSpeech-Align

12. words_align [list of dicts] - Kaldi MFA word-level alignment data. Each dictionary corresponds to a word.
text [string] - Word text.
char_s [int] - Start character index, relative.
char_e [int] - End character index, relative.
time_s [float] - Start time of the word in seconds, relative.
time_e [float] - End time of the word in seconds, relative.
words_idx [int] - Index of word in words. Equal to Null if instance cannot be matched to any words instance.
13. chars_align [list of lists of dicts] - Kaldi MFA character-level alignment data. Every dictionary corresponds to a character, while the list of dictionaries corresponds to a word.
text [string] - Character(s).
time_s [float] - Start time of the character(s), relative.
time_e [float] - End time of the character(s), relative.
char_s [int] - Start index in the full text, relative.
char_e [int] - End index in the full text, relative.

ParlaSpeech-Stress

14. primary_stress [list of dicts] - Primary stress annotations for multisyllabic words.
words_align_idx [int] - Index of word in the words_align.
stress [int] - Index of syllable containing the primary stress.
nuclei [list of int] - Character indices of syllable nuclei.
raw [list of float] - Inferred and unprocessed timestamped intervals of primary stress, relative to audio file.

ParlaSpeech-Ling

15. linguistic_annotation [list of dicts] - Text-based word-level linguistic annotation.
words_idx [int] - Index of the corresponding word from words. Equal to Null for parts of the text that cannot be found in words (such as punctuation).
id [int] - Token ID.
text [string] - Word form.
lemma [string] - Lemma of the word.
upos [string] - Universal part of speech.
xpos [string] - Language-specific part of speech.
feats [string] - Morphological features.
head [int] - Head of the current word (syntactic dependency relation).
deprel [string] - Type of syntactic dependency relation.
misc [string] - Additional information.

ParlaSpeech-Senti

16. sentiment [dict] - Text-based sentiment prediction based on the ParlaSent model (https://huggingface.co/classla/xlm-r-parlasent).
ParlaSent_logit [float] - Sentence sentiment prediction from the ParlaSent model (continuous value, ranging from 0 (negative) to 6 (positive)).
ParlaSent_3 [string] - Sentence sentiment on 3 levels coming from the ParlaSent sentiment text prediction model (Negative, Neutral, Positive).
ParlaSent_6 [string] - Sentence sentiment on 6 levels coming from the ParlaSent sentiment text prediction model (Negative, Mixed Negative, Neutral Negative, Neutral Positive, Mixed Positive, Positive).

Example of an Entry

Here is a short sample from the corpora, reformatted for readability:




          
          



          



        

History