How to Use the ParlaSpeech Corpus Through a Concordancer

The ParlaSpeech Concordancer enables detailed linguistic and prosodic searches in the ParlaSpeech v3 corpus.
This guide provides a walkthrough of its basic and advanced functionalities, designed for linguists and phoneticians, as well as interested non-specialists. The examples given are from the Croatian ParlaSpeech v3.

1. Simple Search (Word or Lemma)

Suppose we are interested in the Croatian verb "uspostaviti" ("to establish").
We can begin with a lemma search, which retrieves all inflected forms of that lemma thanks to the linguistic annotation of the corpus.

Query: uspostaviti

Try the query

This query will return occurrences of various forms, such as:

uspostavi
uspostavljamo
uspostavljena, etc.

A very important feature is the possibility to play each audio segment (red play button on the right side). This button plays the imminent context of the result of a query. If you want to play back the whole sentence, you can click on the metadata button at the left side of each concordance. This will give you the link to the recording of the whole sentence, but additional metadata as well, such as name, gender, age, party of the speaker etc. These metadata can be used during search as well, which will be shown below.

Screenshot: Simple lemma search

2. Advanced Search: Word Form

Let's say we are interested in a specific word form, such as "uspostavi" due to it being a primary stress doublette. We can still use the simple search and type the required word.

Query: uspostavi

Try the query

Why refine the search?

The form uspostavi is ambiguous. It may function as:

Verb (e.g., "He/She establishes")
Noun (e.g. "The establishment of")

Given that in this example we are interested in the verb primary stress doublette, to avoid retrieving the noun use of that word, further filtering is needed via an advanced query.

3. Advanced Search: CQL with Part of Speech (POS)

To disambiguate between different parts of speech, we use Corpus Query Language (CQL) available in the Advanced tab.
In this case, to extract only the verb usages of uspostavi, the query is:

Query: [word="uspostavi" & pos="VERB"]

Try the query

Annotation Standards:

word – surface form
pos – Universal Dependencies (UD) part-of-speech tag

Screenshot: CQL with POS

4. Advanced Search: Primary Stress

The ParlaSpeech corpus includes prosodic annotations, such as primary stress placement.

This particular annotation layer is currently available for Croatian and Serbian.

For example, the verb uspostavi is a primary stress doublette, with possible stress on the 2nd or 3rd syllable.

Query: [word="uspostavi" & pos="VERB" & primary_stress="2"]

Try the query

The query for verbs "uspostavi" with the second syllable stressed gave us 64 results. Let us investigate the same word when stress is on the 3rd syllable:

Query: [word="uspostavi" & pos="VERB" & primary_stress="3"]

This primary stress variant has resulted in 38 results. Use the playback option to listen to various instances. While most primary stress predictions in the corpus are correct, not all will be due to recording issues, alignment issues, or simply a wrong stress position prediction.

About primary_stress:

Indicates the syllable index where stress is realized (1-based).
Acceptable values: 1 to N (number of syllables in the word).

Screenshot: CQL with primary stress

5. Advanced Search: Filled Pauses (FP)

The ParlaSpeech corpus also encodes disfluency markers, currently only filled pauses such as "eee" or "umm". This annotation layer is available through all current four languages (Croatian, Serbian, Czech, Polish).

To find occurrences where the word račun is followed by a filled pause:

Query: [word="račun" & fp_after="1"]

Try the query

To find occurrences where a filled pause precedes the word:

Query: [word="račun" & fp_before="1"]

Interpretation:

fp_after and fp_before can take values:
- 1 – Filled pause is present
- 0 – No filled pause

This feature is valuable for research on speech disfluencies, planning phenomena, and prosody-syntax interfaces.

Screenshot: CQL with filled pause

6. Advanced Search: Metadata

The ParlaSpeech corpora encode also various metadata, such as speaker name, age, gender, party affiliation, the sentiment of the utterance predicted with the ParlaSent model.

In the Change Criteria , these various attributes can be found under Text Types. Selecting an attribute from the dropdown menu filters out the search.

Supposed we are interested in checking whether male or female speakers produce more filled pauses. Using the search term [fp_after="1"], we filter for all utterances with positive filled pauses events. Next, in Text Types, we select speech.speaker_gender and select M.

This query returns 189,317 utteranes. By doing the same, but now filtering for F speakers, the query returns 123,344 utterances. Repeating the same steps, but now with [fp_after="0"], we find all non-filled-pause instances for both male and female speakers, resulting with the following table:

	FPs	nonFPs	Total
Female	123,344	6,301,284	6,424,628
Male	189,317	17,868,505	18,057,822
Total	312,661	24,169,789	24,482,450

Test used: Pearson’s Chi-Square Test

χ² = 28,544.00, df = 1
p = 0.00 → The difference between male and female speakers did not occur by chance.
Odds ratio (effect size, F/M) = 1.85 → Women are almost twice as likely to use filled pauses as men.

Summary of Query Types

Search Type	Example
Simple search	`uspostaviti`
Advanced lemma search	`uspostavi`
CQL: POS disambiguation	`[word="uspostavi" & pos="VERB"]`
CQL: Stress on 2nd syllable	`[word="uspostavi" & pos="VERB" & primary_stress="2"]`
CQL: Filled pause after "račun"	`[word="račun" & fp_after="1"]`

Available Attributes

Additional Notes

Attribute	Description
word	Surface form (orthographic)
lemma	Lemma (canonical form)
pos	Part of speech (UD standard)
primary_stress	Primary stress position (syllable index)
fp_before	Filled pause before word (0 = no, 1 = yes)
fp_after	Filled pause after word (0 = no, 1 = yes)

All searches are case-sensitive!
On the top right-side, the View Options can be used to display information for each individual token for easier searching.
The system uses Universal Dependencies (UD) annotation conventions for morphology and syntax.
Prosodic and disfluency annotations extend standard corpus search functionality into speech-oriented analysis.

Using ParlaSpeech Through a Concordancer

1. Simple Search (Word or Lemma)

Query: uspostaviti

2. Advanced Search: Word Form

Query: uspostavi

3. Advanced Search: CQL with Part of Speech (POS)

Query: [word="uspostavi" & pos="VERB"]

4. Advanced Search: Primary Stress

Query: [word="uspostavi" & pos="VERB" & primary_stress="2"]

Query: [word="uspostavi" & pos="VERB" & primary_stress="3"]

5. Advanced Search: Filled Pauses (FP)

Query: [word="račun" & fp_after="1"]

Query: [word="račun" & fp_before="1"]

6. Advanced Search: Metadata

Query: [fp_after="1"], with Text Types: speaker_gender = M

Summary of Query Types

Available Attributes

Additional Notes