The ParlaMint-UA corpus at a glance

The ParlaMint-UA corpus is the first full-text specialized corpus of Ukrainian parliamentary proceedings, which was compiled, annotated and made comparable with the other national and regional European corpora within the ParlaMint project under the auspices of CLARIN-ERIC1. It contains records of plenary proceedings from the Verkhovna Rada – the unicameral parliament of Ukraine.

Download its versions for free via the CLARIN.SI repository:

ParlaMint-UA 4.0 (annotated)

ParlaMint-UA 4.0 (plain texts)

ParlaMint-UA 4.0.1 (plain texts and annotated)

Or explore them through the NoSketch Engine concordancer:

ParlaMint-UA (NoSketch Engine)

The corpus was compiled in three versions. Version 3.0 covers the time span between 04 December 2012 and 24 February 2023 embracing 783 sitting dates, 1,475 speakers and 52 parliamentary parties, factions, and groups. It includes over 22.5M tokens, 18M words, 1.5M sentences and over 195k utterances (stretches of speech from single speakers).

Version 4.0 contains plenary records between 04 December 2012 and 06 September 2023 the total numbers include over 23M tokens, 18.5M words, 1.6M sentences and over 200k utterances used by 1,501 speakers in the course of 808 sitting dates.

The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 is an extended version of the ParlaMint-UA 4.0 corpus, which contains plenary proceedings for Terms 4–9 between 14 May 2002 and 10 November 2023. The total numbers include over 51M tokens, 41M words, 3.4M sentences and 429k utterances used by 2,532 speakers in 1,723 sittings.

Note that Term 9 was still ongoing as of 2023, when both all three versions were released.

Archived transcripts of all plenary sittings as well as lists of parliamentary speeches containing timestamps, and personal metadata on MPs, including their full names, dates of birth, gender, and affiliations within the Verkhovna Rada, were automatically downloaded in the HTML, XML and CSV formats from the Verkhovna Rada open data portal under the CC BY 4.0 license. The metadata on government members, guest speakers, organizations and events like the periods of governments in office, as well as additional metadata on MPs like person renaming, were collected manually from various open sources.

Matters of language

Although the official working language of the Verkhovna Rada is Ukrainian, some speeches during the parliamentary proceedings were held in other languages. All the speeches delivered by foreign guests in languages other than Ukrainian were recorded in their translation into Ukrainian in the source texts. However, utterances produced by Ukrainian MPs and government officials in Russian were recorded in Russian. With language identification done at the paragraph level in the ParlaMint-UA corpus3.0 and 4.0, 99 % of utterances were recorded in Ukrainian and 1 % were recorded in Russian. The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 enhances language identification between Ukrainian and Russian from the paragraph level to the sentence level to advance research on code switching in public discourse2. In this version, tokens in Ukrainian comprise 94% and tokens in Russian comprise 6%.

Instances of using Russian in the Verkhovna Rada occurred mostly before mid-2019, when the Law on Protecting the Functioning of the Ukrainian Language as the State Language came into effect.

POS tagging, lemmatization and dependency parsing were done with UDPipe 2 using the ukrainian-iu-ud-2.10-220711 and russian-syntagrus-ud-2.10-220711 models for ParlaMint-UA 3.0 and 4.0, and ukrainian-iu-ud-2.12-230717 and russian-syntagrus-ud-2.12-230717 models for ParlaMint-UA 4.0.1. Also, the errors found in ParlaMint 4.0 have been corrected.

The Ukrainian NER model was trained and deployed as part of the NameTag service, with the dedicated dataset used for training. We would like to thank Jana Strakova for training the Ukrainian NER tool.

To increase the availability of Ukrainian data for international researchers, the ParlaMint-UA corpus3.0 and 4.0, along with the other ParlaMint corpora in the project, was were machine translated into English and included into a parallel corpus available via concordancer as::

PARLAMINT-XX-EN 4.0 (the machine translated version), and

PARLAMINT-XX 4.0 (the joint corpus of all the original language corpora, including the ParlaMint-UA corpus, which is sentence aligned with the translated version).

Or for download

The machine translation to English was done at the sentence level with the EasyNMT package using OPUS-MT models. Note that the automatically produced translation to English contains errors typical of neural machine translation.

The Ukrainian parliament at a glance

The Verhkovna Rada of Ukraine (Ukrainian: Верховна Рада України, lit. the Supreme Council of Ukraine, VR or Rada for short) is the unicameral parliament of Ukraine, with members elected to a five-year term. There have been nine terms (convocations) of the Rada in modern Ukrainian history, with the 6th, 8th and 9th Radas elected at snap parliamentary elections. The Rada comprises 450 Members of Parliament (Ukrainian: народні депутати, or people’s deputies). However, elections for the constituencies situated in the occupied parts of Donetsk and Luhansk Oblasts as well as Crimea have not been held since Russia’s aggression in 2014, which resulted in electing only 423 MPs for the 8th Rada and 424 MPs for the 9th Rada. The current election system is mixed, with 50 % of seats distributed under party lists and 50 % of seats won in single-member constituencies.

Parliamentary meetings during one term are grouped into several sessions. Each first session of a newly convoked Rada is presided over by members of a temporary presidium, until a Chairperson (Ukrainian: Голова Верховної Ради, lit. Head of the Verhkovna Rada), a First Deputy Chairperson (Ukrainian: Перший заступник Голови Верховної Ради, lit. First Deputy Head of the Verhkovna Rada) and a Deputy Chairperson (Ukrainian: заступник Голови Верховної Ради, lit. Deputy Head of the Verhkovna Rada) are elected from among its ranks. In circumstances where the post of President of Ukraine becomes vacant, the Chairman of the Rada becomes acting head of state with limited authority, which was the case in February–June 2014.

Commonly there may be one or two parliamentary meetings per day (a morning and an evening sitting).

The political system in Ukraine is multi-party, with 349 political parties on record at the country's Single Registry as of 1 January 2020. Contemporary political parties in Ukraine tend not to have clear-cut ideologies and centre around civilizational and geostrategic orientations, individual politicians or business interests. Also, renaming and rebranding political parties ahead of elections is not unusual. Parties that break the 5% electoral threshold form factions in the parliament. MPs elected on party lists may be either members of the respective parties or be nominated by those parties without membership. Parliamentary groups may consist of MPs who left a parliamentary faction, members of different political parties or independent politicians. An MP may be a member of only one parliamentary faction or group at a time.

The team

Matyáš Kopp (3.0, 4.0, 4.0.1)

Anna Kryvenko (3.0, 4.0, 4.0.1)

Adriana Rii (4.0.1)

More on the workflow developed to build the ParlaMint-UA corpus is below3.

Acknowledgments

The ParlaMint-UA corpus was developed with support from the program Digital Humanities P6-0436 and project N6-0288, both by the Slovenian Research Agency, as well as CLARIN ERIC project ‘ParlaMint: Towards Comparable Parliamentary Corpora’. Also, it used tools and services provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062). The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 was supported by Jožef Stefan Institute CLARIN "CLARIN.SI"

References

(1) Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Pančur, A., Rudolf, M., Kopp, M., Barkarson, S., Steingrímsson, S., Çöltekin, Ç., de Does, J., Depuydt, K., Agnoloni, T., Venturi, G., Pérez, M. C., de Macedo, L. D., Navarretta, C., Luxardo, G., . . . Fišer, D. (2023). The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 57(1), 415– 448. https://doi.org/10.1007/s10579-021-09574-0

(2) Kanishcheva, O., Kovalova, T., Shvedova, M., von Waldenfels, R. (2023). The Parliamentary Code-Switching Corpus: Bilingualism in the Ukrainian Parliament in the 1990s-2020s. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), 79–90, Dubrovnik, Croatia. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.unlp-1.10

(3) Kryvenko, A., Kopp, M. (2023). Workflow and Metadata Challenges in the ParlaMint Project: Insights from Building the ParlaMint-UA Corpus. In: CLARIN Annual Conference Proceedings, 2023. ISSN 2773-2177 (online). Eds. Krister Lindén, Jyrki Niemi, and Thalassia Kontino. Leuven, Belgium, 2023. https://office.clarin.eu/v/CE-2023-2328_CLARIN2023_ConferenceProceedings.pdf

Resources

Kopp, M., Kryvenko, A., Rii, A. (2023). Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 [Slovenian language resource repository CLARIN.SI]. http://hdl.handle.net/11356/1900

Erjavec, T., Kopp, M., Ogrodniczuk, M., Osenova, P., Fišer, D., Pirker, H., Wissik, T., Schopper, D., Kirnbauer, M., Mochtak, M., Ljubešić, N., Rupnik, P., Pol, H. v. d., Depoorter, G., de Does, J., Simov, K., Grigorova, V., Grigorov, I., Jongejan, B., . . . Kryvenko, A. (2023). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.0 [Slovenian language resource repository CLARIN.SI]. http://hdl.handle.net/11356/1860

Erjavec, T., Kopp, M., Ogrodniczuk, M., Osenova, P., Fišer, D., Pirker, H., Wissik, T., Schopper, D., Kirnbauer, M., Mochtak, M., Ljubešić, N., Rupnik, P., Pol, H. v. d., Depoorter, G., de Does, J., Simov, K., Grigorova, V., Grigorov, I., Jongejan, B., . . . Kryvenko, A. (2023). Multilingual comparable corpora of parliamentary debates ParlaMint 4.0 [Slovenian language resource repository CLARIN.SI]. http://hdl.handle.net/11356/1859

Kuzman, T., Ljubešić, N., Erjavec, T., Kopp, M., Ogrodniczuk, M., Osenova, P., Fišer, D., Pirker, H., Wissik, T., Schopper, D., Kirnbauer, M., Mochtak, M., Rupnik, P., Pol, H. v. d., Depoorter, G., de Does, J., Simov, K., Grigorova, V., Grigorov, I., . . . Kryvenko, A. (2023). Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.0 [Slovenian language resource repository CLARIN.SI]. http://hdl.handle.net/11356/1864

Erjavec, T., Kopp, M., Ogrodniczuk, M., Osenova, P., Fišer, D., Pirker, H., Wissik, T., Schopper, D., Kirnbauer, M., Mochtak, M., Ljubešić, N., Rupnik, P., Pol, H. v. d., Depoorter, G., de Does, J., Simov, K., Grigorova, V., Grigorov, I., Jongejan, B., . . . Kryvenko, A. (2023). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 [Slovenian language resource repository CLARIN.SI]. http://hdl.handle.net/11356/1488

Erjavec, T., Kopp, M., Ogrodniczuk, M., Osenova, P., Fišer, D., Pirker, H., Wissik, T., Schopper, D., Kirnbauer, M., Mochtak, M., Ljubešić, N., Rupnik, P., Pol, H. v. d., Depoorter, G., de Does, J., Simov, K., Grigorova, V., Grigorov, I., Jongejan, B., . . . Kryvenko, A. (2023). Multilingual comparable corpora of parliamentary debates ParlaMint 3.0 [Slovenian language resource repository CLARIN.SI]. http://hdl.handle.net/11356/1486

Kuzman, T., Ljubešić, N., Erjavec, T., Kopp, M., Ogrodniczuk, M., Osenova, P., Fišer, D., Pirker, H., Wissik, T., Schopper, D., Kirnbauer, M., Mochtak, M., Rupnik, P., Pol, H. v. d., Depoorter, G., de Does, J., Simov, K., Grigorova, V., Grigorov, I., . . . Kryvenko, A. (2023). Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0 [Slovenian language resource repository CLARIN.SI]. http://hdl.handle.net/11356/1810