Other Ukrainian and Slavic corpora

Ukrainian corpora

Corpus	Size	Texts included	Access
Textual Corpus of Ukrainian	120 million tokens	Journalism, fiction, academic, legal, poetic	Searchable online
Laboratory of Ukrainian Zvidusil: a web corpus with syntactic annotation	3 billion tokens	Web texts	Searchable online
Ukrainian Web Corpus of the Leipzig University Ukrainian mixed corpus based on material from 2014	1,5 billion tokens	Web texts	Searchable online
Web Corpus Araneum Ucrainicum	125 million tokens (“Minus”) and 1,25 billion tokens (“Maius”)	Web texts, downloaded in 2014, 2015, 2021, 2022	Searchable online, registration is required
ukTenTen: Ukrainian corpus from the Web	7,5 billion tokens	Web texts	Searchable online
Polish Automatic Web corpus of Ukrainian language (PAWUK)	700+ million tokens	Web texts (news sites, telegram, twitter, YouTube), downloaded daily from March 2022	Searchable online
Ukrainian ParlaMint	41 million tokens	Parliamentary transcripts (2002-2023)	Searchable online
Ukrainian Brown corpus	633k tokens (510k words)	Balanced, manually annotated corpus	Available for download
Ukrainian Treebank (Laboratory of Ukrainian)	140 thousand tokens	Different genres	Searchable online, available for download
Lang-uk. Corpora of Ukrainian texts	600 million tokens	News, Wikipedia, fiction, web	Available for download
Ukrainian corpus of the Chtyvo library	600 million tokens	Books: fiction, academic texts, journalism	The search is exact (without lemmatizing, morphology or correcting mistakes) and available online
Laboratory of Ukrainian Parallel with English, Polish, French, German, Spanish, Portuguese	6 million tokens	Fiction	Searchable online
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language	34,000 sentences	Texts with errors	Available for download

East Slavic corpora

ruTenTen: Corpus of the Russian Web	>20 billion tokens	Web texts, downloaded in2011, 2017	Searchable online
Araneum Russicum Russicum (Russia-only Russian)	125 million tokens (“Minus”) and 1,25 billion tokens (“Maius”)	Web texts, downloaded in 2015	Searchable online, registration is required
Araneum Russicum Externum (non-Russia Russian)	125 million tokens (“Minus”) and 1,25 billion tokens (“Maius”)	Web texts, downloaded in 2015	Searchable online, registration is required

Belarusian N-corpus [Беларускі N-корпус]	1 billion words	Fiction, journalism, academic, religious, etc. texts	Searchable online
Araneum Albaruthenicum Novum MMXXI	155 million tokens	Web texts	Searchable online
Corpus Albaruthenicum - Corpus of the academic Belarusian language	350 thousand words	Academic texts	Searchable online
Experimental Belarusian corpus [Эксперыментальны корпус беларускай мовы]	7,5 million words	Newspapers, fiction	Available for download
Parallel Belarusian Bible Corpus [Біблійны корпуc]		16 Belarusian translations of the Bible and 6 translations in other languages, including Ukrainian translation by Ivan Ohienko	Searchable online
Corpus of Spoken Rusyn	125 thousand words	Transcriptions of conversations aligned with the audio recordings	Searchable online

West Slavic corpora

National Corpus of Polish [Narodowy Korpus Języka Polskiego]

1,8 billion tokens

Classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts

Searchable online

Polish Language Corpus of the PWN Publishing House [Korpus Języka Polskiego Wydawnictwa Naukowego PWN]

100 million words

Fiction, journalism, other printed texts (advertising, operating instructions, rules, election leaflets, etc.), website texts, conversational texts

Searchable online

Monco corpus search engine [Wyszukiwarka korpusowa Monco]

>6 billion tokens

Web texts

Searchable online

Spokes. Conversational Polish corpus

2,3 million words

Transcriptions of spontaneous conversations aligned with the audio recordings

Searchable online

Corpus of spoken language of Spisz [Korpus języka mówionego mieszkańców Spisza]

Transcriptions of spontaneous conversations aligned with the audio recordings

Searchable online

Electronic corpus of Polish texts from the 17th and 18th centuries (until 1772) [Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do 1772 r.)]

13,5 million words

Searchable online

Polish-German / German-Polish Parallel Corpus

1 million words

fiction, press, law, non-fiction

Searchable online

Czech National Corpus [Český národní korpus]

>4 billion tokens

Written contemporary Czech (more than 4 billion tokens), spontaneous spoken language (more than 7 million tokens), diachronic corpus of historical texts, and parallel corpus InterCorp that contains translations from or to 30+ languages.

Searchable online

Old Czech text bank [Staročeská textová banka]

Searchable online

Database of Late Medieval Biblical Texts [Český biblický překlad v diachronním pohledu: Databáze pozdně středověkých biblických textů]

Searchable online

Slovak national corpus [Slovenský národný korpus]

1,5 billion tokens

Texts of different styles, genres, regions, since 1955

Searchable online

Lower Sorbian text corpus [Dolnoserbski tekstowy korpus]

15 million tokens

Searchable online

South Slavic corpora

Croatian national corpus [Hrvatski nacionalni korpus]	217 mln tokens		Searchable online
Croatian language corpus Riznica [Hrvatski jezični korpus]		Fundamental Croatian literature (e.g. novels, short stories, drama, poetry); non-fiction; scientific publications from various domains and University textbooks; school books; translated literature from outstanding Croatian translators; online journals and newspapers; books from the pre-standardization period of Croatian language that are adapted to nowadays standard Croatian	Searchable online
Slovenian Corpus Nova beseda	318 million words	Journalism, National Assembly session transcripts, fiction, academic, legal	Searchable online
Corpus of spoken Slovene GOS [GOS — GOvorjene Slovenščine]	>1 million words	Radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations when selling objects and services, etc.	Searchable online
Bulgarian national corpus [Български национален корпус]			Searchable online

ParaSol: A Parallel Corpus of Slavic and other languages

Other Ukrainian and Slavic corpora

How to use this theme