Other Ukrainian and Slavic corpora

Ukrainian corpora

Corpus

Size
Texts included
Access
Textual Corpus of Ukrainian

120 million tokens
Journalism, fiction, academic, legal, poetic
Searchable online
Laboratory of Ukrainian
Zvidusil: a web corpus with syntactic annotation

3 billion tokens
Web texts
Searchable online
Ukrainian Web Corpus of the Leipzig University
Ukrainian mixed corpus based on material from 2014
 
1,5 billion tokens
Web texts
Searchable online
Web Corpus Araneum Ucrainicum

125 million tokens (“Minus”) and 1,25 billion tokens (“Maius”)
Web texts, downloaded in 2014, 2015, 2021, 2022
Searchable online, registration is required
ukTenTen: Ukrainian corpus from the Web

7,5 billion tokens
Web texts
Searchable online
Polish Automatic Web corpus of Ukrainian language (PAWUK)

700+ million tokens  Web texts (news sites, telegram, twitter, YouTube), downloaded daily from March 2022
Searchable online
Ukrainian ParlaMint 41 million tokens Parliamentary transcripts (2002-2023)
Searchable online
Ukrainian Brown corpus

633k tokens (510k words)
Balanced, manually annotated corpus
Available for download
Ukrainian Treebank (Laboratory of Ukrainian)

140 thousand tokens
Different genres
Searchable online, available for download
Lang-uk. Corpora of Ukrainian texts

600 million tokens
News, Wikipedia, fiction, web
Available for download
Ukrainian corpus of the Chtyvo library

600 million tokens
Books: fiction, academic texts, journalism
The search is exact (without lemmatizing, morphology or correcting mistakes) and available online
Laboratory of Ukrainian
Parallel with English, Polish, French, German, Spanish, Portuguese

6 million tokens
Fiction
Searchable online
Parallel Ukrainian-Russian and Russian-Ukrainian corpora within the Russian National Corpus

9 million tokens
Fiction, journalism, academic, legal, letters
Searchable online
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language


34,000 sentences


Texts with errors
Available for download



East Slavic corpora

National corpus of Russian language [Национальный корпус русского языка]

>700 million words Fiction, journalism, academic, etc. texts from printed publications
Searchable online
General Internet corpus of Russian language [Генеральный Интернет-корпус Русского Языка (ГИКРЯ)]

>20 billion tokens Web texts
Searchable online
Spoken corpora

Each corpus published on this website represents the spoken language of a specific region of Russia, and contains audiofiles transcribed using standardized orthography.
The search function allows you to listen to fragments containing a word or collocation of interest. For many of the corpora full texts are available.  
More Russian corpora





Belarusian N-corpus [Беларускі N-корпус]

1 billion words Fiction, journalism, academic, religious, etc. texts
Searchable online
Araneum Albaruthenicum Novum MMXXI

155 million tokens Web texts
Searchable online
Corpus Albaruthenicum - Corpus of the academic Belarusian language

350 thousand words Academic texts Searchable online
Experimental Belarusian corpus [Эксперыментальны корпус беларускай мовы]

7,5 million words Newspapers, fiction Available for download
Parallel Belarusian Bible Corpus [Біблійны корпуc]


16 Belarusian translations of the Bible and 6 translations in other languages, including Ukrainian translation by Ivan Ohienko
Searchable online

Corpus of Spoken Rusyn

125 thousand words Transcriptions of conversations aligned with the audio recordings

Searchable online


West Slavic corpora

National Corpus of Polish [Narodowy Korpus Języka Polskiego]

1,8 billion tokens
Classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts

Searchable online

Polish Language Corpus of the PWN Publishing House [Korpus Języka Polskiego Wydawnictwa Naukowego PWN]
100 million words

Fiction, journalism, other printed texts (advertising, operating instructions, rules, election leaflets, etc.), website texts, conversational texts

Searchable online

Monco corpus search engine [Wyszukiwarka korpusowa Monco]

>6 billion tokens Web texts
Searchable online
Spokes. Conversational Polish corpus

2,3 million words
Transcriptions of spontaneous conversations aligned with the audio recordings

Searchable online

Corpus of spoken language of Spisz [Korpus języka mówionego mieszkańców Spisza]


Transcriptions of spontaneous conversations aligned with the audio recordings
Searchable online

Electronic corpus of Polish texts from the 17th and 18th centuries (until 1772) [Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do 1772 r.)]

13,5 million words
Searchable online
Polish-German / German-Polish Parallel Corpus
1 million words fiction, press, law, non-fiction

Searchable online
Czech National Corpus [Český národní korpus]

>4 billion tokens

Written contemporary Czech (more than 4 billion tokens), spontaneous spoken language (more than 7 million tokens), diachronic corpus of historical texts, and parallel corpus InterCorp that contains translations from or to 30+ languages.

Searchable online

Old Czech text bank [Staročeská textová banka]


Searchable online

Database of Late Medieval Biblical Texts [Český biblický překlad v diachronním pohledu: Databáze pozdně středověkých biblických textů]



Searchable online
Slovak national corpus [Slovenský národný korpus]

1,5 billion tokens
Texts of different styles, genres, regions, since 1955
Searchable online
Lower Sorbian text corpus [Dolnoserbski tekstowy korpus]
15 million tokens
Searchable online


South Slavic corpora

Croatian national corpus [Hrvatski nacionalni korpus]

217 mln tokens
Searchable online
Croatian language corpus Riznica [Hrvatski jezični korpus]

Fundamental Croatian literature (e.g. novels, short stories, drama, poetry); non-fiction; scientific publications from various domains and University textbooks; school books; translated literature from outstanding Croatian translators; online journals and newspapers; books from the pre-standardization period of Croatian language that are adapted to nowadays standard Croatian

Searchable online
Slovenian Corpus Nova beseda

318 million words
Journalism, National Assembly session transcripts, fiction, academic, legal

Searchable online
Corpus of spoken Slovene GOS [GOS — GOvorjene Slovenščine]
>1 million words Radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations when selling objects and services, etc.  

Searchable online
Bulgarian national corpus [Български национален корпус]



Searchable online




ParaSol: A Parallel Corpus of Slavic and other languages




How to use this theme

Every part of this theme can be translated to another language. Even this content you are reading now!

The drop-down in the main menu is called a Locale Picker. It lets you quickly switch between any of the available languages when browsing this website.

For help on setting up more languages, close this popup and click the Languages menu item.