Ukrainian corpora
Corpus
|
Size
|
Texts included
|
Access |
Textual Corpus of Ukrainian
|
120 million tokens
|
Journalism, fiction, academic, legal, poetic
|
Searchable online
|
Laboratory of Ukrainian
Zvidusil: a web corpus with syntactic annotation |
3 billion tokens
|
Web texts
|
Searchable online
|
Ukrainian Web Corpus of the Leipzig University
Ukrainian mixed corpus based on material from 2014 |
1,5 billion tokens
|
Web texts
|
Searchable online
|
Web Corpus Araneum Ucrainicum
|
125 million tokens (“Minus”) and 1,25 billion tokens (“Maius”)
|
Web texts, downloaded in 2014, 2015, 2021, 2022
|
Searchable online, registration is required
|
ukTenTen: Ukrainian corpus from the Web
|
7,5 billion tokens
|
Web texts
|
Searchable online
|
Polish Automatic Web corpus of Ukrainian language (PAWUK)
|
700+ million tokens | Web texts (news sites, telegram, twitter, YouTube), downloaded daily from March 2022
|
Searchable online
|
Ukrainian ParlaMint | 41 million tokens | Parliamentary transcripts (2002-2023)
|
Searchable online
|
Ukrainian Brown corpus
|
633k tokens (510k words)
|
Balanced, manually annotated corpus
|
Available for download
|
Ukrainian Treebank (Laboratory of Ukrainian)
|
140 thousand tokens
|
Different genres
|
Searchable online, available for download
|
Lang-uk. Corpora of Ukrainian texts
|
600 million tokens
|
News, Wikipedia, fiction, web
|
Available for download
|
Ukrainian corpus of the Chtyvo library
|
600 million tokens
|
Books: fiction, academic texts, journalism
|
The search is exact (without lemmatizing, morphology or correcting mistakes) and available online
|
Laboratory of Ukrainian
Parallel with English, Polish, French, German, Spanish, Portuguese |
6 million tokens
|
Fiction
|
Searchable online
|
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
|
34,000 sentences |
Texts with errors
|
Available for download
|
East Slavic corpora
ruTenTen: Corpus of the Russian Web
|
>20 billion tokens
|
Web texts, downloaded in2011, 2017
|
Searchable online
|
Araneum Russicum Russicum (Russia-only Russian)
|
125 million tokens (“Minus”) and 1,25 billion tokens (“Maius”)
|
Web texts, downloaded in 2015
|
Searchable online, registration is required
|
Araneum Russicum Externum (non-Russia Russian)
|
125 million tokens (“Minus”) and 1,25 billion tokens (“Maius”)
|
Web texts, downloaded in 2015
|
Searchable online, registration is required
|
Belarusian N-corpus [Беларускі N-корпус]
|
1 billion words | Fiction, journalism, academic, religious, etc. texts
|
Searchable online
|
Araneum Albaruthenicum Novum MMXXI
|
155 million tokens | Web texts
|
Searchable online
|
Corpus Albaruthenicum - Corpus of the academic Belarusian language
|
350 thousand words | Academic texts | Searchable online
|
Experimental Belarusian corpus [Эксперыментальны корпус беларускай мовы]
|
7,5 million words | Newspapers, fiction | Available for download
|
Parallel Belarusian Bible Corpus [Біблійны корпуc]
|
|
16 Belarusian translations of the Bible and 6 translations in other languages, including Ukrainian translation by Ivan Ohienko
|
Searchable online
|
Corpus of Spoken Rusyn
|
125 thousand words | Transcriptions of conversations aligned with the audio recordings
|
Searchable online
|
West Slavic corpora
National Corpus of Polish [Narodowy Korpus Języka Polskiego]
|
1,8 billion tokens
|
Classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts
|
Searchable online
|
|
Polish Language Corpus of the PWN Publishing House [Korpus Języka Polskiego Wydawnictwa Naukowego PWN]
|
100 million words |
Fiction, journalism, other printed texts (advertising, operating instructions, rules, election leaflets, etc.), website texts, conversational texts |
Searchable online
|
|
Monco corpus search engine [Wyszukiwarka korpusowa Monco]
|
>6 billion tokens | Web texts
|
Searchable online
|
|
Spokes. Conversational Polish corpus
|
2,3 million words
|
Transcriptions of spontaneous conversations aligned with the audio recordings
|
Searchable online
|
|
Corpus of spoken language of Spisz [Korpus języka mówionego mieszkańców Spisza]
|
|
Transcriptions of spontaneous conversations aligned with the audio recordings
|
Searchable online
|
|
Electronic corpus of Polish texts from the 17th and 18th centuries (until 1772) [Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do 1772 r.)]
|
13,5 million words |
|
Searchable online
|
|
Polish-German / German-Polish Parallel Corpus |
1 million words | fiction, press, law, non-fiction
|
Searchable online
|
|
Czech National Corpus [Český národní korpus]
|
>4 billion tokens
|
Written contemporary Czech (more than 4 billion tokens), spontaneous spoken language (more than 7 million tokens), diachronic corpus of historical texts, and parallel corpus InterCorp that contains translations from or to 30+ languages.
|
Searchable online
|
|
Old Czech text bank [Staročeská textová banka]
|
|
|
|
|
Database of Late Medieval Biblical Texts [Český biblický překlad v diachronním pohledu: Databáze pozdně středověkých biblických textů]
|
|
|
Searchable online
|
|
Slovak national corpus [Slovenský národný korpus]
|
1,5 billion tokens
|
Texts of different styles, genres, regions, since 1955
|
Searchable online
|
|
Lower Sorbian text corpus [Dolnoserbski tekstowy korpus]
|
15 million tokens |
|
Searchable online
|
South Slavic corpora
Croatian national corpus [Hrvatski nacionalni korpus]
|
217 mln tokens |
|
Searchable online
|
Croatian language corpus Riznica [Hrvatski jezični korpus]
|
|
Fundamental Croatian literature (e.g. novels, short stories, drama, poetry); non-fiction; scientific publications from various domains and University textbooks; school books; translated literature from outstanding Croatian translators; online journals and newspapers; books from the pre-standardization period of Croatian language that are adapted to nowadays standard Croatian
|
Searchable online
|
Slovenian Corpus Nova beseda
|
318 million words
|
Journalism, National Assembly session transcripts, fiction, academic, legal
|
Searchable online
|
Corpus of spoken Slovene GOS [GOS — GOvorjene Slovenščine]
|
>1 million words | Radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations when selling objects and services, etc.
|
Searchable online
|
Bulgarian national corpus [Български национален корпус]
|
|
|
Searchable online
|
|
|
|
|
ParaSol: A Parallel Corpus of Slavic and other languages
|
|
|
|
Every part of this theme can be translated to another language. Even this content you are reading now!
The drop-down in the main menu is called a Locale Picker. It lets you quickly switch between any of the available languages when browsing this website.
For help on setting up more languages, close this popup and click the Languages menu item.