Corpora
General Regionally Annotated Corpus of Ukrainian (GRAC) is the largest manually compiled reference corpus of Ukrainian.
Corpus Project of the Laboratory of Ukrainian contains several corpora and a dedicated morphological analyzer. The corpora include a treebank with manual disambiguation and manual tagging (140 thousand tokens), a web corpus "Zvidusil" with automatic syntactic annotation (about 3 billion tokens), parallel corpora.
Lang-uk corpus project provides collections of Ukrainian online press, fiction, and Wikipedia available for download, totaling 665 million tokens (UberText corpus), a corpus of law and legal acts counting 579 million tokens, a corpus annotated for named entities and also a build-up model for automatic annotation of named entities (people, organizations, locations, and others); different gazetteers, simple tokenizer (splitting text into paragraphs, sentences, and tokens), vector models trained on different corpora.
Ukrainian Brown corpus - open, genre-balanced and in the future annotated corpus of the modern Ukrainian language (BrUK) with a volume of 1 million word usages. The corpus is built on the basis of the well-known Brown corpus of the English language..
UA-GEC a corpus of texts with marked grammatical errors.
Ukrainian Web Corpus (Corpora Collection Leipzig) is a Ukrainian mixed corpus based on material from 2014. It contains 102,429,857 sentences and 1,546,330,404 tokens.
Zvidusil - a web corpus with syntactic annotation (Laboratory of Ukrainian).
Polish Automatic Web corpus of Ukrainian language (PAWUK). PAWUK is an acronym for Polish Automatic Web corpus of UKrainian language. It is a linguistic corpus containing Ukrainian texts acquired from the Internet (selected web pages and social network accounts) and is updated daily. It is automatically annotated with morphosyntactic tags, syntactic dependencies and named entities using Stanza with a custom-built model for Ukrainian to produce both Universal Dependencies tags and VESUM morphological tags.
Ukrainian corpus of the Chtyvo library. Universal (or national) unannotated and unsystematized corpus of the Ukrainian language. Contains 6.6 GB of Ukrainian-language texts from the Chtyvo electronic library.
Ukrainian NLI corpus (translation from Stanford SNLI).
Ukrainian Formality corpus (Translation from GYAFC (Grammarly’s Yahoo Answers Formality Corpus))
Ukrainian Jigsaw Toxicity Classification dataset (translation from English)
Ukrainian Trends: a daily-updated monitor corpus of news articles. The Ukrainian Trends corpus is a Ukrainian monitor corpus made up of news articles, Wikipedia and other sources that are regularly updated from their RSS feeds (newsfeeds). The Ukrainian trends corpus is updated daily with new texts and grows by about 1 million words each day.
Legal Ukrainian Crawling - a 69-million-token corpus of Ukrainian built from the web by targeting specific in-domain urls that belong to the legal sector such as legislation websites, governamental sites, and domains from the Court and the Parliament.
Legal documents from the official webportal of the Parliament of Ukraine (1.0) is a monolingual corpus based on 15335 documents acquired from the portal of the Ukrainian Parliament.
Bitext Lexical Dataset - Ukrainian includes Lemmas, POS tagging, Frequency, Named Entities and Offensive features. Depending on the dataset and language, other syntactic and morphological features are also provided.
UberText 2.0 is the new and extended version of UberText, a corpus of modern Ukrainian texts designed to meet various NLP needs.
Ukrainian Forums - a corpus, which contains 250k sentences scraped from forums.
ZNO dataset contains machine-readable questions and answers from Ukrainian External independent testing (called ЗНО/ZNO in Ukrainian). Question subjects are: History of Ukraine and Ukrainian language and literature. Train set contains 3063 question/answers from 2006-2019 exams. Test set is 751 question/answers from 2020-2023.
UA-GEC contains UA-GEC data and an accompanying Python library.
Yakaboo Book Reviews contains book reviews, ratings and descriptions.
Ukrainian Winograd Schema Challenge (WSC) Dataset contains manual translations of 263 Winograd schemas from the WSC dataset in csv and jsonlines formats.
Ukrainian-Cultural Heritage-Books - a collection of Ukrainian cultural heritage books and periodicals, most of them being in the public domain. The collection has been compiled by Pierre-Carl Langlais from 19,574 digitized files hosted on Internet Archive (462M words) and will be expanded to other cultural heritage sources.
Parallel Corpora
LORELEI Ukrainian Representative Language Pack. LORELEI Ukrainian Representative Language Pack consists of Ukrainian monolingual text, Ukrainian-English parallel and comparable text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program.
MultiParaCrawl is a parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English.
INTERCORP. In Intercorp v.16, the volume of Ukrainian texts is over 18 million tokens with aligned originals or translations into Czech and other languages through Czech. The Ukrainian part of Intercorp consists mainly of manually aligned fiction texts and a smaller dataset of subtitles and the Bible.
maCoCu: Corpora from the Web. The MaCoCu corpora were built by crawling the Internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well.
ParaRook||DE-UK - parallel German-Ukrainian and Ukrainian-German corpus based on GRAC.
OpenSubtitles: multilingual corpora in 58 languages. The OpenSubtitles parallel corpora 2018 are a collection of parallel corpora made up of translated movie subtitles at https://www.opensubtitles.org/. The collection consists of 60 corpora in 58 languages.
Parallel corpus with Russian (Russian National Corpus)
Parallel corpora with English, Polish, French, German, Spanish, Portuguese (Laboratory of Ukrainian)
OPUS is a growing collection of translated texts from the web.
Tatoeba is a large database of sentences and translations, include the Ukrainian.
Dataset Multi30k: English-Ukrainian variation.
Polish-Ukrainian Parallel Corpus “2” is a constantly developing bilingual resource. It contains manually aligned contemporary Polish and Ukrainian texts with a total volume of more than 1.2 million words. The Corpus features a predominance of works representing fiction, colloquial style and specialized language.
English - Ukrainian Legal MT Test Set is a test set of 996 parallel segments in English and Ukrainian. It is intended as a testset for Machine Translation in the legal domain.
Auslandsgesellschaft.de Dortmund Serviceheft Ukraine (Processed) (1.0) is a collection of TUs mined from a dataset (collection of documents) of texts in German and Ukrainian on a variety of different topics.
EU acts in Ukrainian is a corpus based on: a) the translations of the EU acts in Ukrainian that are available at the official web-portal of the Parliament of Ukraine, and b) the EU acts that are available in many CEF languages.
COVID-19 - HEALTH Wikipedia dataset. Bilingual (EN-UK) is a bilingual (EN-UK) corpus acquired from Wikipedia on health and COVID-19 domain (2nd May 2020).
COVID-19 POLISH-GOV dataset v2. Bilingual (EN-UK) is a bilingual (EN-UK) COVID-19-related corpus acquired from the portal of the Polish Government (8th May 2020).
COVID-19 UDSC-PL dataset. Bilingual (EN-UK) is a bilingual (EN-UK) corpus acquired from the website of the Polish Office for Foreigners.
COVID-19 CDC dataset v2. Multilingual (EN, ES, FR, PT, IT, DE, KO, RU, ZH, UK, VI) (2.0) is a multilingual corpus acquired from the website of the Centers for Disease Control and Prevention of the US government (11th August 2020).
COVID-19 POLISH-GOV v2 dataset. Multilingual (EN, PL, FR, DE, VI, RU, UK) is a multilingual (EN, PL, FR, DE, VI, RU, UK) corpus acquired from the website of the Polish Office for Foreigners.
COVID-19 USAHELLO dataset v2. Multilingual (EN, AR, ES, FA, FR, IT, KO, PT, RU, TL, TR, UK, UR, VI, ZH) a multilingual corpus acquired from the website - a free online center for information and education for refugees, asylum seekers, immigrants and welcoming communities (9th August 2020).
COVID-19 Government of Canada dataset v2. Multilingual (EN, FR, DE, ES, EL, IT, PL, PT, RO, KO, RU, ZH, UK, VI, TA, TL) is a multilingual corpus acquired from the website of the Government of Canada (17th July 2020).
Official web-portal of the Parliament of Ukraine, Ukrainian laws in EN was based on the translations of laws of Ukraine in English that are available at the official web-portal of the Parliament of Ukraine.
Official web-portal of the Parliament of Ukraine, primary legislation was based on the translations of primary legislation controlled by committees of the Verkhovna Rada of Ukraine.
Official web-portal of the Parliament of Ukraine, abstracts of UK laws was based on the translations of abstracts of laws of Ukraine in English that are available at the official web-portal of the Parliament of Ukraine.
TüTeAM contains about 2800 entries from Ancient Greek, German, English, Italian, Hungarian, Latin, Swedish, Russian, Ukrainian, Bulgarian. The data come from various sources: linguistic literature (the "classics" on tense and aspect), fiction, documentary evidence.
Multilingual English, French, Polish to Ukrainian Parallel Corpus
AKCES 5 (CzeSL-SGT) Release 2 stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels.
GlobalPhone 2000 Speaker Package is a multilingual audio corpus that covers about 9,000 randomly selected utterances read by 2000 native speakers in 22 languages. The package is designed for various tasks in speaker recognition research and development, such as (1) text-dependent and text-independent speaker recognition (e.g. speaker verification and speaker identification), (2) speaker recognition in multiple languages, (3) multilingual speaker identification, (4) multilingual speaker verification, and (5) speaker recognition with low resources.
SciPar UK-EN-RU is a corpus based on parallel titles and abstracts of theses and dissertations available in academic repositories of Ukrainian universities and polytechnics (EN, RU, UK).
Ukrainian web corpus MaCoCu-uk 1.0 was built by crawling the ".ua" and ".укр" internet top-level domains in 2022, extending the crawl dynamically to other domains as well. It contains 21,471,613 texts.
Covid Parallel Global Voices was created for the European Language Resources Coordination Action (ELRC) by researchers at the NLP group of the Institute for Language and Speech Processing with primary data copyrighted by Global Voices.
TüNeg contains about 2700 entries from mostly the same languages as the TüTeAM database using sources similar in kind.
Clausal Causal Markers in the Languages of Europe: A Database accumulates clausal causal markers in the languages of Europe. Clausal causal markers are used in polypredicative constructions. The data were collected from grammars and language corpora and via elicitation.
2011 NIST Language Recognition Evaluation Test Set consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguistic Data Consortium (LDC) in 24 languages and dialects.
GlobalPhone Ukrainian was developed in collaboration with the Karlsruhe Institute of Technology (KIT), designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.
ParaPat contains the developed parallel corpus from the open access Google. Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens.
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
IWPT 2021 Shared Task Data and System Outputs contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.7, but some treebanks contain additional enhanced annotations.
2011 NIST Language Recognition Evaluation Test Set is a growing collection of translated texts from the web.
The European Literary Text Collection ELTeC is a diachronic, multilingual, medium-sized open access benchmark corpus of novels from 1840-1919.
Web-acquired data related to Scientific research is a corpus generated by processing content of websites related to scientific research (e.g. Research Center and Institutes , Universities, Ministries of Research, etc.).
CCMatrix is a multilingual corpus that has been extracted from web crawls.
Tilde MODEL Corpus is a multilingual open data for European languages. The data has been collected from sites allowing free use and reuse of its content, as well as from Public Sector web sites. This corpus contains 30 languages, 274 bitexts.
PELCRA-PAR-3 is the Polish parallel corpora licensed under the CC-BY license. This resource contains 11300 texts in 6 languages from the CORDIS website, 5556 texts in 28 languages from the RAPID site, 3037 press releases of the European Parliament in 22 languages and 109 press releases of the European Southern Observatory in 17 languages.
HRW dataset is a multilingual corpus acquired from the website of the Human Rights Watch (9th October 2020).
C4Corpus is a large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
WebCorp is comprised of three versions of the software that build on one another. The access to all WebCorp tools (and corpora) is free. WebCorp Live allows access to the World Wide Web as a corpus from which facts about the language can be extracted.
Syntax-semantics interactions is a repository, which contains the datasets that accompany the paper 'Syntax-semantics interactions – seeking evidence froma synchronic analysis of 38 languages'. It consists of a subset of the Universal Dependencies Corpora v2.6.
XL-Sum is a a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics.
Common Language is composed of speech recordings from languages that were carefully selected from the CommonVoice database.
DaMuEL is a large multilingual dataset for Entity Linking containing data in 53 languages.
mLAMA provides the data for mLAMA, a multilingual version of LAMA.
CoNLL 2018 Shared Task System Outputs is parsed by systems submitted to the CoNLL 2018 UD parsing shared task.
OpenSubtitles is a new collection of translated movie subtitles.
Common Voice is a dataset containing audio in 60 languages and 9,283 recorded hours.
TaPaCo is a freely available sentential paraphrase corpus for 73 languages extracted from the Tatoeba database.
Parallel corpus of KDE4 localization files
SentiWS contains sentiment lexicons for 81 languages generated via graph propagation based on a knowledge graph--a graphical representation of real-world entities and the links between them.
JULIELab/MEmoLon contains the resulting lexicons from our ACL 2020 paper "Learning and Evaluating Emotion Lexicons for 91 Languages". The main repository for this project – including models, experimental code, and analyses – can be found on GitHub or the associated zenodo deposit.
BiblePara is a multilingual parallel corpus created from translations of the Bible compiled by Christos Christodoulopoulos and Mark Steedman.
mC4-sampling is a version of the processed version of Google's mC4 dataset by AllenAI, in which sampling methods are implemented to perform on the fly.
The Massively Multilingual Image Dataset is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania.
Bible Corpus is a parallel corpus created from translations of the Bible containing 102 languages.
CC-100: Monolingual Datasets from Web Crawl Data comprises monolingual data for 100+ languages and also includes data for romanized languages.
Microsoft Terminology Collection is the Microsoft Terminology Collection can be used to develop localized versions of applications that integrate with Microsoft products. It can also be used to integrate Microsoft terminology into other terminology collections or serve as a base IT glossary for language development in the nearly 100 languages available.
Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform.
W2C – Web to Corpus – Corpora is a set of corpora for 120 languages automatically collected from wikipedia and the web.
OSCAR is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 166 different languages are available.
WiLI-2018 is a benchmark dataset for language identification and contains 235000 paragraphs of 235 languages.
Plaintext Wikipedia dump 2018 is the Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.
WikiAnn is a dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages.
CC-100 is an attempt to recreate the dataset used for training XLM-R. This corpus comprises monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom).
Wiki Edits is a collection of over 5M sentence edits extracted from Ukrainian Wikipedia history revisions. Edits were filtered by edit distance and sentence length. This makes them usable for grammatical error correction (GEC) or spellchecker models pre-training.
The Aya Dataset is a multilingual instruction fine-tuning dataset curated by an open-science community via Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human annotated prompt-completion pairs along with the demographics data of the annotators.
PluG (Pluperfect GRAC) is a corpus of old GRAC texts for download.