Corpora

General Regionally Annotated Corpus of Ukrainian (GRAC) is the largest manually compiled reference corpus of Ukrainian.

Corpus Project of the Laboratory of Ukrainian contains several corpora and a dedicated morphological analyzer. The corpora include a treebank with manual disambiguation and manual tagging (140 thousand tokens), a web corpus "Zvidusil" with automatic syntactic annotation (about 3 billion tokens), parallel corpora.

Lang-uk corpus project provides collections of Ukrainian online press, fiction, and Wikipedia available for download, totaling 665 million tokens (UberText corpus), a corpus of law and legal acts counting 579 million tokens, a corpus annotated for named entities and also a build-up model for automatic annotation of named entities (people, organizations, locations, and others); different gazetteers, simple tokenizer (splitting text into paragraphs, sentences, and tokens), vector models trained on different corpora.

Ukrainian Brown corpus - open, genre-balanced and in the future annotated corpus of the modern Ukrainian language (BrUK) with a volume of 1 million word usages. The corpus is built on the basis of the well-known Brown corpus of the English language..

UA-GEC a corpus of texts with marked grammatical errors.

Ukrainian Treebank

Ukrainian Web Corpus (Corpora Collection Leipzig) is a Ukrainian mixed corpus based on material from 2014. It contains 102,429,857 sentences and 1,546,330,404 tokens.

Zvidusil - a web corpus with syntactic annotation (Laboratory of Ukrainian).

Web Corpus Araneum Ucrainicum

Polish Automatic Web corpus of Ukrainian language (PAWUK). PAWUK is an acronym for Polish Automatic Web corpus of UKrainian language. It is a linguistic corpus containing Ukrainian texts acquired from the Internet (selected web pages and social network accounts) and is updated daily. It is automatically annotated with morphosyntactic tags, syntactic dependencies and named entities using Stanza with a custom-built model for Ukrainian to produce both Universal Dependencies tags and VESUM morphological tags.

Ukrainian corpus of the Chtyvo library. Universal (or national) unannotated and unsystematized corpus of the Ukrainian language. Contains 6.6 GB of Ukrainian-language texts from the Chtyvo electronic library.

Ukrainian NLI corpus (translation from Stanford SNLI).

Ukrainian Formality corpus (Translation from GYAFC (Grammarly’s Yahoo Answers Formality Corpus))

Ukrainian Jigsaw Toxicity Classification dataset (translation from English)

Ukrainian Trends: a daily-updated monitor corpus of news articles. The Ukrainian Trends corpus is a Ukrainian monitor corpus made up of news articles, Wikipedia and other sources that are regularly updated from their RSS feeds (newsfeeds). The Ukrainian trends corpus is updated daily with new texts and grows by about 1 million words each day.

Parallel Corpora

LORELEI Ukrainian Representative Language Pack. LORELEI Ukrainian Representative Language Pack consists of Ukrainian monolingual text, Ukrainian-English parallel and comparable text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program.

MultiParaCrawl is a parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English.

INTERCORP. In Intercorp v.16, the volume of Ukrainian texts is over 18 million tokens with aligned originals or translations into Czech and other languages through Czech. The Ukrainian part of Intercorp consists mainly of manually aligned fiction texts and a smaller dataset of subtitles and the Bible.

maCoCu: Corpora from the Web. The MaCoCu corpora were built by crawling the Internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well.

ParaRook||DE-UK - parallel German-Ukrainian and Ukrainian-German corpus based on GRAC.

OpenSubtitles: multilingual corpora in 58 languages. The OpenSubtitles parallel corpora 2018 are a collection of parallel corpora made up of translated movie subtitles at https://www.opensubtitles.org/. The collection consists of 60 corpora in 58 languages.

Parallel corpus with Russian (Russian National Corpus)

Parallel corpora with English, Polish, French, German, Spanish, Portuguese (Laboratory of Ukrainian)

OPUS is a growing collection of translated texts from the web.

Tatoeba is a large database of sentences and translations, include the Ukrainian.

Dataset Multi30k: English-Ukrainian variation.