Corpora

General Regionally Annotated Corpus of Ukrainian (GRAC) is the largest manually compiled reference corpus of Ukrainian.

Corpus Project of the Laboratory of Ukrainian contains several corpora and a dedicated morphological analyzer. The corpora include a treebank with manual disambiguation and manual tagging (140 thousand tokens), a web corpus "Zvidusil" with automatic syntactic annotation (about 3 billion tokens), parallel corpora.

Lang-uk corpus project provides collections of Ukrainian online press, fiction, and Wikipedia available for download, totaling 665 million tokens (UberText corpus), a corpus of law and legal acts counting 579 million tokens, a corpus annotated for named entities and also a build-up model for automatic annotation of named entities (people, organizations, locations, and others); different gazetteers, simple tokenizer (splitting text into paragraphs, sentences, and tokens), vector models trained on different corpora.

Ukrainian Brown corpus - open, genre-balanced and in the future annotated corpus of the modern Ukrainian language (BrUK) with a volume of 1 million word usages. The corpus is built on the basis of the well-known Brown corpus of the English language..

UA-GEC a corpus of texts with marked grammatical errors.

Ukrainian Treebank

Ukrainian Web Corpus is a Ukrainian mixed corpus based on material from 2014. It contains 102,429,857 sentences and 1,546,330,404 tokens.

Zvidusil - a web corpus with syntactic annotation (Laboratory of Ukrainian).

Web Corpus Araneum Ucrainicum

Polish Automatic Web corpus of Ukrainian language (PAWUK). PAWUK is an acronym for Polish Automatic Web corpus of UKrainian language. It is a linguistic corpus containing Ukrainian texts acquired from the Internet (selected web pages and social network accounts) and is updated daily. It is automatically annotated with morphosyntactic tags, syntactic dependencies and named entities using Stanza with a custom-built model for Ukrainian to produce both Universal Dependencies tags and VESUM morphological tags.

Ukrainian corpus of the Chtyvo library. Universal (or national) unannotated and unsystematized corpus of the Ukrainian language. Contains 6.6 GB of Ukrainian-language texts from the Chtyvo electronic library.

Parallel Corpora

Parallel corpus with Russian (Russian National Corpus)

Parallel corpora with English, Polish, French, German, Spanish, Portuguese (Laboratory of Ukrainian)

OPUS is a growing collection of translated texts from the web.

Tatoeba is a large database of sentences and translations, include the Ukrainian.

Dataset Multi30k: English-Ukrainian variation.