Open resources and instruments for Ukrainian NLP
- https://github.com/brown-uk/dict_uk — the Large Electronic Dictionary of Ukrainian (VESUM) counts more than 416 thousand lemmas and is constantly updated. It contains information on inflection of the words, non-standard word forms and their alternatives are marked; abbreviations and contractions accounted for; information on some alternative orthographic norms included; encompasses a large database on proper names; is synchronized with the Ukrainian gazetteer, including place names appeared after the decommunization; features a very compact system of marking inflectional types and tags that enables easy updates and regrouping of existing words; contains data on some rare and spoken forms, eg uncontracted adjectives (гарная) and spoken variant of infinitive (поїхать)
- https://github.com/brown-uk/nlp_uk — an instrument of processing the Ukrainian language based on the VESUM dictionary and the LanguageTool engine. Supports tokenization, lemmatization, POS analysis and basic disambiguation. Features an example of realization on python3.
- https://github.com/brown-uk/corpus BrUC — a balanced 1-million corpus of modern Ukrainian, the morphological ambiguity is to be resolved.
- https://github.com/lang-uk — a part of the BRUK annotated for named entities and also a build-up model for automatic annotation of named entities (people, organizations, locations and others); the UberText corpus, different gazetteers, word vectors, simple tokenizer (splitting text into paragraphs, sentences and tokens) and other useful features
- https://github.com/UniversalDependencies/UD_Ukrainian-IU/tree/master — a dependency treebank for Ukrainian
- https://github.com/kmike/pymorphy2 — a morphological analyzer without disambiguation; the Ukrainian language is supported using the old version of VESUM
- https://stanfordnlp.github.io/stanza/ — the Stanford library for language processing; supports Ukrainian using the UD corpus, see above. Features models for tokenization, lemmatization, POS and syntactic analysis.