Tools
Nlp-uk is an instrument based on the VESUM dictionary and the LanguageTool engine. Supports tokenization, lemmatization, POS analysis, and basic disambiguation.
Pymorphy2 — a morphological analyzer without disambiguation; the Ukrainian language is supported via the old version of VESUM.
Stanza — the Stanford library for language processing; it supports Ukrainian using the UD corpus. Features models for tokenization, lemmatization, POS and syntactic analysis.
LanguageTool — spelling, stylistic, and grammar checker, which helps to correct and paraphrase texts.
Stemmer for Ukrainian language — a new stemmer for the Ukrainian language (tree_stem) created via machine learning.
EdUKate translation software 1 — a software package that includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...).
HENSOLDT ANALYTICS services for Speech to text, Language identification, Sentiment analysis and Named entities detection, Keyword spotting, Age detection, Gender detection, Summarization.
UDPipe 2 is a Python prototype, capable of performing tagging, lemmatization and syntactic analysis of CoNLL-U input.
English-Ukrainian Legal Crosslingual Word Embeddings trained on legal domain texts that have been aligned on the same vector space using Vecmap according to their similarity. The embeddings have been developed in the framework of the CEF project MT4ALL.
Collins Multilingual database (MLD) - WordBank covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank) and a multilingual set of sentences in 28 languages (the PhraseBank, distributed separately under reference ELRA-T0377). The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs.
Text to Terminological Concept System 1.1.2 extracts terms, concepts and concept relations and represent them in a terminological concept system, building on a prespecified relation typology: generic, partitive, activity, associative, causal, spatial, instrumental, origination, and property relations. Syonyms are detected and finally grouped in the output format (text and TBX/XML).
Beey is an online tool for converting audio and video (recordings from meetings interviews online files) to text. It also inludes an editing and formatting of text environment. It supports 18 languages and includes a speaker detection feature.
CheckTerm can enforce terminology in any text editor simply via the clipboard. For SDL Trados Studio Adobe InDesign MS Word, it even comes with a special plugin. The check does not only encompass terms stored as forbidden terms in the termbase, but also identifies incorrect variants forms compositions etc.
NEWTON SpeechGrid is a complete workflow for the automatic transcription and processing of audio recordings. It includes a powerful Cloud infrastructure with APIs for online speech recognition. The Newton technologies can recognize and transcribe speech in 18 languages (Bosnian Bulgarian Montenegrin and Macedonian are partly supported).
Neticle Media Intelligence is a media monitoring media analysis and social listening system. It monitors the web in real-time to find every mention of your brand product company or competitors. Neticle's system analyses every mention and it recognizes the texts' positive or negative tone and the key topics persons brands attributions or locations mentioned in them. It recognises the gender of the writer and discovers insights and trends.
Neticle Text Analysis API converts data from unstructured to structured with a sentiment and semantic analysis toolkit for in-house corporate use. The system is keyword based which means you can add keywords and their synonyms. It will find every text that contains the given word and analyse it based on the keyword's context.
PROMT Analyzer SDK provides entity recognition sentiment analysis and language detection APIs allowing to easily integrate natural language processing into your applications and services.
SentiOne — is a Conversational AI Platform used for improving customer service automation based on social listening and data analytics. It monitors various web sources such as social media news blogs forums in chosen country and analyzes mentions. The SentiOne conversational bots are based on industry-agnostic NLP engine. They understand free speech not just predefined phrases and provide suggested answers based on deep learning analysis.
AX Semantics Natural Language Generation is a self-service Natural Language Generation (NLG) software that makes the writing of content scalable. AX Semantics automates the writing of content across sectors – from E-Commerce to Pharma Finance and Banking.
VoiceOverMaker is an online Text-to-Speech converter. It offers more than 180 voices in more than 30 languages and language variants. The editor allows to create and edit high-quality voice over video or create audio files in MP3 or WAV format.
Embeddings trained on CONLL2017 Corpora were trained with finalfrontier on the CONLL2017 corpora with more than 100m tokens. For all languages, the embeddings were trained with the skip- and structgram algorithms and contain subword n-grams.
Event Registry is a system for real-time collection annotation and analysis of content published by global news outlets.
IntelliDockers Adaptable Text Analytics Engines is a Natural Language Processing technology based on Deep Learning AI delivered as self-contained adaptable Docker engines. These engines are deployed on the customer's isolated infrastructure so that the data to be processed does not have to travel anywhere, staying in the customer isolated infrastructure so, being suitable for processing very sensitive information.
Inbenta Chatbot offers a platform for building entrerprise chatbots based on AI machine learning and Inbenta's natural language processing engine.
Smodin is on a mission to make everyday applications available to every language. It is primarily focused on helping students, writers, educators, and virtual workers with their everyday work.
BUET CSE NLP Group Text Summarization Tool contains the mT5 checkpoint finetuned on the 45 languages of XL-Sum dataset.
Gavagai Explorer is a customers' reviews analysis tool. Explorer can analyze texts in any of 46 languages (even more can be added on request). The texts get automatically analyzed and the results are presented in interactive and share-able Dashboards.
Cogito Discover Language Detection comprehends and annotates natural language processing (NLP) data. It also offers the best NLP annotation tools for your computers, applications, and machine learning models to easily comprehend human languages and gain insight from text and audio data.
Ispell is an interactive spell-checking program for Unix which supports a large number of European languages. An emacs interface is available as well as the standard command-line mode.
NLP Cube is an opensource Natural Language Processing Framework with support for languages which are included in the UD Treebanks. NLP-Cube performs the following tasks: sentence segmentation, tokenization, POS Tagging (both language independent (UPOSes) and language dependent (XPOSes and ATTRs)), lemmatization, dependency parsing.
Text Tonsorium — an automatic construction and execution of several workflows which includes normalisation.
spaCy is a library for advanced Natural Language Processing in Python and Cython. spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 60+ languages but it can also be used to train your own pipelines. It features convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration.
trankit is a light-weight transformer-based Python Toolkit for Multilingual Natural Language Processing.
UDify Pretrained Model weights for the UDify model, and extracted BERT weights in pytorch-transformers format.
GATE: Multilingual OCR detects areas of text, extracts the text from each area, and then determines the language of each block of text.
Search Tool for Dependency Graphs is a tool for searching morpho-syntactic constructions from dependency graphs.
LASER is a library to calculate and use multilingual sentence embeddings. The toolkit works with more than 90 languages, including low-resource languages, written in 28 different alphabets. It can be used to transfer natural language processing (NLP) applications originally developed for a single language to many more languages.
YALI is a tool for language identification with pretrained models for 122 languages. Available as a Perl CPAN module Lingua::YALI. Tool available under the BSD licence.
FasText Common Crawl & Wikipedia contains pre-trained word vectors for 157 languages, trained on Wikipedia and the Common Crawl using fastText's CBOW model.
HeLI-OTS 2.0 is a language identifier with language models for 220 languages. It can identify c. 600-1700 sentences (averaging c. 150 characters) per second from a file using one core and around 4,3 gigabytes of memory on a modern laptop.
It-Sr-NER is a CLARIN compatible NER web service for parallel texts with case study on Italian and Serbian; it can be used for recognizing and classifying named entities in bilingual natural language texts.
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting, without the need for fine-tuning.
General Text Embedding Models is the GTE (General Text Embedding) family of models. Achieves state-of-the-art (SOTA) results in multilingual retrieval tasks and multi-task representation model evaluations when compared to models of similar size. Trained using an encoder-only transformers architecture, resulting in a smaller model size.
bge-m3-korean maps sentences and paragraphs to a 1024-dimensional dense vector space. Model Type: Sentence Transformer. Maximum Sequence Length: 8192 tokens. Output Dimensionality: 1024 tokens. Similarity Function: Cosine Similarity.
Multilingual E5 Text Embeddings is initialized from xlm-roberta-large and continually trained on a mixture of multilingual datasets. It supports 100 languages from xlm-roberta, but low-resource languages may see performance degradation. This model has 24 layers and the embedding size is 1024.
Aya 23 is an open weights research release of an instruction fine-tuned model with highly advanced multilingual capabilities. Aya 23 focuses on pairing a highly performant pre-trained Command family of models with the recently released Aya Collection. The result is a powerful multilingual large language model serving 23 languages.
MiniLM-L12-v2 — is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space.
XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
Passage Reranking Multilingual BERT is trained using the Microsoft MS Marco Dataset. This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages.
BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
DistilBERT is a distilled version of the BERT base multilingual model. The model is trained on the concatenation of Wikipedia in 104 different languages. The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base).
XLM model was proposed in Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau, trained on Wikipedia text in 100 languages. The model is a transformer pretrained using a masked language modeling (MLM) objective.
CANINE pretrained on 104 languages using a masked language modeling (MLM) objective. It doesn't require an explicit tokenizer (such as WordPiece or SentencePiece) as other models like BERT and RoBERTa.
VoxLingua107 ECAPA-TDNN is GPT-like models that have 1.3 billion parameters trained on 61 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus.
BLOOMZ is a family of models capable of following human instructions in dozens of languages zero-shot. It is finetuned on the crosslingual task mixture (xP3). Its resulting models are capable of crosslingual generalization to unseen tasks & languages.
LEALLA is a collection of lightweight language-agnostic sentence embedding models supporting 109 languages, distilled from LaBSE. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
Massively Multilingual Speech (MMS) is a model fine-tuned for multi-lingual ASR and part of Facebook's Massive Multilingual Speech project. This checkpoint is based on the Wav2Vec2 architecture and makes use of adapter models to transcribe 1000+ languages.
Nemotron-3-8B is an 8 billion parameter generative language model instruct-tuned on an 8B base model. It takes input with context length up to 4,096 tokens. The model has been customized using the SteerLM method developed by NVIDIA to allow for user control of model outputs during inference.
mHuBERT-147 is compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages. Different from traditional HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units. Training employs a two-level language, data source up-sampling during training.
T5 is an encoder-decoder model based on mT5-base that was trained on multi-language natural language inference datasets as well as on multiple text classification datasets. The model demonstrates a better contextual understanding of text and verbalized label because both inputs are encoded by different parts of a model - encoder and decoder respectively. The zero-shot classifier supports nearly 100 languages and can work in both directions, meaning that labels and text can belong to different languages.
ColBERT-XM is a ColBERT model that can be used for semantic search in many languages. It encodes queries and passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.
ukr-paraphrase-multilingual-mpnet-base is a sentence-transformers model fine-tuned for Ukrainian language: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
AviLaBSE is a unified model trained over LaBSE by google LaBSE to add other row resourced language dimensions and then convereted to PyTorch. It can be used to map more than 250 languages to a shared vector space. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
UForm 3 is a tiny vision and multilingual language encoder, covering 21 languages, mapping them into a shared vector space. This model produces up to 256-dimensional embeddings and is made of: 1) Text encoder: 12-layer BERT for up to 50 input tokens; 2) Visual encoder: ViT-B/16 for images of 224 x 224 resolution.
Forced Alignment provides an efficient way to perform forced alignment between text and audio using Hugging Face's pretrained models. It also features an improved implementation to use much less memory than TorchAudio forced alignment API.
LLaMAX is a language model with powerful multilingual capabilities without loss instruction-following capabilities.
GPT2 124M Trained on Ukranian Fiction is a model trained on corpus of 4040 fiction books, 2.77 GiB in total. Evaluation on brown-uk gives perplexity of 50.16.
NER_FEDA is a multilingual NER system trained using a Frustratingly Easy Domain Adaptation architecture. It is based on LaBSE and supports different tagsets all using IOBES formats.
mT5-m2o contains the many-to-one (m2o) mT5 checkpoint finetuned on all cross-lingual pairs of the CrossSum dataset, where the target summary was in english, i.e. this model tries to summarize text written in any language in English.
RemBERT is pretrained on 110 languages using a masked language modeling (MLM) objective. RemBERT uses small input embeddings and larger output embeddings.
LaBSE is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
Frequently Asked Questions classifier is trained to determine whether a question/statement is a FAQ, in the domain of products, businesses, website faqs, etc.
mT5-m2m-CrossSum contains the many-to-many (m2m) mT5 checkpoint finetuned on all cross-lingual pairs of the CrossSum dataset. This model tries to summarize text written in any language in the provided target language. Represents 45 languages.
XLM-RoBERTa is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is XLM-RoBERTa-large fine-tuned with the conll2003 dataset in English. XLM-RoBERTa is a multilingual model trained on 100 different languages.
NVIDIA Streaming Citrinet 1024 (uk) transcribes speech in lowercase Ukrainian alphabet including spaces and apostrophes, and is trained on 69 hours of Ukrainian speech data. It is a non-autoregressive "large" variant of Streaming Citrinet, with around 141 million parameters.
Dialectal Arabic XLM-R Base is a repo of the language model used for "AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling". The state-of-the-art method for sequence labeling on multi-dialect Arabic.
RoBERTa for NER was fine-tuned on 375.100 sentences in the training set, with a validation set of 173.100 examples. Performance metrics reported are based on additional 173.100 examples. The complete WikiANN dataset includes training examples for 282 languages and was constructed from Wikipedia.
Ukrainian model to restore punctuation and capitalization in sentences, trained on 10m+ sentences from UberText 2.0 corpus.
multilingual_en_ru_uk is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model is used on the resource of multilingual analysis of patient complaints to determine the specialty of the doctor that is needed in this case: Virtual General Practice.
Bernice is a multilingual pre-trained encoder exclusively for Twitter data. It supports 2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata). The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
Ukrainian flair embeddings - a model trained for 25+ epochs on the texts from ubertext2.0 (WIP). Has forward and backward versions of the embeddings.
TwHIN-BERT is a new multi-lingual Tweet language model that is trained on 7 billion Tweets from over 100 distinct languages. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision (e.g., MLM), but also with a social objective based on the rich social engagements within a Twitter Heterogeneous Information Network (TwHIN).
uk_ner_web_trf_base is a fine-tuned XLM-Roberta model that is ready to use for Named Entity Recognition and achieves a performance close to SoA for the NER task for Ukrainian language. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PERS) and Miscellaneous (MISC).
flair-uk-pos is a Flair model that is ready to use for part-of-speech (upos) tagging. It is based on flair embeddings trained for Ukrainian language and has superior performance and a very small size (just 72mb!).
X-GENRE classifier based on xlm-roberta-base and fine-tuned on a multilingual manually-annotated X-GENRE genre dataset. The model can be used for automatic genre identification, applied to any text in a language.
X-MOD is a multilingual masked language model trained on filtered CommonCrawl data containing 81 languages. It was introduced in the paper Lifting the Curse of Multilinguality by Pre-training Modular Transformers (Pfeiffer et al., NAACL 2022).
sentence_boundary_detection_multilang segments a long, punctuated text into one or more constituent sentences. The key feature is that the model is multi-lingual and language-agnostic at inference time. Supports 49 common languages.
uk_core_news is a Ukrainian pipeline optimized for CPU. Components: tok2vec, morphologizer, parser, senter, ner, attribute_ruler, lemmatizer.
psychology_test is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
best-unlp — a model trained by Pravopysnyk team for the Ukrainian NLP shared task in Ukrainian grammar correction. The model is MBart-50-large set to ukr-to-ukr translation task finetuned on UA-GEC augmented by custom dataset generated using our synthetic error generation.
punct_cap_seg_47_language accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
coref-ua is trained on the silver Ukrainian coreference dataset using the F-Coref library. The model was trained on top of the XML-Roberta-base model. According to the metrics retrieved from the evaluation dataset, the model is more precision-oriented.
fastText (Ukrainian) — is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. fastText is a library for efficient learning of word representations and sentence classificationa custom neural networks machine translation engine.
mGPT 13B — a multilingual language model trained on the 61 languages from 25 language families. This model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia.
xlm-roberta_punctuation_fullstop_truecase restores punctuation, true-case (capitalize), and detect sentence boundaries (full stops) in 47 languages.
LaBSE returns the sentence embeddings (pooler_output) and implements caching. Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
MLongT5 is an encoder-decoder transformer pre-trained in a text-to-text denoising generative setting (Pegasus-like generation pre-training). MLongT5 model is an extension of LongT5 model, and it enables using one of the two different efficient attention mechanisms - (1) Local attention, or (2) Transient-Global attention.
SpeechT5 is a fine-tuned version of SpeechT5 for the Ukrainian language, using the Common Voice dataset.
Ukranian mGPT 1.3B one of the models derived from the base mGPT-XL (1.3B) model which was originally trained on the 61 languages from 25 language families using Wikipedia and C4 corpus.
Vxr-Z is developed with state-of-the-art advancements in natural language processing, represents a pivotal leap forward in text understanding, interpretation, and generation.The model is model is a sophisticated neural architecture, meticulously trained on vast and diverse textual datasets encompassing a multitude of languages, topics, and genres.
Passage Reranking Multilingual BERT supports over 100 Languages. This module takes a search query and a passage and calculates if the passage matches the query. It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.
xlm-r-parla is a result of the ParlaMint project. The first application of this model is the XLM-R-parlasent model, fine-tuned on the ParlaSent dataset for the task of sentiment analysis in parliamentary proceedings.
GlotLID is a Fasttext language identification (LID) model that supports more than 2000 labels. GlotLID is now updated to V3.
VoxLingua107 ECAPA-TDNN is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain. The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. However, it uses more fully connected hidden layers after the embedding layer, and cross-entropy loss was used for training.
SONAR_200_text_encoder supports the same 202 languages as NLLB-200. Its embeddings are expected be equal to those the official implementation, but the latter stays the source of truth.
Mistral 7B OpenOrca oasst Top1 contains AWQ, GPTQ, and GGUF model files.The model designed for text generation tasks.
LaBSE is a port of the LaBSE model to PyTorch. It can be used to map 109 languages to a shared vector space.
mBart-large-50-verbalization designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers and dates into their fully expanded textual representations in Ukrainian.
O3ap-sm is a Ukrainian news summarization model fine-tuned on the T5-small architecture. The model has been trained on the Ukrainian Corpus CCMatrix for text summarization tasks.
DPR-XM is a multilingual dense single-vector bi-encoder model. It maps questions and paragraphs 768-dimensional dense vectors and can be used for semantic search. The model uses an XMOD backbone, which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.
AISAK-Listen is a general-purpose AI system comprising various models designed for different tasks. This model is fine-tuned on extensive datasets to excel in converting spoken language into written text. It is intended to be a versatile tool for various applications such as transcription services, voice assistants, voice-controlled systems, and more.
CodeKobzar13B is a generative model that was trained on Ukrainian Wikipedia data and Ukrainian language rules. It has knowledge of Ukrainian history, language, literature and culture.
Mono-XM is a multilingual cross-encoder model. It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search.
LinkBERT-XL is a fine-tuned version of XLM-RoBERTa Large specialising in binary token classification for the purpose of link (anchor text) prediction in plain text. This binary classification model excels in identifying distinct token ranges that web authors are likely to choose as anchor text for links. By analyzing never-before-seen texts, LinkBERT can predict areas within the content where links might naturally occur, effectively simulating web author behavior in link creation.
electraForCausalLM is trained to generate a text description in Ukrainian of spare parts for agricultural machinery based on their name.
BERGAMOT is a multilingual model pre-trained on UMLS (version 2020AB) using a Graph Attention Network (GAT) encoder.
pmmlv2-fine-tuned-yoruba is a Yoruba fine-tuned LLM using sentence-transformers. Yoruba words typically consist of various combinations of vowels and consonants. The Yoruba language has a rich phonetic structure, including eighteen consonants and seven vowels.
pmmlv2-fine-tuned-igbo is an Igbo fine-tuned LLM using sentence-transformers. gbo words, like those in Yoruba, are composed of different combinations of vowels and consonants. The Igbo language has a complex phonetic system featuring twenty-eight consonant sounds and eight vowels.
pmmlv2-fine-tuned-hausa is a Hausa fine-tuned LLM using sentence-transformers. Hausa words typically comprise diverse blends of vowels and consonants. The Hausa language boasts a vibrant phonetic framework featuring twenty-three consonants, five vowels, and two diphthongs.
pmmlv2-fine-tuned-flemish is a Flemish fine-tuned LLM using sentence-transformers. Flemish words typically consist of various combinations of vowels and consonants. The Flemish language has a diverse phonetic structure, including twenty-two consonants, twelve vowels, and some diphthongs.
HPLT Bert for Ukrainian is one of the encoder-only monolingual language models trained as a first release by the HPLT project. It is a so-called masked language model. In particular, this model is the modification of the classic BERT model named LTG-BERT.
Web register classification is a multilingual web register classifier, fine-tuned from XLM-RoBERTa-large. The model is trained with the multilingual CORE corpora across five languages (English, Finnish, French, Swedish, Turkish) to classify documents based on the CORE taxonomy. It can predict labels for the 100 languages covered by XLM-RoBERTa-large. The model achieves state-of-the-art performance in classifying web registers for the trained languages and has strong transfer performance (see Evaluation below). It is designed to support the development of open language models and for linguists analyzing register variation.
Bedrock Titan Text Embeddings v2 You can use the embedding model either via the Bedrock InvokeModel API or via Bedrock's batch jobs. For RAG use cases we recommend the former to embed queries during search (latency optimized) and the latter to index corpus (throughput optimized).
Llama-2-7b-Ukrainian is a bilingual pre-trained model supporting Ukrainian and English. Continued pre-training from Llama-2-7b on 5B tokens consisting of 75% Ukrainian documents and 25% English documents from CulturaX.
Backyard AI makes it easy to start chatting with AI using your own characters or one of the many found in the built-in character hub. It supports advanced features such as lorebooks, author's note, text formatting, custom context size, sampler settings, grammars, local TTS, cloud inference, and tethering, all implemented in a way that is straightforward and reliable.
LLaMAX3-8B is a multilingual language base model, developed through continued pre-training on Llama3, and supports over 100 languages. LLaMAX3-8B can serve as a base model to support downstream multilingual tasks but without instruct-following capability. The model is designed for Text Generation tasks.
LiBERTa is a BERT-like model pre-trained from scratch exclusively for Ukrainian. It was presented during the UNLP @ LREC-COLING 2024.
mT0-XL-detox-orpo is a multilingual 3.7B text detoxification model for 9 languages built on TextDetox 2024 shared task based on mT0-XL. The model shows state-of-the-art performance for the Ukrainian language, top-2 scores for Arabic, and near state-of-the-art performance for other languages. It is designed for Text-to-Text Generation tasks.
tree_stem is a repository that introduces a new stemmer for the Ukrainian language created via machine learning. It outperforms all other stemmers available to date as well as some lemmatizers by the error rate relative to truncation (ERRT) (Paice 1994). It also has the lowest percentage of understemming errors compared to the available stemming algorithms. This repository also contains Python ports of some of the previously published stemmers.
aya-101 is a massively multilingual generative language model that follows instructions in 101 languages. Aya outperforms mT0 and BLOOMZ a wide variety of automatic and human evaluations despite covering double the number of languages.
XGLM is a family of multilingual generative language models pretrained on a balanced corpus covering a diverse set of languages, performs few- and zero-shot learning capabilities in a wide range of tasks. Presents 7.5 billion parameters sets new state of the art in few-shot learning on more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning and natural language inference.
uk4b — models pretrained on 4B tokens from UberText 2.0; designed for Text Generation, Text-Conditioned Metadata Prediction tasks.
haloop is a speech agent toolkit that provides possibility to: initialize models, to train acoustic models, to train and evaluate language models, to score log probabilities of sentences under the GPT language model, to compare labels in datasets using word error rate, etc.
Ukrainian Roberta was trained with code provided in HuggingFace tutorial. Currently released model follows roberta-base-cased model architecture (12-layer, 768-hidden, 12-heads, 125M parameters).
MITIE NER Model — a model that automatically labels words in unfamiliar texts with the corresponding entities (name, geographical locations, company, etc.). For the NER recognition, MITIE library has been chosen. MITIE also provides high quality by combining standard text features and CCA embeddings.
uk_ner_web_trf_large — a fine-tuned XLM-Roberta model that is ready to use for Named Entity Recognition and achieves a SoA performance for the NER task for Ukrainian language. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PERS) and Miscellaneous (MISC).
Flair-uk-ner — a model that is ready to use for Named Entity Recognition. Recognizes four types of entities: location (LOC), organizations (ORG), person (PERS) and Miscellaneous (MISC). The model was fine-tuned on the NER-UK dataset, released by the lang-uk.
skipgram.uk.300.bin is pre-trained word vectors for the Ukrainian language, trained with fastText on (yet unreleased) UberText2.0 dataset, collected and processed by the lang-uk.
Word embeddings (Word2Vec, GloVe, LexVec) — separate models with 300d vectors for newswire, articles, fiction, juridical texts.
BPEmb — a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.
uk-punctcase — a fine-tuning of XLM-RoBERTa-Uk model on Ukrainian texts to recover punctuation and case.
Ukrainian model to restore punctuation and capitalization is the NeMo model to restore punctuation and capitalization in sentences, trained on 10m+ sentences from UberText 2.0 corpus.
ukrainian-word-stress — this package takes text in Ukrainian and adds the stress mark after an accented vowel. This is useful in speech synthesis applications and for preparing text for language learners.
HelpUkraineBot — this chatbot is Latvia’s assistance to Ukraine.
OPUS-tools is a collection of tools for search and download OPUS data.
OpusFilter is a tool for filtering and combining parallel corpora.
Hensoldt Analytics — an automatic speech recognition speech-to-text engine that provides transcription of audio with spoken sentences into text with timestamps and confidence scores, in a variety of languages.
Machine Translation
Tilde MT Machine Translation engine 1.0.0 — a custom neural networks machine translation engine.
The English-Ukrainian Legal Translation Model is a neural translation model trained via unsupervised machine translation using Monoses. The model has been developed in the framework of the CEF project MT4ALL.
HelsinkiNLP - OPUS-MT 1.0.0 is a multilingual machine translation using neural networks.
Apptek Machine Translation 1.0.0 offers state-of-the-art machine translation technology, including neural network-based translation capabilities that support modern architectures. It provides fast, scalable, and high-quality translation across multiple languages. The models are trained on large amounts of public and proprietary data, and cover a wide range of data types and domains. It also offers customized solutions for fast domain adaptation to customer translation memory data. The system can also handle code switching within sentences, such as a sentence written in a mix of Ukrainian and Russian.
Multilizer Localization Tools 1.0.0 are the easiest way to create and manage multilingual versions of software, documents, webpages and other content. With highly usable editor features, dictionaries and validations, the focus can be on the essential: translation.
PROMT.One is a free online translator based on PROMT Neural MT technology. It enables users to translate both words and idioms single phrases and whole texts in different languages. The translator provides best translation when you choose the appropriate topic.
PROMT Neural Machine Translation is a hybrid technology that combines a neural network approach and rule-based machine translation (RBMT). The PROMT Neural algorithms pre-analyze the text and decide which technology is best suited for translating a particular piece of text. It is offered in different packages as a desktop solution or server-based and for mobile devices.
Lingea Machine Translation offers the possibility for integrating MT engines in web applications information systems or DMS. The MT engines are directly adaptable to specific field and style of translated texts. The system preserves the original format of translated documents.
iTranslate Website Translation is a machine translation system for websites.
ModernMT is a context-aware incremental and distributed general purpose Neural Machine Translation technology based on Fairseq Transformer model.
SDL Machine Translation is an enterprise-grade solution for neural machine translation. SDL MT provides two deployment options: Cloud (BeGlobal) or Server (ETS) for deploying MT on premises or as a private cloud.
SYSTRAN Pure Neural Server is powered by SYSTRAN's neural machine translation engine (PNMT™). It can be deployed on a corporate intranet or extranet. It supports unlimited user access millions of translations per day and seamlessly integrates with any business application and document workflow to help enterprises handle day-to-day multilingual challenges in collaboration content management eCommerce customer support business intelligence knowledge management eDiscovery and procedures workflows.
Tradukka translator (Spanish-X)
Bing Translate is an online MT.
Google Translate is a translation machine. For West Frisian, Google Translate offers translation services type, write and see. Not included are the services for: talk, snap, and offline.
Moses Web Demo is an interactive web demo of selected ÚFAL MT systems.
NLLB-200M is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages.
SeamlessM4T is a foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages.
M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. The model that can directly translate between the 9,900 directions of 100 languages.
mBART is fine-tuned for multilingual machine translation. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper.
MADLAD-400-3B-MT is a multilingual machine translation model based on the T5 architecture that was trained on 1 trillion tokens covering over 450 languages using publicly available data.
COMET receives a triplet with (source sentence, translation, reference translation) and returns a score that reflects the quality of the translation compared to both source and reference. The model is intented to be used for MT evaluation.
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. SeamlessM4T covers: 1) 101 languages for speech input; 2) 96 Languages for text input/output; 3) 35 languages for speech output.
Flores101 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.
SMALL-100 Model is a compact and fast massively multilingual machine translation model covering more than 10K language pairs that achieves competitive results with M2M-100 while being much smaller and faster.
NLLB-MoE — the avalable checkpoints requires around 350GB of storage. Make sure to use accelerate if you do not have enough RAM on your machine.
EuroGPT2 — a model for European languages (EU-24 + Ukrainian). The model follows the original architecture as OpenAI's GPT2 apart from using rotary instead of learned positional embeddigs. Training data - Wikimedia dumps (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301). Tokens: 75,167,662,080.
xCOMET is an evaluation model that is trained to identify errors in sentences along with a final quality score and thus leading to an explainable neural metric. This is the XXL version with ~10.7B parameters.
SynEst Translation Models are machine translation models focused on translating from and into the Estonian language. The models are based on the NLLB-1.3B multilingual model.
Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned Paracrawl dataset and unsupervised data selection phase on turuta/Multi30k-uk.
NaSE is a domain-adapted multilingual sentence encoder, initialized from LaBSE. It was specialized to the news domain using two multilingual corpora, namely Polynews and PolyNewsParallel. The model is designed for denoising auto-encoding and sequence-to-sequence machine translation.
ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system. The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while only utilizing ASR data; enables Zero-shot E2E Speech Translation to all the 200 languages supported by NLLB.
EuroLLM-1.7B is a project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. For pre-training, the authors use 256 Nvidia H100 GPUs of the Marenostrum 5 supercomputer, training the model with a constant batch size of 3,072 sequences, which corresponds to approximately 12 million tokens, using the Adam optimizer, and BF16 precision.
OPUS-MT models for Ukrainian — a set of models for translating from and to Ukrainian according to the flores101 devtest benchmark. Results are given in standard BLEU scores (using sacrebleu).
M2M-100 — a Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
OPUS-MT Telegram Translation Bot
Ukrainian - Czech Telegram Translation Bot
Ukrainian - Czech Messenger Translation Bot
Charles Translator for Ukraine — the project whose primary objective is to help refugees from Ukraine by narrowing the communication gap between them and other people in the Czech Republic. This a a machine translation system for Czech-Ukrainian which should be of higher quality than Google Translate and free to use through web app, Android app and REST API.
AppTek translator — neural machine translation.
OPUS-CAT MT Engine is a Windows-based machine translation system built on the Marian NMT framework. OPUS-CAT MT Engine makes it possible to use a large selection of advanced neural machine translation models natively on Windows computers. The primary purpose of OPUS-CAT Engine is to provide professional translators local, secure, and confidential neural machine translation in computer-assisted translation tools (CAT tools), which are usually Windows-based.
OPUS-MT — an app that integrates publically avaiable translation models from the OPUS-MT project to bring fast and secure machine translation to the desktop of end users.
MTData automates the collection and preparation of machine translation (MT) datasets. It provides CLI and python APIs, which can be used for preparing MT experiments.