Regional markup

The regional markup of the corpus is based on the contemporary administrative structure of Ukraine. This is partly because of pragmatic reasons: administrative borders are clearly defined and it is possible to look them up in standard sources. While the administrative structure does not necessarily reflect the dialectal landscape of Ukraine, this choice does have a sociolinguistic dimension since the administrative regions do present socioeconomic and cultural entities of some relevance that are typically oriented towards the same centers. These administrative regions are then united in macroregions consisting of the Western, Eastern, Central, Southern and Northern area. Kyiv as the capital with people coming from different regions is treated as a separate macroregion. The macroregions are formed taking into account the Ukrainian dialects. The North region includes most of the territories of Northern/Polissya dialects, the West includes the Southwestern dialects, the South, East, and Center regions, respectively, the Steppe, Slobozhanshchyna, and Dnieper dialectal groups.

Below are the graphs showing how our texts are distributed across these macroregions overall in the corpus (Fig. 1) and across time (Fig. 2). Kyiv and the Western macoregion are represented by the largest numbers of texts. The other regions have much less texts.


Macroregion

Tokens

%

W

172303252

46

KYV

118565515

32

E

26624696

7

C

23900708

6

S

16903552

5

N

12944789

3


Figure 1: Composition of GRAC by macroregions


Figure 2: Distribution of tokens by macroregions and years


Media texts (papers, news sites on the web) are marked by the region where the respective media appeared. Other texts are annotated by the region where the author (or the translator, for a translated text) was born, studied or lived for more than ten years.

The regional annotation is thus generally linked to the author of a text where such an author is available. A single text can belong to different regional subcorpora if the author or the translator was born, studied or lived for a long time in different regions. In the process of annotation, biographical information from all kinds of sources is evaluated so that the regional annotation reflects the Ukrainian linguistic biography of the author as closely as possible. 

Approximately 85.5% of GRAC v.10 is annotated by region. Texts created in Ukraine that have one macroregion make up 60% of GRAC v.10 corpus.

For regional text markup, GRAC has the attributes DOC.COUNTRY, DOC.MACROREGION (North, West, South, East, Center, Kyiv: Fig. 3), DOC.REGION, and DOC.LOCCODE, which for convenience contains a set of all regional attributes (for example, DOC.COUNTRY = “UA”, DOC.MACROREGION = “C”, DOC.REGION = “CRK”, and DOC.LOCCODE = “UA-C-CRK”).


Figure 3: Macroregions of Ukraine in GRAC


DOC.LOCCODE for Ukraine:


UA-C-CRK — Cherkasy oblast

UA-C-KRV — Kirovohrad oblast

UA-C-KVS — Kyiv oblast

UA-C-PLT — Poltava oblast

UA-E-HRK — Kharkiv oblast

UA-E-SUM — Sumy oblast

UA-KYV-KYV — Kyiv

UA-N-CRG — Chernihiv oblast

UA-N-RVN — Rivne oblast

UA-N-VLN — Volyn oblast

UA-N-ZHT — Zhytomyr oblast

UA-S-DNC — Donetsk oblast.

UA-S-DNP — Dnipropetrovsk oblast

UA-S-HRS — Kherson oblast

UA-S-KRM — Crimea

UA-S-LGN — Luhansk oblast

UA-S-MKL — Mykolaiv oblast

UA-S-ODE — Odesa oblast

UA-S-ZPR — Zaporizhia oblast

UA-W-CRV — Chernivtsi oblast

UA-W-HML — Khmelnytskyi oblast

UA-W-IFR — Ivano-Frankivsk oblast

UA-W-LVV — Lviv oblast

UA-W-TRN — Ternopil oblast

UA-W-VNC — Vinnytsia oblast

UA-W-ZKR — Zakarpattia oblast


Aside from the above macroregions, the countries of the Ukrainian diaspora (the United States, Canada, Poland, Germany, the UK, France etc.) are distinguished in the annotation. DOC.LOCCODE for the Ukrainian diaspora starts with D, followed by a code for post-Soviet countries (DOC.MACROREGION = “V”) and other countries (DOC.MACROREGION = “Z”). The third code specifies the country. For the neighboring Russia, Poland and Czechoslovakia, a fourth code is available to specify further details.


D-V-BY — Belarus

D-V-GE — Georgia (country)

D-V-KZ — Kazakhstan

D-V-MLD — Moldova

D-V-RU — Russia

D-V-RU-KBN — Kuban

D-V-RU-SSL — Eastern Slobozhanshchyna

D-V-TKM — Turkmenistan

D-Z-AR — Argentina

D-Z-AT — Austria

D-Z-AU — Australia

D-Z-BE — Belgium

D-Z-BR — Brazil

D-Z-CA — Canada

D-Z-CH — Switzerland

D-Z-CZE — Czech Republic

D-Z-CZE-SVK — Czechoslovakia (before 1992)

D-Z-DE — Germany

D-Z-EET — Estonia

D-Z-ES — Spain

D-Z-FR — France

D-Z-GB — United Kingdom

D-Z-IL — Israel

D-Z-IT — Italy

D-Z-LT — Lithuania

D-Z-LV — Latvia

D-Z-PL — Poland

D-Z-PL-HLM — Kholm region

D-Z-RO — Romania

D-Z-SRB — Serbia

D-Z-SVK — Slovakia

D-Z-SWE — Sweden

D-Z-USA — United States


M. Shvedova, R. von Waldenfels. Regional Annotation within GRAC, a Large Reference Corpus of Ukrainian: Issues and Challenges. CEUR Workshop Proceedings. Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021). Volume I: Main Conference. Kharkiv, Ukraine, April 22-23, 2021. P. 32-45

How to use this theme

Every part of this theme can be translated to another language. Even this content you are reading now!

The drop-down in the main menu is called a Locale Picker. It lets you quickly switch between any of the available languages when browsing this website.

For help on setting up more languages, close this popup and click the Languages menu item.