Translation. Region: Russian Federation –
Source: Novosibirsk State University –
An important disclaimer is at the bottom of this article.
Master's student Faculty of Information Technology Daniil Lyutaev, a researcher at NSU's Faculty of Information Technologies (FIT), under the supervision of Vladimir Borisovich Barakhnin, Doctor of Engineering Sciences and Professor in the Department of Informatics Systems at FIT NSU, developed an algorithm that automates the process of cross-lingual transfer of named entity markup (titles, names, dates, etc.) using large language models. This method will find application in many areas, including the creation of national search engines, document classification, the construction of communication networks, translation, and other fields.
Named entity recognition (i.e., words and phrases that denote unique or singular objects, such as people, organizations, locations, dates, and others) is a key task in natural language processing, the solution of which depends on the availability of high-quality annotated text corpora. Creating such corpora for new languages, especially those with insufficient digital data for processing and analysis, is resource-intensive, making the automatic cross-lingual transfer of existing annotation a pressing issue. In his paper, Daniil Lyutaev explores the effectiveness of an approach based on large language models (LLMs) to automate the process of annotation transfer from Uzbek to Russian and English.
Initially, the researcher had a large dataset of sentences (approximately 10,000) in Uzbek, in which experts had manually annotated the named entities. The document consisted of a table in which each word had a specific tag next to it, similar to HTML markup language, indicating whether the word was part of a named entity or not. The researcher's task was to automatically transcribe these sentences into another language while preserving the annotation.
"This allows us to scale labeled data to new languages without repeating the work. The labeling is done once and then transferred automatically multiple times," explains Daniil.
The master's student relied on two traditional approaches: sentence and entity translation using a translator and algorithmic matching; and sentence translation using a translator and named entity extraction without regard to the original annotation using pre-trained models. He also proposed his own approach, using large language models—in this case, GPT-4o. For each sentence, a task was formulated in a specific format with example responses. All three methods were compared using standard metrics such as precision, recall, and F1-score (the harmonic mean of the first two parameters) on 30 Russian and 30 English sentences, all manually annotated (the original language was Uzbek).
As a result, it was found that the markup can be transferred with high accuracy (F1 score ~ 0.9) even when working with morphologically different language types: Uzbek is an agglutinative language, Russian is an inflectional language, and English is an isolating language. In particular, when creating multilingual information systems, initial markup can be performed in only one language—for example, the one that requires the lowest cost.
"The goal of our work was to demonstrate that LLM can be used to solve this problem efficiently and automatically generate markup in another language. The results of the markup transfer algorithm can already be applied in many areas—search engines, document classification, building relational networks, translation, as well as for named entity extraction models themselves, where sets of marked data are needed," says Daniil.
To confirm the results, an automated back-translation evaluation was additionally conducted. This involved translating the original Uzbek sentence into a target language, such as Russian, then taking the resulting Russian sentence and translating it back into Uzbek. This back-translation was then compared with the original for semantic similarity. This evaluation is automated for any number of sentences. The second evaluation compares the semantic similarity in the target language of the application's output and a reference sentence, additionally annotated manually. The study shows that these two evaluations correlate for 30 manually annotated sentences in Russian and English.
The developed approach could contribute to achieving information sovereignty and the creation of national search engines. Besides Google, which now has virtually worldwide reach, only Russia (Yandex) and China (Baidu) have their own fully-fledged national search engines. However, there are significant populations around the world who speak Spanish, Arabic, Hindi, and German. However, these countries lack sovereign search engines.
"Existing search engines don't disclose the algorithms they use, yet they possess vast resources that are inaccessible to most countries. Our goal is to develop a system that can be replicated. Scientific knowledge is reproducible and publicly available, and our algorithms are part of science and technology. Furthermore, they are relatively simple and inexpensive to implement. Therefore, we make what Google does truly accessible. This also contributes to resolving the issue of national sovereignty in information technology, which is extremely important. The algorithm we developed will help develop national segments of the internet in countries of the Commonwealth of Independent States, such as Uzbekistan and Kazakhstan," explains Vladimir Barakhnin.
Please note: This information is raw content obtained directly from the source. It represents an accurate account of the source's assertions and does not necessarily reflect the position of MIL-OSI or its clients.
