Translation. Region: Russian Federal
Source: Novosibirsk State University –
An important disclaimer is at the bottom of this article.
A system for automatic recognition and transliteration of texts in classical Tibetan, focused on old printed documents made using Tibetan syllabic writing, which goes back to the ancient Indian Brahmi script, was created by a student of the Fundamental and Applied Linguistics program working at the Institute of Mathematics and Mathematical Geophysics of the Siberian Branch of the Russian Academy of Sciences. Humanitarian Institute of Novosibirsk State University Anna Murashkina. In her research, she used images of pages of classical Tibetan texts from the 18th-20th centuries from the archive of the Center for Oriental Manuscripts and Xylographs of the Institute of Mongolian, Buddhist and Tibetan Studies of the Siberian Branch of the Russian Academy of Sciences.
— The relevance of my work is due to the need to preserve and make digitally accessible the Tibetan cultural heritage, presented in the form of many historical manuscripts. Old printed documents, manuscripts and xylographs contain unique information about philosophy, religion, medicine, history and art, playing a key role in the study of the cultural traditions of the region. This knowledge is passed down from generation to generation in Tibet. However, over time, under the influence of natural and anthropogenic factors, paper media are subject to physical destruction, which leads to the loss of priceless information and limits access to these unique materials. Currently, the Tibetan Fund of the Institute of Mongolian, Buddhist and Tibetan Studies of the Siberian Branch of the Russian Academy of Sciences contains up to 70 thousand units of chronicles that are at risk of being lost. One of the most reliable ways to preserve and systematize historical documents is to digitize them, — said Anna Murashkina.
The young researcher set herself the task of using machine learning to build a model that would recognize Tibetan alphabet symbols from images, translate them into machine-readable form, and at the same time show greater accuracy than existing open solutions, including Tesseract.
— To do this, I manually performed linguistic markup of Tibetan text lines from the IMBT SB RAS collection. Then, taking into account the specifics of Tibetan graphics, I developed a system for assessing the quality of optical character recognition (OCR). Then I compared existing architectures and chose a convolutional neural network model, which required additional training, — explained Anna Murashkina.
She implemented additional training of the model on a marked corpus of documents, and as a result, a complete modular OCR algorithm was created, including the stages of pre-processing, segmentation, recognition and post-processing.
— For me, the value of the project is that I helped digitize an archive that stores history — documents created by people of the past who wanted to pass on their knowledge to future generations. I am glad that I am helping to transfer this knowledge through time, preserve it and make it available to a wider audience. My development will be used by employees of the Institute of Mongolian, Buddhist and Tibetan Studies of the Siberian Branch of the Russian Academy of Sciences. The possibility of cooperation with the Buddhist Center for Digital Technologies, which digitizes the archives of temples and monasteries, is also being discussed. In cooperation with this organization, we will expand the possibilities of digitizing Tibetan manuscripts using open resources developed jointly with researchers from organizations in different countries, so that later everyone can touch this priceless heritage and get acquainted with the documents that are in temples and archive repositories, — said Anna Murashkina.
Material prepared by: Elena Panfilo, NSU press service
Please note: This information is raw content obtained directly from the source of the information. It is an accurate report of what the source claims and does not necessarily reflect the position of MIL-OSI or its clients.
.
