Translation. Region: Russian Federation –
Source: Novosibirsk State University –
An important disclaimer is at the bottom of this article.
Master's student Faculty of Information Technology Stepan Gudkov (NSU Faculty of Information Technologies) has developed software that automates the recognition of handwritten historical documents, specifically the decision books of volost courts that existed in Russia in the early 20th century. The project aims to introduce these court decisions, which reflect many aspects of the lives of Siberian peasants, into scientific circulation. The development is part of Stepan Gudkov's master's thesis, which he is preparing under the supervision of Vladimir Borisovich Barakhnin, Doctor of Engineering Sciences and Professor in the Department of General Informatics at NSU Faculty of Information Technologies.
At the beginning of the 20th century, the peasant class in Russia had so-called volost courts, which dealt with civil matters. Their decisions were recorded in thick journals, bound with cord and sealed with a seal. They represent a treasure trove of information—a description of the lives and daily routines of Russian and Siberian peasants in the pre-revolutionary era.
"Although these are court documents, they're not really about the court; they're about life in its various manifestations. Reading these decisions, we get a picture of the different peasant occupations, learning about their daily lives, their daily concerns, their personalities, and their habits. We see all this diversity in the decisions of the volost court. They give us a glimpse of peasant Russia, which later disappeared during the 20th century, when the country became completely different. It's a photographic portrait of peasant Russia," said Alexey Kirillov, senior researcher at the Institute of History, Siberian Branch, Russian Academy of Sciences.
Thus arose the idea of making this knowledge accessible to a wide range of historians and interested individuals, not just selecting 100 solutions, but digitizing and recognizing a large number of documents, presenting them in a form understandable to the modern reader, so that all this would be easy and accessible.
"By my estimates, at the beginning of the 20th century, volost courts across Russia issued approximately 1 million decisions annually. Of these, only a tiny fraction have survived. Archives in Siberia currently contain several tens of thousands of decisions, and across the country, I believe, we can count on hundreds of thousands. To introduce them into scholarly circulation and begin studying them, they first need to be recognized and translated into modern text. We are currently manually transcribing them, which is a very labor-intensive process. I can give you an example: we will soon publish two books presenting several hundred volost court decisions. This work took us three years. If we set the goal of recognizing the texts of all decisions, then, if done manually, it would take several decades. The use of information technology, however, allows us to automate and significantly speed up this work," added Alexey Kirillov.
Historians approached the NSU Faculty of Information Technology with this task. To introduce a handwritten historical document into scientific circulation, it's not enough to simply digitize it as an image; it must be recognizable in text form.
"The text must, at a minimum, be indexed, with all words extracted. Then the text must be processed, extracting the most important general terms describing the subject matter of a given decision; the document must be cataloged. Then it will be possible to assemble a comprehensive information system that will allow specialists and the general public to access decisions of the district courts. Where should we begin here? Of course, with the translation of the handwritten text, its recognition, and its conversion into a machine-readable format," explained Vladimir Barakhnin.
Existing text recognition systems are not applicable to such documents due to various characteristics, so it was necessary to develop an algorithm suitable for working with handwritten documents.
When recognizing handwritten texts, specialists face a number of challenges that must be addressed. First, the volost court decision log is composed of ruled pages, including vertical ones. However, in reality, writing wasn't always done strictly in columns; often, the text ran continuously across the page, making it difficult to understand. Second, there are different handwriting styles. Although the volost register was typically written by a single scribe and a certain number of documents were written in the same handwriting, handwriting varies from book to book. Third, pre-revolutionary orthography differs from modern orthography. Finally, the scribes' limited literacy and the use of various abbreviations and proper names all complicate text recognition and processing.
To solve the problem of recognizing such texts, NSU developers applied machine vision algorithms that allow them to recognize lines, individual symbols, and letters.
"The system takes as input an image of a page from a book of decisions of rural district courts. It is broken into several small fragments, each of which is divided into individual lines of text, which can be done using neural networks like YOLO. After this, the image of the line must be converted into text. There are several approaches: running a dynamically sized window over the line, cropping the letter images and feeding them to the recognition model (an ensemble of convolutional neural networks can be used); solving the problem of transforming a sequence (handwritten text) into a sequence (printed text) using convolutional recurrent neural networks or transformer-based networks, which requires a large number of manually transcribed lines to train the model; or using a training method with a small number of training samples, which we have not yet tested and has an undeniable advantage since it requires very little data to train the model. The recognized text will, of course, contain errors, so post-processing is required: at least checking it against dictionaries. The result should be a text file containing the recognized text," Stepan Gudkov explained.
A machine vision algorithm has now been developed that helps train a neural network to recognize words as a set of symbols, without any processing or correction. Further refinement of the algorithm is intended to enable the system to suggest possible spellings and corrections based on meaning and context, allowing a human to decide which version is correct.
"Further text processing requires some thought; reading word-by-word doesn't produce a perfect result; errors and recognition difficulties are possible. Solving this problem with IT alone will be difficult; we need to develop an application that, when it encounters unfamiliar words, underlines them, marks them for correction, and suggests the most likely variants. Therefore, it's essential to involve specialists with a humanities background," Vladimir Barakhnin added.
The future plan is to create a full-fledged information system with search interfaces. In such a system, each document is provided with all metadata, all words are extracted, and it is machine-readable. The system allows for contextual searching and selection by various criteria—by village, person, case category, etc.
In the future, this development can be applied to the analysis of any handwritten documents from archives—letters, diaries, etc., created in the pre-revolutionary period—from the mid-19th century, when the modern Russian language emerged, until 1917.
Photo: Photos used from https://volsud.sibistorik.ru/
Please note: This information is raw content obtained directly from the source. It represents an accurate account of the source's assertions and does not necessarily reflect the position of MIL-OSI or its clients.
