This week we got the lovely news of an accepted demonstration paper “DuoSearch: A Novel Search Engine for Bulgarian Historical Documents” at the very competitive European Conference on Information Retrieval ECIR 2022 which will take place in April 2022 in Stavanger, Norway.
The paper is on a search engine that helps to explore content from digitised historical newspapers in Bulgarian. This work was done as an experiential learning project. Two students (Angel Beshirov and Suzan Hadzhieva, on the post photo) worked on the tools with the input of Prof. Ivan Koychev and myself and in consultation with staff from the Digital centre of the National Library Ivan Vazov in Plovdiv.
The importance of this type of work and how it responds to different stakeholders needs can be explored in this short video.
Here is the abstract of the paper, we will share more information when the final version is published.
Abstract: Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from mid-19th to mid-20th century. The system provides interactive and intuitive interface for the end users allowing them to enter search terms in modern Bulgarian and searching across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.