Optimized use of OCR processes

Tesseract as a component in the OCR-D workflow

Contact: Stefan Weil
Funding: German Research Foundation (DFG)
Duration: 2018–2019

As one of eight module projects within the OCR-D coordination project, which aims to further develop Optical Character Recognition (OCR) methods, the University Library is working on a central component – the actual text recognition – and is in charge of the Tesseract software.

Tesseract is open-source software for text recognition (optical character recognition, OCR). This software is characterized by its continuous development for more than 30 years. Within the group of open-source software, Tesseract is one of the programs with the best recognition rates. Since the end of 2016, Tesseract has also supported text recognition using artificial neural networks (LSTM), making it technologically up-to-date. The project expands or supplements Tesseract with interfaces for integration into an overall OCR workflow according to the OCR-D module description (command line, API, REST-based web service). Furthermore, the goal is to further improve the stability, performance, and practical usability.

The University Library Mannheim used Tesseract to perform the first largely complete text recognition for the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” and its predecessor newspapers (1819–1945), and also uses Tesseract in the DFG project Aktienführer-Datenarchiv II.