As one of eight module projects of the OCR-D Coordination Project for the Further Development of Optical Character Recognition (OCR) Methods we focus on a central component, the actual text recognition, and take care of the software Tesseract.
Tesseract is a free software for text recognition (optical character recognition, OCR). This software has a history of more than 30 years of continuous development and improvements. In the small group of open source products for OCR Tesseract belongs to the programs with the best recognition rates.Since end of 2016 Tesseract supports state-of-the-art text recognition by neural networks (LSTM).The context of OCR-D requires well defined interfaces for OCR software. The project will actively contribute to the definition of such interfaces. It will implement them for Tesseract to allow inclusion of Tesseract in an OCR workflow. We also strives to improve the stability, performance and practical usability of Tesseract.
Mannheim University Library has used Tesseract for the first nearly complete text recognition of the historical newspaper Deutscher Reichsanzeiger und Preußischer Staatsanzeiger (German Imperial Gazette and Prussian Official Gazette). We also use Tesseract in the DFG project Aktienführer-Datenarchiv II.