Automated Text Recognition – Extracting Data via OCR/HTR

Automated or optical text recognition (OCR) is used to automatically capture text from digital images and thus generate searchable and analyzable data. The Mannheim University Library has many years of experience in digitization and with the use of various text recognition software.

The Research Data Center is happy to support researchers at the University of Mannheim along the entire workflow from digitization to layout and text recognition as well as training specialized models and structuring of the data.

Services

Consulting on automated text recognition (OCR) for research projects
OCR Recommender – Recommendation for suitable text recognition software
Open OCR consultation hour: every 2nd Thursday of the month, from 3 to 4 p.m., without registration (link to Zoom meeting: ocr-bw.bib.uni-mannheim.de/sprechstunde, meeting ID: 682 8185 1819, ID code: 443071)
Use of Transkribus via the University of Mannheim's organizational license with extended functions, including credit allocation
Text recognition and transcription via the University Library’s eScriptorium instance (personal account available on request)

Tool	Cost model	Properties	Particularly suitable for
ABBYY Finereader	fee-based/commercial	Text and layout recognition; good layout analysis	Modern prints, complex layout
eScriptorium	Open Source	Graphical user interface for Kraken; intuitive use	Historical prints and manuscripts, including non-Latin script
Google Vision	fee-based/commercial	Text recognition; image and video analysis; for manuscripts and prints	Prints and manuscripts
Kraken	Open Source	Command line-based text recognition software; optimised for historical and non-Latin written material	Historical prints and manuscripts, including non-Latin script
OCR4All	Open Source	Graphical user interface for various open source text recognition programmes	Historical prints and manuscripts
OCRmyPDF	Open Source	Command line programme for text recognition of PDF files; uses Tesseract as OCR engine	Historical/modern prints
OCR-D	Open Source	Modular, command line-based text recognition software	Historical prints
PERO-OCR	Open Source	Web-based text recognition platform; good universal models; currently no follow-up training possible	Historical/modern prints and manuscripts
Tesseract	Open Source	Command line-based text recognition software; suitable for large data sets	Historical/modern prints
Transkribus	fee-based/commercial	Comprehensive text recognition and transcription platform; with intuitive user interface	Historical manuscripts and tables

User Seat via the University of Mannheim's organizational license
The University Library offers institutional access to the text recognition and transcription platform Transkribus. To be assigned a user seat, you are required to sign our terms of use. Afterwards, we will schedule a brief onboarding session in which we introduce you to the service.
The onboarding includes:
Mandatory (approx. 30 minutes):
Overview of the institutional Transkribus account
Information on user seats and credit allocation
Administrative conditions and guidelines
Optional:
Introduction to basic features (e.g., upload, layout recognition, HTR models)
Advanced modules such as training your own models, working with tables, or using other specialized functions
Working with your own sample documents
If you are interested in gaining access, feel free to contact us!
Access to the eScriptorium instance of the university library
Members of the University of Mannheim can request their own account for the University Library's eScriptorium instance. Simply send us a short email and we will set up your access and provide you with the relevant instructions for use.
Once your account has been set up, we will be happy to help you get started with eScriptorium, answer any questions you may have about workflows, or assist you with using advanced features.
Instructions and materials for various OCR software
Here you will find instructions and materials on various open source text recognition programmes and transcription platforms. It is a collection of useful references, not all resources have been created by Mannheim University Library itself.
eScriptorium
All Github documentation of the Mannheim University Library on eScriptorium (german)
Local installation (Windows/Linux) (german)
Locale installation (MacOS) (english)
User manuals
German
English
Video: Introduction to eScriptorium (german)
Model transfer from Transkribus to eScriptorium (german)
OCR-D
User and installation guide
OCRmyPDF
Users and installation guide (Windows/Linux) (german)
Tesseract
All Github documentationen of the Mannheim University Library on Tesseract (german)
Users and installation guide (Linux) (german)
I Users and installation guide (Windows) (german)
Tips for creating ground truth (training data)
As part of the OCR-D project, three different transcription levels for the transcription of historical documents were defined in transcription guidelines. The levels differ in the degree of faithful reproduction. The guidelines can be found on the OCR-D project homepage. You can also find a guideline for publishing your own training data on Github.
Here you will find Ground-Truth for training or retraining your own models:
OCR & Ground-Truth-Resources
HTR United
Ground-Truth for Charlottenburger Amtsschrifttum
Ground-Truth for digital copies of the Mannheim University Library
Ground-Truth for digital copies of the Tübingen University Library
IAM Database for manuscripts
A virtual keyboard with the required special characters can also be helpful when creating ground truth. You can also find virtual keyboards for different transcription platforms on Github.

In our FAQs you will find answers to the most frequently asked questions about automated text recognition and the software used in the OCR-BW.

If the answer you are looking for is not listed, simply contact us by e-mail.

Projects and cooperations

Cooperation project on text recognition and data structuring with the Chair of Economic History (Prof. Streb)
Cooperation project on manuscript recognition with the Chair of Late Medieval and Early Modern Studies (Prof. Kümper)

If we can support you or if you have any questions, please do not hesitate to contact us.

Contact

Forschungsdatenzentrum (FDZ)

Team: Irene Schumm, Jan Kamlah, Phil Kolbe, David Morgan, Thomas Schmidt, Renat Kaufmann, Christos Sidiropoulos, Vasilka Paunova, Larissa Will

University of Mannheim
Universitätsbibliothek Mannheim
Schloss Schneckenhof West
68161 Mannheim

E-mail: forschungsdatenuni-mannheim.de
Web: www.bib.uni-mannheim.de/en/teaching-and-research/research-data-center-fdz

Opening Hours

Available Seats

Information and Advice

Chat Mon–Fri 10–6

AI Chatbot

Automated Text Recognition – Extracting Data via OCR/HTR

Services

Selection of text recognition and transcription platforms

User Seat via the University of Mannheim's organizational license

Access to the eScriptorium instance of the university library

Instructions and materials for various OCR software

eScriptorium

OCR-D

OCRmyPDF

Tesseract

Tips for creating ground truth (training data)

Projects and cooperations

Contact

Forschungsdatenzentrum (FDZ)

InfoCenter

FORUM