Automated Text Recognition – Extracting Data via OCR/ HTR
Automated or optical text recognition (OCR) is used to automatically capture text from digital images and thus generate searchable and analyzable data. The Mannheim University Library has many years of experience in digitization and with the use of various text recognition software.
The Research Data Center is happy to support researchers at the University of Mannheim along the entire workflow from digitization to layout and text recognition as well as training specialized models and structuring of the data.
Services
- Consulting on automated text recognition (OCR) for research projects
- OCR Recommender – Recommendation for suitable text recognition software
- Open OCR consultation hour: every 2nd Thursday of the month, from 3 to 4 p.m., without registration (link to Zoom meeting: ocr-bw.bib.uni-mannheim.de/sprechstunde, meeting ID: 682 8185 1819, ID code: 443071)
- Use of Transkribus via the University of Mannheim's organizational license with extended functions, including credit allocation
- Text recognition and transcription via the University Library’s eScriptorium instance (personal account available on request)
Selection of text recognition and transcription platforms
Tool
Cost model
Properties Particularly suitable for fee-based/commercial Text and layout recognition; good layout analysis Modern prints, complex layout Open Source
Graphical user interface for Kraken; intuitive use Historical prints and manuscripts, including non-Latin script fee-based/commercial Text recognition; image and video analysis; for manuscripts and prints Prints and manuscripts Open Source
Command line-based text recognition software; optimised for historical and non-Latin written material Historical prints and manuscripts, including non-Latin script Open Source
Graphical user interface for various open source text recognition programmes Historical prints and manuscripts Open Source
Command line programme for text recognition of PDF files; uses Tesseract as OCR engine Historical/ modern prints Open Source
Modular, command line-based text recognition software Historical prints Open Source
Web-based text recognition platform; good universal models; currently no follow-up training possible Historical/ modern prints and manuscripts Open Source
Command line-based text recognition software; suitable for large data sets Historical/ modern prints fee-based/commercial Comprehensive text recognition and transcription platform; with intuitive user interface Historical manuscripts and tables User Seat via the University of Mannheim's organizational license
The University Library offers institutional access to the text recognition and transcription platform Transkribus. To be assigned a user seat, you are required to sign our terms of use. Afterwards, we will schedule a brief onboarding session in which we introduce you to the service.
The onboarding includes:
Mandatory (approx. 30 minutes):
- Overview of the institutional Transkribus account
- Information on user seats and credit allocation
- Administrative conditions and guidelines
Optional:
- Introduction to basic features (e.g., upload, layout recognition, HTR models)
- Advanced modules such as training your own models, working with tables, or using other specialized functions
- Working with your own sample documents
If you are interested in gaining access, feel free to contact us!
Access to the eScriptorium instance of the university library
Members of the University of Mannheim can request their own account for the University Library's eScriptorium instance. Simply send us a short email and we will set up your access and provide you with the relevant instructions for use.
Once your account has been set up, we will be happy to help you get started with eScriptorium, answer any questions you may have about workflows, or assist you with using advanced features.
Instructions and materials for various OCR software
Here you will find instructions and materials on various open source text recognition programmes and transcription platforms. It is a collection of useful references, not all resources have been created by Mannheim University Library itself.
eScriptorium
- All Github documentation of the Mannheim University Library on eScriptorium (german)
- Local installation (Windows/Linux) (german)
- Locale installation (MacOS) (english)
- User manuals
- Video: Introduction to eScriptorium (german)
- Model transfer from Transkribus to eScriptorium (german)
OCRmyPDF
Tips for creating ground truth (training data)
As part of the OCR-D project, three different transcription levels for the transcription of historical documents were defined in transcription guidelines. The levels differ in the degree of faithful reproduction. The guidelines can be found on the OCR-D project homepage. You can also find a guideline for publishing your own training data on Github.
Here you will find Ground-Truth for training or retraining your own models:
- OCR & Ground-Truth-Resources
- HTR United
- Ground-Truth for Charlottenburger Amtsschrifttum
- Ground-Truth for digital copies of the Mannheim University Library
- Ground-Truth for digital copies of the Tübingen University Library
- IAM Database for manuscripts
A virtual keyboard with the required special characters can also be helpful when creating ground truth. You can also find virtual keyboards for different transcription platforms on Github.
Projects and cooperations
- Cooperation project on text recognition and data structuring with the Chair of Economic History (Prof. Streb)
- Cooperation project on manuscript recognition with the Chair of Late Medieval and Early Modern Studies (Prof. Kümper)
If we can support you or if you have any questions, please do not hesitate to contact us.
Contact

Forschungsdatenzentrum (FDZ)
Universitätsbibliothek Mannheim
Schloss Schneckenhof West
68161 Mannheim
