Data+ 2022: AI-powered Historical Text Transcription

Introduction

The David M. Rubenstein Rare Book and Manuscript Library at Duke University holds millions of pages of handwritten documents ranging from ancient Papyri to records of Southern plantations to 21st century letters and diaries. Only a small subset of these documents have been digitized and made available online, and even fewer have been transcribed. The lack of text transcripts for handwritten documents impairs discovery and use of the materials, and prohibits any kind of computational text analysis that might support new avenues of research, including research related to the histories of racial injustice.

While Optical Character Recognition (OCR) technology has made it possible to derive machine-readable text from typewritten documents in an automated way for several decades, the work of transcribing handwritten documents remains largely manual and labor-intensive. In the last few years, however, platforms like Transkribus have sought to harness the power of machine-learning by using Handwriting Text Recognition (HTR) to extract text from manuscripts and other handwritten documents held in libraries and archives. To date, the Rubenstein Library has conducted a few small-scale HTR experiments with mixed (and mostly disappointing) results. We have a lot to learn about the viability of HTR for our collections and about how to incorporate HTR into our existing workflows.

In this Data+ project, students tested the viability of AI-powered HTR for transcribing digitized handwritten documents in the Rubenstein library and made recommendations for how the library might incorporate HTR into existing workflows, projects, and interfaces. Source material was drawn from the Duke Digital Collections, focusing on a subset of digitized 19th-20th century women’s travel diaries.

Machine-Learning Pipelines

Sample Workflow:

graph TD;
    A[Pre-processing]-->B[OCR Engine];
    B-->C[Correction Algorithm];
    C-->D[Evaluation];
    D-->|reselection| B;
    D-->|Accuracy meets the standard| E[Model checkpoint]
    E--> F[User Interface & Transcription Software];

Initially, we choose five mainstream OCR engines developed by renowned tech companies, which already generate satisfying results on transcribing printed text. Covering most of the transcription industry, these engines are Transkribus, Tesseract (from HP & Google), Kraken, Google Cloud Vision, and Amazon AWS Textract. These engines are relatively mature, leaving space for future development and training. There are many more OCR engines available, such as Ocular (from University of California, Berkeley), but they are not as suitable for the task of handwritten historical text transcription.

Pre-processing

flowchart LR;
    A[greyscale]-->B[Background removal];
    B-->C[threshold];

def get_greyscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)  
    #change the image from RGB color to a single channel grey scale, which transform the image into more printed-text style, and definitely increase accuracy

def remove_noise(image):
    return cv2.bilateralFilter(image, 5, 75, 75)    #remove background noises

def thresholding(image):
    return cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 15, 9)  
    #further remove all the useless information in the background other than handwriting

pre-processing

Symspell Algorithm

SymsSpell is an algorithm used for Post-OCR correction, whose principle is to find all strings in a very short time within a fixed edit distance from a large list of strings. SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing. The Symmetric Delete spelling correction algorithm reduces the difficulty of generating edit candidates and the dictionary quest for the difference in distance. It is six orders of magnitude faster than the traditional method (deletes + transposes + substitutes + inserts) and is language independent.

flowchart LR;
    A[Transcription Result]-->B[Rectified Result];
    B-->|Compare CER/WER/Levenshtein distance| C[Ground Truth];

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

# lookup suggestions for multi-word input strings (supports compound
# splitting & merging)
file = open(r"")
content = file.read()

# max edit distance per lookup (per single word, not per whole input string)
suggestions = sym_spell.lookup_compound(content, max_edit_distance=2, transfer_casing=True)

result_after = ""
# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
    result_after += suggestion.term

Example:

Can yu readthis messa ge despite thehori    can you read this message despite the ho
ble sppelingmsitakes                        rrible spelling mistakes

OCR Engines

Figure_1

Transkribus

Introducton

Transkribus is a comprehensive platform for the digitisation, AI-powered text recognition, transcription and searching of historical documents – from any place, any time, and in any language. Visit the official Transkribus website here.

Strengths:

Extremely high accuracy in cursive handwritten text recognition
Commercial product with mature software available
Supports both printed text (Pylaia engine: Transkribus Print M1 model) and handwritten text in all languages (HTR engine: e.g. Transkribus English Handwriting M3)
Trainable using labeled pages - the feasibility of this still remains to be tested

Weaknesses:

Low generalizability
Not open-sourced, not replicable
Requires credits to process large-scale transcription (16€ for 120 credits).

Dataset & Accuracy

Training Set	Jeremy Bentham Project
Testing Set	Women‘s Travel Diaries
Accuracy w/ symspell algorithm	CER: 1.84%, WER: 5.56%, Fuzz Ratio (Scaled Levenshtein Distance): 96% ¹
Accuracy w symspell algorithm	CER: 7.88%, WER: 12.74%, Fuzz Ratio (Scaled Levenshtein Distance): 92%

Screen Shot 2022-07-05 at 14 01 38

Tesseract

Introducton

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google. Visit Tesseract repository here.

Strengths:

Extremely high accuracy in recognizing a majority of printed fonts
Various line segmentation & recognition modes
High generalizability
Comes with a Python wrapper class called Pytesseract
Supports training

Weaknesses:

Extremely tenuous training process (using shell scripts), nearly untrainable
Training is based on lines (segemented files paired with ground-truth)
Unable to recognize cursive fonts, accuracy changes correspondent with cursiveness.
Low accuracy in transcribing vowels, especially a & e

Dataset & Accuracy

Training Set	tessdata_fast & tessdata_best This repository contains the best trained data for the Tesseract Open Source OCR Engine.
Testing Set	Women’s Travel Diaries

Font type	Author	Accuracy
Non-cursive	N/A	> 95%, around 96% - 97% accuracy in both characters and words ²
Cursive	Crawford, Martha	CER: 68.28%, WER: 91.38%, Fuzz Ratio (Scaled Levenshtein Distance): 45%
Cursive	McMillan, Mary	CER: 75.74%, WER: 98.44%, Fuzz Ratio (Scaled Levenshtein Distance)e: 30%(nearly unrecognizable)
Cursive	Harriet, Sanderson	CER: 66.09%, WER: 92.95%, Fuzz Ratio (Scaled Levenshtein Distance): 50%

tesseract

Graphical User Interface (GUI)

Screen Shot 2022-07-28 at 14 14 36

Kraken

Introducton

kraken is a turn-key OCR system optimized for historical and non-Latin script material.kraken’s main features are: Fully trainable layout analysis and character recognition; Right-to-Left, BiDi, and Top-to-Bottom script support; ALTO, PageXML, abbyXML, and hOCR output; Word bounding boxes and character cuts; Multi-script recognition support; Public repository of model files; Lightweight model files; Variable recognition network architectures. Visit the official Kraken website here.

Strengths:

Easily trainable using shell commands ³
Training is based on pages
Modular design (binarization + segmentation + recognition)
Usable line segmentation tools

Weaknesses:

Has a relatively long training period, depending on computational power
Requires MacOS/Linux operating system (not compatible with Windows)
Existing documentation is incomplete

Dataset & Accuracy (Ideal Metrics If Trained w.t. Enough Data)

Training Set	IAM Handwriting Database IAM database is a widely used, available handwritten English text online. The database contains forms of unconstrained handwritten text, which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels. The figure below provides samples of a complete form, a text line and some extracted words.
Testing Set	Women’s Travel Diaries / IAM database
Accuracy w/ symspell algorithm	CER: 7.87%, WER: 26.27%, Fuzz Ratio (Scaled Levenshtein Distance): 94%
Accuracy w symspell algorithm	CER: 12.54% , WER: 26.88%, Fuzz Ratio (Scaled Levenshtein Distance): 89%

e05a1d40-8a11-488f-9375-141986fd00b6

Google Cloud Vision OCR

Introducton

Google Vision API offers powerful pre-trained machine learning models through REST and RPC APIs. Assign labels to images and quickly classify them into millions of predefined categories. Detect objects and faces, read printed and handwritten text, and build valuable metadata into your image catalog.

Screen Shot 2022-07-19 at 12 17 06

Data & Accuracy

Training Set	N/A
Testing Set	Women’s Travel Diaries
Accuracy w/ symspell algorithm	CER: 28.69%, WER: 46.77%, Fuzz Ratio (Scaled Levenshtein Distance): 80%
Accuracy w symspell algorithm	CER: 31.43%, WER: 49.45%, Fuzz Ratio (Scaled Levenshtein Distance): 78%

Particular Strength in line/word segmentation:

WechatIMG222

AWS Textract

Introduction

Amazon Textract is based on the same proven, highly scalable, deep-learning technology that was developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. Users don’t need any machine learning expertise to use it. Amazon Textract includes simple, easy-to-use APIs that can analyze image files and PDF files. Amazon Textract is always learning from new data, and Amazon is continually adding new features to the service. However, Textract is not externally trainable by any means.
There are four major advantages to incorporate cloud computing into our project:

(1) The pipeline is automated, so once the end user uploads an image or pdf, the model will be triggered to run and the result will be sent back to them
(2) The user does not need to set up the required environment or install any packages, which is helpful to non-technical users
(3) Machine learning model is easily deployed to the cloud so that the end user does not need to worry about forking the files and training the model on their end. They will just either log into the AWS console or open up an URL.
(4) Textract asynchronous mode allows user to upload up to 10GB of pdf files for transcription all at once.

Data & Accuracy

Training Set	N/A
Testing Set	Women’s Travel Diaries
Accuracy w symspell algorithm	CER: 19.83%, WER: 42.13%, Fuzz Ratio (Scaled Levenshtein Distance): 87%

Product–User Transcription Tool

An user interface is developed through using Amazon cloud computing services. The user inintially upload a pdf file to Amazon S3, then the first Lambda function will initiate the process and send the input file to the Textract service. The AWS Textract will then send the output (JSON payload) back to another Lambda function which is responsible for cleaning the JSON payload, turning the JSON file to CSV, and storing the output in Amazon S3. A complete tutorial of this tool, sample results, and a demo can be found here

180444326-652c8576-4217-4c17-bc5d-5e181a580447

Future Directions

Retrain Kraken/Tesseract using a different dataset or using labeled women’s travel diaries
Develop new HTR models from Transkribus
Explore the viability of developing generalizable HTR models for genres of handwritten documents in the Rubenstein (e.g. 19th century diaries from the same hand vs. 20th century business correspondence from different hands)
Develop a better self-designed post OCR correction algorithm using ML
Conduct further computational analysis and visualization of HTR-generated text using NLP or other text-mining techniques or methods
Include yet-to-be digitized materials related to the early history of Duke such as sermons, diaries, and lecture notes of our institution’s first president, Braxton Craven
Develop a better, fancier software after acquiring a highly accuracte model

The current lowest CER produced by the general HTR tool (supports more than cursive handwriting) in the industry is around 2.75%. ↩
The data is released by the official tesseract UNLV testing site. More specific information can be found here ↩
The training Set of all the OCR Engines require highly consistent and legible hand-written documents, which can provide high quality ground-truth files. Joined-up writing documents are relatively harder to train. ↩