Data+ HTR project
This project is maintained by Xushu-Wang
The David M. Rubenstein Rare Book and Manuscript Library at Duke University holds millions of pages of handwritten documents ranging from ancient Papyri to records of Southern plantations to 21st century letters and diaries. Only a small subset of these documents have been digitized and made available online, and even fewer have been transcribed. The lack of text transcripts for handwritten documents impairs discovery and use of the materials, and prohibits any kind of computational text analysis that might support new avenues of research, including research related to the histories of racial injustice.
While Optical Character Recognition (OCR) technology has made it possible to derive machine-readable text from typewritten documents in an automated way for several decades, the work of transcribing handwritten documents remains largely manual and labor-intensive. In the last few years, however, platforms like Transkribus have sought to harness the power of machine-learning by using Handwriting Text Recognition (HTR) to extract text from manuscripts and other handwritten documents held in libraries and archives. To date, the Rubenstein Library has conducted a few small-scale HTR experiments with mixed (and mostly disappointing) results. We have a lot to learn about the viability of HTR for our collections and about how to incorporate HTR into our existing workflows.
In this Data+ project, students tested the viability of AI-powered HTR for transcribing digitized handwritten documents in the Rubenstein library and made recommendations for how the library might incorporate HTR into existing workflows, projects, and interfaces. Source material was drawn from the Duke Digital Collections, focusing on a subset of digitized 19th-20th century women’s travel diaries.
Sample Workflow:
graph TD;
A[Pre-processing]-->B[OCR Engine];
B-->C[Correction Algorithm];
C-->D[Evaluation];
D-->|reselection| B;
D-->|Accuracy meets the standard| E[Model checkpoint]
E--> F[User Interface & Transcription Software];
Initially, we choose five mainstream OCR engines developed by renowned tech companies, which already generate satisfying results on transcribing printed text. Covering most of the transcription industry, these engines are Transkribus, Tesseract (from HP & Google), Kraken, Google Cloud Vision, and Amazon AWS Textract. These engines are relatively mature, leaving space for future development and training. There are many more OCR engines available, such as Ocular (from University of California, Berkeley), but they are not as suitable for the task of handwritten historical text transcription.
flowchart LR;
A[greyscale]-->B[Background removal];
B-->C[threshold];
def get_greyscale(image):
return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
#change the image from RGB color to a single channel grey scale, which transform the image into more printed-text style, and definitely increase accuracy
def remove_noise(image):
return cv2.bilateralFilter(image, 5, 75, 75) #remove background noises
def thresholding(image):
return cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 15, 9)
#further remove all the useless information in the background other than handwriting
SymsSpell is an algorithm used for Post-OCR correction, whose principle is to find all strings in a very short time within a fixed edit distance from a large list of strings. SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing. The Symmetric Delete spelling correction algorithm reduces the difficulty of generating edit candidates and the dictionary quest for the difference in distance. It is six orders of magnitude faster than the traditional method (deletes + transposes + substitutes + inserts) and is language independent.
flowchart LR;
A[Transcription Result]-->B[Rectified Result];
B-->|Compare CER/WER/Levenshtein distance| C[Ground Truth];
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
"symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
# lookup suggestions for multi-word input strings (supports compound
# splitting & merging)
file = open(r"")
content = file.read()
# max edit distance per lookup (per single word, not per whole input string)
suggestions = sym_spell.lookup_compound(content, max_edit_distance=2, transfer_casing=True)
result_after = ""
# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
result_after += suggestion.term
Example:
Can yu readthis messa ge despite thehori can you read this message despite the ho
ble sppelingmsitakes rrible spelling mistakes
Transkribus is a comprehensive platform for the digitisation, AI-powered text recognition, transcription and searching of historical documents – from any place, any time, and in any language. Visit the official Transkribus website here.
Strengths:
Weaknesses:
Training Set | Jeremy Bentham Project |
Testing Set | Women‘s Travel Diaries |
Accuracy w/ symspell algorithm | CER: 1.84%, WER: 5.56%, Fuzz Ratio (Scaled Levenshtein Distance): 96% 1 |
Accuracy w symspell algorithm | CER: 7.88%, WER: 12.74%, Fuzz Ratio (Scaled Levenshtein Distance): 92% |
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google. Visit Tesseract repository here.
Strengths:
Weaknesses:
Training Set | tessdata_fast & tessdata_best This repository contains the best trained data for the Tesseract Open Source OCR Engine. |
Testing Set | Women’s Travel Diaries |
Font type | Author | Accuracy |
---|---|---|
Non-cursive | N/A | > 95%, around 96% - 97% accuracy in both characters and words 2 |
Cursive | Crawford, Martha | CER: 68.28%, WER: 91.38%, Fuzz Ratio (Scaled Levenshtein Distance): 45% |
Cursive | McMillan, Mary | CER: 75.74%, WER: 98.44%, Fuzz Ratio (Scaled Levenshtein Distance)e: 30%(nearly unrecognizable) |
Cursive | Harriet, Sanderson | CER: 66.09%, WER: 92.95%, Fuzz Ratio (Scaled Levenshtein Distance): 50% |
kraken is a turn-key OCR system optimized for historical and non-Latin script material.kraken’s main features are: Fully trainable layout analysis and character recognition; Right-to-Left, BiDi, and Top-to-Bottom script support; ALTO, PageXML, abbyXML, and hOCR output; Word bounding boxes and character cuts; Multi-script recognition support; Public repository of model files; Lightweight model files; Variable recognition network architectures. Visit the official Kraken website here.
Strengths:
Weaknesses:
Training Set | IAM Handwriting Database IAM database is a widely used, available handwritten English text online. The database contains forms of unconstrained handwritten text, which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels. The figure below provides samples of a complete form, a text line and some extracted words. |
Testing Set | Women’s Travel Diaries / IAM database |
Accuracy w/ symspell algorithm | CER: 7.87%, WER: 26.27%, Fuzz Ratio (Scaled Levenshtein Distance): 94% |
Accuracy w symspell algorithm | CER: 12.54% , WER: 26.88%, Fuzz Ratio (Scaled Levenshtein Distance): 89% |
Google Vision API offers powerful pre-trained machine learning models through REST and RPC APIs. Assign labels to images and quickly classify them into millions of predefined categories. Detect objects and faces, read printed and handwritten text, and build valuable metadata into your image catalog.
Training Set | N/A |
Testing Set | Women’s Travel Diaries |
Accuracy w/ symspell algorithm | CER: 28.69%, WER: 46.77%, Fuzz Ratio (Scaled Levenshtein Distance): 80% |
Accuracy w symspell algorithm | CER: 31.43%, WER: 49.45%, Fuzz Ratio (Scaled Levenshtein Distance): 78% |
Particular Strength in line/word segmentation:
Amazon Textract is based on the same proven, highly scalable, deep-learning technology that was developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. Users don’t need any machine learning expertise to use it. Amazon Textract includes simple, easy-to-use APIs that can analyze image files and PDF files. Amazon Textract is always learning from new data, and Amazon is continually adding new features to the service. However, Textract is not externally trainable by any means.
There are four major advantages to incorporate cloud computing into our project:
Training Set | N/A |
Testing Set | Women’s Travel Diaries |
Accuracy w symspell algorithm | CER: 19.83%, WER: 42.13%, Fuzz Ratio (Scaled Levenshtein Distance): 87% |
An user interface is developed through using Amazon cloud computing services. The user inintially upload a pdf file to Amazon S3, then the first Lambda function will initiate the process and send the input file to the Textract service. The AWS Textract will then send the output (JSON payload) back to another Lambda function which is responsible for cleaning the JSON payload, turning the JSON file to CSV, and storing the output in Amazon S3. A complete tutorial of this tool, sample results, and a demo can be found here
The current lowest CER produced by the general HTR tool (supports more than cursive handwriting) in the industry is around 2.75%. ↩
The data is released by the official tesseract UNLV testing site. More specific information can be found here ↩
The training Set of all the OCR Engines require highly consistent and legible hand-written documents, which can provide high quality ground-truth files. Joined-up writing documents are relatively harder to train. ↩