25 Fall / Digital Humanities Seed Grant Project

Principal Investigator: Prof. Michael Yan Hon CHUNG

Manchu OCR and Translation AI

The Manchu OCR project aims to build a reliable, high-accuracy system capable of automatically reading and segmenting Manchu script from real historical sources. Building on a custom word-segmentation workflow and a large corpus of synthetic and real-world scanned materials collected from multiple repositories, the project has already produced a fine-tuned vision-language model that performs strongly on real-world scanned data. Early tests demonstrate that the system can accurately recognize printed and well-written archival pages, validating the feasibility of applying modern OCR techniques to a low-resource script with complex vertical structures and variable calligraphic styles.

The next stage of the project focuses on expanding the training data, strengthening performance, and integrating OCR output with transliteration and machine translation pipelines. Once complete, the system will provide scholars with a practical tool for converting Manchu archival documents into searchable, analyzable text. It will lower linguistic barriers, accelerate historical research, and establish a transferable framework for OCR development in other endangered or low-resource languages.

Project Objectives

Project Team

Principal Investigator:
Prof. Michael CHUNG (Division of Humanities)
Research Assistant:
Hanlin WANG (student from Information Technology, Department of Computer Science and Engineering)
Project Support:
Dr. Steve MA (Division of Humanities)
Yifan WANG (Library)

To be Launched
in Spring 2026...

Project Images on this page are credit to Prof. Michael Chung.