Arabic-Character Historical Document Processing: Why and How To?

  • Zied Mnasri Università di Napoli L'Orientale
Keywords: Arabic manuscript, Historical Document Processing (HDP), Optical Character Recognition (OCR), Character segmentation and recognition

Abstract

The aim of Arabic-character Historical Document Processing (HDP) is to design and develop techniques that will enable automatic transcription into text files, such as in .txt or .doc format, of historic manuscripts in Arabic characters, not only for Arabic, but also for other languages based on this character, such as Farsi, Urdu, Azari, ottoman Turkish, etc. The key idea is to go from the scanned image of the manuscript to the text file using artificial intelligence techniques to accomplish two main steps: First, processing the manuscript image to identify the characters and to remove other forms generally found in historic manuscripts, such as images and other types of ornaments; secondly, identifying the characters by pattern recognition. Such a work requires the availability of a rich dataset of Arabic character manuscripts, in addition to effective methods for image processing, pattern recognition and, optionally, language modelling. In this paper, an overview of the Arabic-character HDP state of the art, datasets, challenges, methods, and potential applications is presented, as a first step to set a general framework to undertake such a project.

Published
2023-02-26
How to Cite
MnasriZ. (2023). Arabic-Character Historical Document Processing: Why and How To?. Archeologie Tra Oriente E Occidente, 1, 49-61. https://doi.org/10.6093/archeologie/9857