Arabic-Character Historical Document Processing: Why and How To?

Zied Mnasri

doi:10.6093/archeologie/9857

Zied Mnasri Università di Napoli L'Orientale

DOI: https://doi.org/10.6093/archeologie/9857

Abstract

The aim of Arabic-character Historical Document Processing (HDP) is to design and develop techniques that will enable automatic transcription into text files, such as in .txt or .doc format, of historic manuscripts in Arabic characters, not only for Arabic, but also for other languages based on this character, such as Farsi, Urdu, Azari, ottoman Turkish, etc. The key idea is to go from the scanned image of the manuscript to the text file using artificial intelligence techniques to accomplish two main steps: First, processing the manuscript image to identify the characters and to remove other forms generally found in historic manuscripts, such as images and other types of ornaments; secondly, identifying the characters by pattern recognition. Such a work requires the availability of a rich dataset of Arabic character manuscripts, in addition to effective methods for image processing, pattern recognition and, optionally, language modelling. In this paper, an overview of the Arabic-character HDP state of the art, datasets, challenges, methods, and potential applications is presented, as a first step to set a general framework to undertake such a project.