Toggle Main Menu Toggle Search

Open Access padlockePrints

An Extended Pattern Based Comprehensive Stemmer for the Urdu Language

Lookup NU author(s): Dr Husnain SheraziORCiD

Downloads


Licence

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).


Abstract

© 2024 Copyright held by the owner/author(s). The Urdu language is used by approximately 200 million people for spoken and written communications on a daily basis. There is a substantial amount of unstructured Urdu textual data that is available worldwide. Data mining techniques can be used to extract meaningful knowledge from such a large, potentially informative source of data. There are many text processing systems available to process unstructured textual data. However, these systems are mostly language specific and developed for a variety of languages such as English, Spanish, Chinese, and so on. Unfortunately, there are not as many language processing resources available for Urdu. Stemming is one of the most important preprocessing steps in the text mining process and its goal is to reduce grammatical words form, e.g., parts of speech, gender, tense, and so on, to their root form. In this work, we have extended the stemming capabilities of our existing pattern-based comprehensive stemming system for Urdu text. In addition to the existing stemming rules in previous work, we introduce novel stemming rules for prefix, and infix stemming. We also optimize the existing suffix removal rules and extend the add character lists for word normalization. These stemming rules are generic and have the ability to generate the stem of Urdu words as well as loan words (words belonging to other languages i.e., Arabic, Persian, Turkish). In the experimental evaluation, we have observed a significant improvement in the overall stemming accuracy of our proposed pattern-based Urud stemmer, which demonstrates the adoptability of the proposed stemming approach for a variety of text-processing applications.


Publication metadata

Author(s): Ali M, Baqir A, Sherazi HHR, Khalid S, Smith P, Lee M

Publication type: Article

Publication status: Published

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing

Year: 2024

Volume: 23

Issue: 12

Print publication date: 23/11/2024

Online publication date: 21/10/2024

Acceptance date: 05/10/2024

Date deposited: 08/01/2025

ISSN (print): 2375-4699

ISSN (electronic): 2375-4702

Publisher: ACM

URL: https://doi.org/10.1145/3701231

DOI: 10.1145/3701231


Altmetrics


Share