Open final draft tagger pdf

4/8/2023

Similarly, the rule-based approach to morpheme segmentation relied on recursive affix identification in a token before verifying these against a lexicon of valid affix combinations and valid roots and stems. Lemmatization previously derived lemmata through rule-based normalization for these languages which was susceptible to inaccurate stem identification and consequently unreliable lemmatization. This study in particular serves as an extension to a previously sponsored NCHLT Text project completed in 2013 that annotated data and developed the associated core technologies for all 10 indigenous South African languages (English was excluded since text resources for English are readily available). These measures promote the development of human language technologies (HLT) and as result helps to preserve the cultural significance of each language whilst contributing to the tools and resources that serve its community of language researchers. In light of this, the South African government has established several legislative frameworks to help promote both the use and advancement of its official languages. All 10 the indigenous official languages (English excluded) are considered resource-scarce and attempting to generate lexical resources can be a protracted endeavor. isiZulu, isiXhosa, isiNdebele, and Siswati), disjunctive languages (Tshivenḓa, Xitsonga, and three Sotho-Tswana languages (Sesotho, Sepedi, and Setswana)), and the middle of the scale, two West-Germanic languages, viz. īased on their orthographies, the 11 official South African languages are all either Southern Bantu languages or West-Germanic languages, and can be categorized on a conjunctive-disjunctive scale into three groups, i.e., conjunctive languages (four Nguni languages, viz. For under-resourced languages, this approach suffers due to the scarcity and often lacking quality of available lexical data. In turn, NLP applications rely on these tasks to perform machine translation, information extraction, text classification, among other tasks. Contemporary research relies on natural language processing (NLP) to investigate usage patterns within large electronic corpora to achieve lexical semantic tasks such as word sense disambiguation and semantic role labelling. To date, lexical semantic knowledge has been captured through either a knowledge-based approach, where linguistic knowledge is directly recorded in a structured and often rule-based form, and corpus-based approaches where machine-learned semantic knowledge is gained from corpora and represented implicitly. Central to these efforts is the notion of lexical semantics, generally defined as the analysis of words and lexical units in terms of their classification, decomposition, and their lexical meaning in relationship to context and syntax. These resources are made publicly accessible through a local resource agency with the intention of fostering further development of both resources and technologies that may benefit the NLP industry in South Africa.Īccess to linguistic resources such as annotated data helps to facilitate, or even hinder, research and development efforts based on its quality and availability.

We report on the quality of these technologies which improve on previously developed rule-based technologies as part of a similar initiative in 2013. a lemmatizer, part-of-speech tagger, morphological analyzer for each of the languages. These sets were in turn used to create and evaluate three core technologies, viz.

Development efforts included sourcing parallel data for these languages and annotating each on token, orthographic, morphological, and morphosyntactic levels.

In this paper, we describe the curation and annotation of corpora and the development of multiple linguistic technologies for four official South African languages, namely isiNdebele, Siswati, isiXhosa, and isiZulu. The creation of linguistic resources is crucial to the continued growth of research and development efforts in the field of natural language processing, especially for resource-scarce languages.

0 Comments

BLOG

Open final draft tagger pdf

Leave a Reply.

Author

Archives

Categories