Indocollex: A testbed for morphological transformation of indonesian word colloquialism

Abstract

Indonesian language is heavily riddled with colloquialism whether in written or spoken forms. In this paper, we identify a class of Indonesian colloquial words that have undergone morphological transformations from their standard forms, categorize their word formations, and propose a benchmark dataset of Indonesian Colloquial Lexicons (IndoCollex) consisting of informal words on Twitter expertly annotated with their standard forms and their word formation types/tags. We evalu- ate several models for character-level transduction to perform morphological word normalization on this testbed to understand their failure cases and provide baselines for future work. As IndoCollex catalogues word formation phenomena that are also present in the non-standard text of other languages, it can also provide an attractive testbed for methods tailored for cross-lingual word normalization and non-standard word formation.

Publication
Association for Computational Linguistics
Haryo Akbarianto Wibowo
Haryo Akbarianto Wibowo
Builder on Artificial Intelligence Field

Researcher and Engineer in Artificial Intelligence, especially in NLP Deep Learning. Love to learn and share.