Language annotation on the Web with MoLA

The Web consists of very many resources in many languages and has information about even more. Sure, the majority of Internet users speak English, Chinese, or Spanish, but there are sites, pages, paragraphs, and documents in other languages and about lesser known ‘languoids’ (language, dialect, variant, etc.), ranging from, say, a poem about the poor man’s dinner written in an old Brabants dialect that used to be spoken in the south of the Netherlands to the effects of mobile phones on Zimbabwean (cf. South African) isiNdebele [1]. How should that be annotated? Here’s a complex use case of languoids for old French:

(source: [2])

The extant multilingual Semantic Web models, such as W3C’s ontolex-lemon community standard have outsourced that to a ‘the language tag comes from some place’, as they focus on the word-level and/or sentence-level for the multilingual (semantic) Web. There are indeed standardisations of language tags. Notably, there are the ISO 639 codes (parts 1, 2, 3 and 5) for some 8000 languages—but there are more languoids that are not covered by the ISO list, with an estimated 8-15K or so currently spoken languoids and 150K extinct. There are also Glottolog, Ethnologue, and MultiTree, which are more comprehensive in some respect, but they are limited and problematic in other cases. For instance, Glottolog—the best among them—still uses the broader/narrower than, has artificial names for grouping languoids, has inconsistencies in modelling decisions, and is still incomplete in coverage.

My co-authors—Frances Gillis-Webber, also at UCT, and Sabine Tittel, with the Heidelberg Academy of Sciences and Humanities—and I aim to change that so as to allow for more comprehensive and more inclusive language tags and annotations on the Semantic Web.

In order to be able to do so, we developed a Model for Language Annotation (MoLA) that caters for relatively comprehensive languoid annotations and how they are related, such as allowing recording which languoid evolved from or was influenced by which other languoid, when it was spoken and where, its preferred and alternate names, what sort of lect it is (e.g., dialect, pidgin), which dialect cluster or language family it is a member of, and backward compatibility with the ISO 639 codes.

The design approach was that of labour-intensive manual modelling, including competency questions, an extensive use case, and iterative development of the model at the conceptualisation stage using the Object-Role Modeling (ORM) language. This model was then formalised in OWL (well, most of it). It was tested on the competency questions, smaller use case scenarios, and validated with the large use case. A snippet for Spanish is as follows, as the one for Old French gets quite lengthy (some details).

It enhances Glottolog’s model on several key points, including proper relations between languoids cf BT/RT, a languoid can be associated with zero or more regions, and it allows for multiple names of a languoid both concurrently and over time.

This sounds like a smooth process, but there were a few modelling hurdles that had to be overcome. One of them is level of granularity of analysis of a languoid. For instance, one could argue both that isiXhosa is a language—it’s one of the 11 official languages of South Africa—but also that it is a dialect cluster (i.e., a collection), as there are multiple dialects of isiXhosa. This is a similar case for Old French that’s a language and member of the Romance family of languages, but also can be seen as a collection of dialects (e.g., Picard and Norman), and dialects, in turn, may have varieties (e.g., Artois and Santerre for Picard). On the bright side, this now can be represented and, because it is represented explicitly, it can be queried, such as “Which languoids are dialects of French that were spoken in the middle ages in France?” and “Which languoids are a member of Nguni?”. The knowledgebase still needs to be populated, though, so it won’t work yet with all languoids.

More details can be found in the paper that was recently published [2]. It will be presented in a few weeks at the 1st Iberoamerican Conference on Knowledge Graphs and Semantic Web (KGSWC’19), in Villa Clara, Cuba, alongside 13 other papers on ontologies. The first author, Frances, is soon travelling to the ISWS summer school, so I will present it at KGSWC’19.

 

References

[1] Nkomo, D., Khumalo, L. Embracing the mobile phone technology: its social and linguistic impact with special reference to Zimbabwean Ndebele. African Identities, 10(2): 143-153.

[2] Gillis-Webber, F., Tittel, S., Keet, C.M.. A Model for Language Annotations on the Web. 1st Iberoamerican conference on Knowledge Graphs and Semantic Web (KGSWC’19). Springer CCIS. 23-30 June 2019, Villa Clara, Cuba.

Advertisement