Some reflections on designing Abstract Wikipedia so far

Abstract Wikipedia aims to at least augment the current, if not be the next-generation, Wikipedia. Besides the human-authored articles that take their time to write and maintain, you could scale up article generation through automation and to do so for many more languages. And keep all that content up-to-date. And all that reliably without hallucinations where algorithms make stuff up. How? Represent the data and information in a structured format, such as in an RDF triple store, JSON, or even a relational database or OWL, and generate text from suitably selected structured content. Put differently: multilingual natural language generation, at scale, and community-controlled. For the Abstract Wikipedia setting, the content would come from Wikidata and the code to compute it from Wikifunctions. Progress in creating the system isn’t going as fast as hoped for and a few fellows wrote an initial critique of the plans and progress made, to which the Abstract Wikipedia team at WMF wrote a comprehensive reply. It was also commented on in a Signpost technology report, and a condensed non-technical summary has appeared in an Abstract Wikipedia updates letter. The question remains: is it feasible? If so, what is the best way to go about doing it; if not, why not and then what?

A ‘pretty picture’ of a prospective Abstract Wikipedia architecture, at a very high level. Challenges lie in what’s going to be in that shiny yellow box in the centre and how that process should unfold, in the lexicographic data in Wikidata, and where the actual text generation will happen and how.

My name appears in some of those documents, as I’ve been volunteering in the NLG stream of the Abstract Wikipedia Project in an overlapping timeframe and I contributed to the template language, to the progress on the constructors (here and here), and to adding isiZulu lexemes to Wikidata, among others. The mentions are mostly in the context of challenges with Niger Congo B (AKA ‘Bantu’) languages that are spoken across most of Sub-Saharan Africa. Are these languages really so special that they deserve a specific mention over all others? Yes and No. A “No” may apply since there are many languages spoken in the world by many people that have ‘unexpected’ or ‘peculiar’ or ‘unique’ or ‘difficult to computationally process’ features or are in the same boat when it comes to their low-resource status and the challenges that entails. NCB languages, such as isiZulu that I focus on mainly, are among just one family of languages. If I were to have moved to St. Lawrence Island in the Bering Street, say, I could have given similar pushback, with the difference that there are many, many more millions of people speaking NCB languages than Yupik. Neither language is in the Indo-European language family. Language families exist for a reason; they have features really unlike others. That’s where the “Yes” answer comes in. The ‘yes’, together with the low-resourcedness, challenges consist of four dimensions: theoretical, technical, people, and praxis. Let me briefly illustrate each in turn./

Theory – linguistic and computational

The theoretical challenges are mainly about the language and linguistics, on the characteristic features they have and how much we know of it, affecting technical aspects down the road. For instance, we know that the noun class system is emblematic of NCB languages. To a novice or an outsider, it smells of the M/F/N gender of nouns like in French, Spanish, or German, but then a few more of them. It isn’t quite like that in the details for the 11-23 noun classes in an NCB language and squeezing that into Wikidata is non-trivial, since here and there an n-ary relation is more appropriate for some aspects than approximating that by reifying binaries partially. The noun class of the noun governs a concordial agreement system goes across a sentence rather than only its adjacent word; e.g., not only an adjective agreeing with the gender of a noun like in Romance languages (e.g., an abuala vieja and abuelo viejo in Spanish) for each noun class, but also conjugation of the verb by noun class and other aspects such as quantification over a noun (e.g., bonke abantu ‘all humans’ and zonke izinja ‘all dogs’). We know some of the rules, but not all of them and only for some of the NCB languages. When I commenced with natural language generation for isiZulu in earnest in 2014, it wasn’t even clear how to pluralise nouns roughly, let alone exactly. We now know how roughly to pluralise nouns automatically. Figuring out the isiZulu verb present tense got us a paper as recent as 2017; the Context-Free Grammar we defined for it is not perfect yet, but it’s an improvement on the state of the art and we can use it in certain controlled settings of natural language generation.

My collaborator and I like such a phrase structure grammar. There are several types of grammars, however, and it’s anyone’s guess whether any of them is expressive and convenient enough to capture grammars of the NCB languages. The alternative family is dependency grammars with its subtypes and variants. To the best of my knowledge, nothing has been done with such grammars and any of the NCB languages. What I can assure you from ample experience, is that it is infeasible for people working on low- or medium-resourced languages to start writing up grammars for every pet preference of grammar flavour of the day that rotating volunteers have.

IsiZulu and Kiswahili are probably the least low-resourced languages of the NCB language family, and yet there’s no abundance of grammar specifications. It’s not that it can’t be done at least in part; it’s just that most material, if available at all, is outdated and never tested on more than a handful of words or sentences, and thus is not off-the-shelf computationally reliable at present. And there are limited resources available to verify. This is also the case for many other low-resourced languages. For Abstract Wikipedia to achieve its inclusivity aim, the system must have a way to deal with incremental development of grammar specifications without large upfront investments. One shouldn’t want to kill a mosquito with a sledgehammer by first having to scramble together the material and build a sledgehammer, because there are no instant resources to create that sledgehammer. Let’s start with something feasible in the near term, to build just enough equipment to get done what’s needed. Rolling up a newspaper page will do just fine to kill that mosquito. For instance, don’t demand that the grammar spec must be able to cover, say, all numbers in all possible constructions, but only for one specific construction in a specific context. Say, for stating the age of a person provided they’re less than 100 years old, or the numbers related to years, not centuries or millennia that will be tackled later. Templates are good for specifying such constrained contexts of use and they assist with incremental grammar development and can offer near-instant concrete user feedback of positive contributions showing results.

Supporting a template-based approach doesn’t mean that I don’t understand that the sledgehammer may be better in theory – an all-purpose CFG or DG would be wonderful. It’s that I know enough of the facts on the ground that I’m aware rolling up a newspaper page suffices for a case and is feasible, unlike the sledgehammer. Let low-resource languages join the party. Devise a novel framework, method, and system that permits incremental development and graceful degradation in the realiser. A nice-to-have on top of that would be automated ‘transformers’ across types of grammars so we won’t have to start all over again when the next grammar formalism flavour is trumped, if it must change at all.

Technical challenges

The theory relates to the second one group of challenges, which are of a technical nature. There are lovely systems and frameworks, who overconfidently claim to be ‘universal’. Grammars coded up for 40, nay a 100, languages, so it must be good and universal, or so the assumption may go. We do want to reuse as much as possible—being resource-constrained and all—but then it never turns out to work off-the-shelf. From word-based spellcheckers like in OpenOffice that are useless for agglutinating languages to the Universal Dependencies (UD) framework and accompanying tools that miss useful dependencies and are too coarse-grained at the word-level and, up till very recently, was artificially constrained to trees rather than DAGs, up to word-based natural language generation realisers: we have (had) to start from scratch mostly and devise new approaches.

So now we have a template language for Abstract Wikipedia (yay!) that can handle the sub-words (yay!), but then we get whacked like a mole on needing a fully functional Dependency Grammar (and initially UD and trees only) for parsing the template, which we don’t have. The UD framework has to be modified to work for NCB languages – none of those 100 is an NCB language – to allow arcs to be drawn on sub-word fragments, or if on the much less useful words only, then allowing for more than one incoming arc. It also means we first have to adapt UD annotation tools to get clarity on the matter. And off we must go to do all that before we can sit at that table again? We’ll do a bit, enough for our own systems of what we need for the use cases.

Sadly, Grammatical Framework is worse, despite there already being a draft partial resource grammar for isiZulu and even though it’s a framework of the CFG flavour of grammars. Unlike for UD, where reading an overview article suffices to get started, that won’t do for GF; a two-week summer school you must attend and the book to read to get anything possibly started. The start-up costs are too high for the vast majority of languages. And remember that the prospective system should be community-driven rather than be an experts-only affair that GF is at present. Even if that route is taken, then the grammar is locked into the GF system, inaccessible for any reuse elsewhere, which is not a good incentive when potential for reuse is important.

The fellows’ review proposed to adopt an extant NLG system and build on it, including possibly GF: if we could have done it, we would have done so and I wouldn’t have received any funding for investigating an NLG system for Nguni languages. A long answer on why we couldn’t can be found in Zola Mahlaza’s PhD thesis on foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu and shorter answers regarding parts of the problems are described in papers emanating from my GeNi and MoreNL research projects. More can be done still to create a better realiser.

The other dimension of technical aspects is the WMF software ecosystem as it stands at present. For a proof-of-concept to demonstrate the Abstract Wikipedia project’s potential, I don’t care whether that’s with Wikifunctions, with Scribunto, or a third-party system that can be (near-instantly) copied over onto Wikifunctions once it works as envisioned. Wikidata will need to beefed up, on speed in SPARQL query answering, on reducing noise in its content, and on the lexemes to cater for highly inflectional and agglutinating languages. It’s not realistic to make the community add all forms of the words, since there are too many and the interface requires too much clicking around and re-typing when entering lexicographic data manually. Either allow for external precomputation, a human-in-the-loop, and then a batch upload, or assume base forms and link it to a set of rules stored somewhere in order to compute the required form at runtime.

People and society

The third aspect, people, consists of two components: NCB language speakers with their schedules and incentives and, for the lack of a better term, colonial peculiarities or sexism, or both. Gender bias issues in content and culture on Wikipedia are amply investigated and documented. Within the context of Abstract Wikipedia, providing examples that are too detailed is awkward to do publicly and anyhow the plural of anecdote is not data. What I experienced were mostly instances of recurring situations. Therefore, let me generalise some of it and formulate it partially as a reply and way forward, in arbitrary order.

First, the “I don’t know any isiZulu, but…” phrases: factless opinions about the language shouldn’t be deemed more valuable and worthy and perceived valid just because one works with a well-resourced language in another language family and is more pushy or abrasive. The research we carried out over the past many years really happened and was published in reputable venues. It may be tempting to (over)generalise for other languages once one speaks several languages, but it’s better to be safe than sorry.

Second, let me remind you that Wikis are intended to be edited by the community – and that includes me. I might just continue expressing my indignation at the repeated condescending comments that I couldn’t be allowed to do so because some European white guy’s edits are unquestionably naturally superior. As it turned out, it were questionable attitudes from certain people within the broader Abstract Wikipedia team, not the nice white guy who had been merely exploring adding a certain piece of non-trivial information. I went ahead and edited it eventually anyway, but it does make me wonder how often people from outside the typical Wiki contributor demographic are actively discouraged from adding content for made-up reasons.

Third, languages evolve and research does happen. The English from a 100 years ago is not the same as it is spoken and written today and that’s the same for most other languages, including low-resourced languages. They’re not frozen in time just because there are fewer computational resources or they’re too far away to see their changes. Societies change and the languages change with them. No doubt the missionary did his best documenting a language 50-150 years ago, but just because it’s written in a book and he wrote it doesn’t mean that my or my colleagues’ recent published research that included an evaluation with a set of words or sentences would be less valid just because it’s our work and we’re not missionaries (or whatever other reason one invents why long gone missionaries’ work takes precedence over anyone else’s contributions).

Fourth, if an existing framework for Indo-European languages doesn’t work for NCB languages, it doesn’t imply we’re all too stupid to grasp that framework. We may not know, but it’s likely that we do and the framework is too limited for the language (see also above) or it’s too impractical for the lived reality of working with a low-resourced language. Regarding the latter, a [stop whining and] “become more active and just get yourself more resources” isn’t a helpful response, nor is not announcing open calls for WMF-funded projects.

As to human contributions to any component of Abstract Wikipedia and any wiki project more generally, it’s complex and deserves more unpacking. On incentives to contribute, perceptions of Wikipedia, sociolinguistics, and the good plans we have that are derailed by things that people in affluent countries wouldn’t think of that could interfere, and there’s Moses and the mountain.

Practical hurdles

Last, there are practical hurdles that an internationally dominant or darling language does not have to put up with. An example is the unbelievable process of getting a language accepted by the WMF ecosystem as deserving to be one. I’m not muttering about being shoved aside for trying to promote an endangered language that doesn’t have an ISO-639 3-letter code and has only a mere handful of speakers left, but even an ISO-639 2-letter code language with millions of speakers faces hurdles. Evidence has to be provided. Yes, there are millions of speakers, here’s the census data; Yes, there are daily news items on national TV, and look here the discussions on Facebook; Yes, there are online newspapers with daily updates. It takes weeks if not months, if ever. These are exclusionary practices. We should not have to waste limited time on countering the anti-nondominant-language ‘trolling’ – having to put in extra effort to pass an invisible and arbitrary bar – but, as a minimum, have each already ISO-recognised language be granted status as being one. True enough that this suggestion is also not a perfect solution, but at least it’s much more inclusive. Needless to say, also this challenge is not unique to NCB languages. And yes, various phabricator tickets with language requests have been open since at least 1.5 years.

In closing

The practicalities is just one more thing on top of all the rest that make a fine idea, Abstract Wikipedia for all, smell of entrenching much deeper the well-documented biassed tendencies of Wikis. I tried, and try, to push back. The issues are complex both theoretically & technically and people & praxis. They hold for NCB languages as well as many others.

Abstract Wikipedia aims to build a multilingual Wikipedia, and the back-end technology that it requires may have been a rather big bite for the Wikimedia Foundation to chew on. The ‘many flowers‘ on top of the constructors to generate the text it will have to be if it is serious about the inclusivity, as well as gradual expansion of the natural language generation features during runtime, an expansion that will be paced differently according to the language resources, not unlike that each Wikipedia has its own pace of growth. From the one step at a time perspective, even basic sentences in a short paragraph for a Wikipedia article is an infinite improvement over no article at all. It invites contributions compared to creating a new article from scratch. The bar for making Abstract Wikipedia successful does not necessarily need to be, say, ‘to surpass English articles’.

The mountain we’ll keep climbing, be it with or without the Abstract Wikipedia project. If Abstract Wikipedia is to become a reality and flourish for many languages soon, it needs to allow for molehills, anthills, dykes, dunes, and hills as well, and with whatever flowers available to set it up and make it grow.