# My gender-balanced book reviews overall, yet with much fluctuation

In one of my random browsing moments, I stumbled upon a blog post of a writer who had her son complaining about the stories she was reading to him, as having so many books with women as protagonists. As it appeared, “only 27% of his books have a female protagonist, compared to 65% with a male protagonist.”. She linked back to another post about a similar issue but then for some TV documentary series called missed in history, where viewers complained that there were ‘too many women’ and more like a herstory than a missed in history. Their tally of the series’ episodes was that they featured 45% men, 21% women, and 34% were ungendered. All this made me wonder how I fared in my yearly book review blog posts. Here’s the summary table and the M/F/both or neither:

 Year posted Book Nr M Nr F Both / neither Pct F 2012 Long walk to freedom, terrific majesty, racist’s guide, end of poverty, persons in community, African renaissance, angina monologues, master’s ruse, black diamond, can he be the one 4 3 3 33% 2013 Delusions of gender, tipping point, affluenza, hunger games, alchemist, eclipse, mieses karma 2 3 2 43% 2014 Book of the dead, zen and the art of motorcycle maintenance, girl with the dragon tattoo, outliers, abu ghraib effect, nice girls don’t get the corner office 2 1 3 17% 2015 Stoner, not a fairy tale, no time like the present, the time machine, 1001 nights, karma suture, god’s spy, david and goliath, dictator’s learning curve, MK 4 2 4 20% 2016 Devil to pay, black widow society, the circle, accidental apprentice, moxyland, muh, big short, 17 contradictions 2 4 2 50% Total 14 13 14 32%

Actually, I did pretty well in the overall balance. It also shows that were I to have done a bean count for a single year only, the conclusion could have been very different. That said, I classified them from memory, and not by NLP of the text of the books, so the actual amount allotted to the main characters might differ. Related to this is the screenplay dialogue-based data-driven analysis of Hollywood movies, for which NLP was used. Their results show that even when there’s a female lead character, Hollywood manages to get men to speak more; e.g., The Little Mermaid (71%) and The Hunger Games (55% male). Even the chick flick Clueless is 50-50. (The website has several nice interactive graphs based on the lots of data, so you can check yourself.) For the Hunger Games, though, the books do have Katniss think, do, and say more than in the movies.

A further caveat of the data is that these books are not the only ones I’ve read over the past five years, just the ones written about. Anyhow, I’m pleased to discover there is some balance in what I pick out to write about, compared to unconscious bias.

As a last note on the fiction novels listed above, there was a lot of talk online the past week about Lionel Shriver’s keynote on defense on fiction writing-what-you-like and having had enough of the concept of ‘cultural appropriation’. Quite few authors in the list above would be thrown on the pile of authors who ‘dared’ to imagine characters different from the box they probably would by put in. Yet, most of them still did a good job to make it a worthwhile read, such as Hugh Fitzgerald Ryan on Alice the Kyteler in ‘The devil to pay’, David Safier with Kim Lange in ‘Mieses Karma’, Stieg Larsson with ‘Girl with the dragon tattoo’, and Richard Patterson in ‘Eclipse’ about Nigeria. Rather: a terrible character or setting that’s misrepresenting a minority or oppressed, marginalised, or The Other group in a novel is an indication of bad writing and the writer should educate him/herself better. For instance, JM Coetzee could come back to South Africa and learn a thing or two about the majority population here, and I hope for Zakes Mda he’ll meet some women who he can think favourably about and then reuse those experiences in a story. Anyway, even if the conceptually problematic anti-‘cultural appropriation’ police wins it from the fiction writers, then I suppose I can count myself lucky living in South Africa that, with its diversity, will have diverse novels to choose from (assuming they won’t go further overboard into dictating that I would be allowed to read only those novels that are designated to be appropriate for my (from the outside) assigned box).

UPDATE (20-9-2016): following the question on POC protagonist, here’s the table, where those books with a person (or group) of colour is a protagonist are italicised. Some notes on my counting: Angina monologues has three protagonists with 2 POCs so I still counted it, Hunger games’ Katniss is a POC in the books, Eclipse is arguable, abu ghraib effect is borderline and Moxyland is an ensemble cast so I still counted that as well. Non-POC includes cows as well (Muh), hence that term was chosen rather than ‘white’ that POC is usually contrasted with. As can be seen, it varies quite a bit by year as well.

 Year posted Book POC (italics in the list) Non-POC or N/A Pct POC 2012 Long walk to freedom, terrific majesty, racist’s guide, end of poverty, persons in community, African renaissance, angina monologues, master’s ruse, black diamond, can he be the one 8 2 80% 2013 Delusions of gender, tipping point, affluenza, hunger games, alchemist, eclipse, mieses karma 2 5 29% 2014 Book of the dead, zen and the art of motorcycle maintenance, girl with the dragon tattoo, outliers, abu ghraib effect, nice girls don’t get the corner office 2 4 33% 2015 Stoner, not a fairy tale, no time like the present, the time machine, 1001 nights, karma suture, god’s spy, david and goliath, dictator’s learning curve, MK 4 6 40% 2016 Devil to pay, black widow society, the circle, accidental apprentice, moxyland, muh, big short, 17 contradictions 3 5 38% Total 19 22 46%

# Brief report on the INLG16 conference

Another long wait at the airport is being filled with writing up some of the 10 pages of notes I scribbled while attending the WebNLG’16 workshop and the 9th International Natural Language Generation conference 2016 (INLG’16), that were held from 6 to 10 September in Edinburgh, Scotland.

There were two keynote speakers, Yejin Choi and Vera Demberg, and several long and short presentations and a bunch of posters and demos, all of which had full or short papers in the (soon to appear) ACL proceedings online. My impression was that, overall, the ‘hot’ topics were image-to-text, summaries and simplification, and then some question generation and statistical approaches to NLG.

The talk by Yejin Choi was about sketch-to-text, or: pretty much anything to text, such as image captioning, recipe generation based on the ingredients, and one even could do it with sonnets. She used a range of techniques to achieve it, such as probabilistic CFGs and recurrent neural networks. Vera Demberg’s talk, on the other hand, was about psycholinguistics for NLG, starting from the ‘uniform information density hypothesis’ compared to surprisal words and grammatical errors and how that affects a person reading the text. It appears that there’s more pupil jitter when there’s a grammar error. The talk then moved on to see how one can model and predict information density, for which there are syntactic, semantic, and event surprisal models. For instance, with the semantic one: ‘peter felled a tree’: then how predictable is ‘tree’, given that its already kind of entailed in the word ‘felled’? Some results were shown for the most likely fillers for, e.g., ‘serves’ as in ‘the waitress serves…’ and ‘the prisoner serves…’, which then could be used to find suitable word candidates in the sentence generation.

The best paper award went to “Towards generating colour terms for referents in photographs: prefer the expected or the unexpected?”, by Sina Zarrieß and David Schlangen [1]. While the title might sound a bit obscure, the presentation was very clear. There is the colour spectrum, and people assign names to the colours, which one could take as RGB colour value for images. This is all nice and well on the colour strip, but when a colour is put in context of other colours and background knowledge, the colours humans would use to describe that patch on an image isn’t always in line with the actual RGB colour. The authors approached the problem by viewing it as a multi-class classification problem and used a multi-layer perceptron with some top-down recalibration—and voilá, the software returns the intended colour, most of the times. (Knowing the name of the colour, one then can go on trying to automatically annotate images with text.)

As for the other plenary presentations, I did make notes of all of them, but will select only a few due to time limitations. The presentation by Advaith Siddhartan on summarisation of news stories for children [2] was quite nice, as it needed three aspects together: summarising text (with NLG, not just repeating a few salient sentences), simplifying it with respect to children’s vocabulary, and editing out or rewording the harsh news bits. Another paper on summaries was presented by Sabita Acharya [3], which is likely to be relevant also to my student’s work on NLG for patient discharge notes [4]. Sabita focussed on trying to get doctor’s notes and plan of care into a format understandable by a layperson, and used the UMLS in the process. A different topic was NLG for automatically describing graphs to blind people, with grade-appropriate lexicons (4-5th grade learners and students) [5]. Kathy Mccoy outlined how they were happy to remember their computer science classes, and seeing that they could use graph search to solve it, with its states, actions, and goals. They evaluated the generated text for the graphs—as many others did in their research—with crowdsourcing using the Mechanical Turk. One other paper that is definitely on my post-conference reading list, is the one about mereology and geographic entities for weather forecasts [6], which was presented by Rodrigo de Oliveira. For instance, a Scottish weather forecast referring to ‘the south’ is a different region than that of the UK as a whole, and the task was how to generate the right term for the intended region.

our poster on generating sentences with part-whole relations in isiZulu (click to enlarge)

My 1-minute lightning talk of Langa’s and my long paper [7] went well (one other speaker of the same session even resentfully noted afterward that I got all the accolades of the session), as did the poster and demo session afterward. The contents of the paper on part-whole relations in isiZulu were introduced in a previous post, and you can click on the thumbnail on the right for a png version of the poster (which is less text than the blog post). Note that the poster only highlights three part-whole relations from the 11 discussed in the paper.

ENLG and INLG will merge and become a yearly INLG, there is a SIG for NLG, (www.siggen.org), and one of the ‘challenges’ for this upcoming year will be on generating text from RDF triples.

Irrelevant for the average reader, I suppose, was that there were some 92 attendees, most of whom attended the social dinner where there was a ceilidh—Scottish traditional music by a band with traditional dancing by the participants—were it was even possible to have many (traditional) couples for the couples dances. There was some overlap in attendees between CNL16 and INLG16, so while it was my first INLG it wasn’t all brand new, yet also new people to meet and network with. As a welcome surprise, it was even mostly dry and sunny during the conference days in the otherwise quite rainy Edinburgh.

References

[1] Sina Zarrieß and David Schlangen. Towards generating colour terms for referents in photographs: prefer the expected or the unexpected? INLG’16. ACL, 246-255.

[2] Iain Macdonald and Advaith Siddhartan. Summarising news stories for children. INLG’16. ACL, 1-10.

[3] Sabita Acharya. Barbara Di Eugenio, Andrew D. Boyd, Karen Dunn Lopez, Richard Cameron, Gail M Keenan. Generating summaries of hospitalizations: A new metric to assess the complexity of medical terms and their definitions. INLG’16. ACL, 26-30.

[4] Joan Byamugisha, C. Maria Keet, Brian DeRenzi. Tense and aspect in Runyankore using a context-free grammar. INLG’16. ACL, 84-88.

[5] Priscilla Morales, Kathleen Mccoy, and Sandra Carberry. Enabling text readability awareness during the micro planning phase of NLG applications. INLG’16. ACL, 121-131.

[6] Rodrigo de Oliveira, Somayajulu Sripada and Ehud Reiter. Absolute and relative properties in geographic referring expressions. INLG’16. ACL, 256-264.

[7] C. Maria Keet and Langa Khumalo. On the verbalization patterns of part-whole relations in isiZulu. INLG’16. ACL, 174-183.

# UVa 11357 Ensuring truth solution description

We’re in the midst of preparing for the ICPC Southern Africa Regionals, to be held in October, and so I step up reading problems to find nice ones to train the interested students in a range of topics. The “Ensuring truth” problem was one of those, which I’ll discuss in the remainder of the post, since there’s no discussion of it online yet (only some code), and it is not as daunting as it may look like at first glance:

The task is to determine whether such a formula is satisfiable.

While it may ‘scare’ a 1st or 2nd-year student, when you actually break it down and play with an example or two, it turns out to be pretty easy. The ‘scary’ looking aspects are the basic propositional logic truth tables and the BNF grammar for (simplified!) Boolean formulas. Satisfiability of normal Boolean formulas is NP-compete, which you may have memorised, so that looks daunting as well, as if the contestant would have to come up with a nifty optimization to stay within the time limit. As it appears, not so.

Instead of being put off by it, let’s look at what is going on. The first line of the BNF grammar says that a formula can be a clause, or a formula followed by a clause that is separated by a disjunction (| ‘or’). The second line says that a clause is a conjunction of literals, which (in the third line) transpires to be just a series of ‘and’ (&) conjunctions between literals. The fourth lines states that a literal can be a variable or its negation, and the fifth line states that a variable is one of the letters in the alphabet.

Now try to generate a few inputs that adhere to this grammar. Swapping one variable at a time on the left of the “::=” sign for one of the elements on the right-hand side of the “::=” sign in the BNF grammar, with steps indicated with “=>”, then e.g.:

<formula> => <formula> | <clause> => <clause> | <clause> => (<conjunction-of-literals>) | <clause> => (<literal>) | <clause> => (<variable>) | <clause> => (a)| <clause> => (a)| (<conjunction-of-literals>) => (a)|(<conjunction-of-literals> & <literal>) => (a)|(<conjunction-of-literals> & <literal> & <literal>) => (a)|(<conjunction-of-literals> & <literal> & <literal> & <literal>) => (a)|(<literal> & <literal> & <literal> & <literal>) => (a)|(~<variable> & <literal> & <literal> & <literal>) => (a)|(~a & <literal> & <literal> & <literal>) => (a)|(~a & <variable> & <literal> & <literal>) => (a)|(~a&b& <literal> & <literal>) => (a)|(~a&b& <variable> & <literal>) => (a)|(~a&b&a& <literal>) => (a)|(~a&b&a& <variable>) => (a)|(~a&b&a&c)

That is, (a)|(~a&b&a&c) is in the language of the grammar, as are the two in the input given, being (a&b&c)|(a&b)|(a) and (x&~x). Do you see a pattern emerging of how the formulas look like with this grammar?

It’s a series of disjunctions of conjuncts, and only one of the conjuncts shouldn’t have a contradiction for the formula to be satisfiable. The only way we get a contradiction is if both a literal and its negation are in the same conjunct (analyse the truth tables if you didn’t know that). So, the only thing you have to do with the input is to check whether within the brackets there is, say, an x and a ~x, and with the first conjunct you encounter where there is no contradiction, then the formula is satisfiable and you print YES, else NO. That’s all. So, when given “(a)|(~a&b&a&c)”, you know upon processing the first conjunct “(a)”, that the answer is YES, because “(a)” is trivially not contradictory and thus we can ignore the “(~a&b&a&c)” that does have a contradiction (it doesn’t matter anymore, because we have found one already that doesn’t).

I’ll leave the implementation as an exercise to the reader  :).

# On generating isiZulu sentences with part-whole relations

It all sounded so easy… We have a pretty good and stable idea about part-whole relations and their properties (see, e.g., [1]), we know how to ‘verbalise’/generate a natural language sentence from basic description logic axioms with object properties that use simple verbs [2], like $Professor \sqsubseteq \exists teaches.Course$ ‘each professor teaches at least one course’, and SNOMED CT is full of logically ‘simple’ axioms (it’s in OWL 2 EL, after all) and has lots of part-whole relations. So why not combine that? We did, but it took some more time than initially anticipated. The outcomes are described in the paper “On the verbalization patterns of part-whole relations in isiZulu”, which was recently accepted at the 9th International Natural Language Generation Conference (INLG’16) that will be held 6-8 September in Edinburgh, Scotland.

What it ended up to be, is that notions of ‘part’ in isiZulu are at times less precise and other times more precise compared to the taxonomy of part-whole relations. This interfered with devising the sentence generation patterns, it pushed the number of ‘elements’ to deal with in the language up to 13 constituents, and there was no way to avoid proper phonological conditioning. We already could handle quantitative, relative, and subject concords, the copulative, and conjunction, but what had to be added were, in particular, the possessive concord, locative affixes, a preposition (just the nga in this context), epenthetic, and the passive tense (with modified final vowel). As practically every element has to be ‘completed’ based on the context (notably the noun class), one can’t really speak of a template-based approach anymore, but a bunch of patterns and partial grammar engine instead. For instance, plain parthood, structural parthood, involvement, membership all have:

• (‘each whole has some part’) $QCall_{nc_{x,pl}}$ $W_{nc_{x,pl}}$ $SC_{nc_{x,pl}}-CONJ-P_{nc_y}$ $RC_{nc_y}-QC_{nc_y}-$dwa
• (‘each part is part of some whole’) $QCall_{nc_{x,pl}}$ $P_{nc_{x,pl}}$ $SC_{nc_{x,pl}}-COP-$ingxenye $PC_{\mbox{\em ingxenye}}-W_{nc_y}$ $RC_{nc_y}-QC _{nc_y}-$dwa

There are a couple of noteworthy things here. First, the whole-part relation does not have one single string, like a ‘has part’ in English, but it is composed of the subject concord (SC) for the noun class (nc) of the noun that play the role of the whole ( W ) together with the phonologically conditioned conjunction na- ‘and’ (the “SC-CONJ”, above) and glued onto the noun of the entity that play the role of the part (P). Thus, the surface realisation of what is conceptually ‘has part’ is dependent on both the noun class of the whole (as the SC is) and on the first letter of the name of the part (e.g., na-+i-=ne-). The ‘is part of’ reading direction is made up of ingxenye ‘part’, which is a noun that is preceded with the copula (COP) y– and together then amounts to ‘is part’. The ‘of’ of the ‘is part of’ is handled by the possessive concord (PC) of ingxenye, and with ingxenye being in noun class 9, the PC is ya-. This ya- is then made into one word together with the noun for the object that plays the role of the whole, taking into account vowel coalescence (e.g., ya-+u-=yo-). Let’s illustrate this with heart (inhliziyo, nc9) standing in a part-whole relation to human (umuntu, NC1), with the ‘has part’ and ‘is part of’ underlined:

• bonke abantu banenhliziyo eyodwa ‘All humans have as part at least one heart’
• The algorithm, in short, to get this sentence from, say $Human \sqsubseteq \exists hasPart.Heart$: 1) it looks up the noun class of umuntu (nc1); 2) it pluralises umuntu into abantu (nc2); 3) it looks up the quantitative concord for universal quantification (QCall) for nc2 (bonke); 4) it looks up the SC for nc2 (ba); 5) then it uses the phonological conditioning rules to add na- to the part inhliziyo, resulting in nenhliziyo and strings it together with the subject concord to banenhliziyo; 6) and finally it looks up the noun class of inhliziyo, which is nc9, and from that it looks up the relative concord (RC) for nc9 (e-) and the quantitative concord for existential quantification (QC) for nc9 (being yo-), and strings it together with –dwa to eyodwa.
• zonke izinhliziyo ziyingxenye yomuntu oyedwa ‘All hearts are part of at least one human’
• The algorithm, in short, to get this sentence from $Heart \sqsubseteq \exists isPartOf.Human$: 1) it looks up the noun class of inhliziyo (nc9); 2) it pluralises inhliziyo to izinhliziyo (nc10); 3) it looks up the QCall for nc10 (zonke); 4) it looks up the SC for nc10 (zi-), takes y- (the COP) and adds them to ingxenye to form ziyingxenye; 5) then it uses the phonological conditioning rules to add ya- to the whole umuntu, resulting in yomuntu; 6) and finally it looks up the noun class of umuntu, which is nc1, and from that the RC for nc10 (o-) and the QC for nc10 (being ye-), and strings it together with –dwa to oyedwa.

For subquantities, we end up with three variants: one for stuff-parts (as in ‘urine has part water’, still with ingxenye for ‘part’), one for portions of solid objects (as in ‘tissue sample is a subquantity of tissue’ or a slice of the cake) that uses umunxa instead of ingxenye, and one ‘spatial’ notion of portion, like that an operating theatre is a portion of a hospital, or the area of the kitchen where the kitchen utensils are is a portion of the kitchen, which uses isiqephu instead of ingxenye. Umunxa is in nc3, so the PC is wa- so that with, e.g., isbhedlela ‘hospital’ it becomes wesibhedlela ‘of the hospital’, and the COP is ng- instead of y-, because umunxa starts with an u. And yet again part-whole relations use locatives (like the containment type of part-whole relation). The paper has all those sentence generation patterns, examples for each, and explanations for them.

The meronymic part-whole relations participation and constitution have added aspects for the verb, such as generating the passive for ‘constituted of’: –akha is ‘to build’ for objects that are made/constituted of some matter in some structural sense, else –enza is used. They are both ‘irregular’ in the sense that it is uncommon that a verb stem starts with a vowel, so this means additional vowel processing (called hiatus resolution in this case) to put the SC together with the verb stem. Then, for instance za+akhiwe=zakhiwe but u+akhiwe=yakhiwe (see rules in paper).

Finally, this was not just a theoretical exercise, but it also has been implemented. I’ll readily admit that the Python code isn’t beautiful and can do with some refactoring, but it does the job. We gave it 42 test cases, of which 38 were answered correctly; the remaining errors were due to an ‘incomplete’ (and unresolvable case for any?) pluraliser and that we don’t know how to systematically encode when to pick akha and when enza, for that requires some more semantics of the nouns. Here is a screenshot with some examples:

The ‘wp’ ones are that a whole has some part, and the ‘pw’ ones that the part is part of the whole and, in terms of the type of axiom that each function verbalises, they are of the so-called ‘all some’ pattern.

The source code, additional files, and the (slightly annotated) test sentences are available from the GENI project’s website. If you want to test it with other nouns, please check whether the noun is already in nncPairs.txt; if not, you can add it, and then invoke the function again. (This remains this ‘clumsily’ until we make a softcopy of all isiZulu nouns with their noun classes. Without the noun class explicitly given, the automatic detection of the noun class is not, and cannot be, more than about 50%, but with noun class information, we can get it up to 90-100% correct in the pluralisation step of the sentence generation [4].)

References

[1] Keet, C.M., Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2):91-110.

[2] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS vol. 8620, 216-225. August 18-20, 2014, Prague, Czech Republic.

[3] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG’16), September 5-8, 2016, Edinburgh, UK. (in print)

[4] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), Springer LNCS. April 3-9, 2016, Konya, Turkey. (in print)

# A search engine, browser, and language bias mini-experiment

I’m in the midst of preparing for the “Social Issues and Professional Practice” block for a course and was pondering whether I should touch upon known search engine issues, like the filter bubble and search engine manipulation to nudge democratic elections, which could be interesting given that South Africa just had the local elections last week, with interesting results.

I don’t have the option to show the differences between ‘Google search when logged in’ versus ‘Google search when logged out’, nor for the Bing-Hotmail combination, so I played with other combinations: Google in isiZulu on Firefox (GiF), Google in English on Safari (GES), and Bing in English on Firefox (BEF). I did seven searches at the same time (Friday 12 August 2016, 17:18-17:32) on the same machine (a MacBookPro), using the eduroam on campus. Although this certainly will not pass a test of scientific rigour, it unequivocally shows that it deserves a solid experiment. The only thing I aimed to do was to see whether those things happen in South Africa too, not just in the faraway USA or India. They do.

Before giving the results, some basic preliminaries may be of use if you are not familiar with the topic. On HTTP, that the browser uses: in trying to GET information, your browser sends the server what operating system you are using (Mac, Linux, Windows, etc.), your browser information (e.g., Firefox, Safari, Chrome, etc.), and language settings (e.g., UK English, isiZulu, Italian). Safari is linked to the Mac, and thus Apple, and it is assumed that Apple users have more disposable income (are richer). Free and open source software users (e.g., Linux + Firefox) are assumed to be not rich or a leftie or liberal, or all of them. I don’t know if they categorise Apple + Firefox as an armchair socialist or a posh right-wing liberal ;-).

Here goes the data, being the screenshots and the reading and interpretation of the links of the search results, with a bit of context in case you’re not in South Africa. The screens in the screenshots are in the order (from let to right) as GiF, GES, BEF.

EFF search

• Search term: EFF: GiF and BEF show EFF as political party (leftpopulist opposition party in South Africa) information and a link to EFF as the electronic frontier foundation, whereas the GES just shows EFF as political party in the context of news about the DA political party (capitalist, for the rich, mainly White voters). The GES difference may be explained by the Mac+Safari combination, and it makes one wonder whether and how this has had an effect on perceptions and voting behaviour. Bing had 10mln results, Google 46mln.

Jacob Zuma search

• Search term: Jacob Zuma (current president of South Africa): GiF and BEF show general results, GES with articles also about JZ to stay (by a DA supporter) and on that he won’t resign. Bing has 1.1mln results, Google 9.6mln.

• Search term: Nkandla (Zuma’s controversial lavish homestead

Nkandla Search

that was upgraded with taxpayers money): GiF has pictures and a fact about Nkandla, GES has a picture, fact, and a bit negative news, BEF: more on news and issues (that is: that JZ has to pay back the money). Bing has 700K results, Google 1.8mln.

• Search term: FeesMustFall (hashtag of 2015 on no university fee increases and free higher education): Google results has

FeesMustFall search

‘plain’-looking information, whereas Bing shows results with more information from the FMF perspective, it seems. Bing has 165K results, Google 451K.

• Search term: Fleming Rose (person with controversial ideas,

Fleming Rose search

recently disinvited by UCT to not give the academic freedom lecture): Google shows a little general information and several UCT opinion issues, BEF has information about Fleming Rose. Bing has 1.25mln results, Google about 500K—the only time that Bing’s number of results far outnumbers Google’s.

• Search term: Socialism: GiF has links to definitions, GES and BEF

socialism search

show a definition in their respective info boxes, which takes up most of the screen. Bing has 7.3mln results, GiF with 23.4mln, GES: 31mln—this is the first time there is a stark difference between the number of results in Google hits, with more for English and Safari.

• Law on cookies in south africa Search

Search term: Law on cookies in south africa: the results are similar throughout the three search results. Bing has 108mln results, GiF 3mln, and GES 2.2mln—a 1/3 difference in Google’s number of results in the other direction.

In interpreting the results, it has to be noted that Google, though typing in google.com, forced it to google.co.za, where as Bing stayed on bing.com. This might explain some ‘tailoring’ of GiF and GES to news that is topical in South Africa, which does not happen to the same extent on Bing. I suppose that for some search terms, one would like that, and for others, one would not; i.e., to have the option to choose to search for facts vs opinion pieces vs news, nationally or internationally, or whether you’d want to get an answer or get links to multiple answers. Neither Bing nor Google gives you a free choice on the matter: based on the data you provide involuntarily, they make assumptions as to whom you are and what they think that that kind of person would probably like to see in the search results. That three out of the seven searches on GES lean clearly to the political right is a cause of concern, as is the fewer amounts of facts in Google search results vs Bing’s. I also find it a bit odd that the selection of results is from such wide-ranging numbers of results.

Based on this small sampling, I obviously cannot draw hard conclusions, but it would be nice if we can get some money to get a student to investigate this systematically with more browsers and more languages. We now know that it happens, but how does it happen in South Africa, and might there be some effect because of it? Those questions remain unanswered. In the meantime, I’ll have to do with some anecdotes for the students in an upcoming lecture.

# Experimentally-motivated non-trivial intermodel links between conceptual models

I am well aware that some people prefer Agile and mash-ups and such to quickly, scuffily, put an app together, but putting a robust, efficient, lasting, application together does require a bit of planning—analysis and design in the software development process. For instance, it helps to formalise one’s business rules or requirements, or at least structure them with, say, SBVR or ORM, so as to check that the rules obtained from the various stakeholders do not contradict themselves cf. running into problems when testing down the line after having implemented it during a testing phase. Or analyse a bit upfront which classes are needed in the front-end application layer cf. perpetual re-coding to fix one’s mistakes (under the banner ‘refactoring’, as if naming the process gives it an air of respectability), and create, say, a UML diagram or two. Or generating a well-designed database based on an EER model.

Each of these three components can be done in isolation, but how to do this for complex system development where the object-oriented application layer hast to interact with the database back-end, all the while ensuring that the business rules are still adhered to? Or you had those components already, but they need to be integrated? One could link the code to tables in the implementation layer, on an ad hoc basis, and figure it out again and again for any combination of languages and systems. Another one is to do that at the conceptual modelling layer irrespective of the implementation language. The latter approach is reusable (cf. reinventing the mapping wheel time and again), and at a level of abstraction that is easier to cope with for more people, and even more so if the system is really large. So, we went after that option for the past few years and have just added another step to realising all this: how to link which elements in the different models for the system.

It is not difficult to imagine a tool where one can have several windows open, each with a model in some conceptual modelling language—many CASE tools already support modelling in different notations anyway. It is conceptually also fairly straightforward when in, say, the UML model there is a class ‘Employee’ and in the ER diagram there’s an ‘Employee’ entity type: it probably will work out to align these classes. Implementing just this is a bit of an arduous engineering task, but doable. In fact, there is such a tool for models represented in the same language, where the links can be subsumption, equivalence, or disjointness between classes or between relationships: ICOM [2]. But we need something like that to work across modelling languages as well, and for attributes, too. In the hand-waiving abstract sense, this may be intuitively trivial, but the gory details of the conceptual and syntax aspects are far from it. For instance, what should a modeller do if one model has ‘Address’ as an attribute and the other model has it represented as a class? Link the two despite being different types of constructs in the respective languages? Or that ever-recurring example of modelling marriage: a class ‘Marriage’ with (at least) two participants, or ‘Marriage’ as a recursive relationship (UML association) of a ‘Person’ class? What to do if a modeller in one model had chosen the former option and another modeller the latter? Can they be linked up somehow nonetheless, or would one have to waste a lot of time redesigning the other model?

Instead of analysing this for each case, we sought to find a generic solution to it; with we being: Zubeida Khan, Pablo Fillottrani, Karina Cenci, and I. The solution we propose will appear soon in the proceedings of the 20th Conference on Advances in DataBases and Information Systems (ADBIS’16) that will be held at the end of this month in Prague.

So, what did we do? First, we tried to narrow down the possible links between elements in the models: in theory, one might want to try to link anything to anything, but we already knew some model elements are incompatible, and we were hoping that others wouldn’t be needed yet other suspected to be needed, so that a particular useful subset could be the focus. To determine that, we analysed a set of ICOM projects created by students at the Universidad Nacionál del Sur (in Bahía Blanca), and we created model integration scenarios based on publicly available conceptual models of several subject domains, such as hospitals, airlines, and so on, including EER diagrams, UML class diagrams, and ORM models. An example of an integration scenario is shown in the figure below: two conceptual models about airline companies, with on the left the ER diagram and on the right the UML diagram.

One of the integration scenarios [1]

The solid purple links are straightforward 1:1 mappings; e.g., er:Airlines = uml:Airline. Long-dashed dashed lines represent ‘half links’ that are semantically very similar, such as er:Flight.Arr_time ≈ uml:Flight.arrival_time, where the idea of attribute is the same, but ER attributes don’t have a datatype specified whereas UML attributes do. The red short-dashed dashed lines require some transformation: e.g., er:Airplane.Type is an attribute yet uml:Aircraft is a class, and er:Airport.code is an identifier (with its mandatory 1:1 constraint, still no datatype) but uml:Airport.ID is just a simple attribute. Overall, we had 40 models with 33 schema matchings, with 25 links in the ICOM projects and 258 links in the integration scenarios. The detailed aggregates are described in the paper and the dataset is available for reuse (7MB). Unsurprisingly, there were more attribute links than class links (if a class can be linked, then typically also some of its attributes). There were 64 ‘half’ links and 48 transformation links, notably on the slightly compatible attributes, attributes vs. identifiers, attribute<->value type, and attribute<->class.

Armed with these insights from the experiment, a general intermodel link validation approach [3] that uses the unified metamodel [4], and which type of elements occur mostly in conceptual models with their logic-based profiles [5,6], we set out to define those half links and transformation links. While this could have been done with a logic of choice, we opted for a clear step toward implementability by exploiting the ATLAS Transformation Language (ATL) [7] to specify the transformations. As there’s always a source model and a target model in ATL, we constructed the mappings such that both models in question are the ‘source’ and both are mapped into a new, single, ‘target’ model that still adheres to the constraints imposed by the unifying metamodel. A graphical depiction of the idea is shown in the figure below; see paper for details of the mapping rules (they don’t look nice in a blog post).

Informal, graphical rendering of the rule Attribute<->Object Type output [1]

Someone into this matter might think, based on this high-level description, there’s nothing new here. However, there is, and the paper has an extensive related works section. For instance, there’s related work on Distributed Description Logics with bridge rules [8], but they don’t do attributes and the logics used for that doesn’t fit well with the features needed for conceptual modelling, so it cannot be adopted without substantial modifications. Triple Graph Grammars look quite interesting [9] for this sort of task, as does DOL [10], but either would require another year or two to figure it out (feel free to go ahead already). On the practical side, e.g., the Eclipse metamodel of the popular Eclipse Modeling Framework didn’t have enough in the metamodel for what needs to be included, both regarding types of entities and the constraints that would need to be enforced. And so on, such that by a process of elimination, we ended up with ATL.

It would be good to come up with those logic-based linking options and proofs of correctness of the transformation rules presented in the paper, but in the meantime, an architecture design of the new tool was laid out in [11], which is in the stage of implementation as I write this. For now, at least a step has been taken from the three years of mostly theory and some experimentation toward implementation of all that. To be continued J.

References

[1] Khan, Z.C., Keet, C.M., Fillottrani, P.R., Cenci, K.M. Experimentally motivated transformations for intermodel links between conceptual models. 20th Conference on Advances in Databases and Information Systems (ADBIS’16). Springer LNCS. August 28-31, Prague, Czech Republic. (in print)

[2] Fillottrani, P.R., Franconi, E., Tessaris, S. The ICOM 3.0 intelligent conceptual modelling tool and methodology. Semantic Web Journal, 2012, 3(3): 293-306.

[3] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer Lecture Notes in Computer Science LNCS vol. 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

[4] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering, 2015, 98:30-53.

[5] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Johannesson, P., Lee, M.L. Liddle, S.W., Opdahl, A.L., Pastor López, O. (Eds.). Springer LNCS vol 9381, 585-593. 19-22 Oct, Stockholm, Sweden.

[6] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Morzy et al. (Eds.). Springer LNCS vol. 9282, 215-229. Poitiers, France, Sept 8-11, 2015.

[7] Jouault, F. Allilaire, F. Bzivin, J. Kurtev, I. ATL: a model transformation tool. Science of Computer Programming, 2008, 72(12):31-39.

[8] Ghidini, C., Serafini, L., Tessaris, S., Complexity of reasoning with expressive ontology mappings. Formal ontology in Information Systems (FOIS’08). IOS Press, FAIA vol. 183, 151-163.

[9] Golas, U., Ehrig, H., Hermann, F. Formal specification of model transformations by triple graph grammars with application conditions. Electronic Communications of the ESSAT, 2011, 39: 26.

[10] Mossakowsi, T., Codescu, M., Lange, C. The distributed ontology, modeling and specification language. Proceedings of the Workshop on Modular Ontologies 2013 (WoMo’13). CEUR-WS vol 1081. Corunna, Spain, September 15, 2013.

[11] Fillottrani, P.R., Keet, C.M. A Design for Coordinated and Logics-mediated Conceptual Modelling. 29th International Workshop on Description Logics (DL’16). Peñaloza, R. and Lenzerini, M. (Eds.). CEUR-WS Vol. 1577. April 22-25, Cape Town, South Africa. (abstract)

# Bootstrapping a Runyankore CNL from an isiZulu one mostly works well

Earlier this week the 5th Workshop on Controlled Natural Language (CNL’16) was held in Aberdeen, Scotland, where I presented progress made on a Runyankore CNL [1], rather than my student, Joan Byamugisha, who did most of the work on it (she could not attend due to nasty immigration rules by the UK, not a funding issue).

“Runyankore?”, you might ask. It is one of the languages spoken in Uganda. As Runyankore is very under-resourced, any bootstrapping to take a ‘shortcut’ to develop language resources would be welcome. We have a CNL for isiZulu [2], but that is spoken in South Africa, which is a few thousand kilometres further south of Uganda, and it is in a different Guthrie zone of the—in linguistics still called—Bantu languages, so it was a bit of a gamble to see whether those results could be repurposed for Runynakore. They could, needing only minor changes.

What stayed the same were the variables, or: components to make up a grammatically correct sentence when generating a sentence within the context of OWL axioms (ALC, to be more precise). They are: the noun class of the name of the concept (each noun is assigned a noun class—there are 20 in Runyankore), the category of the concept (e.g., noun, adjective), whether the concept is atomic (named OWL class) or an OWL class expression, the quantifier used in the axiom, and the position of the concept in the axiom. The only two real differences were that for universal quantification the word for the quantifier is the same when in the singular (cf. isiZulu, where it changes for both singular or plural), and for disjointness there is only one word, ti ‘is not’ (cf. isiZulu’s negative subject concord + pronomial). Two very minor differences are that for existential quantification ‘at least one’, the ‘at least’ is in a different place in the sentence but the ‘one’ behaves exactly the same, and ‘all’ for universal quantification comes after the head noun rather than before (but is also still dependent on the noun class).

It goes without saying that the vocabulary is different, but that is a minor aspect compared to figuring out the surface realisation for an axiom. Where the bootstrapping thus came in handy was that that arduous step of investigating from scratch the natural language grammar involved in verbalising OWL axioms could be skipped and instead the ones for isiZulu could be reused. Yay. This makes it look very promising to port to other languages in the Bantu language family. (yes, I know, “one swallow does not a summer make” [some Dutch proverb], but one surely is justified to turn up one’s hope a notch regarding generalizability and transferability of results.)

Joan also conducted a user survey to ascertain which surface realisation was preferred among Runyankore speakers, implemented the algorithms, and devised a new one for the ‘hasX’ naming scheme of OWL object properties (like hasSymptom and hasChild). All these details, as well as the details of the Runyankore CNL and the bootstrapping, are described in the paper [1].

I cannot resist a final comment on all this. There are people who like to pull it down and trivialise natural language interfaces for African languages, on the grounds of “who cares about text in those kind of countries; we have to accommodate the illiteracy with pictures and icons and speech and such”. People are not as illiterate as is claimed here and there (including by still mentally colonised people from African countries)—if they were, then the likes of Google and Facebook and Microsoft would not invest in localising their interfaces in African languages. The term “illiterate” is used by those people to include also those who do not read/write in English (typically an/the official language of government), even though they can read and write in their local language. People who can read and write—whichever natural language it may be—are not illiterate, neither here in Africa nor anywhere else. English is not the yardstick of (il)literacy, and anyone who thinks it is should think again and reflect a bit on cultural imperialism for starters.

References

[1] Byamugisha, J., Keet, C.M., DeRenzi, B. Bootstrapping a Runyankore CNL from an isiZulu CNL. 5th Workshop on Controlled Natural Language (CNL’16), Springer LNAI vol. 9767, 25-36. 25-27 July 2016, Aberdeen, UK. Springer’s version

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016. DOI: 10.1007/s10579-016-9340-0 (in print) accepted version