# Book reviews for 2017

The third, and probably for a while last, post in a row that has not much, or even nothing, to do with my current job and research interest: the seventh installment of short book reviews and opinions on a selection of fiction and non-fiction books I read last year.

Non-Fiction

Dataclysm by Christian Rudder (2014). This is a highly entertaining book about some interesting aspects of Data Science. The author, one of the founders of the dating site OkCupid, takes his pile of OkCupid data and data from some other sources, and plays with it. What do those millions of up- and down-votes and match answers reveal? What’s the difference between what people say in surveys and how they behave online on the dating site? A lot, it appears. The book makes the anonymised aggregates fun and look harmless rather than Big Brother-like to haunt an individual. A bunch of people copy-and-paste messages, but it doesn’t seem to matter for replies and interaction. Looks matter, a lot, but weirdness, too. You’re a women over, say, 25 and the man says you’re gorgeous? Probably lying: they dig the looks of 20-year old women most, no matter how old they are. The portion of people identifying as gay correlates with societal and legal acceptance of same-sex relationships and marriages. And so on. One has to bear in mind that the conclusions drawn from the data should be seen in the light of self-selection (are the OkCupid members average in the sense of collectively being like the whole population?), that pattern-finding is different from hypothesis testing, and accidental data is different from collecting data in a controlled setting. That said, it’s still interesting to read about what the data says and it offers a peek into the kitchen of online dating sites.

What the dog saw by Malcom Gladwell (2009). It’s not nearly as good as the others. In a way, the narrative is the opposite of Dataclysm (as Rudder also discusses): Gladwell’s books are more about the peculiar particular and trying to generalise from that, whereas Rudder takes the aggregates over very large amounts of instances that is the general trend backed up by data rather than anecdotes (which isn’t the plural of data). Both books were very USA-centric, which became annoying in What the dog saw but not with Dataclysm.

Gastrophysics—the new science of eating by Charles Spence (2017). I’ve met people who can’t even believe there’s such thing as food science—an established applied science and engineering discipline—so perhaps the reader’s first response to the term ‘science of eating’ may be even less credulous. But sure enough, there’s truth to that. As the blurb about the book ended up longer than intended, it got its own blogpost: gastrophysics and follies. In short: the first part is very highly recommended reading.

I didn’t manage to finish Slavoj Zizek’s “Trouble in paradise—from the end of history to the end of capitalism”. There were fine parts in it, but there was too much rambling on too many pages, piling dialectical upon dialectical topped-off with upside-downs that the plot got lost and the logic missing. I searched online for book reviews of it, wondering if it was just me being too tired to concentrate, but turned out I’m not the only one. 17 contradictions and the end of capitalism by David Harvey was better (reviewed a few years ago).

Fiction

Indaba, my children—African tribal history, legends, customs and religious beliefs by Credo Mutwa (1964). This book was recommended to me with the note that although the author made it all up, for there are no such stories among the amaZulu, it is great storytelling and a must-read nonetheless. It is indeed. The first part chronicles in a fantastic way the story of creation and the first humans as a poem that is best read aloud for the most dramatic effect of the unexpected turns of events. Parts two and three see successive interactions with intruders and societies rising and falling, the organisation of society, their laws, customs, rites, and quests for power, with a bit of interference and nudging here and there by the goddesses, immortals, and the tree of life. Part four consists of reflections, events, stories, and criticisms of the recent past. Woven into these stories are how the things came to be as they are, such as the creation of the moon, the naming of the marimba (a type of xylophone), how/from where the Swazi and amaXhosa originate, and so on. The tome of almost 700 pages takes a while to finish; yet at the same time, one feels that loss once having finished a great book.

The drowning by Camila Lackberg (2012, translated from Swedish). This is a ‘whodunnit’ crime novel with twists and turns and even if you think near the end of the book that you know who’s behind the threats and murders, you’ll still be surprised, and perhaps somehow a bit sad, too. That much for spoilers. The general storyline has the novel set in a village in Sweden where village life seems idyllic, but all sorts of things are not what they seem—the ‘marital bliss’ that’s not so blissful after all, and so on. Christian Thydell published his first novel and gets great reviews, but has been receiving anonymous threats, and soon a few other men in the village get them as well. Gradually, a few people die or are murdered. Erica Falck sets out to uncover the truth informally, while her detective husband tries to do so in the official police investigation. With the cross-fertilisation of information, eventually the mystery is resolved.

How to fall in love by Cecelia Ahern (2014). I recall the time when Ahern’s first novel came out in Ireland, where the first reactions that she’d published a book was a like “well, bleh, but she’s the daughter of…” [the then Taoiseach ‘prime minister’, Bertie Ahern], yet then reviews came in like “actually, it’s really sweet/nice/quite good/etc.” and it ended up as an international bestseller and a movie (P.S. I love you). So, when I was recently in Ireland, I decided to buy one of her books to see for myself, which turned out to be her latest novel How to fall in love. Sounds just as cheesy, true, but the nutshell version of the story is quite grim. The protagonist, Christine Rose, talks a stranger (Adam) out of his suicide attempt with a deal: that she can convince him in two weeks’ time that life is worth living; if she can’t, then he still can go off and kill himself. She had just walked out of a relationship; he had found his fiancé cheating with his best friend (among other reasons why he wanted to kill himself). Christine then comes up with a range of ‘mini-adventures’, mostly set in Dublin, trying to fix it yet having no real experience in suicide-prevention support. Some activities and meddling work out better than others. The storytelling is heart-warming, funny, and light-hearted, yet at time serious and depressing as well (suicide is quite a large problem in Ireland, and higher than the world average). Several unexpected turns in the story and development of the characters and their motivations keep it interesting. It is a fairly quick read for its easy writing style, yet also one of those books that one would like to read again for the first time.

Woman on the edge of time by Marge Piercy (1976). I bought the book because it said “The Classic Feminist Science Fiction Novel” on the front cover. Frankly, that’s rubbish. If Americans think that’s feminism and sci-fi, then no wonder gender parity hasn’t been achieved and science is facing tough times there. Anyway, the story. The protagonist had dreams of education and independence and is sane but was put in the insane box and she goes along with it, with some weakness and whining about oppression here and there. What exactly is feminist about that?! The supposedly sci-fi part is the protagonist doing some mental trips to a future of “sexual, racial, and environmental harmony”. She can do that because here mind is “receptive”. Seriously? Really, there’s no smell of ‘sci’ there, just a lot of ‘fi’. If there were a single classic in the genre of ‘futurist fiction for feminists’, I’d say it’s most definitely kinderen van moeder aarde children of mother earth’ by Thea Beckman that I read back in the mid ‘80s. I still can remember the storyline now, more than 30 years later, without having read it since. The setting is in Greenland that, after some terrible nuclear war that has moved continents, was pushed south into a moderate climate whereas Europe ended up at a latitude where it’s a scorching hot climate. Thule (Greenland) is governed by women—because it was the men who screwed up with their wars—and now a dirty steamboat with exploitative patriarchal Europeans is arriving. The book describes how that society functions where women run the show (e.g., there are no prisons). The protagonists are two teenagers—one son of a female member of the governing body, the other his girlfriend from the commoners—who think that it isn’t fair that only women from the ruling hierarchy rule. In the end, they manage to neutralise the invaders and a few men get some say in governance.

The Power by Naomi Alderman (2016). As the saying goes, don’t judge a book by its cover. But the book’s front cover looked really cool and the back cover story sounded like an interesting scenario, so I ended up impulse-buying it anyway. It turned out to be a page-turner. The main part of the novel describes the unfolding events of an eclectic set of main characters across the world when, from one day to the next, women turn out to have ‘the power’. That is, there’s an organ that only women have that has suddenly become active with which, while it does not make girls and women physically stronger than men, they can hurt or kill men with or rape them through administering an electrical surge at a certain place. This obviously affects the status quo of the patriarchal societies across the world, and the girls and women respond differently on the new powers gained based on both their personal background and how bad it was for them in their subculture and country. The men respond differently to it as well. Without revealing too much, it could be categorised in the ‘futurist fiction for feminists’, sort of, as it’s also about ‘what if for millennia sexism was reversed?’ and ‘what if it were to be reversed now and it’s payback time?’. The answers that the author came up with make for useful reading, and perhaps also contemplation, for both women and men. Whether you think at the end it’s a dystopian novel, fanciful daydreaming, futurist fiction for feminists, a thriller, a useful mirror to the current society we live in, or stick another label to it—an opinion you’ll have of it :).

If these books don’t interest you, then perhaps one of the previous ones I posted about in 2012, 2013, 2014, 2015, 2016, and 2017 just might (I’m an omnivorous reader).

# Logic, diagrams, or natural language for representing temporal constraints in conceptual modeling languages?

Spoiler alert of the answer: it depends. In this post, I’ll trace it back to how we got to that conclusion and refine it to what it depends on.

There are several conceptual modelling languages with extensions for temporal constraints that then will be used in a database to ensure data integrity with respect to the business rules. For instance, there may be a rule for some information system that states that “all managers in a company must have been employees of that company already” or “all postgraduate students must also be a teaching assistant for some time during their studies”. The question then becomes how to get the modellers to model this sort of information in the best way. The first step in that direction is figuring out the best way to represent temporal constraints. We already know that icons aren’t that unambiguous and easy [1], which leaves the natural language rendering devised recently [2], or one of the logic-based notations, such as the temporal Description Logic DLRUS [3]. So, the questions to investigate thus became, more precisely:

• Which representation is preferred for representing temporal information: formal semantics, Description Logics (DL), a coding-style notation, diagrams, or template-based (pseudo-)natural language sentences?
• What would be easier to understand by modellers: a succinct logic-based notation, a graphical notation, or a ‘coding style’ notation?

To answer these questions, my collaborator, Sonia Berman (also at UCT) and I conducted a survey to find out modeller preference(s) and understanding of these representation modes. The outcome of the experiment is about to be presented at the 36th International Conference on Conceptual Modeling (ER’17) that will be held next week in Valencia, Spain, and is described in more detail in the paper “Determining the preferred representation of temporal constraints in conceptual models” [4].

The survey consisted mainly of questions asking them about which representation they preferred, a few questions on trying to model it, and basic questions, like whether they had English as first language (see the questionnaire for details). Below is one of the questions to illustrate it.

One of the questions of the survey

Its option (a) is the semantics notation of the DLRUS Description Logic, its option (b) the short-hand notation in DLRUS, option (c) a coding-style notation we made up, and option (e) is the natural language rendering that came out of prior work [2]. Option (d) was devised for this experiment: it shows the constraint in the Temporal information Representation in Entity-Relationship Diagrams (TREND) language. TREND is an updated and extended version of ERVT [5], taking into account earlier published extensions for temporal relationships, temporal attributes, and quantitative constraints (e.g., ‘employee receives a bonus after two years’), a new extension for the distinction between optional and mandatory temporal constraints, and the notation preferences emanating from [1].

Here are some of the main quantitative results:

The top-rated representation modes and `dislike’ ratings.

These are aggregates, though, and they hide some variations in responses. For instance, representing ‘simple’ temporal constraints in the DL notation was still ok (though noting that diagrams were most preferred), but the more complex the constraints got, the more the preference for the natural language rendering. For instance, take “Person married-to Person may be followed by Person divorced-from Person, ending Person married-to Person.” is deemed easier to understand than $\langle o , o' \rangle \in marriedTo^{\mathcal{I}(t)} \rightarrow \exists t'>t. \langle o , o' \rangle \in divorcedFrom^{\mathcal{I}(t')} \land \langle o , o' \rangle \not\in marriedTo^{\mathcal{I}(t')}$ or $\diamond^+\mbox{{\sc RDev}}_{{\sf marriedTo,divorcedFrom}}$. Yet, the temporal relationship ${\sf marriedTo \sqsubseteq \diamond^* \neg marriedTo}$ was deemed easier to understand than “The objects participating in a fact in Person married to Person do not relate through married-to at some time”. Details of the experiment and more data and analysis are described in the paper [4]. In sum, the evaluation showed the following:

1. a clear preference for graphical or verbalised temporal constraints over the other three representations;
2. ‘simple’ temporal constraints were preferred graphically and complex temporal constraints preferred in natural language; and
3. their English specification of temporal constraints was inadequate.

Overall, this indicates that what is needed is some modeling tool that has a multi-modal interface for temporal conceptual model development, with the ability to switch between graphical and verbalised temporal constraints in particular.

If I hadn’t had teaching obligations (which now got cancelled due to student protests anyway) and no NRF funding cut in the incentive funding (rated researchers got to hear from one day to the next that it’ll be only 10% of what it used to be), I’d have presented the paper myself at ER’17. Instead, my co-author is on her way to all the fun. If you have any questions, suggestions, or comments, you can ask her at the conference, or drop me a line via email or in the comments below. If you’re interested in TREND: we’re working on a full paper with all the details and have conducted further modeling experiments with it, which we hope to finalise writing up by the end of the year (provided student protests won’t escalate and derail research plans any further).

References

[1] T. Shunmugam. Adoption of a visual model for temporal database representation. M. IT thesis, Department of Computer Science, University of Cape Town, South Africa, 2016.

[2] Keet, C.M. Natural language template selection for temporal constraints. CREOL: Contextual Representation of Events and Objects in Language, Joint Ontology Workshops 2017, 21-23 September 2017, Bolzano, Italy. CEUR-WS Vol. (in print).

[3] A. Artale, E. Franconi, F. Wolter, and M. Zakharyaschev. A temporal description logic for reasoning about conceptual schemas and queries. In S. Flesca, S. Greco, N. Leone, and G. Ianni, editors, Proceedings of the 8th Joint European Conference on Logics in Artificial Intelligence (JELIA-02), volume 2424 of LNAI, pages 98-110. Springer Verlag, 2002.

[4] Keet, C.M., Berman, S. Determining the preferred representation of temporal constraints in conceptual models. 36th International Conference on Conceptual Modeling (ER’17). Mayr, H.C., Guizzardi, G., Ma, H. Pastor. O. (Eds.). Springer LNCS vol. 10650, 437-450. 6-9 Nov 2017, Valencia, Spain.

[5] A. Artale, C. Parent, and S. Spaccapietra. Evolving objects in temporal information systems. Annals of Mathematics and Artificial Intelligence, 50(1-2):5-38, 2007.

# Round 2 of the search engine, browser, and language bias mini-experiment

Exactly a year ago I did a mini-experiment to see whether search engine bias exist in South Africa as well. It did. The notable case was that Google in English on Safari on the Mac (GES) showed results for ‘politically interesting searches’ that had less information and was leaning to the right-side of the political spectrum in a way that raised cause for concern, as compared to Google in isiZulu in Firefox (GiF) and Bing in English in Firefox (BEF). I repeated the experiment in the exact same way, with some of the same queries and a few more new ones that take into account current affairs; the only difference being using my Internet connection at home rather than at work. The same problem still exists, sometimes quite dramatically. As recommendation, then: don’t use Google in English on Safari on the Mac unless you want to be in an “anti-government Democratic Alliance as centre-of-the-world” bubble.

To back it all up, I took screenshots again, with the order fltr GiF, GES, BEF, so you can check for yourself what users with different configurations see on the first page of the search results. The set of clearly different/biased results are listed first.

• EFF”, which in South Africa is a left populist opposition party, and internationally the abbreviation of the electronic frontier foundation:

“EFF” search

GiF lists it as political party; GES in relation to the DA first and then as political party; BEF as political party and electronic frontier foundation.

• jacob zuma”, the current president of the country: GiF first has a google ad to oust zuma, then general info and news; GES with a google ad to oust zuma, comment by JZ’s son

“jacob zuma” search

blaming the whites (probably fuelling racial divisiveness), then general info and news; BEF has general info and news.

• ANC”, currently the largest political party nationally and in power: GiF has first a link to ANC site, one

“ANC” search

news, and for the rest contact info; GES has first ‘bad press’ for the ANC as top stories, then twitter, then the ANC website; BEF lists first the ANC site, then news and info.

• Manana”, who is the Higher Education deputy minister who faces allegations of mistreatment by female

“Manana” search

staff members in his department: GiF with news about the accusations; GES has negative news about the ANC women’s league and DA actions; BEF shows info about Manana and mixed it up with the Spanish mañana.

• The autocomplete function when typing “ANC” was somewhat surprising: GiF also associates it with ‘eff news’, and ‘zuma’;

exploring the autocomplete on “ANC”

GES doesn’t have ‘eff news’ to suggest, so autocomplete also seems to be determined by the client-side configuration; BEF has all sorts of things.

• white monopoly capital” (long story): GiF shows general info and news; GES also shows general info

“white monopoly capital” search

and news, but with that inciting blaming the whites news item; BEF shows general info and news as well, but differently ordered from Google’s result.

• DA”, which in South Africa refers to the abbreviation of the Democratic Alliance opposition party

“DA” search

(capitalist, for the rich): GiF lists the DA website and some news; GES shows news on DA action and opinion, then the DA website; BEF lists the DA site, some general info and disambiguation.

• motion of no confidence”, which was held last week against Jacob Zuma

“motion of no confidence” search

(the motion failed, but not by a large margin): GiF has again that Google ad for the organization to oust Zuma, then info and mostly news (with 1 international news site [Al Jazeera]); GES has info then SA opinion pieces rather than news; BEF has news and info.

• FeesMustFall”, which was one of the tags of the student protests in 2015

“FeesMustFall” search

and 2016 (for free higher education): GiF has general info and news; GES shows first two ads to join the campaign, then general info and news; BEF has info and news. So, this seems flipped cf. last year.

Then the set of searches of which the results are roughly the same. I had expected this for “Law on cookies in South Africa” and “Socialism”, for they were about the same last year as well. I wasn’t sure about “women’s month” (this month, August), given its history; there are slight differences, but not much. The interesting one, perhaps, was that “state capture gupta” also showed similar results across the three configurations, all of them showing results to pages that treat it as fact and at least some detailed background reading on it.

“Law on cookies in South Africa” search

“Socialism” search

“women’s month” search

“state capture gupta” search

Finally, last year the mini-experiment was motivated by lecture preparations for the “Social Issues and Professional Practice” block of CSC1016S that I’m scheduled to teach in the upcoming semester (if there won’t be protests, that is). As compared to last year, now I can also add a note on the Algorithmic Transparency and Accountability statement from the ACM, in addition to the ‘filter bubble’ and ‘search engine manipulation’ items. Maybe I should cook up an exercise for the students so we can get data rather still being in the realm of anecdotes with my 20 searches and three configurations. If you did the same with a different configuration, please let me know.

# Improved! TDDonto v2—more types of axioms supported and better feedback

Yes, the title almost sounds like a silly washing powder ad, but version 2 of TDDonto really does more than the TDDonto tool for Test-Driven Development of ontologies [1,2] that was introduced earlier this year. There are two principal novelties, largely thanks to Kieren Davies (also at UCT): more types of axioms are supported—arbitrary class expressions on both sides of the inclusion and ABox assertions—and differentiated test feedback beyond just pass/fail/unknown. TDDonto2 obviously still uses a test-first approach rather than test-last for ontology authoring, i.e., checking whether the axiom is already entailed in the ontology or would cause problems before actually adding it, saving yourself a lot of classification time overhead in the ontology authoring process.

On the first item, TDDonto (or TawnyOwl or Scone) could not handle, e.g., $Carnivore \sqcup Herbivore \sqsubseteq Animal$ or some domain restriction $\forall eats.Animal \sqsubseteq Carnivore$, or whether some individual is different/same from another. TDDonto2 can. This required a new set of algorithms, some nifty orchestration of several functions offered by an automated reasoned (of the DL/OWL variety), and extending the Protégé 5 functionality with parsing Manchester syntax keyword constructs for individuals as well (another 3600 lines of code). The Protégé 5 plugin works. Correctness of those algorithms has been proven, so you can rely on it just like you can with the test-last approach of add-axiom-and-then-run-the-reasoner (I’ll save you from those details).

On the second item (and also beyond the current TDD tools): now it can tell you not only just ‘pass’ (i.e., the axiom is entailed), but the ‘failed’ has been refined into the various possible cases: that adding the axiom to the ontology would cause the ontology to become inconsistent, or that it would cause a class to become unsatisfiable (incoherent), or it may be neither of the three (absent) so it would be ‘safe’ to add the axiom under test to the ontology (that is: at least not cause inconsistency, incoherence, or redundancy). Further, we’ve added ‘pre-real TDD unit test’ checks: if the ontology is already inconsistent, there’s no point in testing the axiom; if the ontology already has unsatisfiable classes, then one should fix that first; and if there is an entity in the test axiom that is not in the ontology, then it should be added first.

The remainder of the post mainly just shows off some of the functionality in text and with screenshots (UPDATE 13-3-2017: we now also have a screencast tutorial [177MB mov file]). Put the JAR file in the plugins directory, and then put it somewhere via Window – Views – Ontology views – TDDonto2. As toy ontology, I tend to end up with examples of the African Wildlife Ontology, which I use for exercises in my Ontology Engineering course, but as it is almost summer holiday here, I’ve conjured up a different example. That test ontology contains the following knowledge at the start:

$ServiceObject \equiv Facility \sqcup Attraction$

$Pool \sqsubseteq Facility$

$Braai \sqsubseteq Facility$

$Pool \sqcap Braai \sqsubseteq \bot$

$Hotel \sqsubseteq Accommodation$

$BedAndBreakfast \sqsubseteq Accommodation$

$BedAndBreakfast \sqcap Hotel \sqsubseteq \bot$

$Facility \sqsubseteq \exists offeredBy.Accommodation$

$Hotel \sqsubseteq =1 offers.Pool$

$Hotel(LagoonBeach)$

The first test is to see whether $\exists offeredBy.Accommodation \sqsubseteq Facility$, to show that TDDonto2 can handle class expressions on the left-hand side of the inclusion axiom. It can, and it is clearly absent from the toy ontology; see screenshot below, first line in the middle section. Likewise for the second and third test, where a typical novice ontology authoring mixup is made between ‘and’ and ‘or’, which different test results: one is absent, the other entailed.

Then some more fun: the pool braai. First of all, PoolBraai is not in our ontology, so TDDonto2 returns an error: it can be seen from Protégé’s handling (red dotted line below PoolBraai and red-lined text box in the screenshot above), and TDDonto2 will not let you add it to the set of tests (pop-up box, not shown). After adding it and testing “PoolBraai SubClassOf: Pool and Braai”, then if we were to add that axiom to the ontology, it will be incoherent (because Pool and Braai are disjoint):

Doing this nonetheless by selecting the axiom and adding it to our ontology by pressing the “Add selected to ontology”:

and running all tests again by pressing the “Evaluate all” button (or select it and click “Evaluate selected”), the results look like this:

That is, we failed a precondition, because PoolBraai is unsatisfiable, so no tests are being executed until this is fixed. Did I make this up just to have a silly toy ontology example? No, the pool braai does exist, in South Africa at least: it is a stainless steel barbecue table-set that one can place in a small backyard pool. So, we remove $PoolBraai \sqsubseteq Pool \sqcap Braai$from the ontology and add $PoolBraai \sqsubseteq Braai$, so that we can do a few more tests.

Let’s assume we want to explore more about accommodations and their facilities, and add some knowledge about that (tests 5-7):

Finally, let’s check something about any instances in the ontology. First, whether LagoonBeach is a hotel “LagoonBeach Type: Hotel”, which it is (with a view on Table Mountain), and whether it also could be a B&B, which it cannot be, because hotel and B&B are disjoint. Adding another individual to the ontology for the sake of example, SinCity (an owl:Thing), we want to know whether SinCity can be the same as LagoonBeach, or asserted as different (the last two test in the list): the tests return absent, i.e., they can be either, for nothing is known about SinCity.

Now let’s remove a selection of the tests because they would cause problems in the ontology, and add the remaining five in one go:

This change requires one to classify the ontology, and subsequently you’re expected to run all the tests again to check that they are all entailed and do not cause some new problem, which they don’t:

And, finally, a few arbitrary ones that are ontologically a bit off, but they show that yes, something arbitrary both on the left-hand side and right-hand side of the inclusion (or equivalence) works (first test, below), disjointness still works (test 2) and now also with arbitrary class expressions (test 5), and the same/different individuals can take more than two arguments (tests 3 and 4).

The source code and JAR file are freely available (GPL licence) to use, examine, or extend. A paper with the details has been submitted, so you’ll have to make do with just the tool for the moment. If you have any feedback on the tool, please let us know.

References

[1] Keet, C.M., Lawrynowicz, A. Test-Driven Development of Ontologies. 13th Extended Semantic Web Conference (ESWC’16). H. Sack et al. (Eds.). Springer LNCS vol. 9678, pp642-657. 29 May – 2 June, 2016, Crete, Greece.

[2] Lawrynowicz, A., Keet, C.M. The TDDonto Tool for Test-Driven Development of DL Knowledge bases. 29th International Workshop on Description Logics (DL’16). April 22-25, Cape Town, South Africa. CEUR WS vol. 1577.

# My gender-balanced book reviews overall, yet with much fluctuation

In one of my random browsing moments, I stumbled upon a blog post of a writer who had her son complaining about the stories she was reading to him, as having so many books with women as protagonists. As it appeared, “only 27% of his books have a female protagonist, compared to 65% with a male protagonist.”. She linked back to another post about a similar issue but then for some TV documentary series called missed in history, where viewers complained that there were ‘too many women’ and more like a herstory than a missed in history. Their tally of the series’ episodes was that they featured 45% men, 21% women, and 34% were ungendered. All this made me wonder how I fared in my yearly book review blog posts. Here’s the summary table and the M/F/both or neither:

 Year posted Book Nr M Nr F Both / neither Pct F 2012 Long walk to freedom, terrific majesty, racist’s guide, end of poverty, persons in community, African renaissance, angina monologues, master’s ruse, black diamond, can he be the one 4 3 3 33% 2013 Delusions of gender, tipping point, affluenza, hunger games, alchemist, eclipse, mieses karma 2 3 2 43% 2014 Book of the dead, zen and the art of motorcycle maintenance, girl with the dragon tattoo, outliers, abu ghraib effect, nice girls don’t get the corner office 2 1 3 17% 2015 Stoner, not a fairy tale, no time like the present, the time machine, 1001 nights, karma suture, god’s spy, david and goliath, dictator’s learning curve, MK 4 2 4 20% 2016 Devil to pay, black widow society, the circle, accidental apprentice, moxyland, muh, big short, 17 contradictions 2 4 2 50% Total 14 13 14 32%

Actually, I did pretty well in the overall balance. It also shows that were I to have done a bean count for a single year only, the conclusion could have been very different. That said, I classified them from memory, and not by NLP of the text of the books, so the actual amount allotted to the main characters might differ. Related to this is the screenplay dialogue-based data-driven analysis of Hollywood movies, for which NLP was used. Their results show that even when there’s a female lead character, Hollywood manages to get men to speak more; e.g., The Little Mermaid (71%) and The Hunger Games (55% male). Even the chick flick Clueless is 50-50. (The website has several nice interactive graphs based on the lots of data, so you can check yourself.) For the Hunger Games, though, the books do have Katniss think, do, and say more than in the movies.

A further caveat of the data is that these books are not the only ones I’ve read over the past five years, just the ones written about. Anyhow, I’m pleased to discover there is some balance in what I pick out to write about, compared to unconscious bias.

As a last note on the fiction novels listed above, there was a lot of talk online the past week about Lionel Shriver’s keynote on defense on fiction writing-what-you-like and having had enough of the concept of ‘cultural appropriation’. Quite few authors in the list above would be thrown on the pile of authors who ‘dared’ to imagine characters different from the box they probably would by put in. Yet, most of them still did a good job to make it a worthwhile read, such as Hugh Fitzgerald Ryan on Alice the Kyteler in ‘The devil to pay’, David Safier with Kim Lange in ‘Mieses Karma’, Stieg Larsson with ‘Girl with the dragon tattoo’, and Richard Patterson in ‘Eclipse’ about Nigeria. Rather: a terrible character or setting that’s misrepresenting a minority or oppressed, marginalised, or The Other group in a novel is an indication of bad writing and the writer should educate him/herself better. For instance, JM Coetzee could come back to South Africa and learn a thing or two about the majority population here, and I hope for Zakes Mda he’ll meet some women who he can think favourably about and then reuse those experiences in a story. Anyway, even if the conceptually problematic anti-‘cultural appropriation’ police wins it from the fiction writers, then I suppose I can count myself lucky living in South Africa that, with its diversity, will have diverse novels to choose from (assuming they won’t go further overboard into dictating that I would be allowed to read only those novels that are designated to be appropriate for my (from the outside) assigned box).

UPDATE (20-9-2016): following the question on POC protagonist, here’s the table, where those books with a person (or group) of colour is a protagonist are italicised. Some notes on my counting: Angina monologues has three protagonists with 2 POCs so I still counted it, Hunger games’ Katniss is a POC in the books, Eclipse is arguable, abu ghraib effect is borderline and Moxyland is an ensemble cast so I still counted that as well. Non-POC includes cows as well (Muh), hence that term was chosen rather than ‘white’ that POC is usually contrasted with. As can be seen, it varies quite a bit by year as well.

 Year posted Book POC (italics in the list) Non-POC or N/A Pct POC 2012 Long walk to freedom, terrific majesty, racist’s guide, end of poverty, persons in community, African renaissance, angina monologues, master’s ruse, black diamond, can he be the one 8 2 80% 2013 Delusions of gender, tipping point, affluenza, hunger games, alchemist, eclipse, mieses karma 2 5 29% 2014 Book of the dead, zen and the art of motorcycle maintenance, girl with the dragon tattoo, outliers, abu ghraib effect, nice girls don’t get the corner office 2 4 33% 2015 Stoner, not a fairy tale, no time like the present, the time machine, 1001 nights, karma suture, god’s spy, david and goliath, dictator’s learning curve, MK 4 6 40% 2016 Devil to pay, black widow society, the circle, accidental apprentice, moxyland, muh, big short, 17 contradictions 3 5 38% Total 19 22 46%

# Brief report on the INLG16 conference

Another long wait at the airport is being filled with writing up some of the 10 pages of notes I scribbled while attending the WebNLG’16 workshop and the 9th International Natural Language Generation conference 2016 (INLG’16), that were held from 6 to 10 September in Edinburgh, Scotland.

There were two keynote speakers, Yejin Choi and Vera Demberg, and several long and short presentations and a bunch of posters and demos, all of which had full or short papers in the (soon to appear) ACL proceedings online. My impression was that, overall, the ‘hot’ topics were image-to-text, summaries and simplification, and then some question generation and statistical approaches to NLG.

The talk by Yejin Choi was about sketch-to-text, or: pretty much anything to text, such as image captioning, recipe generation based on the ingredients, and one even could do it with sonnets. She used a range of techniques to achieve it, such as probabilistic CFGs and recurrent neural networks. Vera Demberg’s talk, on the other hand, was about psycholinguistics for NLG, starting from the ‘uniform information density hypothesis’ compared to surprisal words and grammatical errors and how that affects a person reading the text. It appears that there’s more pupil jitter when there’s a grammar error. The talk then moved on to see how one can model and predict information density, for which there are syntactic, semantic, and event surprisal models. For instance, with the semantic one: ‘peter felled a tree’: then how predictable is ‘tree’, given that its already kind of entailed in the word ‘felled’? Some results were shown for the most likely fillers for, e.g., ‘serves’ as in ‘the waitress serves…’ and ‘the prisoner serves…’, which then could be used to find suitable word candidates in the sentence generation.

The best paper award went to “Towards generating colour terms for referents in photographs: prefer the expected or the unexpected?”, by Sina Zarrieß and David Schlangen [1]. While the title might sound a bit obscure, the presentation was very clear. There is the colour spectrum, and people assign names to the colours, which one could take as RGB colour value for images. This is all nice and well on the colour strip, but when a colour is put in context of other colours and background knowledge, the colours humans would use to describe that patch on an image isn’t always in line with the actual RGB colour. The authors approached the problem by viewing it as a multi-class classification problem and used a multi-layer perceptron with some top-down recalibration—and voilá, the software returns the intended colour, most of the times. (Knowing the name of the colour, one then can go on trying to automatically annotate images with text.)

As for the other plenary presentations, I did make notes of all of them, but will select only a few due to time limitations. The presentation by Advaith Siddhartan on summarisation of news stories for children [2] was quite nice, as it needed three aspects together: summarising text (with NLG, not just repeating a few salient sentences), simplifying it with respect to children’s vocabulary, and editing out or rewording the harsh news bits. Another paper on summaries was presented by Sabita Acharya [3], which is likely to be relevant also to my student’s work on NLG for patient discharge notes [4]. Sabita focussed on trying to get doctor’s notes and plan of care into a format understandable by a layperson, and used the UMLS in the process. A different topic was NLG for automatically describing graphs to blind people, with grade-appropriate lexicons (4-5th grade learners and students) [5]. Kathy Mccoy outlined how they were happy to remember their computer science classes, and seeing that they could use graph search to solve it, with its states, actions, and goals. They evaluated the generated text for the graphs—as many others did in their research—with crowdsourcing using the Mechanical Turk. One other paper that is definitely on my post-conference reading list, is the one about mereology and geographic entities for weather forecasts [6], which was presented by Rodrigo de Oliveira. For instance, a Scottish weather forecast referring to ‘the south’ is a different region than that of the UK as a whole, and the task was how to generate the right term for the intended region.

our poster on generating sentences with part-whole relations in isiZulu (click to enlarge)

My 1-minute lightning talk of Langa’s and my long paper [7] went well (one other speaker of the same session even resentfully noted afterward that I got all the accolades of the session), as did the poster and demo session afterward. The contents of the paper on part-whole relations in isiZulu were introduced in a previous post, and you can click on the thumbnail on the right for a png version of the poster (which is less text than the blog post). Note that the poster only highlights three part-whole relations from the 11 discussed in the paper.

ENLG and INLG will merge and become a yearly INLG, there is a SIG for NLG, (www.siggen.org), and one of the ‘challenges’ for this upcoming year will be on generating text from RDF triples.

Irrelevant for the average reader, I suppose, was that there were some 92 attendees, most of whom attended the social dinner where there was a ceilidh—Scottish traditional music by a band with traditional dancing by the participants—were it was even possible to have many (traditional) couples for the couples dances. There was some overlap in attendees between CNL16 and INLG16, so while it was my first INLG it wasn’t all brand new, yet also new people to meet and network with. As a welcome surprise, it was even mostly dry and sunny during the conference days in the otherwise quite rainy Edinburgh.

References

[1] Sina Zarrieß and David Schlangen. Towards generating colour terms for referents in photographs: prefer the expected or the unexpected? INLG’16. ACL, 246-255.

[2] Iain Macdonald and Advaith Siddhartan. Summarising news stories for children. INLG’16. ACL, 1-10.

[3] Sabita Acharya. Barbara Di Eugenio, Andrew D. Boyd, Karen Dunn Lopez, Richard Cameron, Gail M Keenan. Generating summaries of hospitalizations: A new metric to assess the complexity of medical terms and their definitions. INLG’16. ACL, 26-30.

[4] Joan Byamugisha, C. Maria Keet, Brian DeRenzi. Tense and aspect in Runyankore using a context-free grammar. INLG’16. ACL, 84-88.

[5] Priscilla Morales, Kathleen Mccoy, and Sandra Carberry. Enabling text readability awareness during the micro planning phase of NLG applications. INLG’16. ACL, 121-131.

[6] Rodrigo de Oliveira, Somayajulu Sripada and Ehud Reiter. Absolute and relative properties in geographic referring expressions. INLG’16. ACL, 256-264.

[7] C. Maria Keet and Langa Khumalo. On the verbalization patterns of part-whole relations in isiZulu. INLG’16. ACL, 174-183.

# A search engine, browser, and language bias mini-experiment

I’m in the midst of preparing for the “Social Issues and Professional Practice” block for a course and was pondering whether I should touch upon known search engine issues, like the filter bubble and search engine manipulation to nudge democratic elections, which could be interesting given that South Africa just had the local elections last week, with interesting results.

I don’t have the option to show the differences between ‘Google search when logged in’ versus ‘Google search when logged out’, nor for the Bing-Hotmail combination, so I played with other combinations: Google in isiZulu on Firefox (GiF), Google in English on Safari (GES), and Bing in English on Firefox (BEF). I did seven searches at the same time (Friday 12 August 2016, 17:18-17:32) on the same machine (a MacBookPro), using the eduroam on campus. Although this certainly will not pass a test of scientific rigour, it unequivocally shows that it deserves a solid experiment. The only thing I aimed to do was to see whether those things happen in South Africa too, not just in the faraway USA or India. They do.

Before giving the results, some basic preliminaries may be of use if you are not familiar with the topic. On HTTP, that the browser uses: in trying to GET information, your browser sends the server what operating system you are using (Mac, Linux, Windows, etc.), your browser information (e.g., Firefox, Safari, Chrome, etc.), and language settings (e.g., UK English, isiZulu, Italian). Safari is linked to the Mac, and thus Apple, and it is assumed that Apple users have more disposable income (are richer). Free and open source software users (e.g., Linux + Firefox) are assumed to be not rich or a leftie or liberal, or all of them. I don’t know if they categorise Apple + Firefox as an armchair socialist or a posh right-wing liberal ;-).

Here goes the data, being the screenshots and the reading and interpretation of the links of the search results, with a bit of context in case you’re not in South Africa. The screens in the screenshots are in the order (from let to right) as GiF, GES, BEF.

EFF search

• Search term: EFF: GiF and BEF show EFF as political party (leftpopulist opposition party in South Africa) information and a link to EFF as the electronic frontier foundation, whereas the GES just shows EFF as political party in the context of news about the DA political party (capitalist, for the rich, mainly White voters). The GES difference may be explained by the Mac+Safari combination, and it makes one wonder whether and how this has had an effect on perceptions and voting behaviour. Bing had 10mln results, Google 46mln.

Jacob Zuma search

• Search term: Jacob Zuma (current president of South Africa): GiF and BEF show general results, GES with articles also about JZ to stay (by a DA supporter) and on that he won’t resign. Bing has 1.1mln results, Google 9.6mln.

• Search term: Nkandla (Zuma’s controversial lavish homestead

Nkandla Search

that was upgraded with taxpayers money): GiF has pictures and a fact about Nkandla, GES has a picture, fact, and a bit negative news, BEF: more on news and issues (that is: that JZ has to pay back the money). Bing has 700K results, Google 1.8mln.

• Search term: FeesMustFall (hashtag of 2015 on no university fee increases and free higher education): Google results has

FeesMustFall search

‘plain’-looking information, whereas Bing shows results with more information from the FMF perspective, it seems. Bing has 165K results, Google 451K.

• Search term: Fleming Rose (person with controversial ideas,

Fleming Rose search

recently disinvited by UCT to not give the academic freedom lecture): Google shows a little general information and several UCT opinion issues, BEF has information about Fleming Rose. Bing has 1.25mln results, Google about 500K—the only time that Bing’s number of results far outnumbers Google’s.

• Search term: Socialism: GiF has links to definitions, GES and BEF

socialism search

show a definition in their respective info boxes, which takes up most of the screen. Bing has 7.3mln results, GiF with 23.4mln, GES: 31mln—this is the first time there is a stark difference between the number of results in Google hits, with more for English and Safari.

• Law on cookies in south africa Search

Search term: Law on cookies in south africa: the results are similar throughout the three search results. Bing has 108mln results, GiF 3mln, and GES 2.2mln—a 1/3 difference in Google’s number of results in the other direction.

In interpreting the results, it has to be noted that Google, though typing in google.com, forced it to google.co.za, where as Bing stayed on bing.com. This might explain some ‘tailoring’ of GiF and GES to news that is topical in South Africa, which does not happen to the same extent on Bing. I suppose that for some search terms, one would like that, and for others, one would not; i.e., to have the option to choose to search for facts vs opinion pieces vs news, nationally or internationally, or whether you’d want to get an answer or get links to multiple answers. Neither Bing nor Google gives you a free choice on the matter: based on the data you provide involuntarily, they make assumptions as to whom you are and what they think that that kind of person would probably like to see in the search results. That three out of the seven searches on GES lean clearly to the political right is a cause of concern, as is the fewer amounts of facts in Google search results vs Bing’s. I also find it a bit odd that the selection of results is from such wide-ranging numbers of results.

Based on this small sampling, I obviously cannot draw hard conclusions, but it would be nice if we can get some money to get a student to investigate this systematically with more browsers and more languages. We now know that it happens, but how does it happen in South Africa, and might there be some effect because of it? Those questions remain unanswered. In the meantime, I’ll have to do with some anecdotes for the students in an upcoming lecture.