Review of ‘The web was done by amateurs’ by Marco Aiello

Via one of those friend-of-a-friend likes on social media that popped up in my stream, I stumbled upon the recently published book “The web was done by amateurs” (there’s also a related talk) by Marco Aiello, which piqued my interest both concerning the title and the author. I’ve met Aiello once in Trento, when a colleague and he had a departing party, with Aiello leaving for Groningen. He probably doesn’t remember me, nor do I remember much of him—other than his lamentations about Italian academia and going for greener pastures. Turns out he’s done very well for himself academically, and the foray into writing for the general public has been, in my opinion, a fairly successful attempt with this book.

The short book—it easily can be read in a weekend—starts in the first part with historical notes on who did what for the Internet (the infrastructure) and the multiple predecessor proposals and applications of hyperlinking across documents that Tim Berners-Lee (TBL) apparently was blissfully unaware of. It’s surely a more interesting and useful read than the first Google hit, the few factoids from W3C, or Wikipedia one can find online with a simple search—or: it pays off to read books still in this day and age :). The second part is for most readers, perhaps, also still history: the ‘birth’ of the Web and the browser wars in the mid 1990s.

Part III is, in my opinion, the most fun to read: it discusses various extensions to the original design of TBL’s Web that fixes, or at least aims to fix, a shortcoming of the Web’s basics, i.e., they’re presented as “patches” to patch up a too basic—or: rank-amateur—design of the original Web. They are, among others, persistence with cookies to mimic statefulness for Web-based transactions (for, e.g., buying things on the web), trying to get some executable instructions with Java (ActiveX, Flash), and web services (from CORBA, service-oriented computing, to REST and the cloud and such). Interestingly, they all originate in the 1990s in the time of the browser wars.

There are more names in the distant and recent history of the Web that I knew of, so even I picked up a few things here or there. IIRC, they’re all men, though. Surely there would be at least one woman worthy of mention? I probably ought to know, but didn’t, so I searched the Web and easily stumbled upon the Internet Hall of Fame. That list includes Susan Estrada among the pioneers, who founded CERFnet that “grew the network from 25 sites to hundreds of sites.”, and, after that, Anriette Esterhuysen and Nancy Hafkin for the network in Africa, Qiheng Hu for doing this for China, and Ida Holz for the same in Latin America (in ‘global connections’). Web innovators specifically include Anne-Marie Eklund Löwinder for DNS security extensions (DNSSEC, noted on p143 but not by its inventor’s name) and Elizabeth Feinler for the “first query-based network host name and address (WHOIS) server” and “she and her group developed the top-level domain-naming scheme of .com, .edu, .gov, .mil, .org, and .net, which are still in use today”.

One patch to the Web that I really missed in the overview of the early patches, is the “Web 2.0”. I know that, technologically, it is a trivial extension to TBL’s original proposal: the move from static web pages in 1:n communication from content provider to many passive readers, to m:n communication with comment sections (fancy forms), or: instead of the surfer being just a recipient of information by reading one webpage after another and thinking her own thing of it, to be able to respond and interact, i.e., the chatrooms, the article and blog comment features, and, in the 2000s, the likes of MySpace and Facebook. It got so many more people involved in it all.

Continuing with the book’s content, cloud computing and the fog (section 7.9) are from this millennium, as is, what Aiello dubbed, the “Mother of All Patches.”: the Semantic Web. Regarding the latter, early on in the book (pp. vii-viii) there is already an off-hand comment that does not bode well: “Chap. 8 on the Semantic Web is slightly more technical than the rest and can be safely skipped.” (emphasis added). The way Chapter 8 is written, perhaps. Before discussing his main claim there, a few minor quibbles: it’s the Web Ontology Language OWL, not “Ontology Web Language” (p105), and there’s OWL 2 as successor of the OWL of 2004. “RDF is a nifty combination of being a simple modeling language while also functioning as an expressive ontological language” (p104), no: RDF is for representing data, not really for modeling, and most certainly would not be considered an ontology language (one can serialize an ontology in RDF/XML, but that’s different). Class satisfiability example: no, that’s not what it does, or: the simplification does not faithfully capture it; an example with a MammalFish that cannot have any instances (as subclass of both Mammal and Fish that are disjoint), would have been (regardless the real world).

The main claim of Aiello regarding the Semantic Web, however, is that it’s been that time to throw in the towel, because there hasn’t been widespread uptake of Semantic Web technologies on the Web even though it was proposed already around the turn of the millenium. I lean towards that as well and have reduced the time spent on it from my ontology engineering course over the years, but don’t want to throw out the baby with the bathwater just yet, for two reasons. First, scientific results tend to take a long time to trickle down. Second, I am not convinced that the ‘semantic’ part of the Web is the same level of end-user stuff as playing with HTML is. I still have an HTML book from 1997. It has instructions to “design your first page in 10 minutes!”. I cannot recall if it was indeed <10 minutes, but it sure was fast back in 1998-1999 when I made my first pages, as a non-IT interested layperson. I’m not sure if the whole semantics thing can be done even on the proverbial rainy Sunday afternoon, but the dumbed down version with schema.org sort of works. This schema.org brings me to p110 of Aiello’s book, which states that Google can make do with just statistics for optimal search results because of its sheer volume (so bye-bye Semantic Web). But it is not just stats-based: even Google is trying with schema.org and its “knowledge graph”; admitted, it’s extremely lightweight, but it’s more than stats-only. Perhaps the schema.org and knowledge graph sort of thing are to the Semantic Web what TBL’s proposal for the Web was to, say, the fancier HyperCard.

I don’t know if people within the Semantic Web research community would think of its tooling as technologies for the general public. I suspect not. I consider the development and use of ontologies in ontology-driven information systems as part of the ‘back office’ technologies, notwithstanding my occasional attempts to explain to friends and family what sort of things I’m working on.

What I did find curious, is that one of Aiello’s arguments for the Semantic Web’s failure was that “Using ontologies and defining what the meaning of a page is can be much more easily exploited by malicious users” (p110). It can be exploited, for sure, but statistics can go bad, very bad, too, especially on associations of search terms, the creepy amount of data collection on the Web, and bias built into the Machine Learning algorithms. Search engine optimization is just the polite terms for messing with ‘honest’ stats and algorithms. With the Semantic Web, it would a conscious decision to mess around and that’s easily traceable, but with all the stats-based approaches, it sneakishly can creep in whilst trying to keep up the veneer of impartiality, which is harder to detect. If it were a choice between two technology evils, I prefer the honest bastard cf. being stabbed in the back. (That the users of the current Web are opting for the latter does not make it the lesser of two evils.)

As to two possible new patches (not in the book and one can debate whether they are), time will tell whether a few recent calls for “decentralizing” the Web will take hold, or more fine-grained privacy that also entails more fine-grained recording of events (e.g., TBL’s solid project). The app-fication discussion (Section 10.1) was an interesting one—I hardly use mobile apps and so am not really into it—and the lock-in it entails is indeed a cause for concern for the Web and all it offers. Another section in Chapter 10 is IoT, which sounds promising and potentially scary (what would the data-hungry ML algorithms of the Web infer from my fridge contents, and from that, about me??)—for the past 10 years or so. Lastly, the final chapter has the tempting-to-read title “Should a new Web be designed?”, but the answer is not a clear yes or no. Evolve, it will.

Would I have read the book if I weren’t on sabbatical now? Probably still, on an otherwise ‘lost time’ intercontinental trip to a conference. So, overall, besides the occasional gap and one could quibble a bit here and there, the book is a nice read on the whole for any lay-person interested in learning something about the ubiquitous Web, any expert who’s using only a little corner of it, and certainly for the younger generation to get a feel for how the current Web came about and how technologies get shaped in praxis.

Advertisement

From ontology verbalisation to language learning exercises

I’m aware that to most people ‘playing with’ (investigating) ontologies and isiZulu does not sound particularly useful on the face of it. Yet, there’s the some long-term future music, like eventually being able to generate patient discharge notes in one’s own language, which will do its bit to ameliorate the language barrier in healthcare in South Africa so that patients at least will adhere to the treatment instructions a little better, and therewith receive better quality healthcare. But benefits in the short-term might serve something as well. To that end, I proposed an honours project last year, which has been completed in the meantime, and one of the two interesting outcomes has made it into a publication already [1]. As you may have guessed from the title, it’s about automation for language learning exercises. The results will be presented at the 6th Workshop on Controlled Natural Language, in Maynooth, Ireland in about 2 weeks time (27-28 August). In the remainder of this post, I highlight the main contributions described in the paper.

First, regarding the post’s title, one might wonder what ontology verbalisation has to do with language learning. Nothing, really, except that we could reuse the algorithms from the controlled natural language (CNL) for ontology verbalisation to generate (computer-assisted) language learning exercises whose answers can be computed and marked automatically. That is, the original design of the CNL for things like pluralising nouns, verb conjugation, and negation that is used for verbalising ontologies in isiZulu in theory [2] and in practice [3], was such that the sentence generator is a detachable module that could be plugged in elsewhere for another task that needs such operations.

Practically, the student who designed and developed the back-end, Nikhil Gilbert, preferred Java over Python, so he converted most parts into Java, and added a bit more, notably the ‘singulariser’, a sentence scrabble, and a sentence generator. Regarding the sentence generator, this is used as part of the exercises & answers generator. For instance, we know that humans and the roles they play (father, aunt, doctor, etc.) are mostly in isiZulu’s noun classes 1, 2, 1a, 2a, or 3a, that those classes do not (or rarely?) have non-human nouns and generally it holds for all humans and their roles that they can ‘eat’, ‘talk’ etc. This makes it relatively easy create a noun chain and a verb chain list to mix and match nouns with verbs accordingly (hurrah! for the semantics-based noun class system). Then, with the 231 nouns and 59 verbs in the newly constructed mini-corpus, the noun chain and the verb chain, 39501 unique question sentences could be generated, using the following overall architecture of the system:

Architecture of the CNL-driven CALL system. The arrows indicate which upper layer components make use of the lower layer components. (Source: [1])

From a CNL perspective as well as the language learning perspective, the actual templates for the exercises may be of interest. For instance, when a learner is learning about pluralising nouns and their associated verb, the system uses the following two templates for the questions and answers:

Q: <prefixSG+stem> <SGSC+VerbRoot+FV>
A: <prefixPL+stem> <PLSC+VerbRoot+FV>
Q: <prefixSG+stem> <SGSC+VerbRoot+FV> <prefixSG+stem>
A: <prefixPL+stem> <PLSC+VerbRoot+FV> <prefixPL+stem>

The answers can be generated automatically with the algorithms that generate the plural noun (from ‘prefixSG’ to ‘prefixPL’) and add the plural subject concord (from ‘SGSC’ to ‘PLSC’, in agreement with ‘prefixPL’), which were developed as part of the GeNI project on ontology verbalization. This can then be checked against what the learner has typed. For instance, a generated question could be umfowethu usula inkomishi and the correct answer generated (to check the learner’s response against) is abafowethu basula izinkomishi. Another example is generation of the negation from the positive, or, vv.; e.g.:

Q: <PLSC+VerbRoot+FV>
A: <PLNEGSC+VerbRoot+NEGFV>

For instance, the question may present batotoba and the correct answer is then abatotobi. In total, there are six different types of sentences, with two double, like the plural above, hence a total of 16 templates. It is not a lot, but it turned out it is one of the very few attempts to use a CNL in such way: there is one paper that also will be presented at CNL’18 in the same session [4], and an earlier one [5] uses a fancy grammar system (that we don’t have yet computationally for isiZulu). This is not to be misunderstood as that this is one of the first CNL/NLG-based system for computer-assisted language learning—e.g., there’s assistance in essay writing, grammar concept question generation, reading understanding question generation—but curiously very little on CNLs or NLG for the standard entry-level type of questions to learn the grammar. Perhaps the latter is considered ‘boring’ for English by now, given all the resources. However, thousands of students take introduction courses in isiZulu each year, and some automation can alleviate the pressure of routine activities from the lecturers. We have done some evaluations with learners—with encouraging results—and plan to do some more, so that it may eventually transition to actual use in the courses; that is: TBC…

 

References

[1] Gilbert, N., Keet, C.M. Automating question generation and marking of language learning exercises for isiZulu. 6th International Workshop on Controlled Natural language (CNL’18). IOS Press. Co. Kildare, Ireland, 27-28 August 2018. (in print)

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[3] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E. et al. (eds.). Springer LNCS vol. 10577, 59-64.

[4] Lange, H., Ljunglof, P. Putting control into language learning. 6th International Workshop on Controlled Natural language (CNL’18). IOS Press. Co. Kildare, Ireland, 27-28 August 2018. (in print)

[5] Gardent, C., Perez-Beltrachini, L. Using FB-LTAG Derivation Trees to Generate Transformation-Based Grammar Exercises. Proc. of TAG+11, Sep 2012, Paris, France. pp117-125, 2012.