# Yet another software-based clicker system: LetsThink – GoAnswer

Every now and then, I dabble into trying to improve on teaching, which in this case of the post’s topic, is in the form of an implementation project. Yes, one of those with working software as result, thanks to the programmer and CS honours student Yaseen Hamdulay who implemented it. In short: we now have a software-based audience response system that satisfies some pertinent requirements for its use in lectures: math support, figures, and diacritics, and one-click question release and showing the results so as not to hold up the lecture. It will be presented at UCT’s upcoming Teaching & Learning Conference on Oct 22-23.

The context

Peer instruction (PI) enables students to learn from each other. Unlike some other educational interventions, PI has been shown to improve the final grade by up to 34-45%, regardless the discipline it is used (computer science, genetics, zoology, psychology—you name it). On top of that, it generally receives positive feedback from students, because it makes the lectures more engaging, or in any case at least less dull.

From the practical side on how to go about using it in class, there are four principal options: expensive hardware-based ‘clickers’, software-based audience response systems (ARSs) with students’ smartphones or laptops, card and picture-taking (image processing), and none of that through indicating the concept test’s answer with one’s fingers on one’s chest. Regarding resources, the software-based ARS sits somewhere in the middle, which suits our setting well. There are quite a few software-based ARSs (see, e.g., this review). However, upon analysis, they all have issues hampering effective classroom use. For instance, they may not have a single-question release or are for-payment only (e.g., Socrative, Google Forms, Pinnion, and eClicker), or have an impractical results display—they make money from corporate accounts, so the tools are geared for that. The major drawbacks of all these systems are that at least the free versions, if not all, have very limited question setting and answer options, mostly having just plaintext. So, no pictures in the question, answers can be only 100 or 140 characters short, let alone displaying mathematical symbols. Perhaps surprisingly in the international arena, they don’t do diacritics either, even though multiple natural languages have them. For the lectures, this ends up really clumsily and cumbersome, where such questions also must have projected on the screen an ‘offline’ version and then toggling with the screen of the voting progress and results.

The problems

Thus, the main problem is that there is no software-based ARS that contains basic required features for smooth use of PI in the classroom irrespective of one’s discipline, hampering the uptake of this active learning intervention. The current software-based ARSs are prohibitive in uptake of PI in the classroom, yet in our—and many others—medium-level resource setting, extensive use of hardware clickers is not a sustainable option, nor are structural expenses of the user-based payment schemes an option, for they effectively discourage broad participation due to the by-participant payment scheme.

The solution

To address these issues, we developed an ARS that has the following features: support for figures, mathematical formulae, proper HTML text for display of diacritics, one-click question release, showing voting progress, results, question reset, saving of the voting results of a session (among others). This is a web-accessible ARS that can cater for various disciplines within one software system and that is simple and fast to work with.

The project page has several screenshots and the link to the tool. It requires a simple registration (at this stage, we don’t bother with email verification—so don’t lose your password), but is then also immediately usable.

To brighten up this post, here are some screenshots; others are on the project page and some have explanations in a previous post on using PI in a networks course, like the one that has maths and a picture in one.

Screenshot of a question with a picture:

Question with a figure

Note that the lecturer controls (bottom left, and top-left to go back) are intentionally kept in small font. It’s easily readable from the lecturer’s machine, but it’s irrelevant to read for the audience.

Screenshot of a question with some maths, with first the interface for creating the question, then the one projected, and finally the student’s interface when voting. Note that this is just to give an impression that you can do it, not that it is in any way a sensible concept test. Anything that can be done with the latex plugin for html (MathJax) can be used in the question and in the answer box, but thus not your own macros. (For the curious: the answer to the question is A (read the problem).)

Interface for creating a question

The question rendered in the presentation interface

Small screenshot of the voting interface

Screenshot of a question with diacritics:

Some question with diacritics.

Desktop/laptop interface of the question to answer, after having entered the question ID:

Section of the lecturer/admin interface, where you can choose to group questions

Exported results from the voting imported into a spreadsheet:

LetsThink-GoAnswer has been beta-tested with my networks students last semester, has been demo-ed in a CILT seminar on ARSs, and the occasional lecturer already is/will be experimenting with it. The code will be made available on GitHub soon. In the meantime, please don’t hesitate to contact me if you have any questions, and we’d love to hear your feedback!

The 19th Conference on Advances in Databases and Information Systems (ADBIS’15) just finished yesterday. It was an enjoyable and well-organised conference in the lovely town of Poitiers, France. Thanks to the general chair, Ladjel Bellatreche, and the participants I had the pleasure to meet up with, listen to, and receive feedback from. The remainder of this post mainly recaps the keynotes and some of the presentations.

Keynotes

The conference featured two keynotes, one by Serge Abiteboul and on by Jens Dittrich, both distinguished scientists in databases. Abiteboul presented the multi-year project on Webdamlog that ended up as a ‘personal information management system’, which is a simple term that hides the complexity that happens behind the scenes. (PIMS is informally explained here). It breaks with the paradigm of centralised text (e.g., Facebook) to distributed knowledge. To achieve that, one has to analyse what’s happening and construct the knowledge from that, exchange knowledge, and reason and infer knowledge. This requires distributed reasoning, exchanging facts and rules, and taking care of access control. It is being realised with a datalog-style language but that then also can handle a non-local knowledge base. That is, there’s both solid theory and implementation (going by the presentation; I haven’t had time to check it out).

The main part of the cool keynote talk by Dittrich was on ‘the case for small data management’. From the who-wants-to-be-a-millionaire style popquiz question asking us to guess the typical size of a web database, it appeared to be only in the MBs (which most of us overestimated), and sort of explains why MySQL [that doesn’t scale well] is used rather widely. This results in a mismatch between problem size and tools. Another popquiz question answer: the 100MB RDF can just as well be handled efficiently by python, apparently. Interesting factoids, and one that has/should have as consequence we should be looking perhaps more into ‘small data’. He presented his work on PDbF as an example of that small data management. Very briefly, and based on my scribbles from the talk: its an enhanced pdf where you can access the raw data behind the graphs in the paper as well (it is embedded in it, with OLAP engine for posing the same and other queries), has a html rendering so you can hover over the graphs, and some more visualisation. If there’s software associated with the paper, it can go into the whole thing as well. Overall, that makes the data dynamic, manageable, traceable (from figure back to raw data), and re-analysable. The last part of his talk was on his experiences with the flipped classroom (more here; in German), but that was not nearly as fun as his analysis and criticism of the “big data” hype. I can’t recall exactly his plain English terms for the “four V4”, but the ‘lots of crappy XML data that changes’ remained of it in my memory bank (it was similar to the first 5 minutes of another keynote talk he gave).

Sessions

Sure, despite the notes on big data, there were presentations in the sessions that could be categorised under ‘big data’. Among others, Ajantha Dahanayake presented a paper on a proposal for requirements engineering for big data [1]. Big data people tend to assume it is just there already for them to play with. But how did it get there, how to collect good data? The presentation outlined a scenario-based backwards analysis, so that one can reduce unnecessary or garbage data collection. Dahanayake also has a tool for it. Besides the requirements analysis for big data, there’s also querying the data and the desire to optimize it so as to keep having fast responses despite its large size. A solution to that was presented by Reuben Ndindi, whose paper also won the best paper award of the conference [2] (for the Malawians at CS@UCT: yes, the Reuben you know). It was scheduled in the very last session on Friday and my note-taking had grinded to a halt. If my memory serves me well, they make a metric database out of a regular database, compute the distances between the values, and evaluate the query on that, so as to obtain a good approximation of the true answer. There’s both the theoretical foundation and an experimental validation of the approach. In the end, it’s faster.

Data and schema evolution research is alive and well, as were time series and temporal aspects. Due to parallel sessions and my time constraints writing this post, I’ll mention only two on the evolution; one because it was a very good talk, the other because of the results of the experiments. Kai Herrmann presented the CoDEL language for database evolution [3]. A database and the application that uses it change (e.g., adding an attribute, splitting a table), which requires quite lengthy scripts with lots of SQL statements to execute. CoDEL does it with fewer statements, and the language has the good quality of being relationally complete [3]. Lesley Wevers approached the problem from a more practical angle and restricted to online databases. For instance, Wikipedia does make updates to their database schema, but they wouldn’t want to have Wikipedia go offline for that duration. How long does it take for which operation, in which RDBMS, and will it only slow down during the schema update, or block any use of the database entirely? The results obtained with MySQL, PostgreSQL and Oracle are a bit of a mixed bag [4]. It generated a lively debate during the presentation regarding the test set-up, what one would have expected the results to be, and the duration of blocking. There’s some work to do there yet.

The presentation of the paper I co-authored with Pablo Fillottrani [5] (informally described here) was scheduled for that dreaded 9am slot the morning after the social dinner. Notwithstanding, quite a few participants did show up, and they showed interest. The questions and comments had to do with earlier work we used as input (the metamodel), qualifying quality of the conceptual model, and that all too familiar sense of disappointment that so few language features were used widely in publicly available conceptual models (the silver lining of excellent prospects of runtime usage of conceptual models notwithstanding). Why this is so, I don’t know, though I have my guesses.

And the other things that make conference useful and fun to go to

In short: Networking, meeting up again with colleagues not seen for a while (ranging from a few months [Robert Wrembel] to some 8 years [Nadeem Iftikhar] and in between [a.o., Martin Rezk, Bernhard Thalheim]), meeting new people, exchanging ideas, and the social events.

2008 was the last time I’d been in France, for EMMSAD’08, where, looking back now, I coincidentally presented a paper also on conceptual modelling languages and logic [6], but one that looked at comprehensive feature coverage and comparing languages rather than unifying. It was good to be back in France, and it was nice to realise my understanding and speaking skills in French aren’t as rusty as I thought they were. The travels from South Africa are rather long, but definitely worthwhile. And it gives me time to write blog posts killing time on the airport.

References

(note: most papers don’t show up at Google scholar yet, hence, no links; they are on the Springer website, though)

[1] Noufa Al-Najran and Ajantha Dahanayake. A Requirements Specification Framework for Big Data Collection and Capture. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, .

[2] Boris Cule, Floris Geerts and Reuben Ndindi. Space-bounded query approximation. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 397-414.

[3] Kai Herrmann, Hannes Voigt, Andreas Behrend and Wolfgang Lehner. CoDEL – A Relationally Complete Language for Database Evolution. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 63-76.

[4] Lesley Wevers, Matthijs Hofstra, Menno Tammens, Marieke Huisman and Maurice van Keulen. Analysis of the Blocking Behaviour of Schema Transformations in Relational Database Systems. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 169-183.

[5] Pablo R. Fillottrani and C. Maria Keet. Evidence-based Languages for Conceptual Data Modelling Profiles. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 215-229.

[6] C. Maria Keet. A formal comparison of conceptual data modeling languages. EMMSAD’08. CEUR-WS Vol-337, 25-39.

# Reblogging 2007: AI and cultural heritage workshop at AI*IA’07

From the “10 years of keetblog – reblogging: 2007”: a happy serendipity moment when I stumbled into the AI & Cultural heritage workshop, which had its presentations in Italian. Besides the nice realisation I actually could understand most of it, I learned a lot about applications of AI to something really useful for society, like the robot-guide in a botanical garden, retracing the silk route, virtual Rome in the time of the Romans, and more.

AI and cultural heritage workshop at AI*IA’07, originally posted on Sept 11, 2007. For more recent content on AI & cultural heritage, see e.g., the workshop’s programme of 2014 (also collocated with AI*IA).

——–

I’m reporting live from the Italian conference on artificial intelligence (AI*IA’07) in Rome (well, Villa Mondrogone in Frascati, with a view on Rome). My own paper on abstractions is rather distant from near-immediate applicability in daily life, so I’ll leave that be and instead write about an entertaining co-located workshop about applying AI technologies for the benefit of cultural heritage that, e.g., improve tourists’ experience and satisfaction when visiting the many historical sites, museums, and buildings that are all over Italy (and abroad).

I can remember well the handheld guide at the Alhambra back in 2001, which had a story by Mr. Irving at each point of interest, but there was only one long story and the same one for every visitor. Current research in AI & cultural heritage looks into solving issues how this can be personalized and be more interactive. Several directions are being investigated how this can be done. This ranges from the amount of information provided at each point of interest (e.g., for the art buff, casual American visitor who ‘does’ a city in a day or two, or narratives for children), to location-aware information display (the device will detect which point of interest you are closest to), to cataloguing and structuring the vast amount of archeological information, to the software monitoring of Oetzi the Iceman. The remainder of this blog post describes some of the many behind-the-scenes AI technologies that aim to give a tourist the desired amount of relevant information at the right time and right place (see the workshop website for the list of accepted papers). I’ll add more links later; any misunderstandings are mine (the workshop was held in Italian).

First something that relates somewhat to bioinformatics/ecoinformatics: the RoBotanic [1], which is a robot guide for botanical gardens – not intended to replace a human, but as an add-on that appeals in particular to young visitors and get them interested in botany and plant taxonomy. The technology is based on the successful ciceRobot that has been tested in the Archeological Museum Agrigento, but having to operate outside in a botanical garden (in Palermo), new issues have to be resolved, such as tuff powder, irregular surface, lighting, and leaves that interfere with the GPS system (for the robot to stop at plants of most interest). Currently, the RoBotanic provides one-way information, but in the near-future interaction will be built in so that visitors can ask questions as well (ciceRobot is already interactive). Both the RoBotanic and ciceRobot are customized off-the shelf robots.

Continuing with the artificial, there were three presentations about virtual reality. VR can be a valuable add-on to visualize lost or severely damaged property, timeline visualizations of rebuilding over old ruins (building a church over a mosque or vice versa was not uncommon), to prepare future restorations, and general reconstruction of the environment, all based on the real archeological information (not Hollywood fantasy and screenwriting). The first presentation [2] explained how the virtual reality tour of the Church of Santo Stefano in Bologna was made, using Creator, Vega, and many digital photos that served for the texture-feel in the VR tour. [3] provided technical details and software customization for VR & cultural heritage. On the other hand, the third presentation [4] was from a scientific point most interesting and too full of information to cover it all here. E. Bonini et al. investigated if, and if yes how, VR can give added-value. Current VR being insufficient for the cultural heritage domain, they look at how one can do an “expansion of reality” to give the user a “sense of space”. MUDing on the via Flaminia Antica in the virtual room in the National Museum in Rome should be possible soon (CNR-ITABC project started). Another issue came up during the concluded Appia Antica project for Roman era landscape VR: behaviour of, e.g., animals are now pre-coded and become boring to the user quickly. So, what these VR developers would like to see (i.e., future work) is to have technologies for autonomous agents integrated with VR software in order to make the ancient landscape & environment more lively: artificial life in the historical era one wishes, based on – and constrained by – scientific facts so as to be both useful for science and educational & entertaining for interested laymen.

A different strand of research is that of querying & reasoning, ontologies, planning and constraints.
Arbitrarily, I’ll start with the SIRENA project in Naples (the Spanish Quarter) [5], which aims to provide automatic generation of maintenance plans for historical residential buildings in order to make the current manual plans more efficient, cost effective, and maintain them just before a collapse. Given the UNI 8290 norms for technical descriptions of parts of buildings, they made an ontology, and used FLORA-2, Prolog, and PostgreSQL to compute the plans. Each element has its own interval for maintenance, but I didn’t see much of the partonomy, and don’t know how they deal with the temporal aspects. Another project [6] also has an ontology, in OWL-DL, but is not used for DL-reasoning reasoning yet. The overall system design, including use of Sesame, Jena, SPARQL can be read here and after server migration, their portal for the archeological e-Library will be back online. Another component is the webGIS for pre- and proto-historical sites in Italy, i.e., spatio-temporal stuff, and the hope is to get interesting inferences – novel information – from that (e.g., discover new connections between epochs). A basic online accessible version of webGIS is already running for the Silk Road.
A third different approach and usage of ontologies was presented in [7]. With the aim of digital archive interoperability in mind, D’Andrea et al. took the CIDOC-CRM common reference model for cultural heritage and enriched it with DOLCE D&S foundational ontology to better describe and subsequently analyse iconographic representations, from, in this particular work, scenes and reliefs from the meroitic time in Egypt.
With In.Tou.Sys for intelligent tourist systems [8] we move to almost-industry-grade tools to enhance visitor experience. They developed software for PDAs one takes around in a city, which then through GPS can provide contextualized information to the tourist, such as the building you’re walking by, or give suggestions for the best places to visit based on your preferences (e.g., only baroque era, or churches, or etc). The latter uses a genetic algorithm to compute the preference list, the former a mix of RDBMS on the server-side, OODBMS on the client (PDA) side, and F-Logic for the knowledge representation. They’re now working on the “admire” system, which has a time component built in to keep track of what the tourist has visited before so that the PDA-guide can provide comparative information. Also for city-wide scale and guiding visitors is the STAR project [9], bit different from the previous, it combines the usual tourist information and services – represented in a taxonomy, partonomy, and a set of constraints – with problem solving and a recommender system to make an individualized agenda for each tourist; so you won’t stand in front of a closed museum, be alerted of a festival etc. A different PDA-guide system was developed in the PEACH project for group visits in a museum. It provides limited personalized information, canned Q & A, and visitors can send messages to their friend and tag points of interest that are of particular interest.

Utterly different from the previous, but probably of interest to the linguistically-oriented reader is philology & digital documents. Or: how to deal with representing multiple versions of a document. Poets and authors write and rewrite, brush up, strike through etc. and it is the philologist’s task to figure out what constitutes a draft version. Representing the temporality and change of documents (words, order of words, notes about a sentence) is another problem, which [10] attempts to solve by representing it as a PERT/CPM graph structure augmented with labeling of edges, the precise definition of a ‘variant graph’, and a method of compactly storing it (ultimately stored in XML). The test case as with a poem from Valerio Magrelli.

The proceedings will be put online soon (I presume), is also available on CD (contact the WS organizer Luciana Bordoni), and probably several of the articles are online on the author’s homepages.

[1] A. Chella, I. Macaluso, D. Peri, L. Riano. RoBotanic: a Robot Guide for Botanical Gardens. Early Steps.
[2] G. Adorni. 3D Virtual Reality and the Cultural Heritage.
[3] M.C.Baracca, E.Loreti, S. Migliori, S. Pierattini. Customizing Tools for Virtual Reality Applications in the Cultural Heritage Field.
[4] E. Bonini, P. Pierucci, E. Pietroni. Towards Digital Ecosystems for the Transmission and Communication of Cultural Heritage: an Epistemological Approach to Artificial Life.
[5] A. Calabrese, B. Como, B. Discepolo, L. Ganguzza , L. Licenziato, F. Mele, M. Nicolella, B. Stangherling, A. Sorgente, R Spizzuoco. Automatic Generation of Maintenance Plans for Historical Residential Buildings.
[6] A.Bonomi, G. Mantegari, G.Vizzari. Semantic Querying for an Archaeological E-library.
[7] A. D’Andrea, G. Ferrandino, A. Gangemi. Shared Iconographical Representations with Ontological Models.
[8] L. Bordoni, A. Gisolfi, A. Trezza. INTOUSYS: a Prototype Personalized Tourism System.
[9] D. Magro. Integrated Promotion of Cultural Heritage Resources.
[10] D. Schmidt, D. Fiormonte. Multi-Version Documents: a Digitisation Solution for Textual Cultural Heritage Artefacts

# The ontology-driven unifying metamodel of UML class diagrams, ER, EER, ORM, and ORM2

Metamodelling of conceptual data modelling languages is nothing new, and one may wonder why one would need yet another one. But you do, if you want to develop complex systems or integrate various legacy sources (which South Africa is going to invest more money in) and automate at least some parts of it. For instance: you want to link up the business rules modelled in ORM, the EER diagram of the database, and the UML class diagram that was developed for the application layer. Are the, say, Student entity types across the models really the same kind of thing? And UML’s attribute StudentID vs. the one in the EER diagram? Or EER’s EmployeesDependent weak entity type with the ORM business rule that states that “each dependent of an employee is identified by EmployeeID an the Dependent’s Name?

Ascertaining the correctness of such inter-model assertions in different languages does not require a comparison and contrast of their differences, but a way to harmonise or unify them. Some such models already exist, but they take subsets of the languages, whereas all those features do appear in actual models [1] (described here informally). Our metamodel, in contrast, aims to capture all constructs of the aforementioned languages and the constraints that hold between them, and generalize in an ontology-driven way so that the integrated metamodel subsumes the structural, static elements of them (i.e., the integrated metamodel has as them as fragments). Besides some updates to the earlier metamodel fragment presented in [2,3], the current version [4,5] also includes the metamodel fragment of their constraints (though omits temporal aspects and derived constraints). The metamodel and its explanation can be found in the paper in An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2 [4] that I co-authored with Pablo Fillottrani, and which was recently accepted in Data & Knowledge Engineering.

Methodologically, the unifying metamodel presented in An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2 [4], is ontological rather than formal (cf. all other known works). On that ‘ontology-driven approach’, here is meant the use of insights from Ontology (philosophy) and ontologies (in computing) to enhance the quality of a conceptual data model and obtain that ‘glue stuff’ to unify the metamodels of the languages. The DKE paper describes all that, such as: on the nature of the UML association/ORM fact type (different wording, same ontological commitment), attributes with and without data types, the plethora of identification constraints (weak entity types, reference modes, etc.), where can one reuse an ‘attribute’ if at all, and more. The main benefit of this approach is being able to cope with the larger amount of elements that are present in those languages, and it shows that, in the details, the overlap in features across the languages is rather small: 4 among the set of 23 types of relationship, role, and entity type are essentially the same across the languages (see figure below), and 6 of the 49 types of constraints. The metamodel is stable for the modelling languages covered. It is represented in UML for ease of communication, but, as mentioned earlier, it also has been formalised in the meantime [5].

Types of elements in the languages; black-shaded: entity is present in all three language families (UML, EER, ORM); dark grey: on two of the three; light grey: in one; while-filled: in none, but we added the more general entities to ‘glue’ things together. (Source: [4])

Metamodel fragment with some constraints among some of the entities. (Source [4])

The DKE paper also puts it in a broader context with examples, model analyses using the harmonised terminology, and a use case scenario that demonstrates the usefulness of the metamodel for inter-model assertions.

While the 24-page paper is rather comprehensive, research results wouldn’t live up to it if it didn’t uncover new questions. Some of them have been, and are being, answered in the meantime, such as its use for classifying models and comparing their characteristics [1,6] (blogged about here and here) and a rule-based approach to validating inter-model assertions [7] (informally here). Although the 3-year funded project on the Ontology-driven unification of conceptual data modelling languages—which surely contributed to realising this paper—just finished officially, we’re not done yet, or: more is in the pipeline. To be continued…

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (in press)

[2] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[3] Keet, C.M., Fillottrani, P.R. Structural entities of an ontology-driven unifying metamodel for UML, EER, and ORM2. 3rd International Conference on Model & Data Engineering (MEDI’13). A. Cuzzocrea and S. Maabout (Eds.) September 25-27, 2013, Amantea, Calabria, Italy. Springer LNCS 8216, 188-199.

[4] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering. 2015. DOI: 10.1016/j.datak.2015.07.004. (in press)

[5] Fillottrani, P.R., Keet, C.M. KF metamodel Formalization. Technical Report, Arxiv.org http://arxiv.org/abs/1412.6545. Dec 19, 2014. 26p.

[6] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Springer LNCS. Poitiers, France, Sept 8-11, 2015. (in press)

[7] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

# Quasi wordles of isiZulu online newspaper articles from this weekend

Every now and then, I get side-tracked from what I was (supposed to be) doing. This time, it was a result of the combination of preparing ICPC training problems, preparing for a statistics tutorial for the postgraduate research methods, and a conversation from last week on an isiZulu corpus with Langa Khumalo from UKZN’s ULPDO (and my co-author on several papers on isiZulu CNLs). To make a long story short, I ended up sourcing some online news articles in isiZulu and writing a little python script to count the words and top-k words of the news articles to get a feel of what the most prevalent topics of the articles were.

Materials and data

10 Isolezwe, listed on the front page on August 8, 2015 (articles were from Aug 6 and 7—no updates in the long weekend)

10 News24 in isiZulu articles, listed on the front page on August 8, 2015 (articles were from Aug 8)

10 News24 in isiZulu articles, listed on the front page on August 9, 2015 (articles were from Aug 9, a Sunday, and Women’s Day in South Africa)

Simple basicCorpusStats.py that one can make already just by going through the first part of ThinkPython (in case you’re unfamiliar with python).

Note: ilanga doesn’t have articles online, and therefore was not included.

Note 2: for copyright issues, I probably cannot share the txt files online, but in case you’re interested, just ask me and I’ll email them.

Some general stats

Isolezwe had, on average, 265 words/article, whereas news24 had about half of that (110 and 134 on Saturday and Sunday, respectively). The top-20 of each is listed at the end of this post (the raw results of News24 had “–” removed [bug], as well as udaba and olunye [standard-text noise from the articles]).

Comparing them on the August 8 offering, Isolezwe had people saying this that and the other (ukuthi ‘saying/to say’ had the highest frequency of 60) and then the police (amaphoyisa, n=27), whereas News24 had amaphoyisa 27 times as most frequent word, then abasolwa (‘suspects’) 11 times that doesn’t even appear in Isolezwe’s top-20 most frequent words (though the stem –solwa appears 9 times). The police is problematic in South Africa—they commit crimes and other dubious behaviour under investigation (e.g., Marikana)—and more get killed than in may other countries (another one last week), and crime happens. But not on a public holiday, apparently: News24 had only one –phoyisa on Aug 9.

While I hoped to find a high incidence of women, for it being Women’s Day on August 9, none of –fazi appeared in the News24 mini-corpus of 1353 words of the 10 front page articles; instead, there was a lot of saying this that and the other (ukuthi had the highest frequency of 37), and little on suspects or blaming (-solwa n=3).

On that quasi wordle

While ukuthi is the infinitive, there are a gazillion conjugations and things agglutinated to it that is barely clear to the linguists on how it all works, so I did not analyse that further. Amaphoyisa, on the other hand, as a noun (plural of ‘police’), has fewer variations. In the Isolezwe mini-corpus, –phoyis– (the root of ‘police’) appeared 47 times, including variants like lwamaphoyisa, ngamaphoyisa, yiphoyisa, i.e., substantially more than the 27 amaphoyisa. If I were to create a wordle, they’d be missed unless one uses some stemmer, which doesn’t happen to be available[1] and I didn’t write one (just regex in the txt). By the same token, News24’s mention of the police on August 8 goes up to 28 with –phoyisa, and as close second the blaming and suspects (-solwa, n=27).

The lack of a stemmer also means missing out on all sorts of variations on imali (‘money’, n=11) in the isolezwe articles, whereas its stem –mali pops up 29 times, due to, among others, kwemali (n=5), mali (n=3), yimali (y- functioning as copulative in that sentence, n=1), ngezimali (n=1) and others. Likewise on person/people (-ntu) for which n=17 that are distributed among abantu (plural) umuntu (singular), nabantu (‘and people’), among others.

Last, the second most frequently used word in News24 on August 9 was njengoba (‘as’, ‘whereas’, ‘since’), primarily due to the first article on the sports results of the matches played.

So, with all that background knowledge, Isolezwe’s wordle would be, in descending order (and in English for the readers of this blog): say, police, money, people. News24 on August 8: police, suspect/blame, say (two variations, n=9 each). News24 on August 9: say, as/since (and then some other adverbs).

In closing

This dabbling resulted in more problems and questions being raised than answered. But, for now, it’s at least still a bit of a peek into the kitchen of news in a language that I don’t master as well as I want to and should. It wasn’t useful either for the ICPC problem setting or the stats tutorial, nor is a 5123-word corpus of any use, but it was fun with python at least and satisfying at last a little of my curiosity, and perhaps it spurs someone to do all this properly/more systematically and on a grander scale. For the isiZulu speakers: it’s surely still up to you to read whichever news outlet you prefer reading.

References

[1] Pretorius, L., Bosch, S.E. (2010). Finite-state morphology of the Nguni language cluster: modelling and implementation Issues. In A. Yli-Jyrä, Kornai, A., Sakarovitch, J. & Watson, B. (Eds.), Finite-State Methods and Natural Language Processing 8th International Workshop, FSMNLP 2009. Lecture Notes in Computer Science, Vol. 6062, pp. 123–130

[2] Spiegler, S., van der Spuy, A., and Flach, P. A. (2010). Ukwabelana – an opensource morphological zulu corpus. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10), pages 1020-1028. Association for Computational Linguistics. Beijing

 Top-20 words Isolezwe on Aug8 Top-20 words News24 on Aug8 Top-20 words News24 on Aug9 ukuthi 60 amaphoyisa 19 ukuthi 37 amaphoyisa 27 abasolwa 11 njengoba 13 uthe 17 uthe 9 ngemuva 12 ngoba 17 ukuthi 9 ngesikhathi 8 lokhu 16 ubudala 9 lo 8 kuthiwa 16 njengoba 9 uthe 7 kusho 16 kusho 9 uma 7 kodwa 12 lo 8 futhi 7 imali 11 oneminyaka 7 uzakwe 6 uma 10 ngokuthakatha 7 usnethemba 6 ngesikhathi 10 ngokusho 7 kodwa 6 yakhe 9 omphakathi 6 johannesburg 6 ukuba 9 okhulumela 6 yakhe 5 njengoba 9 kanti 6 united 5 nje 9 endaweni 6 ukuba 5 lo 9 yohlobo 5 ukomphela 5 khona 9 ngesikhathi 5 ufaku 5 abantu 9 ngemuva 5 ubudala 5 umphakathi 8 le 5 rhythms 5 umnuz 8 imoto 5 ngokubika 5

[1] There is some material on that (among others, [1,2]), though, but it’s mostly theoretical or very proof of concept, rather than the easy reuse of tools like for English, and the example rule in [1] isn’t right (it’s umfana, not umufana; the longer prefix with the extra –u– is used when the stem is one syllable, like –ntu -> umuntu).

# An orchestration of ontologies for linguistic knowledge

Starting from multilingual knowledge representation in ontologies and an eye on linguistic linked data and controlled natural languages, we had developed a basic ontology for the Bantu noun class system [1] to link with the lemon model [2]. The noun class system is alike gender in, e.g., German and Italian, but then a bit different. It is based on semantics of the nouns and each Bantu language has some 12-23 noun classes. For instance, noun classes 1 and 2 are for singular and plural humans, 9 and 10 for animals (singular and plural, respectively), 11 for inanimates and long thin objects (e.g., a telephone cable), and class 14 has abstract nouns (e.g., beauty). Each class has its own augment or augment+prefix to be added to the stem. None of the other linguistic resources, such as ISOcat or the GOLD ontology, dealt with them, so, lemon did not either, but we needed it. The first version of the ontology we introduced in [1] had its limitations, but it mostly did its job. Mostly, but not fully.

Lemon needs that morphology module and then some for the rules. The ontology did not fully satisfy Bantu languages other than Chichewa and isiZulu. With the knowledge of the latter only, it was more alike a merged conceptual data model, for it was tailored to the two specific languages. Also, it wasn’t aligned to other models or ontologies, thus hampering interoperability and reuse. We didn’t have any competency questions or cool inferences either, because our scope then was just to annotate the names of the classes in an ontology. Hence, it was time for an improvement.

Among others, we don’t want just to annotate, but, given that Bantu languages are underresourced, see what we can add to derive implicit information, which could help with tagging terms. For instance

• if you know abantu is a plural and in noun class 2 and umuntu is the singular of it, then umuntu is in noun class 1, or
• when it is declared that inja is in noun class 9, then so is its stem -ja (or vv), or
• language specific, which singular (plural) noun class goes with which plural (singular) noun class: while the majority neatly has a pair of successive odd and even numbers (1-2, 3-4, 5-6 etc), this is not always the case; e.g., in isiZulu, noun class 11 does not have noun class 12 as plural, but noun class 10 (which has its own augment and prefix).

Then, besides the interoperability and reuse requirements, we’d needed to distinguish between language-specific axioms and those that hold across the language family. To solve all that, we developed a framework, reusing the pyramid structure idea from BioTop [3] and the so-called “double articulation principle” of DOGMA [4], where the language-specific axioms are at the level of DOGMA’s conceptual model, for they add specific constraints.

To make a long story short, the framework/orchestration applied to the linguistic knowledge of Bantu noun classes in general, and specific to some language, looks as follows:

framework applied to some linguistics ontologies (source: [5])

More details are described in the recently accepted paper “An orchestration framework for linguistic task ontologies” [5], to be presented as the 9th Metadata and Semantics Research Conference (MTSR’15), to be held from 9 to 11 September in Manchester, UK. My co-author Catherine Chavula will be attending MTSR’15 and present our paper, hoping/assuming that all those last-minute things—like visa and money actually being transferred to buy that plane ticket—will be sorted this month. (Odd ‘checks and balances’ that make life harder and more expensive for people outside of a visa-free zone and tied to a funding benefactor is a topic for some other time.).

The set of ontologies (in OWL) is available in NCS1.zip from my ontologies directory. It contains the goldModule—a module extracted from the GOLD ontology for general linguistics knowledge and that is aligned to the foundational ontology SUMO—the NCS ontology, and three languages-specific axiomatizations for the noun classes, being Chichewa, isiXhosa, and isiZulu (more TBA). The same approach can be used for other linguistic features in other language groups or families; e.g., instead of the NCS, one could have knowledge represented about conjugation in the Romance languages (Italian, Spanish etc.), and then the more precise axiomatization (conceptual data model, if you will) for constraints unique to each language.

p.s.: Bantu languages is the term used in linguistics, so that’s why it’s used here. Elsewhere, they are also called African languages. They’re not synonymous, however, as the latter includes also other, non-Bantu, languages, as it can designate any language spoken in Africa that may have a wholly different grammar, hence, the difference linguists make to avoid misinterpretation.

References

[1] Chavula, C., Keet, C.M. Is Lemon Sufficient for Building Multilingual Ontologies for Bantu Languages? 11th OWL: Experiences and Directions Workshop (OWLED’14). Keet, C.M., Tamma, V. (Eds.). Riva del Garda, Italy, Oct 17-18, 2014. CEUR-WS vol. 1265, 61-72.

[2] McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez-Pérez, A., Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D., Wunner, T.: Interchanging lexical resources on the Semantic Web. Language Resources & Evaluation, 2012, 46(4), 701-719

[3] Beißwanger, E., Schulz, S., Stenzhorn, H., Hahn, U.: Biotop: An upper domain ontology for the life sciences: A description of its current structure, contents and interfaces to obo ontologies. Applied Ontology, 2008, 3(4), 205-212

[4] Jarrar, M., Meersman, R.: Ontology Engineering The DOGMA Approach. In: Advances in Web Semantics I, LNCS, vol. 4891, pp. 7-34. Springer (2009)

[5] Chavula, C., Keet, C.M. An Orchestration Framework for Linguistic Task Ontologies. 9th Metadata and Semantics Research Conference (MTSR’15), Springer CCIS. 9-11 September, 2015, Manchester, UK. (in print)

# Reblogging 2006: Figuring out requirements for automated reasoning services for formal bio-ontologies

From the “10 years of keetblog – reblogging: 2006”: a preliminary post that led to the OWLED 2007 paper I co-authored with Marco Roos and Scott Marshall, when I was still predominantly into bio-ontologies and biological databases. The paper received quite a few citations, and a good ‘harvest’ from both OWLED’07 and co-located DL’07 participants on how those requirements may be met (described here). The original post: Figuring out requirements for automated reasoning services for formal bio-ontologies, from Dec 27, 2006.

What does the user want? There is a whole sub-discipline on requirements engineering, where researchers look into methodologies how one best can extract the users’ desires for a software application and organize the requirements according to type and priority. But what to do when users – in this case biologists and (mostly non-formal) bio-ontologies developers – neither do know clearly themselves what they want nor what type of automated reasoning is already ‘on offer’. Here, I’m making a start by briefly listing informally some of the desires & usages that I came across in the literature, picked up from conversations and further probing to disambiguate the (for a logician) vague descriptions, or bumped into myself; they are summarised at the end of this blog entry and update (d.d. 5-5-’07) described more comprehensively in [0].

Feel free to add your wishes & demands; it may even be fed back into current research like [1] or be supported already after all. (An alternative approach is describing ‘scenarios’ from which one can try to extract the required reasoning tasks; if you want to, you can add those as well.)

I. A fairly obvious use of automated reasoners such as Racer, Pellet and FaCT++ with ontologies is to let the software find errors (inconsistencies) in the representation of the knowledge or reality. This is particularly useful to ensure no ‘contradictory’ information remains in the ontology when an ontology gets too big for one person to comprehend and multiple people update an ontology. Also, it tends to facilitate learning how to formally represent something. Hence, the usage is to support the ontology development process.

But this is just the beginning: having a formal ontology gives you other nice options, or at least that is the dangling carrot on front of the developer’s nose.

II. One demonstration of the advantages of having a formal ontology, thus not merely a promise, is the classification of protein phosphatases by Wolstencroft et al. [9], where also some modest results were obtained in discovering novel information about those phosphatases that was entailed in the extant information but hitherto unknown. Bandini and Mosca [2] pushed a closely related idea one step further in another direction. To constrain the search space of candidate rubber molecules for tire production, they defined the constraints (properties) all types of molecules for tires must satisfy in the TBox, treated each candidate molecule as an instance in the ABox, and performed model checking on the knowledgebase: each instance inconsistent w.r.t. the TBox was thrown out of the pool of candidate-molecules. Overall, the formal representation with model checking achieved a considerable reduction in resource usage of the system and reduced the amount of costly wet-lab research. Hence, the usages are classification and model checking.[i]

III. Whereas the former includes usage of particular instances for the reasoning scenarios, another on is to stay at the type level and, in particular, relations between the types (classes in the class hierarchy in Protégé). In short, some users want to discover new or missing relations. What type of relation is not always exactly clear, but I assume for now that any non-isA relation would do. For instance, Roos et al. [8] would like to do that for the subject domain of histones; with or without instance-level data. The former, using instance-level data, resembles the reverse engineering option in VisioModeler, which takes a physical database schema and the data stored in the tables and computes the likely entities, relations, and constraints at the level of a conceptual model (in casu, ORM). Mungall [7] wants to “Check for inconsistent or missing relationships” and “Automatically assign relationships for new terms”. How can one find what is not there but ought to be in the ontology? An automated reasoner is not an oracle. I will split up this topic into two aspects. First, one can derive relations among types, meaning that some ontology developer has declared several types, relations, other properties, but not everything. The reasoner then, takes the declared knowledge and can return relations that are logically implied by the formal ontology. From a user perspective, such a derived relation may be perceived as a ‘new’ or ‘missing’ relation – but it did not fall out of the sky because the relation was already entailed in the ontology (or: you did not realize you knew it already). Second, another notion of ‘missing relations’: e.g. there are 17 types of macrophages (types of cell) in the FMA, which must be part of, contained in, or located in something. If you query the FMA through OQAFMA, it gives as answer that the hepatic macrophage is part of the liver [5]. An informed user knows it cannot be the case that the other macrophages are not part of anything. Then, the ontology developer may want to fill this gap – adding the ‘missing’ relations – by further developing those cell-level sections of the ontology. Note that the reasoner will not second-guess you by asking “do you want more things there?”; it uses the Open World Assumption, i.e. that there always may be more than actually represented on the ontology (and absence of some piece of information is not negation of that piece). Thus, the requirements are to have some way of dealing with gaps’ in an ontology, to support computing derived relations entailed in a logical theory, and, third, deriving type-level relations based on instance-level data. The second one is already supported, the first one only with intervention by an informed user, and the third one might, to some extent.

Now three shorter points, either because there is even less material or there is too much to stuff it in this blog entry.

IV. A ‘this would be nice’ suggestion from Christina Hettne, among others, concerns the desire to compare pathways, which, in its simplest form, amounts to checking for sub-graph isomorphisms. More generally, one could – or should be able to – treat an ontology as a scientific theory [6] and compare competing explanations of some natural phenomenon (provided both are represented formally). Thus, we have a requirement for comparison of two ontologies, not with the aim of doing meaning negotiation and/or merging them, but where the discrepancies themselves are the fun part. This indicates that dealing with ‘errors’ that a reasoner spits out could use an upgrade toward user-friendliness.

V. Reasoning with parthood and parthood-like relations in bio-ontologies are on a par with importance of the subsumption relation. Donnelly [3] and Keet [4], among many, would like to use parthood and parthood-like relations for reasoning, covering more than transitivity alone. Generalizing a bit, we have another requirement: reasoning with properties (relations) and hierarchies of relations, focusing first on the part-whole relation. What reasoning services are required exactly, be it for parthood or any other relation, deserves an entry on its own.

VI. And whatnot? For instance, linking up different ontologies that each reside at their own level of granularity, yet have enabled to perform ‘granular cross-ontology queries’, or infer locations of diseases based on combining an anatomy ontology with a disease taxonomy, hence, reasoning over linked ontologies. This needs to be written down in more detail, and may be covered at least partially with point two in item III.

Summarizing, we have to following requirements for automated reasoning services, in random order w.r.t. importance:

• Support in the ontology development process;
• Classification;
• Model checking;
• Finding ‘gaps’ in the content of an ontology;
• Computing derived relations at the type level;
• Deriving type-level relations from instance-level data;
• Comparison of two ontologies ([logical] theories);
• Reasoning with a plethora of parthood and parthood-like relations;
• Using (including finding inconsistencies in) a hierarchy of relations in conjunction with the class hierarchy;

I doubt this is an exhaustive list, and expect to add more requirements & desires soon. They also have to be specified more precisely than explained briefly above and the solutions to meet these requirements need to be elaborated upon as well.

[0] Keet, C.M., Roos, M., Marshall, M.S. A survey of requirements for automated reasoning services for bio-ontologies in OWL. Third international Workshop OWL: Experiences and Directions (OWLED 2007), 6-7 June 2007, Innsbruck, Austria. CEUR-WS.

[1] European FP6 FET Project “Thinking ONtologiES (TONES)”. (UDATE 29-7-2015: URL defunct by now)

[2] Bandini, S., Mosca, A. Mereological knowledge representation for the chemical formulation. 2nd Workshop on Formal Ontologies Meets Industry 2006 (FOMI2006), 14-15 December 2006, Trento, Italy. pp55-69.

[3] Donnelly, M., Bittner, T. and Rosse, C. A Formal Theory for Spatial Representation and Reasoning in Biomedical Ontologies. Artificial Intelligence in Medicine, 2006, 36(1):1-27.

[4]
Keet, C.M. Part-whole relations in Object-Role Models. 2nd International Workshop on Object-Role Modelling (ORM 2006), Montpellier, France, Nov 2-3, 2006. In: OTM Workshops 2006. Meersman, R., Tari, Z., Herrero., P. et al. (Eds.), LNCS 4278. Berlin: Springer-Verlag, 2006. pp1116-1127.

[5]
Keet, C.M. Granular information retrieval from the Gene Ontology and from the Foundational Model of Anatomy with OQAFMA. KRDB Research Centre Technical Report KRDB06-1, Free University of Bozen-Bolzano, 6 April 2006. 19p.

[6] Keet, C.M.
Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS2005), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics 3615, Springer Verlag, 2005. pp46-62.

[7] Mungall, C.J. Obol: integrating language and meaning in bio-ontologies. Comparative and Functional Genomics, 2004, 5(6-7):509-520. (UPDATE: link rot as well; a freely accessible veriosn is available at: http://berkeleybop.org/~cjm/obol/doc/Mungall_CFG_2004.pdf)

[8] Roos, M., Rauwerda, H., Marshall, M.S., Post, L., Inda, M., Henkel, C., Breit, T. Towards a virtual laboratory for integrative bioinformatics research. CSBio Reader: Extended abstracts of “CS & IT with/for Biology” Seminar Series 2005. Free University of Bozen-Bolzano, 2005. pp18-25.

[9]
Wolstencroft, K., Lord, P., Tabernero, L., Brass, A., Stevens, R. Using ontology reasoning to classify protein phosphatases [abstract]. 8th Annual Bio-Ontologies Meeting; 2005 24 June; Detroit, United States of America.

[i] Observe another aspect regarding model checking where the automated reasoner checks if the theory is satisfiable, or: given your ontology, if there is/can be a combination of instances such that all the declared knowledge in the ontology holds (is true), which is called a model’ (as well, like so many things). That an ontology is satisfiable does not imply it only has models as intended by you, i.e. there is a difference between ‘all models’ and ‘all intended models’. If an ontology is satisfiable it means that it is logically possible that each type can be instantiated without running into inconsistencies; it neither demonstrates that one can indeed find in reality the real-world versions of those represented entities nor if there is one-and-only-one model that actually matches exactly the data you may want to have linked to & checked against the ontology.