Logics and other math for computing (LAC18 report)

Last week I participated in the Workshop on Logic, Algebra, and Category theory (LAC2018) (and their applications in computer science), which was held 12-16 February at La Trobe University in Melbourne, Australia. It’s not fully in my research area, so there was lots of funstuff to learn. There were tutorials in the morning and talks in the afternoon, and, of course, networking and collaborations over lunch and in the evenings.

I finally learned some (hardcore) foundations of institutions that underpins the OMG-standardised Distributed Ontology, Model, and Specification Language DOL, whose standard we used in the (award-winning) KCAP17 paper. It concerns the mathematical foundations to handle different languages in one overarching framework. That framework takes care of the ‘repetitive stuff’—like all languages dealing with sentences, signatures, models, satisfaction etc.—in one fell swoop instead of repeating that for each language (logic). The 5-day tutorial was given by Andrzej Tarlecki from the University of Warsaw (slides).

Oliver Kutz, from the Free University of Bozen-Bolzano, presented our K-CAP paper as part of his DOL tutorial (slides), as well as some more practical motivations for and requirements that went into DOL, or: why ontology engineers need DOL to solve some of the problems.

Dirk Pattinson from the Australian National University started gently with modal logics, but it soon got more involved with coalgebraic logics later on in the week.

The afternoons had two presentations each. The ones of most interest to me included, among others, CSP by Michael Jackson; José Fiadeiro’s fun flexible modal logic for specifying actor networks for, e.g., robots and security breaches (that looks hopeless for implementations, but that as an aside); Ionuț Țuțu’s presentation on model transformations focusing on the maths foundations (cf the boxes-and-lines in, say, Eclipse); and Adrian Rodriguez’s program analysis with Maude (slides). My own presentation was about ontological and logical foundations for interoperability among the main conceptual data modelling languages (slides). They covered some of the outcomes from the bilateral project with Pablo Fillottrani and some new results obtained afterward.

Last, but not least, emeritus Prof Jennifer Seberry gave a presentation about a topic we probably all should have known about: Hadamard matrices and transformations, which appear to be used widely in, among others, error correction, cryptography, spectroscopy and NMR, data encryption, and compression algorithms such as MPEG-4.

Lots of thanks go to Daniel Găină for taking care of most of the organization of the successful event. (and thanks to the generous funders, which made it possible for all of us to fly over to Australia and stay for the week 🙂 ). My many pages of notes will keep me occupied for a while!

Advertisements

Updated isiZulu spellchecker and new isiXhosa spellchecker

Noting that February is the month of language activism in South Africa and that 21 February is the International Mother Language Day (a United Nations event since 2000), let me add my proverbial two cents to that. Since the launch of the isiZulu spellchecker in November 2016, research and development has progressed quite a bit, so that we have released a new ‘version 2’ of the spellchecker. For those not in-the-know: isiZulu and isiXhosa are both among the 11 official languages of South Africa, with isiZulu the largest language in the country by first language speakers and isiXhosa is slated to make an international breakthrough, as it’s used in the Black Panther movie that was released this weekend. Anyhow, the main novelties of the updated spellchecker are:

  • first error correction algorithms for isiZulu;
  • improved error detection with a few basic rules, also for isiZulu;
  • new isiXhosa error detection and correction;

The source code is open source, and, due to various tool limitations beyond our control, it’s still a standalone jar file (zipped for download). Here’s a screenshot of the tool, where it checks a piece of text from a novel in isiZulu, illustrating that *khupels has a substitution error (khupela was the intended word):

Single word error *khupels that has a substitution error s for a in the intended word (khupela)

The error corrector can propose possible corrections for single-word errors that are either transpositions, substitutions, insertions, or deletions. So, for instance, *eybo, *yrbo, *yeebo, and *ybo, respectively, cf. the correctly spelled yebo ‘yes’. It doesn’t perform equally well on each type of typo yet, with the best results obtained for transpositions. As with the error detector, it relies on a data-driven approach, with, for error correction, a lot more statistics-based algorithms cf. the error detection-only algorithms. They are described in detail in Frida Mjaria’s 2017 CS honours project. Suggestion accuracy (i.e., that it at least can suggest something) is 95% and suggestion relevance (that it contains the intended word) made it to 61%, mainly due to weak results of corrections for insertion errors (they mess too much with the trigrams).

The error detection accuracy has been improved mainly through better handling of punctuation, thanks to Norman Pilusa’s programming efforts. This was done through a series of rules on top of the data-driven approach, for it is too hard to learn those from a large corpus. For instance, semi-colons, end-of-sentence periods, and numbers (written in isiZulu like, e.g., ngu-42 rather than just 42) are now mostly ignored rather than the words adjacent to it being detected as probably misspelt. It works better than spellchecker.net’s version, which is the only other available isiZulu spellchecker: on a random selection of actual pieces of text, our tool obtained 91.71% lexical recall for error detection, whereas the spellchecker.net’s version got to 82.66% on the same text. Put differently: spellchecker.net flagged about twice as many words as incorrect as ours did (so there wasn’t much point in comparing error corrections).

Finally, because all the algorithms are essentially language-independent (ok, there’s an underlying assumption of using them for highly agglutinative languages), we fed the algorithms a large isiXhosa corpus that is being developed as part of another project, and incorporated that into the spellchecker. There’s room for some fine-tuning especially for the corrector, but at least now there is one, thanks to Norman Pilusa’s software development contributions. That we thought we could get away with this approach is thanks to Nthabiseng Mashiane’s 2017 CS honours project, which showed that the results would be fairly good (>80% error detection) with more data. We also tried a rules-based approach for isiXhosa. It obtained better accuracies than the statistical language model of Nthabiseng, but only for those parts of speech covered by the rules, which is a subset of all types of words. If you’re interested in those rules, please check out Siseko Neti’s 2017 CS Honours project. To the best of my knowledge, it’s the first time those rules have been formally represented in a computer-usable format and they may be useful for other endeavours, such as morphological analysers.

A section of the isiXhosa Wikipedia entry about the UN (*ukuez should be ukuze, which is among the proposed words).

Further improvements are possible, which are being scoped for a v3 some time later. For instance, for the linguists and language scholars: what are the most common typos? What are the most commonly used words? If we had known that, it would have been an easy way to boost the performance. Can we find optimisations to substitutions, insertions, and deletions similar to the one for transpositions? Should some syntax rules be added for further optimisation? These are some of the outstanding questions. If you’re interested in that or related questions, or you would like to use the algorithms in your tool, please contact me.