natural language inference

The trip to the Empirical Methods in Natural Language Processing 2022 conference is certainly one I’ll remember. The conference had well over 1000 in-person people attending what they could of the 6 tutorials and 24 workshops on Wednesday and Thursday, and then the 175 oral presentations, 654 posters, 3 keynotes and a panel session, and 10 Birds of Feather sessions on Friday-Sunday, which was topped off with a welcome reception and a social dinner. The open air dinner was on the one day in the year that it rains in the desert! More precisely on the venue: that was the ADNEC conference centre in Abu Dhabi, from 7 to 11 December.

With so many parallel sessions, it was not always easy to choose. Although I expected many presentations about just large language models (LLMs) that I’m not particularly interested in from a research perspective, it turned out it was very well possible to find a straight road through the parallel NLP sessions with research that had at least added an information-based or a knowledge-based approach to do NLP better. Ha! NLP needs structured data, information, and knowledge to mitigate the problems with hallucinations in natural language generation – elsewhere called “fluent bullshit” – that those LLMs suffer from, among other tasks. Adding a symbolic approach into the mix turned out to be a recurring theme in the conference. Some authors tried to hide a rule-based approach or were apologetic about it, so ‘hot’ the topic is not just yet, but we’ll get there. In any case, it worked so much better for my one-liner intro to state that I’m into ontologies having been branching out to NLG than to say I’m into NLG for African languages. Most people I met had heard of ontologies or knowledge graphs, whereas African languages mostly drew a blank expression.

It was hard to choose what to attend especially on the first day, but eventually I participated in part of the second workshop on Natural Language Generation, Evaluation, and Metrics (GEM’22), NLP for positive impact (NLP4PI’22), and Data Science with Human-in-the-Loop (DaSH’22), and walked into a few more poster sessions of other workshops. The conference sessions had 8 sessions in parallel in each timeslot; I chose the semantics one, ethics, NLG, commonsense reasoning, speech and robotics grounding, and the birds of a feather sessions on ethics and on code-switching. I’ve structured this post by topic rather than by type of session or actual session, however, in the following order: NLP with structured stuff, ethics, a basket with other presentations that were interesting, NLP for African languages, the two BoF sessions, and a few closing remarks. I did at least skim over the papers associated with the presentations and referenced here, and so any errors in discussing the works are still mine. Logistically, the links to the papers in this post are a bit iffy: about 900 EMNLP + workshops papers were already on arxiv according to the organisers, and 828 papers of the main conference are being ingested into the ACL anthology and so its permanent URL is not functional yet, and so my linking practice was inconsistent and may suffer link rot. Be that as it may, let’s get to the science.

*The entrance of the conference venue, ADNEC in Abu Dhabi, at the end of the first workshop and tutorials day.*

NLP with at least some structured data, information, or knowledge and/or reasoning

I’ve tried to structure this section, roughly going from little addition of structured stuff to more, and then from less to more inferencing.

The first poster session on the first day that I attended was the one of the NLP4PI workshop; it was supposed to be for 1 hour, but after 2.5h it was still being well-attended. I also passed by the adjacent Machine translation session (WMT’22) that also paid off. There were several posters there that were of interest to my inclination toward knowledge engineering. Abhinav Lalwani presented a Findings paper on Logical Fallacy Detection in the NLP4PI’22 poster session, which was interesting both for the computer ethics that I have to teach and their method: create a dataset of 2449 fallacies of 13 types that were taken for online educational resources, machine-learn templates from those sentences – that they call generating a “structure-aware model” – and then use those templates to find new ones in the wild, which was on climate change claims in this case [1]. Their dataset and code are available on GitHub. The one presented by Lifeng Han from the University of Manchester was part of WMT’22: their aim was to see whether a generic LLM would do better or worse than smaller in-domain language models enhanced with clinical terms extracted from biomedical literature and electronic health records and from class names of (unspecified in the paper) ontologies. The smaller models win, and terms or concepts may win depending on the metric used [2].

For the main conference, and unsurprising for a session called “semantics”, it wasn’t just about LLMs. The first paper was about Structured Knowledge Grounding, of which the tl;dr is that SQL tables and queries improve on the ‘state of the art’ of just GPT-3 [3]. The Reasoning Like Program Executors aims to fix nonsensical numerical output of LLMs by injecting small programs/code for sound numerical reasoning, among the reasoning types that LLMs are incapable of, and are successful at doing so [4]. And there’s a paper on using WordNet for sense retrieval in the context of word in/vs context use, and on discovering that the human evaluators were less biassed than the language model [5].

The commonsense reasoning session also – inevitably, I might add – had papers that combined techniques. The first paper of the session looked into the effects of injecting external knowledge (Comet) to enhance question answering, which is generally positive, and more positive for smaller models [6]. I also have in my notes that they developed an ontology of knowledge types, and so does the paper text claim so, but it is missing from the paper, unless they are referring to the 5 terms in its table 6.

I also remember seeing a poster on using Abstract Meaning Representation. Yes, indeed, and there turned out to be a place for it: for text style transfer to convert a piece of text from one style into another. The text-to-AMR + AMR-to-text model T-STAR beat the state of the art with a 15% increase in content preservation without substantive loss of accuracy (3%) [7].

Moving on to rules and more or less reasoning, first, at the NLP4PI’22 poster session, there was a poster on “Towards Countering Essentialism through Social Bias Reasoning”, which was presented by Maarten Sap. They took a very interdisciplinary approach, mixing logic, psychology and cognitive science to get the job done, and the whole system was entirely rules-based. The motivation was to find a way to assist content moderators by generating possible replies to counter prejudiced statements in online comments. They generated five types of replies and asked users which one they preferred. Types of sample generated replies include, among others, to compute exceptions to the prejudice (e.g., an individual in the group who does not have that trait), attributing the trait also to other groups, and a generic statement on tolerance. Bland seemed to work best. I tried to find the paper for details, but was unsuccessful.

The DaSH’22 presentation about WaNLI concerned the creation of a dataset and pipeline to have crowdsourcing workers and AI “collaborate” in dataset creation, which had a few rules sprinkled into the mix [8]. It turns out that humans are better at revising and evaluating than at creating sentences from scratch, so the pipeline takes that into account. First, from a base set, it uses NLG to generate complement sentences, which are filtered and then reviewed and possibly revised by humans. Complement sentence generation (the AI part) involves taking sentences like “5% chance that the object will be defect free” + “95% that the object will have defects” to then generate (with GPT-3, in this case) the candidate sentence pairs “1% of the seats were vacant” + “99% of the seats were occupied”, using encoded versions of the principles of entailment and set complement, among the reasoning cases used.

Turning up the reasoning a notch, Sean Welleck of the University of Washington gave the keynote at GEM’22. His talks consisted of two parts, on unlearning bad behaviour of LLMs and then an early attempt with a neuro-symbolic approach. The latter concerned connecting a LLM’s output to some logic reasoning. He chose Isabelle, of all reasoners, as a way to get it to check and verify the hallucinations (the nonsense) the LLMs spit out. I asked him why he chose a reasoner for an undecidable language, but the response was not a direct answer. It seemed that he liked the proof trace but was unaware of the undecidability issues. Maybe there’s a future for description logics reasoners here. Elsewhere, and hidden behind a paper title that mentions language models, lies a reality of the ConCoRD relation detection for “boosting consistency of pre-trained language models” with a MAX-SAT solver in the toolbox [9].

*Impression of the NLP4PI’22 poster session 2.5h into the 1h session timeslot.*

There are (many?) more relevant presentations that I did not get around to attending, such as on dynamic hierarchical reasoning that uses both a LM and a knowledge graph for their scope of question answering [10], a unified representation for graph query language, GraphQ IR [11], and on that RoBERTa, T5, and GPT3 have problems especially with deductive reasoning involving negation [12] and PLOG table-to-logic to enhance table-to-text. Open the conference program handbook and search on things like “commonsense reasoning” or NLI where the I isn’t an abbreviation of Interface but of Inference rather, and there’s even neural-symbolic inference for graph parsing. The compound term “Knowledge graph” has 84 mentions and “reasoning” has 244 mentions. There are also four papers with “grammar induction”, two papers with CFGs, and one with a construction grammar.

It was a pleasant surprise to not be entirely swamped by the “stats/NN + automated metric” formula. I fancy thinking it’s an indication that the frontiers of NLP research already grew out of that and is adding knowledge into the mix.

Ethics and computational social science

Of course, the previously-mentioned topic of trying to fix hallucinations and issues with reasoning and logical coherence of what the language models spit out implies researchers know there’s a problem that needs to be addressed. That is a general issue. Specific ones are unique in their own way; I’ll mention three. Inna Lin presented work on gendered mental health stigma and potential downstream issues with health chatbots that would rely on such language models [13]. For instance, that women were more likely to be recommended to seek professional help and men to toughen up and get on with it. The GeoMLAMA dataset showed that not everything is as bad as one might suspect. The dataset was created to explore multilingual Pre-Trained Language Models on cultural commonsense knowledge, like which colour the dress of the bride is typically. The authors selected English, Chinese, Hindi, Persian, and Swahili. Evaluation showed that multilingual PLMs are not biased toward the USA, that the native language of a country may not be the best language to probe its knowledge (as the commonsense isn’t explicitly stated) and a language may better probe knowledge about a nonnative country than its native country. [14]. The third paper is more about working on a mechanism to help NLP ethics: modelling information change in science communication. The scientist or the press release says one thing, which gets altered slightly in a popular science article, and then morphs into tweets and toots with yet another, different, message. More distortions occurs in the step from popsci article to tweet than from scientist to popsci article. The sort of distortion or ‘not as faithful as one would like’? Notably, “Journalists tend to downplay the certainty and strength of findings from abstracts” and “limitations are more likely to be exaggerated and overstated”. [15]

In contrast, Fatemehsadat Mireshghallah showed some general ethical issues with the very LLMs in her lively presentation. They are so large and have so many parameters that what they end up doing is more alike text memorization and output that memorised text, rather than outputting de novo generated text [16]. She focussed on potential privacy issues, where such models may output sensitive personal data. It also applies to copyright infringement issues: if they return chunk of already existing text, say, a paragraph from this blog, it would be copyright infringement, since I hold the copyright on it by default and I made it CC-BY-NC-SA, which those large LLMs do not adhere to and they don’t credit me. Copilot is already facing a class action lawsuit for unfairly reusing open source code without having obtained permission. In both cases, there’s the question, or task, of removing pieces of text and retraining the model, or not, as well as how to know whether your text was used to create the model. I recall seeing something about that in the presentations and we had some lively discussions about it as well, leaning toward a remove & re-train and suspecting that’s not what’s happening now (except at IBM apparently).

Last, but not least, on this theme: the keynote by Gary Marcus turned out to be a pre-recorded one. It was mostly a popsci talk (see also his recent writings here, among others) on the dangers of those large language models, with plenty of examples of problems with them that have been posted widely recently.

Noteworthy “other” topics

The ‘other’ category in ontologies may be dubious, but here it is not meant as such – I just didn’t have enough material or time to write more about them in this post, but they deserved a mention nonetheless.

The opening keynote of the EMNLP’22 conference by Neil Cohn was great. His main research is in visual languages, and those in comic books in particular. He raised some difficult-to-answer questions and topics. For instance, is language multimodal – vocal, body, graphic – and are gestures separate from, alongside, or part of language? Or take the idea of abstract cognitive principles as basis for both visual and textual language, the hypothesis of “true universals” that should span across modalities, and the idea of “conceptual permeability” on whether the framing in one modality of communication affects the others. He also talked about the topic of cross-cultural diversity in those structures of visual languages, of comic books at least. It almost deserves to be in the “NLP + symbolic” section above, for the grammar he showed and to try to add theory into the mix, rather than just more LLMs and automated evaluation scores.

The other DaSH paper that I enjoyed after aforementioned Wanli was the Cheater’s Bowl, where the authors tried to figure out how humans cheat in online quizzes [17]. Compared to automated open-domain question-answering, humans use fewer keywords more effectively, use more world knowledge to narrow searches, use dynamic refinement and abandonment of search chains, have multiple search chains, and do answer validation. Also in the workshops days setting, I somehow also walked into a poster session of the BlackboxNLP’22 workshop on analysing and interpreting neural networks for NLP. Priyanka Sukumaran enthusiastically talked about her research how LSTMs handle (grammatical) gender [18]. They wanted to know where about in the LSTM a certain grammatical feature is dealt with; and they did, at least for gender agreement in French. The ‘knowledge’ is encoded in just a few nodes and does better on longer than on shorter sentences, since then it can use more other cues in the sentence, including gendered articles, to figure out M/F needed for constructions like noun-adjective agreement. That definitely is alike the same way humans do, but then, algorithms do not need to copy human cognitive processes.

NLP4PI’s keynote was given Preslav Nakov, who recently moved to the Mohamed Bin Zayed University of AI. He gave an interesting talk about fake news, mis- and dis-information detection, and also differentiated it with propaganda detection that, in turn, consists of emotion and logical fallacy detection. If I remember correctly, not with knowledge-based approaches either, but interesting nonetheless.

I had more papers marked for follow up, including on text generation evaluation [19], but this post is starting to become very long as it is already.

Papers with African languages, and Niger-Congo B (‘Bantu’) languages in particular

Last, but not least, something on African languages. There were a few. Some papers had it clearly in the title, others not at all but they used at least one of them in their dataset. The list here is thus incomplete and merely reflects on what I came across.

On the first day, as part of NLP4PI, there was also a poster on participatory translations of Oshiwambo, a language spoken in Namibia, which was presented by Jenalea Rajab from Wits and Millicent Ochieng from Microsoft Kenya, both with the masakhane initiative; the associated paper seems to have been presented at the ICLR 2022 Workshop on AfricaNLP. Also within the masakhane project is the progress on named entity recognition [20]. My UCT colleague Jan Buys also had papers with poster presentation, together with two of his students, Khalid Elmadani and Francois Meyer. One was part of the WMT’22 on multilingual machine translation for African languages [21] and another on sub-word segmentation for Nguni languages (EMNLP Findings) [22]. The authors of AfroLID show results that they have some 96% accuracy on identification of a whopping 517 African languages, which sounds very impressive [23].

Birds of a Feather sessions

The BoF sessions seemed to be loosely organised discussions and exchange-of-ideas about a specific topic. I tried out the Ethics and NLP one, organised by Fatemehsadat Mireshghallah, Luciana Benotti, and Patrick Blackburn, and the code-switching & multilinguality one, organised by Genta Winata, Marina Zhukova, and Sudipta Kar. Both sessions were very lively and constructive and I can recommend to go to at least one of them the next time you’ll attend EMNLP or organise something like that at a conference. The former had specific questions for discussion, such as on the reviewing process and on that required ethics paragraph; the latter had themes, including datasets and models for code-switching and metrics for evaluation. For ethics, there seems to be a direction to head toward, whereas the NLP for code-switching seems to be still very much in its infancy.

Final remarks

As if all that wasn’t keeping me busy already, there were lots of interesting conversations, meeting people I haven’t seen in many years, including Barbara Plank who finished her undergraduate studies at FUB when I was a PhD student there (and focussing on ontologies rather, which I still do) and likewise for Luciana Benotti (who had started her European Masters at that time, also at FUB); people with whom I had emailed before but not met due to the pandemic; and new introductions. There was a reception and an open air social dinner; an evening off meeting an old flatmate from my first degree and a soccer watch party seeing Argentina win; and half a day off after the conference to bridge the wait for the bus to leave which time I used to visit the mosque (it doubles as worthwhile tourist attraction), chat with other attendees hanging around for their evening travels, and start writing this post.

Will I go to another EMNLP? Perhaps. Attendance was most definitely very useful, some relevant research outputs I do have, and there’s cookie dough and buns in the oven, but I’d first need a few new bucketloads of funding to be able to pay for the very high registration cost that comes on top of the ever increasing travel expenses. EMNLP’23 will be in Singapore.

References

[1] Zhijing Jin, Abhinav Lalwani, Tejas Vaidhya, Xiaoyu Shen, Yiwen Ding, Zhiheng Lyu, Mrinmaya Sachan, Rada Mihalcea, Bernhard Schölkopf. Logical Fallacy Detection. EMNLP’22 Findings.

[2] L Han, G Erofeev, I Sorokina, S Gladkoff, G Nenadic Examining Large Pre-Trained Language Models for Machine Translation: What You Don’t Know About It. 7th Conference on Machine translation at EMNLP’22.

[3] Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer and Tao Yu. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. EMNLP’22.

[4] Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Qiang Fu, Yan Gao, Jian-Guang LOU and Weizhu Chen. Reasoning Like Program Executors. EMNLP’22

[5] Qianchu Liu, Diana McCarthy and Anna Korhonen. Measuring Context-Word Biases in Lexical Semantic Datasets. EMNLP’22

[6] Yash Kumar Lal, Niket Tandon, Tanvi Aggarwal, Horace Liu, Nathanael Chambers, Raymond Mooney and Niranjan Balasubramanian. Using Commonsense Knowledge to Answer Why-Questions. EMNLP’22

[7] Anubhav Jangra, Preksha Nema and Aravindan Raghuveer. T-STAR: Truthful Style Transfer using AMR Graph as Intermediate Representation. EMNLP’22

[8] A Liu, S Swayamdipta, NA Smith, Y Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation. DaSH’22 at EMNLP2022.

[9] Eric Mitchell, Joseph Noh, Siyan Li, Will Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn and Christopher Manning. Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference. EMNLP’22

[10] Miao Zhang, Rufeng Dai, Ming Dong and Tingting He. DRLK: Dynamic Hierarchical Reasoning with Language Model and Knowledge Graph for Question Answering. EMNLP’22

[11] Lunyiu Nie, Shulin Cao, Jiaxin Shi, Jiuding Sun, Qi Tian, Lei Hou, Juanzi Li, Jidong Zhai GraphQ IR: Unifying the semantic parsing of graph query languages with one intermediate representation. EMNLP’22

[12] Soumya Sanyal, Zeyi Liao and Xiang Ren. RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners. EMNLP’22

[13] Inna Lin, Lucille Njoo, Anjalie Field, Ashish Sharma, Katharina Reinecke, Tim Althoff and Yulia Tsvetkov. Gendered Mental Health Stigma in Masked Language Models. EMNLP’22

[14] Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li, Kai-Wei Chang. Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models. EMNLP’22

[15] Dustin Wright, Jiaxin Pei, David Jurgens, Isabelle Augenstein. Modeling Information Change in Science Communication with Semantically Matched Paraphrases. EMNLP’22

[16] Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao Wang, David Evans and Taylor Berg-Kirkpatrick. An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models. EMNLP’22

[17] Cheater’s Bowl: Human vs. Computer Search Strategies for Open-Domain QA. DaSH’22 at EMNLP2022.

[18] Priyanka Sukumaran, Conor Houghton,Nina Kazanina. Do LSTMs See Gender? Probing the Ability of LSTMs to Learn Abstract Syntactic Rules. BlackboxNLP’22 at EMNLP2022. 7-11 Dec 2022, Abu Dhabi, UAE. arXiv:2211.00153

[19] Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji and Jiawei Han. Towards a Unified Multi-Dimensional Evaluator for Text Generation. EMNLP’22

[20] David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson KALIPE, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, MBONING TCHIAZE Elvis, Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia and Joyce Nakatumba-Nabende. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. EMNLP’22

[21] Khalid Elmadani, Francois Meyer and Jan Buys. University of Cape Town’s WMT22 System: Multilingual Machine Translation for Southern African Languages. WMT’22 at EMNLP’22.

[22] Francois Meyer and Jan Buys. Subword Segmental Language Modelling for Nguni Languages. Findings of EMNLP, 7-11 December 2022, Abu Dhabi, United Arab Emirates.

[23] Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed and Alcides Inciarte. AfroLID: A Neural Language Identification Tool for African Languages. EMNLP’22

Keet blog

research and teaching, with some relevance for society

Tag Archives: natural language inference

EMNLP’22 trip report: neuro-symbolic approaches in NLP are on the rise

Share this: