72010 SemWebTech lecture 10: SWLS and text processing and ontologies

There is a lot to be said about how Ontology, ontologies, and natural language interact from a philosophical perspective up to the point that different commitments lead to different features and, moreover, limitations of a (Semantic Web) application. In this lecture on 22 Dec, however, we shall focus on the interaction of NLP and ontologies within a bio-domain from an engineering perspective.

During the bottom-up ontology development and methodologies lectures, it was already mentioned that natural language processing (NLP) can be useful for ontology development. In addition, NLP can be used as a component in an ontology-driven information system and an NLP application can be enhanced with an ontology. Which approaches and tools suit best depends on the goal (and background) of its developers and prospective users, ontological commitment, and available resources.

Summarising the possibilities for “something natural language text” and ontologies or ontology-like artifacts, we can:

  • Use ontologies to improve NLP: to enhance precision and recall of queries (including enhancing dialogue systems [1]), to sort results of an information retrieval query to the digital library (e.g. GoPubMed [2]), or to navigate literature (which amounts to linked data [3]).
  • Use NLP to develop ontologies (TBox): mainly to search for candidate terms and relations, which is part of the suite of techniques called ‘ontology learning’ [4].
  • Use NLP to populate ontologies (ABox): e.g., document retrieval enhanced by lexicalised ontologies and biomedical text mining [5].
  • Use it for natural language generation (NLG) from a formal language: this can be done using a template-based approach that works quite well for English but much less so for grammatically more structured languages such as Italian [6], or with a full-fledged grammar engine as with the Attempto Controlled English and bi-directional mappings (see for a discussion [7]).

Intuitively, one may be led to think that simply taking the generic NLP or NLG tools will do fine also for the bio(medical) domain. Any application does indeed use those techniques and tools—Paul Buitelaar’s slides have examples and many references to NLP tools—but, generally, they do not suffice to obtain ‘acceptable’ results. Domain specific peculiarities are many and wide-ranging. For instance, to deal with the variations of terms (scientific name, variant, common misspellings) and the grounding step (linking a term to an entity in a biological database) in the ontology-NLP preparation and instance classification side [5], to characterize the question in a question answering system correctly [1], and to find ways to deal with the rather long strings that denote a biological entity or concept or universal [4]. Some of such peculiarities actually generate better overall results than in generic or other domain-specific usages of NLP tools, but it requires extra manual preparatory work and a basic understanding of the subject domain and its applications.

References

[1] K. Vila, A. Ferrández. Developing an Ontology for Improving Question Answering in the Agricultural Domain. In: Proceedings of MTSR’09. Springer CCIS 46, 245-256.

[2] Heiko Dietze, Dimitra Alexopoulou, Michael R. Alvers, Liliana Barrio-Alvers, Bill Andreopoulos, Andreas Doms, Joerg Hakenberg, Jan Moennich, Conrad Plake, Andreas Reischuck, Loic Royer, Thomas Waechter, Matthias Zschunke, and Michael Schroeder. GoPubMed: Exploring PubMed with Ontological Background Knowledge. In Stephen A. Krawetz, editor, Bioinformatics for Systems Biology. Humana Press, 2008.

[3] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784] (but see also some comments on the paper)

[4] Dimitra Alexopoulou, Thomas Waechter, Laura Pickersgill, Cecilia Eyre, and Michael Schroeder. Terminologies for text-mining: an experiment in the lipoprotein metabolism domain. BMC Bioinformatics, 9(Suppl4):S2, 2008

[5] Witte, R. Kappler, T. And Baker, C.J.O. Ontology design for biomedical text mining. In: Semantic Web: revolutionizing knowledge discovery in the life sciences, Baker, C.J.O., Cheung, H. (eds), Springer: New York, 2007, pp 281-313.

[6] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[7] R. Schwitter, K. Kaljurand, A. Cregan, C. Dolbear, G. Hart. A comparison of three controlled natural languages for OWL 1.1. Proc. of OWLED 2008 DC.

Note: references 4 and 5 are mandatory reading, and 1-3 and 6 are optional (recommended for the EMLCT students).

Lecture notes: lecture 10 – Text processing

Course website

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s