Posts Tagged ‘conceptual modeling’

Book chapter on conceptual data modelling for biological data

My invited book chapter, entitled “Ontology-driven formal conceptual data modeling for biological data analysis” [1], recently got accepted for publication in the Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data, edited by Mourad Elloumi and Albert Y. Zomaya, and is scheduled for printing by Wiley early 2012.

All this started off with my BSc(Hons) in IT & Computing thesis back in 2003 and my first paper about the trials and tribulations of conceptual data modelling for bio-databases [2] (which is definitely not well-written, but has some valid points and has been cited a bit). In the meantime, much progress has been made on the topic, and I’ve learned, researched, and published a few things about it, too. So, what is the chapter about?

The main aspect is the ‘conceptual data modelling’ with EER, ORM, and UML Class Diagrams, i.e., concerning implementation-independent representations of the data to be managed for a specific application (hence, not ontologies for application-independence).

The adjective ‘formal’ is to point out that the conceptual modeling is not just about drawing boxes, roundtangles, and lines with some adornments, but there is a formal, logic-based, foundation. This is achieved with the formally defined CMcom conceptual data modeling language, which has the greatest common denominator between ORM, EER, and UML Class Diagrams. CMcom has, on the one hand, a mapping the Description Logic language DLRifd and, on the other hand, mappings to the icons in the diagrammatic languages. The nice aspect of this it that, at least in theory and to some extent in practice as well, one can subject it to automated reasoning to check consistency of the classes, of the whole conceptual data model, and derive implicit constraints (an example) or use it in ontology-based data access (an example and some slides on ‘COMODA’ [COnceptual MOdel-based Data Access], tailored to ORM and the horizontal gene transfer database as example).

Then there is the ‘ontology-driven’ component: Ontology and ontologies can aid in conceptual data modeling by providing solution to recurring modeling problems, an ontology can be used to generate several conceptual data models, and one can integrate (a section of) an ontology into a conceptual data model that is subsequently converted into data in database tables.

Last, but not least, it focuses on ‘biological data analysis’. A non-(biologist or bioinformatician) might be inclined to say that should not matter, but it does. Biological information is not as trivial as the typical database design toy examples like “Student is enrolled in Course”, but one has to dig deeper and figure out how to represent, e.g., catalysis, pathway information, the ecological niche. Moreover, it requires an answer to ‘which language features are ‘essential’ for the conceptual data modeling language?’ and if it isn’t included yet, how do we get it in? Some of such important features are n-aries (n>2) and the temporal dimension. The paper includes a proposal for more precisely representing catalysis, informed by ontology (mainly thanks to making the distinction between the role and its bearer), and shows how certain temporal information can be captured, which is illustrated by enhancing the model for SARS viral infection, among other examples.

The paper is not online yet, but I did put together some slides for the presentation at MAIS’11 reported on earlier, which might serve as a sneak preview of the 25-page book chapter, or you can contact me for the CRC.

References

[1] Keet, C.M. Ontology-driven formal conceptual data modeling for biological data analysis. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data. Mourad Elloumi and Albert Y. Zomaya (Eds.). Wiley (in print).

[2] Keet, C.M. Biological data and conceptual modelling methods. Journal of Conceptual Modeling, Issue 29, October 2003. http://www.inconcept.com/jcm.

Building bias into your database

For developing bio-ontologies, if one follows Barry Smith and cs., then one is solely concerned with the representation of reality; moreover, it has been noted that ontologies can, or should be, seen as a representation of a scientific theory [1] or at least that they are an important part of doing science [2]. In that case, life is easy, not hard, for we have the established method of scientific inquiry to settle disputes (among others, by doing additional lab experiments to figure out more about reality). Domain- and application ontologies, as well as conceptual data models, for the enterprise universe of discourse require, at times, a consensus-based approach where some parts of the represented information are the outcome of negotiations and agreements among the stakeholders.

Going one step further on the sliding scale: for databases and application software for the humanities, and conflict databases in particular, one makes an ontology or conceptual data model conforming to one’s own (or the funding organisation’s) political convictions and with the desired conclusions in mind. Building data vaults seems to be the intended norm rather than the exception, hence, maintenance and usage and data analysis beyond the developers limited intentions, let alone integration, are a nightmare.

In this post, I will outline some suggestions for building your own politicized representation—be it an ontology or conceptual data model—for armed conflict data, such as terrorist incidents, civil war, and inter-state war. I will discuss in the next post a few examples of conflict data analysis, both regarding extant databases and the ‘dirty war index’ application built on top of them. A later post may deal with a solution to the problems, but for now, it would already be a great help not to adhere to the tips below.

Tips for biasing the representation

In random order, you could do any of the following to pollute the model and hamper data analysis so as to ensure your data is scientifically unreliable but suitable to serve your political agenda.

1. Have a fairly flat taxonomy of types of parties; in fact, just two subtypes suffice: US and THEM, although one could subtype the latter into ‘they’, ‘with them’, and ‘for them’. The analogue, with ‘we’, ‘with us’, and ‘for us’ is too risky for potential of contagion of responsibility of atrocities and therefore not advisable to include; if you want to record any of it, then it is better to introduce types such as ‘unknown perpetrator’ or ‘not officially claimed event’ or ‘independent actor’.

2. Aggregate creatively. For instance, if some of the funding for your database comes from a building construction or civil engineering company, refine that section of target types, or include new target types only when you feel like it is targeted sufficiently often by the opponent to warrant a whole new tuple or table from then onwards. Likewise, some funding agencies would like to see a more detailed breakdown of types of victims by types of violence, some don’t. Last, be careful with the typology of arms used, in particular when your country is producing them; a category like ‘DIY explosive device’ helps masking the producer.

3. Under-/over-represent geography. Play with granularity (by city/village, region, country, continent) and categorization criteria (state borders, language, former chiefdoms, parishes, and so forth), e.g., include (or not) notions such as ‘occupied territory’ (related to the actors) and `liberated region’ or `autonomous zone’, or that an area may, or may not, be categorized or named differently at the same time. Above all, make the modelling decisions in an inconsistent way, so that no single dimension can be analysed properly.

4. Make an a-temporal model and pretend not to change it, but (a) allow non-traceable object migration so that defecting parties who used to be with US (see point 1) can be safely re-categorised as THEM, and (b) refine the hierarchy over time anyway so as to generate time-inconsistency for target types (see point 2) and geography (see point 3), in order to avoid time series analyses and prevent discovering possible patterns.

5. Have a minimal amount of classes for bibliographic information, lest someone would want to verify the primary/secondary sources that report on numbers of casualties and discovers you only included media reports from the government-censored newspapers (or the proxy-funding agency, or the rebel radio station, or the guerrilla pamphlets).

6. Keep natural language definitions for key concepts in a separate file, if recorded at all. This allows for time-inconsistency in operational definitions as well as ignorance of the data entry clerks so that each one can have his own ideas about where in the database the conflict data should go.

7. Minimize the use of database integrity constraints, hence, minimize representing constraints in the ontology to begin with, hence, use a very simple modelling language so you can blame the language for not representing the subject domain adequately.

I’m not saying all conflict databases use all of these tricks; but some use at least most of them, which ruins credibility of those database of which the analysts actually did try to avoid these pitfalls (assuming there are such databases, that is). Optimism wants me to believe developers did not think of all those issues when designing the database. However, there is a tendency that each conflict researcher compiles his own data set and that each database is built from scratch.

For the current scope, I will set aside the problems with data collection and how to arrive at guesstimated semi-reliable approximations of deaths, severe injuries, rape, torture victims and so forth (see e.g. [3] and appendix B of [4]). Inherent problems with data collection is one thing and difficult to fix, bad modelling and dubious or partial data analysis is a whole different thing and doable to fix. I elaborate on latter claim in the next post.

[1] Barry Smith. Ontology (Science). In: C. Eschenbach and M. Gruninger (eds.), Formal Ontology in Information Systems. Proceedings of FOIS 2008. preprint

[2] Keet, C.M. Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS’05), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics LNBI 3615, Springer Verlag, 2005. pp46-62.

[3] Taback N (2008 ) The Dirty War Index: Statistical issues, feasibility, and interpretation. PLoS Med 5(12): e248. doi:10.1371/journal.pmed.0050248.

[4] Weinstein, Jeremy M. (2007). Inside rebellion—the politics of insurgent violence. Cambridge University Press. 402p.

Follow

Get every new post delivered to your Inbox.

Join 25 other followers

%d bloggers like this: