Reblogging 2009: Building bias into your database

From the “10 years of keetblog – reblogging: 2009”: The tl;dr of it: bad data management -> bad policy decisions, and how you can embed political preferences and prejudices in a conceptual data model.

While the post has a computing flavor to it especially on the database design and a touch of ontologies, it is surely also of general interest, because it gives some insight into the management of data that is used for policy-making in and for conflict zones. A nicer version of this blog post and the one after that made it into a paper-review article “Dirty wars, databases, and indices” in the Peace & Conflict Review journal (Fall 2009 issue) of the UN-mandated University for Peace in Costa Rica.

Building bias into your database; Jan 7, 2009

 p.s.: while I intended to write a post on attending the ER’15 conferences, the exciting times with the student protests in South Africa put that plan on the backburner for a few more days at least.

—–

For developing bio-ontologies, if one follows Barry Smith and cs., then one is solely concerned with the representation of reality; moreover, it has been noted that ontologies can, or should be, seen as a representation of a scientific theory [1] or at least that they are an important part of doing science [2]. In that case, life is easy, not hard, for we have the established method of scientific inquiry to settle disputes (among others, by doing additional lab experiments to figure out more about reality). Domain- and application ontologies, as well as conceptual data models, for the enterprise universe of discourse require, at times, a consensus-based approach where some parts of the represented information are the outcome of negotiations and agreements among the stakeholders.

Going one step further on the sliding scale: for databases and application software for the humanities, and conflict databases in particular, one makes an ontology or conceptual data model conforming to one’s own (or the funding organisation’s) political convictions and with the desired conclusions in mind. Building data vaults seems to be the intended norm rather than the exception, hence, maintenance and usage and data analysis beyond the developers limited intentions, let alone integration, are a nightmare.

 In this post, I will outline some suggestions for building your own politicized representation—be it an ontology or conceptual data model—for armed conflict data, such as terrorist incidents, civil war, and inter-state war. I will discuss in the next post a few examples of conflict data analysis, both regarding extant databases and the ‘dirty war index’ application built on top of them. A later post may deal with a solution to the problems, but for now, it would already be a great help not to adhere to the tips below.

Tips for biasing the representation

In random order, you could do any of the following to pollute the model and hamper data analysis so as to ensure your data is scientifically unreliable but suitable to serve your political agenda.

1. Have a fairly flat taxonomy of types of parties; in fact, just two subtypes suffice: US and THEM, although one could subtype the latter into ‘they’, ‘with them’, and ‘for them’. The analogue, with ‘we’, ‘with us’, and ‘for us’ is too risky for potential of contagion of responsibility of atrocities and therefore not advisable to include; if you want to record any of it, then it is better to introduce types such as ‘unknown perpetrator’ or ‘not officially claimed event’ or ‘independent actor’.

2. Aggregate creatively. For instance, if some of the funding for your database comes from a building construction or civil engineering company, refine that section of target types, or include new target types only when you feel like it is targeted sufficiently often by the opponent to warrant a whole new tuple or table from then onwards. Likewise, some funding agencies would like to see a more detailed breakdown of types of victims by types of violence, some don’t. Last, be careful with the typology of arms used, in particular when your country is producing them; a category like ‘DIY explosive device’ helps masking the producer.

3. Under-/over-represent geography. Play with granularity (by city/village, region, country, continent) and categorization criteria (state borders, language, former chiefdoms, parishes, and so forth), e.g., include (or not) notions such as ‘occupied territory’ (related to the actors) and `liberated region’ or `autonomous zone’, or that an area may, or may not, be categorized or named differently at the same time. Above all, make the modelling decisions in an inconsistent way, so that no single dimension can be analysed properly.

4. Make an a-temporal model and pretend not to change it, but (a) allow non-traceable object migration so that defecting parties who used to be with US (see point 1) can be safely re-categorised as THEM, and (b) refine the hierarchy over time anyway so as to generate time-inconsistency for target types (see point 2) and geography (see point 3), in order to avoid time series analyses and prevent discovering possible patterns.

5. Have a minimal amount of classes for bibliographic information, lest someone would want to verify the primary/secondary sources that report on numbers of casualties and discovers you only included media reports from the government-censored newspapers (or the proxy-funding agency, or the rebel radio station, or the guerrilla pamphlets).

6. Keep natural language definitions for key concepts in a separate file, if recorded at all. This allows for time-inconsistency in operational definitions as well as ignorance of the data entry clerks so that each one can have his own ideas about where in the database the conflict data should go.

7. Minimize the use of database integrity constraints, hence, minimize representing constraints in the ontology to begin with, hence, use a very simple modelling language so you can blame the language for not representing the subject domain adequately.

I’m not saying all conflict databases use all of these tricks; but some use at least most of them, which ruins credibility of those database of which the analysts actually did try to avoid these pitfalls (assuming there are such databases, that is). Optimism wants me to believe developers did not think of all those issues when designing the database. However, there is a tendency that each conflict researcher compiles his own data set and that each database is built from scratch.

For the current scope, I will set aside the problems with data collection and how to arrive at guesstimated semi-reliable approximations of deaths, severe injuries, rape, torture victims and so forth (see e.g. [3] and appendix B of [4]). Inherent problems with data collection is one thing and difficult to fix, bad modelling and dubious or partial data analysis is a whole different thing and doable to fix. I elaborate on latter claim in the next post.

References

[1] Barry Smith. Ontology (Science). In: C. Eschenbach and M. Gruninger (eds.), Formal Ontology in Information Systems. Proceedings of FOIS 2008. preprint

[2] Keet, C.M. Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS’05), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics LNBI 3615, Springer Verlag, 2005. pp46-62.

[3] Taback N (2008 ) The Dirty War Index: Statistical issues, feasibility, and interpretation. PLoS Med 5(12): e248. doi:10.1371/journal.pmed.0050248.

[4] Weinstein, Jeremy M. (2007). Inside rebellion—the politics of insurgent violence. Cambridge University Press. 402p.

Advertisement

Reblogging 2009: A collection of parameters for ontology design

From the “10 years of keetblog – reblogging: 2009”: How the paper introduced in this post came about is a story of its own (it was in the context of finding suitable ontologies for testing Ontology-Based Data Access systems). The short MTSR’09 paper that the post introduces was extended into a journal paper published in IJSMO in 2010.

A collection of parameters for ontology design; June 1, 2009

——-

Ontology design is still more of an art than a science. A methodology, Methontology, does exist, but it does not cover all aspects of ontology development. Likewise, there are tools, such as Protégé and the NeOn toolkit, that make several steps in the whole procedure easier. But, with the plethora of resources around, where should one start developing one’s own domain ontology, what resources are available for reuse to speed up its development, for which purposes can the ontology be developed?

The novice ontology engineer would have to go through much of the extant literature, read case studies and draw their own conclusions on how to go about developing the ontology and/or also attend ontology engineering courses or summer schools, which is a rather high start-up cost.

To ameliorate this, but also to save myself from repeating such information informally, I gave it a try to condense that information in, effectively, 4 Springer-size pages [pdf] (plus 1 page intro and 1 page references) [1]. The paper contains a grouping of input parameters that determine effectiveness of ontology development and use, which are categorised along four dimensions: purpose, ontology reuse, ways of ontology learning, and the language and reasoning services.

The aim was to be brief, so while the list of parameters is long, the list of references is comparatively short—but the references are kept diverse and they do contain references to different paradigms around instead of just one. (A version with lots of references is in the making.)

The paper has several examples taken from the agriculture domain by having build upon experiences gained in previous and current projects and related literature. It is noteworthy, however, that development of agri-ontologies is in its infancy. Then, for a relatively seasoned ontology engineer, most, if not all, parameters may be known to a greater or lesser extent already, but from the intended audience perspective, the paper was deemed to be a timely, much needed, and useful overview. My impression is that those reviewers’ comments say more about the knowledge transfer—well, the lack thereof—from one discipline to another than about the modellers and domain experts.

For those of you who are interested in agri-ontologies and would like to know more about the latest developments in that area, there is the (third) special track on agriculture, food and the environment during MTSR’09 in Milan 1-2 October.

References
[1] Keet, C.M. Ontology design parameters for aligning agri-informatics with the Semantic Web. 3rd International Conference on Metadata and Semantics (MTSR’09) — Special Track on Agriculture, Food & Environment, Oct 1-2 2009 Milan, Italy. Springer CCIS. to appear.

SA ICPC Regionals 2013 problem analysis

Our 2015 Southern Africa ICPC Regionals is nearby, and we have been using some of the 2013 SA problems for training purposes as well as a teaser/taste of what’s to come on the 24th of October (registration closes on Oct 10). While the training materials are on vula (the UCT CMS for courses), some hints to solve some of them may be of general interest. I’ll give a breakdown and a ‘spoiler alert’ for five of the eight problems. The problem-solving aspects and explanations in the training sessions were longer, but these short notes will give you some useful starting points where to look for implementation details already anyway.

The problems can be categorised into the following types:

  1. Isle of the birds – computational geometry
  2. Fitness training – simple ad hoc
  3. Similarity – String processing
  4. Railways – Graphs
  5. Student IDs – String processing

 

Isle of the birds

There’s an island with trees, and the rubber band will enclose them all. That is, we need to find the polygon with corners of the outermost points enclosing the rest of the points. Thus, we need to compute a convex hull. How can that be done, and, more importantly, how can that be done efficiently? Computing the whole solution space is going to take too much time, as there can be between 3 and 15000 points. One technique is the sweepline (generally useful to check out), and one of those tailored to finding the convex hull is the Graham Scan algorithm: first, starting with the left-lowest point, scan the plane of points counter clock-wise to figure out where the points are (points on the same line are ignored), then, second, connect the points in a stepwise fashion from the bottom going counter-clock-wise again: if the angle is >180 (compare values of the coordinates), then discard the penultimate point and connect the 2nd last to the last point.

Only 4 teams solved this problem at the 2103 regionals (including the winning team ‘if cats programmed computers’).

 

Fitness training

John cycles A km, Mary runs B km, starting and finishing at the same place using one circular route of M km. This can be computed with a straight-forward modulo operation. All 53 teams solved this problem at the 2013 regionals.

 

Similarity

Spellchecking in the online search engine; well, given two words, what is the minimum cost of the change operation to go from word_A to word_B, given certain costs of additions, deletions, and character swaps? Comparing strings of characters is around quite a while, from spellchecking, to plagiarism checkers, to DNA sequence alignments, so surely a fine algorithm should be around for that already. Indeed: the minimum edit distance (Levenshtein distance) (nice explanation), where, instead of computing all possible options (very costly!), you fill in the table accordingly. The ‘tricky’ part is that the basic algorithm for the minimum edit distance counts each change as a cost of 1, whereas in this problem, some changes cost 2; hence, you will have to change those values in the standard algorithm (demo that lets you play with different costs).

Only 2 teams solved this problem at the 2103 regionals (including the winning team ‘if cats programmed computers’).

 

Railways

Construct a railroad network between cities in the shape of a tree, but put in a bid for the second-most cheapest option. So, we have lines and points, or: some graph algorithm. Two main groups are shortest path (Dijkstra, Bellman-Ford) and spanning tree. We need a minimum spanning tree (MST) to begin with. This reduces the option for the most suitable algorithm to Prim’s or Kruskal’s. Prim requires a particular starting vertex, Kruskal doesn’t. The problem statement doesn’t require a starting vertex, hence Kruskal’s algorithm is the one of choice (example). But then how do we get the second-best spanning tree? Also in this case, many have asked before (thoertically and practically—search online for both): take an edge with weight w that’s not in the MST and results in a cycle when added to the MST, compare w with the weight of the heaviest (non-w) edge in the cycle (v), then of those comparisons among the cycles, take the one with the lowest difference, add the edge with weight w and remove the other edge v. There you have your second-best option.

Only 4 teams solved this problem at the 2103 regionals (including the winning team ‘if cats programmed computers’)

 

Student IDs

Generate student IDs from the students’ names, following a given pattern. Of itself, this is a somewhat laborious implementation. The only real issue is to keep track of what’s been processed of the string. Here, it is especially useful to first design the solution separately before delving into the murky code, as it otherwise will require a lot of test cases to check the corner cases (and remember you have only one machine). A nice way to design it is to use automata and only then to convert that into code.

39 teams out of the 53 solved this problem at the 2103 regionals.

 

Finally

Just in case you’re trying out the remaining problems, and are banging your head against the wall or pulling your hair out: no team solved the Street lights (Problem B; looks like a maths problem, with floating point complication) and the Necklace (Problem G), and only 3 teams solved Matchstick maths (Problem D; ask a team member of ‘if cats programmed computers’, who solved it).