Data visualisation entertainment with htmlgraph for your website

Via a great list of novel data visualisation techniques and interfaces at data visualisation: modern approaches, hence, ways of structuring and analysing data, I stumbled upon htmlgraph. This tool analyses the usage of HTML tags of each page of a website so as to obtain an idea about how the site is structured and what kind of site it is. For instance, the difference between Yahoo (bit messy, with old-fashioned tables) and Google (clean and simple) is very striking. The developer has many more annotated examples and has made htmlgraph available as java applet, so I could not resist to test it and see what came out of my different types of sites.

I have checked three principally different sites: my home page (which I assume to be typical for a research-oriented one), a Joomla CMS I administer (PwoB), and this blog. The colour codings are as follows:

blue: for links (the A tag)
red: for tables (TABLE, TR and TD tags)
green: for the DIV tag
violet: for images (the IMG tag)
yellow: for forms (FORM, INPUT, TEXTAREA, SELECT and OPTION tags)
orange: for linebreaks and blockquotes (BR, P, and BLOCKQUOTE tags)
black: the HTML tag, the root node
gray: all other tags

First, as expected, my home page shows the pattern of a ‘typical’ content-based site centred around the old-fashioned html coding with a central table (research page and publications) and some side topics (the IT remnants, MSc thesis, and one to the personal pages).

structure of my homepage as generated by htmlgraph

structure of my homepage as generated by htmlgraph

This is in stark contrast to my blog, which seems like an amalgamation of random things. It looks like each page has its own little flower, so even if I had stuck to one topic throughout all posts, it still would look like this.

structure of this blog as generated by htmlgraph

structure of this blog as generated by htmlgraph

Content management systems, on the other hand, have their own principal structure: the following picture shows the HTML structure of the NGO Professors without Borders website running Joomla. I also tested the EU FP6 FET project TONES website that runs Joomla too, which looks very much the same as the one of Professors without Borders, i.e., the structure of Joomla sites are highly similar regardless the topic/content of the site.

structure of the Professors without Borders website made with Joomla, as generated by htmlgraph

structure of the Professors without Borders website made with Joomla, as generated by htmlgraph

Last, comparing the colours of the nodes across the three types of sites, then it is immediately clear that the blog is heavy in links (blue), Joomla and ‘researcher homepage’ (ok, perhaps just mine) heavy in old-fashioned tables (red) and that the prefab generic structures of Joomla and WordPress use quite a lot of divs (green) as well. Remains to ponder about all those “other” (grey) HTML tags in the WordPress blogs that I surely have not added to my blog posts but are in there somewhere, presumably doing something, anyway.


Conflict data analysis issues and the dirty Dirty War Index

In the previous post on building bias into your database, I outlined seven modelling tricks to build your preference into the information system. Here, I will look at some of those databases and a tool/calculation built on top of such conflict databases (the ‘dirty war index’).

Conflict databases

The US terrorism incident database at MIPT suffers from most of the afore-mentioned pitfalls, which drove a recently graduated friend, Dr. Fraser Gray, to desperation asking me if I could analyse the numbers (but, alas, the database has some inconsistencies). I have more, official, detail about the design rationale and limitations of the civil war incident databases developed by Weinstein and by Sutton. In his fascinating book Inside Rebellion [1], Weinstein has described his tailor-made incident database in Appendix B; so, although I’m going to comment on the database, I still highly recommend you read the book.

Weinstein applies organsational theory to rebel organisations in civil war settings and tests his hypotheses experimentally against case studies of Uganda, Mozambique, and Peru. As such, his self-made database was made with the following assumption in mind: “civilians are often the primary and deliberate target of combatants in civil wars… Accordingly, an appropriate indicator of the “incidence” of civil war is the use of violence against noncombatant populations.” Translated to the database focus, it is a people-centred database, not, say, target-centred. Not only deaths are counted, but also a range of violations, including mutilation, abduction, detention, looting, and rape, and victim charactersitics with name, age, sex, affilitation and affiliation groups, such as religious leaders, students, occupation of civilian, and traditional authorities (according to Appendix B).

Geography is coded only at a high level—at least, the information provided in chapter 6 that deals with the quantitative data discusses only (aggregated?) rough regions, such as Mozambique’s “north”, “centre” and “south”, but for Sendero Luminoso-Huallaga no sub-regions at all. To its merit, it has a year-by-year breakdown of the incidents, although one has no access to which type of incidents exactlyeven though they are supposed to be in the database. It does not discuss quantitatively the types of arms and the targets; it certainly makes a difference to understand the dynamics of the conflict if, say, targets like water purification plants are blown up or military bases attacked and if sophisticated ‘non-conventional arms’ are used or machetes. If we want to know that, it seems we have to redo the data collection process. No statistical analysis is performed, so that for, e.g., the size of the victim groups we get indications of ‘relatively more’, and barely even percentages or ratios to make cross-comparisons across years or across conflicts but which could have been done based on the stacked-bar charts of the (yet again aggregated) data. The huge amount of incidents marked as “unclear” for Peru only has guessed explanations, due to data collection issues (e.g., for 1987 some 500 “unclear” versus about 40 attributed to Sendero Luminoso-Nacional and 30 government)—try feeding such data into the DWI (see below). The definitions of “civilian” and “non-combatant” are not clear, not even sort of inferable as with Sutton’s database (see below).

Overall, it merely gives a rough idea of some aspects of the examined conflicts, but maybe this already suffices for comparative politics.

UPDATE (21-1-2009): Jeremy Weinstein kindly responded via email, being aware of the aggregations used in the data analysis, because they intended to serve a descriptive role, and pointing me to an effort of more detailed data collection, finer-grained analysis, and online data (in proprietary Strata format) of the conflict in Sierra Leone, which was published in American Political Science Review. That freely available paper, Handing and manhandling civilians in civil war, also gives an indication what the reader can expect of the contents in the book, and has a set of 8 hypotheses that are tested against the data (not all of them could be confirmed).

The Dirty War Index

There are people who build tools upon such conflict databases. Garbage In, Garbage Out? I will highlight one of those tools, which received extensive coverage in PLoS Medicine recently [2,3,4]: being able to calculate a “Dirty War Index” for a variety of parameters that follow the pattern of DWI = \frac{nr\_of\_dirty\_cases}{total\_nr\_of\_cases} \times 100 . The cases and their aggregates to nr of cases come from the conflict’s incidents databases. Go figure. It’s not just that, but one could/would/should assume that the examples Hicks and Spagat give in their paper [3] are to illustrate, but not to invalidate, their DWI approach.

Let us take their first example, the DWIs for the actors in the Colombian civil conflict as the measure \frac{nr\_of\_civilians\_killed}{total\_nr\_of\_ civilians\_killed + combatants\_killed} \times 100 . The ‘guerillas’ (presumably FARC) have a DWI of \frac{2498}{5444} \times 100  = 46, the ‘government forces’ \frac{593}{659} \times 100 = 45 , and the ‘illegal paramilitaries’ (a pleonasm) \frac{6944}{6985} \times 100 = 99 (numbers taken from the simple Colombia conflict database [5]). Hicks and Spagat explain that “Guerrillas rank 2nd in killing absolute numbers of civilians”, as if the government forces deserve a laurel for having the best (closest to 0) DWI—with a mere 1-point margin—and as if paramilitaries are independent of the government whereas it is the norm, rather than the exception, that governments tend to arrange for a third party to do the dirty work for them (with or without external funding) so as to look comparatively good in the international spotlights. Aggregating by ‘opponents of FARC’, we get a DWI of \frac{593+6944}{659+6985} \times 100 = 98.6 , which is substantially more dirty than FARC that cannot be explained away anymore by data collection biases [4]; to put it differently, FARC is in this DWI the proverbial ‘lesser of two evils’, or, if you support their cause then you could say they have good reason to be annoyed with the current violent governance in the country. This also suggest that requiring “recognition in Colombia’s paramilitary demobilization, disarmament, and reintegration process” [3] alone may not be enough to achieve durable peace for Colombians.

The other main illustration is the conflict in Northern Ireland by using two complementary DWIs: “aggressive acts (killing civilians) and endangerment to civilians (by not wearing uniforms)”[1]. The ‘British Security Forces’ (BSF) have a “Civilian mortality DWI” of 52, the ‘Irish Republican Paramilitaries’ (IRP) 36, and the ‘Loyalist paramilitaries’ (LP) 86—note the odd naming and aggregations, e.g., are we talking IRA, or lumping the IRA together with the Real-IRA and Continuity-IRA, and all UFF, LVF…? Consulting the extensive source database, it lists 29 groups. In addition, [3]’s “number of civilian + civilian political activist” are, respectively, 190+738+873=1801, but the source’s data has 1797 civ.+ 58 civ.pol.activists = 1855, and then a series of statuses such as “ex-British army”, “ex-IRA” and so forth, who, while being “ex-” are not real civilians according to the database. Much more data for compiling your preferred DWI and preferred details or aggregates can be found here [6].

The “Attacks without uniform DWI” are “approaches 0” (BSF), “approaches 100” (IRP) and “approaches 100” (LP) without actual values to do the calculation with; nevertheless the vagaries, for the IRP they prefer the adjective “extremely high rate” but for the LP it is only “very high rate”. They try a comparatively long explanation for the nastyness of the IRP, but it is plain that the BSF and LP have the dirtiest civilian DWI and that LP killed most civilians, no matter how one wants to explain it away and dress it up with DWIs (maybe not so coincidentally, the authors are affiliated with UK institutions).

I will leave Hicks and Spagat’s “female mortality DWI” of the Arab-Israeli conflict and the “child casualty DWI” of Chechnya for the interested reader to analyse (including the term ‘unexploded ordnance’ that injured or killed children—by exploding).

Although the idea of multiple DWIs can indeed be interesting to give a rough indication, there is the real danger of misuse due to unfair sanitation of data: it can easily stimulate misinterpretation by showing some neat aggregated numbers without having to assess the source data and by brushing over the reality on the ground that a bean-counting person may not be aware of and more readily can set aside in favour of the aggregated numbers.

Hicks and Spagat do have a section on considerations, but that their two main worked-out examples with Colombia and Northern Ireland are problematic already just proves the point about possible dubious use for one’s own political agenda. Perhaps they would say the same of my alternative rendering being politically coloured, but I do not try to give it a veneer of credibility and advantages of DWIs, just that it is simple to turn around and play with the DWIs to suit one’s preferences, whichever they may be.

UPDATE (5-6-’09): a more comprehensive review of Hicks and Spagat’s paper will be published in the autumn 2009 issue of the Peace & Conflict Review.

[1] Weinstein, Jeremy M. (2007). Inside rebellion—the politics of insurgent violence. Cambridge University Press.

[2] Sondorp E (2008 ) A new tool for measuring the brutality of war. PLoS Med 5(12): e249. doi:10.1371/journal.pmed.0050249

[3] Hicks MH-R, Spagat M (2008 ) The Dirty War Index: A public health and human rights tool for examining and monitoring armed conflict outcomes. PLoS Med 5(12): e243. doi:10.1371/journal.pmed.0050243.

[4] Taback N (2008 ) The Dirty War Index: Statistical issues, feasibility, and interpretation. PLoS Med 5(12): e248. doi:10.1371/journal.pmed.0050248.

[5] The numbers originate from CERAC’s Colombia conflict database as reported in [3]; both Hicks and Spagat are research associates of CERAC; database available after registration, which has substantially less types of information and less explanation than Sutton’s [6] database.

[6] CAIN Web Service as reported in [3]; database freely available, including data, querying, and design and data collection choices.

[1] The latter DWI is theoretically problematic, because the distinction between actors who use violence and their supporters in the population (be it passively or actively with food, shelter, and logistics) is often not that clear, and off-duty soldiers are not necessarily automatically civilians; but the argument is long. Hicks and Spagat’s table 3 has a longer list than just this item, and I shall not digress further on the topic here.

Building bias into your database

For developing bio-ontologies, if one follows Barry Smith and cs., then one is solely concerned with the representation of reality; moreover, it has been noted that ontologies can, or should be, seen as a representation of a scientific theory [1] or at least that they are an important part of doing science [2]. In that case, life is easy, not hard, for we have the established method of scientific inquiry to settle disputes (among others, by doing additional lab experiments to figure out more about reality). Domain- and application ontologies, as well as conceptual data models, for the enterprise universe of discourse require, at times, a consensus-based approach where some parts of the represented information are the outcome of negotiations and agreements among the stakeholders.

Going one step further on the sliding scale: for databases and application software for the humanities, and conflict databases in particular, one makes an ontology or conceptual data model conforming to one’s own (or the funding organisation’s) political convictions and with the desired conclusions in mind. Building data vaults seems to be the intended norm rather than the exception, hence, maintenance and usage and data analysis beyond the developers limited intentions, let alone integration, are a nightmare.

In this post, I will outline some suggestions for building your own politicized representation—be it an ontology or conceptual data model—for armed conflict data, such as terrorist incidents, civil war, and inter-state war. I will discuss in the next post a few examples of conflict data analysis, both regarding extant databases and the ‘dirty war index’ application built on top of them. A later post may deal with a solution to the problems, but for now, it would already be a great help not to adhere to the tips below.

Tips for biasing the representation

In random order, you could do any of the following to pollute the model and hamper data analysis so as to ensure your data is scientifically unreliable but suitable to serve your political agenda.

1. Have a fairly flat taxonomy of types of parties; in fact, just two subtypes suffice: US and THEM, although one could subtype the latter into ‘they’, ‘with them’, and ‘for them’. The analogue, with ‘we’, ‘with us’, and ‘for us’ is too risky for potential of contagion of responsibility of atrocities and therefore not advisable to include; if you want to record any of it, then it is better to introduce types such as ‘unknown perpetrator’ or ‘not officially claimed event’ or ‘independent actor’.

2. Aggregate creatively. For instance, if some of the funding for your database comes from a building construction or civil engineering company, refine that section of target types, or include new target types only when you feel like it is targeted sufficiently often by the opponent to warrant a whole new tuple or table from then onwards. Likewise, some funding agencies would like to see a more detailed breakdown of types of victims by types of violence, some don’t. Last, be careful with the typology of arms used, in particular when your country is producing them; a category like ‘DIY explosive device’ helps masking the producer.

3. Under-/over-represent geography. Play with granularity (by city/village, region, country, continent) and categorization criteria (state borders, language, former chiefdoms, parishes, and so forth), e.g., include (or not) notions such as ‘occupied territory’ (related to the actors) and `liberated region’ or `autonomous zone’, or that an area may, or may not, be categorized or named differently at the same time. Above all, make the modelling decisions in an inconsistent way, so that no single dimension can be analysed properly.

4. Make an a-temporal model and pretend not to change it, but (a) allow non-traceable object migration so that defecting parties who used to be with US (see point 1) can be safely re-categorised as THEM, and (b) refine the hierarchy over time anyway so as to generate time-inconsistency for target types (see point 2) and geography (see point 3), in order to avoid time series analyses and prevent discovering possible patterns.

5. Have a minimal amount of classes for bibliographic information, lest someone would want to verify the primary/secondary sources that report on numbers of casualties and discovers you only included media reports from the government-censored newspapers (or the proxy-funding agency, or the rebel radio station, or the guerrilla pamphlets).

6. Keep natural language definitions for key concepts in a separate file, if recorded at all. This allows for time-inconsistency in operational definitions as well as ignorance of the data entry clerks so that each one can have his own ideas about where in the database the conflict data should go.

7. Minimize the use of database integrity constraints, hence, minimize representing constraints in the ontology to begin with, hence, use a very simple modelling language so you can blame the language for not representing the subject domain adequately.

I’m not saying all conflict databases use all of these tricks; but some use at least most of them, which ruins credibility of those database of which the analysts actually did try to avoid these pitfalls (assuming there are such databases, that is). Optimism wants me to believe developers did not think of all those issues when designing the database. However, there is a tendency that each conflict researcher compiles his own data set and that each database is built from scratch.

For the current scope, I will set aside the problems with data collection and how to arrive at guesstimated semi-reliable approximations of deaths, severe injuries, rape, torture victims and so forth (see e.g. [3] and appendix B of [4]). Inherent problems with data collection is one thing and difficult to fix, bad modelling and dubious or partial data analysis is a whole different thing and doable to fix. I elaborate on latter claim in the next post.

[1] Barry Smith. Ontology (Science). In: C. Eschenbach and M. Gruninger (eds.), Formal Ontology in Information Systems. Proceedings of FOIS 2008. preprint

[2] Keet, C.M. Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS’05), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics LNBI 3615, Springer Verlag, 2005. pp46-62.

[3] Taback N (2008 ) The Dirty War Index: Statistical issues, feasibility, and interpretation. PLoS Med 5(12): e248. doi:10.1371/journal.pmed.0050248.

[4] Weinstein, Jeremy M. (2007). Inside rebellion—the politics of insurgent violence. Cambridge University Press. 402p.