What about ethics and responsible data integration and data firewalls?

With another level 4 lockdown and a curfew from 9pm for most of July, I eventually gave in and decided to buy a TV, for some diversion with the national TV channels. In the process of buying, it appeared that here in South Africa, you have to have a valid paid-up TV licence to be allowed to buy a TV. I had none yet. So there I was in the online shopping check-out on a Sunday evening being held up by a message that boiled down to a ‘we don’t recognise your ID or passport number as having a TV licence’. As advances in the state’s information systems would have it, you can register for a TV licence online and pay with credit card to obtain one near-instantly. The interesting question from an IT perspective then was: how long will it take for the online retailer to know I duly registered and paid for the licence? In other words: are the two systems integrated and if so, how? It definitely is not based on a simple live SPARQL query from the retailer to a SPARQL endpoint of the TV licences database, as I still failed the retailer’s TV licence check immediately after payment of the licence and confirmation of it. Some time passed with refreshing the page and trying again and writing a message to the retailer, perhaps 30-45 minutes or so. And then it worked! A periodic data push or pull it is then, either between the licence database and the retailer or within the state’s back-end system and any front-end query interface. Not bad, not bad at all.

One may question from a privacy viewpoint whether this is the right process. Why could I not simply query by, say, just TV licence number and surname, but having had to hand over my ID or passport number for the check? Should it even be the retailer’s responsibility to check whether their customer has paid the tax?

There are other places in the state’s systems where there’s some relatively advanced integration of data between the state and companies as well. Notably, the SA Revenue Service (SARS) system pulls data from any company you work for (or they submit that via some ETL process) and from any bank you’re banking with to check whether you paid the right amount (if you owe them, they send the payment order straight to your bank, but you still have to click ‘approve’ online). No doubt it will help reduce fraud, and by making it easier to fill in tax forms, it likely will increase the amount collected and will cause less errors that otherwise may be costly to fix. Clearly, the system amounts to reduced privacy, but it remains within the legal framework—someone trying to evade paying taxes is breaking the law, rather—and I support the notion of redistributive taxation and to achieve that will as little admin as possible.

These examples do raise broader questions, though: when is data integration justified? Always? If not always, then when is it not? How to ensure that it won’t happen when it should not? Who regulates data integration, if anyone? Are there any guidelines or a checklist for doing it responsibly so that it at least won’t cause unintentional harm? Which steps in the data integration, if any, are crucial from a responsibility and ethical point of view?

No good answers

pretty picture of a selection of data integration tasks. source: https://datawarehouseinfo.com/wp-content/uploads/2018/10/data-integration-1024x1022.png
pretty picture of a selection of data integration tasks. (source: dwh site)

I did search for academic literature, but found only one paper mentioning we should think of at least some of these sort of questions [1]. There are plenty of ethics & Big Data papers (e.g., [2,3]), but those papers focus on the algorithms let loose on the data and consequences thereof once the data has been integrated, rather than yes/no integration or any of the preceding integration processes themselves. There are, among others, data cleaning, data harmonisation and algorithms for that, schema-based integration (LAV, GAV, or GLAV), conceptual model-based integration, ontology-driven integration, possibly recurring ETL processes and so on, and something may go wrong at each step or may be the fine-grained crucial component of the ethical considerations. I devised one toy example in the context of ontology-based data access and integration where things would go wrong because of a bias [4] in that COVID-19 ontology that has data integration as its explicit purpose [5]. There are also informal [page offline dd 25-7-2021] descriptions of cases where things went wrong, such as the data integration issues with the City of Johannesburg that caused multiple riots in 2011, and no doubt there will be more.

Taking the ‘non-science’ route further to see if I could find something, I did find a few websites with some ‘best practices’ and ‘guidelines’ for data integration (e.g., here and here), with the brand new and most comprehensive set of data integration guidelines at end-user level by UN’s ESCAP that focuses on data integration for statistics offices on what to do and where errors may creep in [6]. But that’s all. No substantive hits with ‘ethics in data integration’ and similar searches in the academic literature. Maybe I’m searching in the wrong places. Wading through all ‘data ethics’ papers to find the needle in the haystack may have to be done some other time. If you know of scientific literature that I missed specifically regarding data integration, I’d be most grateful if you’d let me know.

The ‘recurring reliables’ for issues: health and education

Meanwhile, to take a step toward an answer of at least a subset of the aforementioned questions, let me first mention two other recent cases, also from South Africa, although the second issue happened in the Netherlands as well.

The first one is about healthcare data. I’m trying to get a SARS-CoV-2 vaccine. Registration for the age group I’m in opened on the 14th in the evening and so I did register in the state’s electronic vaccination data system (EVDS), which is the basic requirement for getting a vaccine. The next day, it appeared that we could book a slot via the health insurance I’m a member of. Their database and the EVDS are definitely not integrated, and so my insurer spammed me for a while with online messages in red, via email, and via SMS that I should register with the EVDS, even though I had already done that well before trying out their app.

Perhaps the health data are not integrated because it’s health; perhaps it was just time pressure to not delay the SARS-CoV-2 vaccination programme rollout. For some sectors, such as the basic education sector and then the police, they got loaded into the EVDS by the respective state department in one go via some ETL process, rather than people having to bother with individual registration. ID number, names, health insurance, dependants, home address, phone number, and whatnot that the EVDS asked for. And that regardless whether you want the vaccine or not—at least most people do. I don’t recall anyone having had a problem with that back-end process that it happened, aside from reported glitches in the basic education sectors’ ETL process, with reports on missing foreign national teachers and employees of independent schools who wanted in but weren’t.

Both the IT systems for vaccination management and any app for a ‘pass’ for having been vaccinated enjoys some debates on privacy internationally. Should they be self-standing systems? If it is allowed some integration, then with what? Should a healthcare provider or insurer be informed of the vaccination status of a member (and, consequently, act accordingly, whatever that may be), only if the member voluntarily discloses it (like with the vaccination scheduling app), or never? One’s employer? The movie theatre or mall you may want to enter? Perhaps airline companies want access to the vaccine database as well, who could choose to only let vaccinated people on their planes? The latter happens with other vaccinations for sure; e.g., yellow fever vaccination proof to enter SA from some countries, which the airline staff did ask for when I checked in in Argentina when travelling back to SA in 2012. That vaccination proof had gone into the physical yellow fever vaccination booklet that I carried with me; no app was involved in that process, ever. But now more things are digital. Must any such ‘covid-19 pass’ necessarily be digital? If so, who decides who, if anyone, will get access to the vaccination data, be it the EVDS data in SA or their homologous systems in other countries? To the best of my knowledge, no regulations exist yet. Since the EVDS is an IT system of the state, I presume they will decide. If they don’t, it will be up to the whims of each company, municipality, or province, and then is bound to generate lots of confusion among people.

The other case of a different nature comes in the news regularly; e.g., here, here, and here. It’s the tension that exists between children’s right to education and the paperwork to apply for a school. This runs into complications when they have an “undocumented” status, be it because of an absent birth certificate or their and their parent’s status as legal/illegal and their related ID documents or the absence thereof. It is forbidden for a school to contact Home affairs to get the prospective pupil’s and their respective parents’/guardians’ status, and for Home Affairs to provide that data to the schools, let alone integrate those two database at the ministerial level. Essentially, it is an intentional ‘Chinese wall’ between the two databases: the right to education of a child trumps any possible violation of legality of stay in the country or missing paperwork of the child or their parents/guardians.

Notwithstanding, exclusive or exclusionary schools try to filter them out by other means, such as by demanding that sort of data when you want to apply for admission; here’s an example, compared to public schools where evidence of an application for permission to stay suffices or at least evidence of efforts to engage with Home Affairs will do already. When the law says ‘no’ to the integration, how can you guarantee it won’t happen, neither through the software nor by other means (like by de facto requiring the relevant data stored in the Home Affairs database in an admission form)? Policing it? People reporting it somewhere? Would requesting such information now be a violation of the Protection of Personal Information Act (POPIA) that came into force on the 1st of July, since it asks for more personal data than needed by law?

Regulatory aspects

These cases—TV licence, SARS (the tax, not the syndrome), vaccine database, school admissions—are just a few anecdotes. Data integration clearly is not always allowed and when it is not, it has been a deliberate decision not to do so because its outcome is easy to predict and deemed unwanted. Notably for the education case, it is the government who devised the policy for a regulatory Chinese wall between its systems. The TV licence appears to lie at the other end of the spectrum. The broadcasting act of 1999 implicitly puts the onus on the seller of TVs: the licence is not a fee to watch public TV, it is a thing to give the licence holder the right to use a TV (article 27, if you must know), so if you don’t have the right to have it, then you can’t buy it. It’s analogous to having to be over 18 to buy alcohol, where the seller is held culpable if the buyer isn’t. That said, there are differences in what the seller requests from the customer: Makro requires the licence number only and asks for ID only if you can’t remember the licence number so as to ‘help you find it’, whereas takealot demands both ID and licence in any case, and therewith perhaps is then asking for more than strictly needed. Either way, since any retailer thus should be able to access the licence information instantly to check whether you have the right to own a TV, it’s a bit like as if “come in and take my data” is written all over the TV licence database. I haven’t seen any news articles about abuse.

For the SARS-CoV-2 vaccine and the EVDS data, there is, to the best of my knowledge, no specific regulation in place from the EVDS to third parties, other than that vaccination is voluntary and there is SA’s version of the GDPR, the aforementioned POPIA, which is based on the GDPR principles. I haven’t seen much debate about organisations requiring vaccination, but they can make vaccination mandatory if they want to, from which follows that there will have to be some data exchange either between the EVDS and third parties or from EVDS to the person and from there to the company. Would it then become another “come in and take my data”? We’ll cross that bridge when it comes, I suppose; coverage is currently at about 10% of the population and not everyone who wants to could get vaccinated yet, so we’re still in a limbo.

What could possibly go wrong with widespread access, alike with the TV licence database? A lot, of course. There are the usual privacy and interoperability issues (also noted here), and there are calls even in the laissez faire USA to put a framework in place to provide companies with “standards and bounds”. They are unlikely going to be solved by the CommonPass of the Commons Project bottom-up initiative, since there are so many countries with so many rules on privacy and data sharing. Interoperability between some systems is one thing; one world-wide system is another cup of tea.

What all this boils down to is not unlike Moshe Vardi’s argument, in that there’s the need for more policy to reduce and avoid ethical issues in IT, AI, and computing, rather than that computing would be facing an ethics crisis [7]. His claim is that failures of policy cause problems and that the “remedy is public policy, in the form of laws and regulations”, not some more “ethics outrage”. Presumably, there’s no ethics crisis, of the form that there would be a lack of understanding of ethical behaviour among computer scientists and their managers. Seeing each year how students’ arguments improve between the start of the ethics course and at the end in the essay and exam, I’d argue that basic sensitization is still needed, but on the whole, more and better policy could go a long way indeed.

More research on possible missteps in the various data integration processes would also be helpful, and that from a technical angle, as would learning from case studies be, and contextual inquiries [8], as well as a rigorous assessment on possible biases, alike it was examined for software development processes [9]. Those outcomes then may end up as a set of guidelines for data integration practitioners and the companies they work for, and inform government to devise policies. For now, the ESCAP guidelines [6] probably will be of most use to a data integration practitioner. It won’t catch all biases and algorithmic issues & tools and assumes one is allowed to integrate already, but it is a step in the direction of responsible data integration. I’ll think about it a bit more, too, and for the time being I won’t bother my students with writing an essay about ethics of data integration just yet.


[1] Firmani, D., Tanca, L., Torlone, R. Data processing: reflection on ethics. International Workshop on Processing Information Ethically (PIE’19). CEUR-WS vol. 2417. 4 June 2019.

[2] Herschel, R., Miori, V.M. Ethics & Big Data. Technology in Society, 2017, 49:31‐36.

[3] Sax, M. Finders keepers, losers weepers. Ethics and Information Technology, 2016, 18: 25‐31.

[4] Keet, C.M. Bias in ontologies — a preliminary assessment. Technical Report, Arxiv.org, January 20, 2021. 10p

[5] He, Y., et al. 2020. CIDO: The Community-based CoronavirusInfectious Disease Ontology. In Hastings, J.; and Loebe, F., eds., Proceedings of the 11th international Conference on Biomedical Ontologies, CEUR-WS vol. 2807.

[6] Economic and Social Commission for Asia and the Pacific (ESCAP). Asia-Pacific Guidelines to Data Integration for Official Statistics. Training manual. 15 April 2021.

[7] Vardi, M.Y. Are We Having An Ethical Crisis in Computing? Communications of the ACM, 62(1):7

[8] McKeown, A., Cliffe, C., Arora, A. et al. Ethical challenges of integration across primary and secondary care: a qualitative and normative analysis. BMC Med Ethics 20, 42 (2019).

[9] R. Mohanani, I. Salman, B. Turhan, P. Rodriguez, P. Ralph, Cognitive biases in software engineering: A systematic mapping study, IEEE Transactions on Software Engineering, 46 (2020): 1318–1339.

72010 SemWebTech lecture 8: SWT for HCLS background and data integration

After the ontology languages and general aspects of ontology engineering, we now will delve into one specific application area: SWT for health care and life sciences. Its frontrunners in bioinformatics were adopters of some of the Semantic Web ideas even before Berners-Lee, Hendler, and Lassila wrote their Scientific American paper in 2001, even though they did not formulate their needs and intentions in the same terminology: they did want to have shared, controlled vocabularies with the same syntax, to facilitate data integration—or at least interoperability—across Web-accessible databases, have a common space for identifiers, it needing to be a dynamic, changing system, to organize and query incomplete biological knowledge, and, albeit not stated explicitly, it all still needed to be highly scalable [1].

Bioinformaticians and domain experts in genomics already organized themselves together in the Gene Ontology Consortium, which was set up officially in 1998 to realize a solution for these requirements. The results exceeded anyone’s expectations in its success for a range of reasons. Many tools for the Gene Ontology (GO) and its common KR format, .obo, have been developed, and other research groups adopted the approach to develop controlled vocabularies either by extending the GO, e.g., rice traits, or adding their own subject domain, such as zebrafish anatomy and mouse developmental stages. This proliferation, as well as the OWL development and standardization process that was going on at about the same time, pushed the goal posts further: new expectations were put on the GO and its siblings and on their tools, and the proliferation had become a bit too wieldy to keep a good overview what was going on and how those ontologies would be put together. Put differently, some people noticed the inferencing possibilities that can be obtained from moving from obo to OWL and others thought that some coordination among all those obo bio-ontologies would be advantageous given that post-hoc integration of ontologies of related and overlapping subject domains is not easy. Thus came into being the OBO Foundry to solve such issues, proposing a methodology for coordinated evolution of ontologies to support biomedical data integration [2].

People in related disciplines, such as ecology, have taken on board experiences of these very early adopters, and instead decided to jump on board after the OWL standardization. They, however, were not only motivated by data(base) integration. Referring to Madin et al’s paper [3] again, I highlight three points they made: “terminological ambiguity slows scientific progress, leads to redundant research efforts, and ultimately impedes advances towards a unified foundation for ecological science”, i.e., identification of some serious problems they have in ecological research; “Formal ontologies provide a mechanism to address the drawbacks of terminological ambiguity in ecology”, i.e., what they expect that ontologies will solve for them (disambiguation); and “and fill an important gap in the management of ecological data by facilitating powerful data discovery based on rigorously defined, scientifically meaningful terms”, i.e., for what purpose they want to use ontologies and any associated computation (discovery). That is, ontologies not as a—one of many possible—tool in the engineering/infrastructure means, but as a required part of a method in the scientific investigation that aims to discover new information and knowledge about nature (i.e., in answering the who, what, where, when, and how things are the way they are in nature).

What has all this to do with actual Semantic Web technologies? On the one hand, there are multiple data integration approaches and tools that have been, and are being, tried out by the domain experts, bioinformaticians, and interdisciplinary-minded computer scientists [4], and, on the other hand, there are the W3C Semantic Web standards XML, RDF(S), SPARQL, and OWL. Some use these standards to achieve data integration, some do not. Since this is a Semantic Web course, we shall take a look at two efforts who (try to) do, which came forth from the activities of the W3C’s Health Care and Life Sciences Interest Group. More precisely, we take a closer look at a paper written about 3 years ago [5] that reports on a case study to try to get those Semantic Web Technologies to work for them in order to achieve data integration and a range of other things. There is also a more recent paper from the HCLS IG [6], where they aimed at not only linking of data but also querying of distributed data, using a mixture of RDF triple stores and SKOS. Both papers reveal their understanding of the purposes of SWT, and, moreover, what their goals are, their experimentation with various technologies to achieve them, and where there is still some work to do. There are notable achievements described in these, and related, papers, but the sought-after “killer app” is yet to be announced.

The lecture will cover a ‘historical’ overview and what more recent ontology-adopters focus on, the very basics of data integration approaches that motivated the development of ontologies, and we shall analyse some technological issues and challenges mentioned in [5] concerning Semantic Web (or not) technologies.


[1] The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genetics, May 2000;25(1):25-9.

[2] Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, The OBI Consortium, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H Scheuermann, Nigam Shah, Patricia L. Whetzel, Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25, 1251-1255 (2007).

[3] Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer and Matthew B. Jones. (2008). Advancing ecological research with ontologies. Trends in Ecology & Evolution, 23(3): 159-168.

[4] Erhard Rahm. Data Integration in Bioinformatics and Life Sciences. EDBT Summer School, Bolzano, Sep. 2007.

[5] Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Scott Marshall M, Ogbuji C, Rees J, Stephens S, Wong GT, Elizabeth Wu, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web, BMC Bioinformatics, 8, 2007.

[6] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to Semantic Web query federation in the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):S10

Note: references 1, 2, and (5 or 6) are mandatory reading, and 3 and 4 are recommended to read.

Lecture notes: lecture 8 – SWLS background and data integration

Course website