With another level 4 lockdown and a curfew from 9pm for most of July, I eventually gave in and decided to buy a TV, for some diversion with the national TV channels. In the process of buying, it appeared that here in South Africa, you have to have a valid paid-up TV licence to be allowed to buy a TV. I had none yet. So there I was in the online shopping check-out on a Sunday evening being held up by a message that boiled down to a ‘we don’t recognise your ID or passport number as having a TV licence’. As advances in the state’s information systems would have it, you can register for a TV licence online and pay with credit card to obtain one near-instantly. The interesting question from an IT perspective then was: how long will it take for the online retailer to know I duly registered and paid for the licence? In other words: are the two systems integrated and if so, how? It definitely is not based on a simple live SPARQL query from the retailer to a SPARQL endpoint of the TV licences database, as I still failed the retailer’s TV licence check immediately after payment of the licence and confirmation of it. Some time passed with refreshing the page and trying again and writing a message to the retailer, perhaps 30-45 minutes or so. And then it worked! A periodic data push or pull it is then, either between the licence database and the retailer or within the state’s back-end system and any front-end query interface. Not bad, not bad at all.
One may question from a privacy viewpoint whether this is the right process. Why could I not simply query by, say, just TV licence number and surname, but having had to hand over my ID or passport number for the check? Should it even be the retailer’s responsibility to check whether their customer has paid the tax?
There are other places in the state’s systems where there’s some relatively advanced integration of data between the state and companies as well. Notably, the SA Revenue Service (SARS) system pulls data from any company you work for (or they submit that via some ETL process) and from any bank you’re banking with to check whether you paid the right amount (if you owe them, they send the payment order straight to your bank, but you still have to click ‘approve’ online). No doubt it will help reduce fraud, and by making it easier to fill in tax forms, it likely will increase the amount collected and will cause less errors that otherwise may be costly to fix. Clearly, the system amounts to reduced privacy, but it remains within the legal framework—someone trying to evade paying taxes is breaking the law, rather—and I support the notion of redistributive taxation and to achieve that will as little admin as possible.
These examples do raise broader questions, though: when is data integration justified? Always? If not always, then when is it not? How to ensure that it won’t happen when it should not? Who regulates data integration, if anyone? Are there any guidelines or a checklist for doing it responsibly so that it at least won’t cause unintentional harm? Which steps in the data integration, if any, are crucial from a responsibility and ethical point of view?
No good answers
I did search for academic literature, but found only one paper mentioning we should think of at least some of these sort of questions . There are plenty of ethics & Big Data papers (e.g., [2,3]), but those papers focus on the algorithms let loose on the data and consequences thereof once the data has been integrated, rather than yes/no integration or any of the preceding integration processes themselves. There are, among others, data cleaning, data harmonisation and algorithms for that, schema-based integration (LAV, GAV, or GLAV), conceptual model-based integration, ontology-driven integration, possibly recurring ETL processes and so on, and something may go wrong at each step or may be the fine-grained crucial component of the ethical considerations. I devised one toy example in the context of ontology-based data access and integration where things would go wrong because of a bias  in that COVID-19 ontology that has data integration as its explicit purpose . There are also informal [page offline dd 25-7-2021] descriptions of cases where things went wrong, such as the data integration issues with the City of Johannesburg that caused multiple riots in 2011, and no doubt there will be more.
Taking the ‘non-science’ route further to see if I could find something, I did find a few websites with some ‘best practices’ and ‘guidelines’ for data integration (e.g., here and here), with the brand new and most comprehensive set of data integration guidelines at end-user level by UN’s ESCAP that focuses on data integration for statistics offices on what to do and where errors may creep in . But that’s all. No substantive hits with ‘ethics in data integration’ and similar searches in the academic literature. Maybe I’m searching in the wrong places. Wading through all ‘data ethics’ papers to find the needle in the haystack may have to be done some other time. If you know of scientific literature that I missed specifically regarding data integration, I’d be most grateful if you’d let me know.
The ‘recurring reliables’ for issues: health and education
Meanwhile, to take a step toward an answer of at least a subset of the aforementioned questions, let me first mention two other recent cases, also from South Africa, although the second issue happened in the Netherlands as well.
The first one is about healthcare data. I’m trying to get a SARS-CoV-2 vaccine. Registration for the age group I’m in opened on the 14th in the evening and so I did register in the state’s electronic vaccination data system (EVDS), which is the basic requirement for getting a vaccine. The next day, it appeared that we could book a slot via the health insurance I’m a member of. Their database and the EVDS are definitely not integrated, and so my insurer spammed me for a while with online messages in red, via email, and via SMS that I should register with the EVDS, even though I had already done that well before trying out their app.
Perhaps the health data are not integrated because it’s health; perhaps it was just time pressure to not delay the SARS-CoV-2 vaccination programme rollout. For some sectors, such as the basic education sector and then the police, they got loaded into the EVDS by the respective state department in one go via some ETL process, rather than people having to bother with individual registration. ID number, names, health insurance, dependants, home address, phone number, and whatnot that the EVDS asked for. And that regardless whether you want the vaccine or not—at least most people do. I don’t recall anyone having had a problem with that back-end process that it happened, aside from reported glitches in the basic education sectors’ ETL process, with reports on missing foreign national teachers and employees of independent schools who wanted in but weren’t.
Both the IT systems for vaccination management and any app for a ‘pass’ for having been vaccinated enjoys some debates on privacy internationally. Should they be self-standing systems? If it is allowed some integration, then with what? Should a healthcare provider or insurer be informed of the vaccination status of a member (and, consequently, act accordingly, whatever that may be), only if the member voluntarily discloses it (like with the vaccination scheduling app), or never? One’s employer? The movie theatre or mall you may want to enter? Perhaps airline companies want access to the vaccine database as well, who could choose to only let vaccinated people on their planes? The latter happens with other vaccinations for sure; e.g., yellow fever vaccination proof to enter SA from some countries, which the airline staff did ask for when I checked in in Argentina when travelling back to SA in 2012. That vaccination proof had gone into the physical yellow fever vaccination booklet that I carried with me; no app was involved in that process, ever. But now more things are digital. Must any such ‘covid-19 pass’ necessarily be digital? If so, who decides who, if anyone, will get access to the vaccination data, be it the EVDS data in SA or their homologous systems in other countries? To the best of my knowledge, no regulations exist yet. Since the EVDS is an IT system of the state, I presume they will decide. If they don’t, it will be up to the whims of each company, municipality, or province, and then is bound to generate lots of confusion among people.
The other case of a different nature comes in the news regularly; e.g., here, here, and here. It’s the tension that exists between children’s right to education and the paperwork to apply for a school. This runs into complications when they have an “undocumented” status, be it because of an absent birth certificate or their and their parent’s status as legal/illegal and their related ID documents or the absence thereof. It is forbidden for a school to contact Home affairs to get the prospective pupil’s and their respective parents’/guardians’ status, and for Home Affairs to provide that data to the schools, let alone integrate those two database at the ministerial level. Essentially, it is an intentional ‘Chinese wall’ between the two databases: the right to education of a child trumps any possible violation of legality of stay in the country or missing paperwork of the child or their parents/guardians.
Notwithstanding, exclusive or exclusionary schools try to filter them out by other means, such as by demanding that sort of data when you want to apply for admission; here’s an example, compared to public schools where evidence of an application for permission to stay suffices or at least evidence of efforts to engage with Home Affairs will do already. When the law says ‘no’ to the integration, how can you guarantee it won’t happen, neither through the software nor by other means (like by de facto requiring the relevant data stored in the Home Affairs database in an admission form)? Policing it? People reporting it somewhere? Would requesting such information now be a violation of the Protection of Personal Information Act (POPIA) that came into force on the 1st of July, since it asks for more personal data than needed by law?
These cases—TV licence, SARS (the tax, not the syndrome), vaccine database, school admissions—are just a few anecdotes. Data integration clearly is not always allowed and when it is not, it has been a deliberate decision not to do so because its outcome is easy to predict and deemed unwanted. Notably for the education case, it is the government who devised the policy for a regulatory Chinese wall between its systems. The TV licence appears to lie at the other end of the spectrum. The broadcasting act of 1999 implicitly puts the onus on the seller of TVs: the licence is not a fee to watch public TV, it is a thing to give the licence holder the right to use a TV (article 27, if you must know), so if you don’t have the right to have it, then you can’t buy it. It’s analogous to having to be over 18 to buy alcohol, where the seller is held culpable if the buyer isn’t. That said, there are differences in what the seller requests from the customer: Makro requires the licence number only and asks for ID only if you can’t remember the licence number so as to ‘help you find it’, whereas takealot demands both ID and licence in any case, and therewith perhaps is then asking for more than strictly needed. Either way, since any retailer thus should be able to access the licence information instantly to check whether you have the right to own a TV, it’s a bit like as if “come in and take my data” is written all over the TV licence database. I haven’t seen any news articles about abuse.
For the SARS-CoV-2 vaccine and the EVDS data, there is, to the best of my knowledge, no specific regulation in place from the EVDS to third parties, other than that vaccination is voluntary and there is SA’s version of the GDPR, the aforementioned POPIA, which is based on the GDPR principles. I haven’t seen much debate about organisations requiring vaccination, but they can make vaccination mandatory if they want to, from which follows that there will have to be some data exchange either between the EVDS and third parties or from EVDS to the person and from there to the company. Would it then become another “come in and take my data”? We’ll cross that bridge when it comes, I suppose; coverage is currently at about 10% of the population and not everyone who wants to could get vaccinated yet, so we’re still in a limbo.
What could possibly go wrong with widespread access, alike with the TV licence database? A lot, of course. There are the usual privacy and interoperability issues (also noted here), and there are calls even in the laissez faire USA to put a framework in place to provide companies with “standards and bounds”. They are unlikely going to be solved by the CommonPass of the Commons Project bottom-up initiative, since there are so many countries with so many rules on privacy and data sharing. Interoperability between some systems is one thing; one world-wide system is another cup of tea.
What all this boils down to is not unlike Moshe Vardi’s argument, in that there’s the need for more policy to reduce and avoid ethical issues in IT, AI, and computing, rather than that computing would be facing an ethics crisis . His claim is that failures of policy cause problems and that the “remedy is public policy, in the form of laws and regulations”, not some more “ethics outrage”. Presumably, there’s no ethics crisis, of the form that there would be a lack of understanding of ethical behaviour among computer scientists and their managers. Seeing each year how students’ arguments improve between the start of the ethics course and at the end in the essay and exam, I’d argue that basic sensitization is still needed, but on the whole, more and better policy could go a long way indeed.
More research on possible missteps in the various data integration processes would also be helpful, and that from a technical angle, as would learning from case studies be, and contextual inquiries , as well as a rigorous assessment on possible biases, alike it was examined for software development processes . Those outcomes then may end up as a set of guidelines for data integration practitioners and the companies they work for, and inform government to devise policies. For now, the ESCAP guidelines  probably will be of most use to a data integration practitioner. It won’t catch all biases and algorithmic issues & tools and assumes one is allowed to integrate already, but it is a step in the direction of responsible data integration. I’ll think about it a bit more, too, and for the time being I won’t bother my students with writing an essay about ethics of data integration just yet.
 Firmani, D., Tanca, L., Torlone, R. Data processing: reflection on ethics. International Workshop on Processing Information Ethically (PIE’19). CEUR-WS vol. 2417. 4 June 2019.
 Herschel, R., Miori, V.M. Ethics & Big Data. Technology in Society, 2017, 49:31‐36.
 Sax, M. Finders keepers, losers weepers. Ethics and Information Technology, 2016, 18: 25‐31.
 Keet, C.M. Bias in ontologies — a preliminary assessment. Technical Report, Arxiv.org, January 20, 2021. 10p
 He, Y., et al. 2020. CIDO: The Community-based CoronavirusInfectious Disease Ontology. In Hastings, J.; and Loebe, F., eds., Proceedings of the 11th international Conference on Biomedical Ontologies, CEUR-WS vol. 2807.
 Economic and Social Commission for Asia and the Pacific (ESCAP). Asia-Pacific Guidelines to Data Integration for Official Statistics. Training manual. 15 April 2021.
 Vardi, M.Y. Are We Having An Ethical Crisis in Computing? Communications of the ACM, 62(1):7
 McKeown, A., Cliffe, C., Arora, A. et al. Ethical challenges of integration across primary and secondary care: a qualitative and normative analysis. BMC Med Ethics 20, 42 (2019).
 R. Mohanani, I. Salman, B. Turhan, P. Rodriguez, P. Ralph, Cognitive biases in software engineering: A systematic mapping study, IEEE Transactions on Software Engineering, 46 (2020): 1318–1339.