Google searches, sneaky, and data duplication

Aside from this blog and Facebook, I recently signed up for, a Web 2.0ish site where researchers can connect and follow each other academically. It even was so ‘smart’ that it could tell me who of my FB friends were already in the system without me giving explicitly the information about my FB account, and it found most of my papers for me (with some noise, though). Setting aside the uncomfortable former aspect, the ‘finding and handling my papers for me’ is actually really sneaky. I’ll spend the remainder of this post on that, just so you know what you’ll be letting yourself into just in case you sign up for it.

The first annoying thing is, that if you let collect your papers automatically when you build up your profile there, which it seemingly does ‘intelligently’, it snatches your papers and either takes them from citeseer or puts them on scribd, even though they are all on my and the publisher’s websites, too. And then there’s some noise; e.g., it links to a pdf of the presentation you did at the conference instead of the paper itself. Also, it does not provide full publication details (just the title), even though easily could be programmed to screen scrape that from any researcher’s website, or, better, ask me for a bib file.

And then there’s the real catch: when someone now searches for your papers, the URL to their version of the paper comes up higher in the Google ranking than either yours or the publisher’s. ‘Thanks’ to’s services, I now know with which Google search terms they clicked to retrieve which paper. So when on March 20 someone from an unknown country at 04:50am local time searched for “data information granularity bioinformatics”, it found as the very first Google hit—yes, the rank is given in the stats as well—the slides of my PhD defense on scribd, not my thesis on my website (that it should have done); and even if s/he wanted to have download it, upon clicking the “Download” button, it complains that “You must be logged in to download” (!). The slides were probably not what s/he was after (idem ditto for the visitor from Poland, who searched for this). There are many such misdirected instances. In general, essentially they would have had to do the search again to get the publication data and, in case of the linking of a wrong file, searching for the right file. It is correctable on—manually. Subsequently adding new papers is also a manual process with an impractical GUI.

And then I have not said anything yet about the scruffy page rendering by scribd. Besides, I never gave scribd approval to offer my work on/through their site (there are more malfunctioning aggregators, which is an issue of its own). In addition, it would not surprise me if that would violate the delicately balanced copyright arrangements that exist for CS publications. UPDATE: the terms and conditions (d.d. 21-4-2011) says that “By displaying or publishing (“posting”) any Content on or through the Services, you hereby grant to a limited license to use, modify, publicly perform, publicly display, reproduce, and distribute such Content solely on and through the Services.”. That does violate the delicately balanced copyright arrangements that exist for CS publications. The terms & conditions also say it is the Member’s own responsibility to get the approval from the respective publishers/copyright holders.

Moreover, there is a preposterous “message” on the right-hand side of the search statistics: “Tip: To make your page appear higher up on Google: Link to your page from your department website Upload more documents – papers, talks and a CV”. But my own site was higher up in the Google ranking before I signed up to your devious service! Honestly, I want to lure people to my site when they are interested in my contributions, not to a place where there is partial information badly duplicated. Ok, this smells of ego-tripping, but my site is worth almost $17K according to and has a page rank of 5; if I run out of money, I’ll have an asset to sell fairly easily without much disruption. Seriously though, this ‘rerouting’ of visitors away from the source toward some obscure other location on the Internet is obviously a more important issue at the institutional level. No sane university or research institute would want to have as policy to redirect visitors to any other site than their own when it comes to displaying the scientific impact its employees have made. If one is at a non-indexable institute that only happens to carry the title ‘university’ bit is not in substance, then perhaps helps with your visibility. But I am not at such an institute; UKZN is one of the 5 top research-intensive universities in South Africa.

So what can you do? Remove your papers on This I did this morning, one by one. The sad thing is that “following” other researchers is, in theory at least, an easy way to be notified automatically of their new publications compared to manually checking their homepages regularly, but this is precisely that which manages to mess up, badly.

I can envision a couple of mean scenarios why anyone would have wanted to set up the site in the way it is, like that they first pollute Google rankings and then ask for a fee in the near future (after all, they already require you to login to download the file, and a “fee” item is included in the terms & conditions file). The statistics they are gathering on who-follows-who gives a better insight in research networks and its leaders than the more common citation-network analyses. Finding out which scientific papers and topics are ‘hot’ must be valuable material as well, and become perhaps just as important as the rather imprecise ISI impact factor that is quite useless for CS at the moment. You also could use the data for NLP and semantic annotations to, in near future, offer indispensible academic semantic search facilities (at a price). And no doubt there are more scenarios.

In short, that was the end of that “Web 2.0” experiment for me.

(p.s.: just in case someone wants to see some proof: I did make several screenshots that I can share)


Five years of keet blog

Was it worth the effort? Yes, for two reasons. First, there is the amount of offline positive feedback and the steadily increasing number of visits/month, hence having provided some added-value at least to some readers. Second, it contributed to making me a more efficient and attentive reader and conference attendee, and it improved my communication skills in describing scientific results informally and succinctly. To be clear, though, it most likely did not improve my job prospects directly (perhaps even on the contrary) and the time spent on writing the blog posts surely could have been used to churn out another paper; irrespective of these two considerations, it was fun to do.

The vast blog-o-sphere is an impressive 127 posts richer thanks to the existence of keet blog. Some posts received many more hits than I ever thought it would generate (top-down and bottom-up ontology development, CS & IT with/for biology, and musings about multi-tasking vs. parallel processing and the brain), while others much less (like the one announcing I successfully defended my PhD thesis, on the transformation relation, and relation migration). The surprising thing, to me at least, is that despite (the general idea) that blog posts have a short reading/attention/lifespan, many of my posts somehow have been picked up by search engines and keep generating traffic thanks to those searches, including the older ones. Sensible search terms people used to arrive at my blog include, among many, ontology, ORM, handbook on KR, dl2010 etc., but there are also rather peculiar ones that still refer to older posts like this month’s search terms “incompetence blog” that presumably returned the post about the Dunning-Kruger effect I wrote about in mid 2008 and “random structure of website” (this post from >2 years ago has something to do with it). If there were some sort of an ‘ISI blog impact factor’—say, hits/day over the past month—then keet blog would be utterly insignificant, whereas with a ‘more-sensible-than-ISI [blog] impact factor’ spanning, say, 5 years, then my blog would be less insignificant on the absolute scale of blog impact.

Relatively, though, the past year generated consistently >1000 visits/month, and last month even >1600 visits, which is not that bad for a mostly ‘boring science blog’—even if only 10% of the visitors would actually read the posts. To make it less ‘boring’, there are occasional posts on science-society-entertainment (such as the complexity of coffee and culinary evolution) and trivia (e.g., htmlgraph). Posts on Computer Science & Society generate wildly varying amounts of visits (such as the Aperitivo Informatico, SA women in STI, or the ICT for Peace entries). A teasing headline like “what philosophers say about computer science” works surprisingly well. By the way, do you remember who said: “The future of our home country necessarily has to be a future of scientists”?

What about ‘pure’ computer science posts, including the shameless self-promotion of announcing accepted papers I am (co-)author of? Some posts have generated relatively many visits: all posts about ontology-based data access, all posts of the Semantic Web Technologies course that provided merely an introduction to each lecture, and CS conference blogging posts.  One well-visited post made it almost verbatim into a EU project deliverable and several posts (e.g., here, here & here, and here) were preludes to papers that have been published in the meantime.

As for my own research, the number of hits of the posts is more often than not at the lower end of the scale, with granularity (my PhD thesis topic) and ontology engineering mixed. So, well, yes, indeed it seems that writing about just about anything except my own research papers makes the blog ‘popular’. If you think that sounds depressing, then think again: the vast majority of scientific papers are mostly ignored anyway, and the other researchers’ work I chose to write about is a very small selection not only of what has been published but also of what I’ve read (and I read a lot).

I have updated the list of all blog post for easy reference (and thereby possibly rescuing the odd post from complete obscurity.) For those of you curious how many visits one or the other post got: the vox populi page contains a list with the 20 most visited posts and their respective number of visits.

I have not decided if I want to go on with it for another five years, but neither did I think keet blog would last for five years when I created the blog on WordPress on April 8, 2006, and started shortly after that with a first note. Many a blog fizzles out quickly, so I am somewhat proud of having kept it up for 5 years and steadily increasing its popularity—one post at a time.

Last, but most certainly not least, to all readers and [on-/off-]line commentators: a big thank you for your interest and feedback!

Reports on Digital Inclusion and divide

The Mail & Guardian (SA weekly) reported on a survey about “digital inclusion”/digital divide the other day, with the title “India’s digital divide worst among Brics”. It appeared to be based on a survey from risk analysis firm MapleCroft and their “Digital Inclusion Index” (DII).

Searching for the original survey and related news articles, the first three pages of Google’s result were news articles with pretty much the same title and content (except for one, where the Swedes say they are doing well). As it turns out, the low ranking of India is the first sentence of MapleCroft’s own news item about the DII. Lots of more data is described there, and everything together not only can be interpreted in various ways, but also raises more questions than it answers.

186 countries were surveyed, the Netherlands being number 186 (highest DII) and Niger number 1 (lowest DII). India turned out to have a DII of 39 and is therewith in the “extreme risk” category, China 103, Brazil 110, and Russia 134, which are relatively a lot better and in the “medium risk” category, but China and “to a lesser extent Russia” in the ‘wrong’ way (limited internet freedom). To tease a little: instead of ‘India is the worst’ regarding digital divide, one also can reformulate it in a way that India is important enough to be a full BRICS member [even though it has/irrespective of] a low DII. The place of the new “S” in BRICS—South Africa—is not even mentioned in the Mail & Guardian article, but MapleCroft has put it in the “High Risk” category (see figure here, about halfway on the page).

According to MapleCroft, “Sub-Saharan Africa is by far the worst performing region for digital inclusion with 29 of the 39 countries rated ‘extreme risk’ in the index.”. Summarizing the figure, Africa and South-East Asia are mostly in the high or extreme risk categories, Latin America, East-Europe and North Asia are in the medium or high risk categories, and the US, Canada, West-Europe, Japan, and Australia are in the low risk category. One of my fellow members at Informatici Senza Frontiere (Alessandra Cattani, who did here thesis on the digital divide) provided me the information that internet access in Italy is less than 40%, yet they are also in the low risk category according to the DII.

At the bottom of MapleCroft’s page, there is a paragraph rambling about the position of Tunisia, Egypt, and Libya in the ranking (81, 66, 77, respectively) and that “Internet and mobile phone technologies played a central role in motivating and coordinating the uprisings”. A third in Tunisia uses the internet, 16% is on facebook, whereas only about 5% of the Egyptians and 3% of the Libyans use facebook; all three countries are in the “high risk” category. This data can be ‘explained’ in any direction, even that facebook access is so low that it hardly may have contributed to motivate the uprisings (such as USAid and neoliberal policies in Egypt).

So, what exactly did MapleCroft measure? They used 10 indicators, being: “numbers of mobile cellular and broadband subscriptions; fixed telephone lines; households with a PC and television; internet users and secure internet servers; internet bandwidth; secondary education enrolment; and adult literacy”.

Considering fixed telephone lines is a bit of a joke in sparsely populated areas though, because it is utterly unprofitable for telcoms to lay the cables, so countries with low population density and a geographically more evenly distributed population are at a disadvantage in the DII. (and are all telephone lines and TVs digital nowadays?). Mobile phone use is relatively high in Africa, not just having one and using it to call family and friends, but also, among others, to handle electronic health records, disaster management, banking, and school-student communications, and the number of internet users has increased by some 2350% over the past 10 years (OneWorld news item, in Dutch). Even I can use mobile phone banking from the moment I opened my account here in SA and they were surprised I did not know how to do that (even after about 6.5 year in Italy, I still had to ‘wait a little longer’ for Italian internet banking—they do not offer mobile phone banking). Then there are the ATMs here that offer services that would fall under ‘online baking’ in many a European country. But MapleCroft has not considered the type and intensity of usage, or the inventiveness of people to enhance one technology as a way to ‘counterbalance’ the ‘lack’ of another technology.

Regarding bandwidth, fibre optic cables for fast internet access are not evenly distributed around the globe (picture), and even when they pass close by, some countries are prevented from plugging into the fast lines (most notably Cuba—the lines are owned by US companies who are prevented from doing business with Cuba due to the blockade).

The last two indicators to compute the DII may, to some, come as a surprise, but is not: one thing is to have the equipment, a whole different story is to be literate to read and comprehend the information, and then there’s a whole different story of having developed sufficient critical thinking to be able to separate the wheat from the chaff in the data deluge on the internet. India has and adult literacy of some 63%; this compared to adult literacy of 90% in Brazil, 100% in Russia, 94 in China, and 89% in South Africa (data from UNICEF). Secondary education enrollment is trickier, where UNICEF at least is more detailed, because it makes a difference between enrollment and attendance (and graduation and tertiary education, not covered by either one).

Then there’s digital inclusion, versus a digital divide. Both the bottom and the top echelon are “included”, according to MapleCroft, the former just with an extreme risk and latter with a low risk of falling behind. It certainly has a friendlier tone to it than considering the divide it has created between people and the consequences that follow from it, both economic and social.

Take the underlying social divide: who has access? For instance, if there is one PC in the household, who uses it? Recollecting my even younger years, the PC access pecking order was Father > Mother (practically skipped) > Brother > Sister (me, youngest, female), which obviously has changed over the years for both my brother and me. There are other parameters to consider here, such as occupation, level of higher education, and several countries have whole groups of people that are at a relative (dis)advantage due to socio-economic, political, ethnic, disability etc. factors. However, it is a separate line of inquiry to determine to what extent it affects the inclusion or exacerbates the divide. MapleSoft did not include it in the DII.

And then there is the time dimension. The DII diagram is a snapshot (I do not know which measurement date), but comparison along a time axis may reveal trends. So will percentages. Take, for instance internet users. Worldmapper has two beautiful figures as topically scaled maps (density equalized maps) for 1990 and 2002 data, which I showed earlier: the US shrunk relatively while Asia, Eastern Europe, Latin America, and Africa grew. No doubt they also grew a lot over the past 8 years.

Hence, overall, the coarse-grained ranking of the DII as such does not say much, and raises more questions than that it answers. Aside from serving an underlying political agenda, the real news value of the DII as such is rather limited.