Google searches, sneaky, and data duplication

Aside from this blog and Facebook, I recently signed up for, a Web 2.0ish site where researchers can connect and follow each other academically. It even was so ‘smart’ that it could tell me who of my FB friends were already in the system without me giving explicitly the information about my FB account, and it found most of my papers for me (with some noise, though). Setting aside the uncomfortable former aspect, the ‘finding and handling my papers for me’ is actually really sneaky. I’ll spend the remainder of this post on that, just so you know what you’ll be letting yourself into just in case you sign up for it.

The first annoying thing is, that if you let collect your papers automatically when you build up your profile there, which it seemingly does ‘intelligently’, it snatches your papers and either takes them from citeseer or puts them on scribd, even though they are all on my and the publisher’s websites, too. And then there’s some noise; e.g., it links to a pdf of the presentation you did at the conference instead of the paper itself. Also, it does not provide full publication details (just the title), even though easily could be programmed to screen scrape that from any researcher’s website, or, better, ask me for a bib file.

And then there’s the real catch: when someone now searches for your papers, the URL to their version of the paper comes up higher in the Google ranking than either yours or the publisher’s. ‘Thanks’ to’s services, I now know with which Google search terms they clicked to retrieve which paper. So when on March 20 someone from an unknown country at 04:50am local time searched for “data information granularity bioinformatics”, it found as the very first Google hit—yes, the rank is given in the stats as well—the slides of my PhD defense on scribd, not my thesis on my website (that it should have done); and even if s/he wanted to have download it, upon clicking the “Download” button, it complains that “You must be logged in to download” (!). The slides were probably not what s/he was after (idem ditto for the visitor from Poland, who searched for this). There are many such misdirected instances. In general, essentially they would have had to do the search again to get the publication data and, in case of the linking of a wrong file, searching for the right file. It is correctable on—manually. Subsequently adding new papers is also a manual process with an impractical GUI.

And then I have not said anything yet about the scruffy page rendering by scribd. Besides, I never gave scribd approval to offer my work on/through their site (there are more malfunctioning aggregators, which is an issue of its own). In addition, it would not surprise me if that would violate the delicately balanced copyright arrangements that exist for CS publications. UPDATE: the terms and conditions (d.d. 21-4-2011) says that “By displaying or publishing (“posting”) any Content on or through the Services, you hereby grant to a limited license to use, modify, publicly perform, publicly display, reproduce, and distribute such Content solely on and through the Services.”. That does violate the delicately balanced copyright arrangements that exist for CS publications. The terms & conditions also say it is the Member’s own responsibility to get the approval from the respective publishers/copyright holders.

Moreover, there is a preposterous “message” on the right-hand side of the search statistics: “Tip: To make your page appear higher up on Google: Link to your page from your department website Upload more documents – papers, talks and a CV”. But my own site was higher up in the Google ranking before I signed up to your devious service! Honestly, I want to lure people to my site when they are interested in my contributions, not to a place where there is partial information badly duplicated. Ok, this smells of ego-tripping, but my site is worth almost $17K according to and has a page rank of 5; if I run out of money, I’ll have an asset to sell fairly easily without much disruption. Seriously though, this ‘rerouting’ of visitors away from the source toward some obscure other location on the Internet is obviously a more important issue at the institutional level. No sane university or research institute would want to have as policy to redirect visitors to any other site than their own when it comes to displaying the scientific impact its employees have made. If one is at a non-indexable institute that only happens to carry the title ‘university’ bit is not in substance, then perhaps helps with your visibility. But I am not at such an institute; UKZN is one of the 5 top research-intensive universities in South Africa.

So what can you do? Remove your papers on This I did this morning, one by one. The sad thing is that “following” other researchers is, in theory at least, an easy way to be notified automatically of their new publications compared to manually checking their homepages regularly, but this is precisely that which manages to mess up, badly.

I can envision a couple of mean scenarios why anyone would have wanted to set up the site in the way it is, like that they first pollute Google rankings and then ask for a fee in the near future (after all, they already require you to login to download the file, and a “fee” item is included in the terms & conditions file). The statistics they are gathering on who-follows-who gives a better insight in research networks and its leaders than the more common citation-network analyses. Finding out which scientific papers and topics are ‘hot’ must be valuable material as well, and become perhaps just as important as the rather imprecise ISI impact factor that is quite useless for CS at the moment. You also could use the data for NLP and semantic annotations to, in near future, offer indispensible academic semantic search facilities (at a price). And no doubt there are more scenarios.

In short, that was the end of that “Web 2.0” experiment for me.

(p.s.: just in case someone wants to see some proof: I did make several screenshots that I can share)