Google searches, sneaky, and data duplication

Aside from this blog and Facebook, I recently signed up for, a Web 2.0ish site where researchers can connect and follow each other academically. It even was so ‘smart’ that it could tell me who of my FB friends were already in the system without me giving explicitly the information about my FB account, and it found most of my papers for me (with some noise, though). Setting aside the uncomfortable former aspect, the ‘finding and handling my papers for me’ is actually really sneaky. I’ll spend the remainder of this post on that, just so you know what you’ll be letting yourself into just in case you sign up for it.

The first annoying thing is, that if you let collect your papers automatically when you build up your profile there, which it seemingly does ‘intelligently’, it snatches your papers and either takes them from citeseer or puts them on scribd, even though they are all on my and the publisher’s websites, too. And then there’s some noise; e.g., it links to a pdf of the presentation you did at the conference instead of the paper itself. Also, it does not provide full publication details (just the title), even though easily could be programmed to screen scrape that from any researcher’s website, or, better, ask me for a bib file.

And then there’s the real catch: when someone now searches for your papers, the URL to their version of the paper comes up higher in the Google ranking than either yours or the publisher’s. ‘Thanks’ to’s services, I now know with which Google search terms they clicked to retrieve which paper. So when on March 20 someone from an unknown country at 04:50am local time searched for “data information granularity bioinformatics”, it found as the very first Google hit—yes, the rank is given in the stats as well—the slides of my PhD defense on scribd, not my thesis on my website (that it should have done); and even if s/he wanted to have download it, upon clicking the “Download” button, it complains that “You must be logged in to download” (!). The slides were probably not what s/he was after (idem ditto for the visitor from Poland, who searched for this). There are many such misdirected instances. In general, essentially they would have had to do the search again to get the publication data and, in case of the linking of a wrong file, searching for the right file. It is correctable on—manually. Subsequently adding new papers is also a manual process with an impractical GUI.

And then I have not said anything yet about the scruffy page rendering by scribd. Besides, I never gave scribd approval to offer my work on/through their site (there are more malfunctioning aggregators, which is an issue of its own). In addition, it would not surprise me if that would violate the delicately balanced copyright arrangements that exist for CS publications. UPDATE: the terms and conditions (d.d. 21-4-2011) says that “By displaying or publishing (“posting”) any Content on or through the Services, you hereby grant to a limited license to use, modify, publicly perform, publicly display, reproduce, and distribute such Content solely on and through the Services.”. That does violate the delicately balanced copyright arrangements that exist for CS publications. The terms & conditions also say it is the Member’s own responsibility to get the approval from the respective publishers/copyright holders.

Moreover, there is a preposterous “message” on the right-hand side of the search statistics: “Tip: To make your page appear higher up on Google: Link to your page from your department website Upload more documents – papers, talks and a CV”. But my own site was higher up in the Google ranking before I signed up to your devious service! Honestly, I want to lure people to my site when they are interested in my contributions, not to a place where there is partial information badly duplicated. Ok, this smells of ego-tripping, but my site is worth almost $17K according to and has a page rank of 5; if I run out of money, I’ll have an asset to sell fairly easily without much disruption. Seriously though, this ‘rerouting’ of visitors away from the source toward some obscure other location on the Internet is obviously a more important issue at the institutional level. No sane university or research institute would want to have as policy to redirect visitors to any other site than their own when it comes to displaying the scientific impact its employees have made. If one is at a non-indexable institute that only happens to carry the title ‘university’ bit is not in substance, then perhaps helps with your visibility. But I am not at such an institute; UKZN is one of the 5 top research-intensive universities in South Africa.

So what can you do? Remove your papers on This I did this morning, one by one. The sad thing is that “following” other researchers is, in theory at least, an easy way to be notified automatically of their new publications compared to manually checking their homepages regularly, but this is precisely that which manages to mess up, badly.

I can envision a couple of mean scenarios why anyone would have wanted to set up the site in the way it is, like that they first pollute Google rankings and then ask for a fee in the near future (after all, they already require you to login to download the file, and a “fee” item is included in the terms & conditions file). The statistics they are gathering on who-follows-who gives a better insight in research networks and its leaders than the more common citation-network analyses. Finding out which scientific papers and topics are ‘hot’ must be valuable material as well, and become perhaps just as important as the rather imprecise ISI impact factor that is quite useless for CS at the moment. You also could use the data for NLP and semantic annotations to, in near future, offer indispensible academic semantic search facilities (at a price). And no doubt there are more scenarios.

In short, that was the end of that “Web 2.0” experiment for me.

(p.s.: just in case someone wants to see some proof: I did make several screenshots that I can share)


36 responses to “Google searches, sneaky, and data duplication

  1. I spotted your blog via google. Search terms: “ copyright”. I agree with your statements and removed my research-papers, too.
    The benefits are small and the GUI is still poor of this site.
    However, if you are a “mobile element” in science and change institutions after funding ended you may be difficult to track. gives you the opportunity to create a universal personal link-hub to your current position.
    Again, I share your opinion to choose the information content wisely, presented on that site. This should be common sense anyway.

  2. Hi cattelhill,

    If the
    “if you are a “mobile element” in science and change institutions after funding ended you may be difficult to track.”
    would be the only reason, then that can be easily fixed by other means: get a domain name and set up your own website, be it separately hosted on a friend’s server like I have (at, or use the department’s users homepage directory and put a redirect from the chosen domain name to the temporary url for your department’s users home page directory.

  3. I also found your post through googling “ pdf copyright”. Their pdf handling is pretty sneaky indeed. And – of course – they have their terms covering everything, but it is still a very bad practice, aimed at maximizing traffic. I prefer as a 2.0 academia network. They ask permission for everything, and you don’t have the feeling they are being unethical.

  4. Hi,

    Thanks for the post. I agree. Initially, I was impressed by how found many of my publications. Unfortunately, there were duplicates and because I didn’t want ‘my’ profile looking messy, I felt I had to dedicate some time to go through each citation and add the missing detail/delete duplicates.

    Best wishes,

    • impressed/surprised, yes. But Google Scholar found them (almost) all, too–and still does–automatically, which much less noise and more data. i.e., GS is better than when it comes to finding papers. Also, at least GS shows link also to the openly accessible copies without any modification or new access restriction as scribd does (the links I checked on my GS page were to the publisher’s website and to copies on citeseer, a co-author’s homepage, workshop’s website etc., or to to my homepage)

  5. Dear Keet,

    I found your article insightful, thank you for sharing your opinions.
    I am a graduate student and I have found’s service excellent so far, so that I was surprised by your article and I decided to verify your statements.
    I found out that what you say about the terms and conditions (T&C) is indeed literally true, but I believe that it may mislead the reader. This is the reason I decided to post this comment.

    First, you omitted that T&C also states that:

    “ *does not claim any ownership rights* in the Content that you post to the Services”

    and that

    “you *continue to retain all ownership rights in such Content*, and you continue to have the right to use your Content in any way you choose”.

    This seems pretty fair.
    Moreover, even if the T&C states that you give a “limited license to use, modify, publicly perform, publicly display, reproduce, and distribute such Content”, this is true “solely on and through the Services”.

    This means that you authorize Academia to show your papers on (and only on, to let people download them from, to visualize them on their PC, print them, read them, etcetera. This seems the point of sharing a paper on Academia. So far so good!

    The right to “modify” seems more sneaky. Nevertheless, I think this may be determined by Academia’s need to convert some file formats, for instance powerpoint to pdf. (another remark here: you forgot to make an update about Academia’s policy of uploading paper on scribd – it ended several months ago). I have never heard of any’s paper being modified by’s in its content, and I believe it would not be in’s staff interest to do so – remember that they do not have any ownership rights, and the putative “modified” copy would still be yours (and useless – therefore unlikely to exist).

    Lastly, the T&C states that the license is limited. This means that whenever you want you can revoke Academia’s rights on your paper. This seems a pretty good way to solve any problem with Academia’s service, if you find out something isn’t going the way you expected (even if I don’t understand what you fear Academia would do with your paper, if I have understood the T&C correctly so far).

    To conclude, I understand that’s high ranking on google might be annoying for a well-known professor – but this is also a great advantage for the “small fishes” like me, and it may represent a stimulus to a (virtually) democratic academic agora (I’m not saying this is necessarily positive – I am only suggesting a less dysphoric interpretation of this feature of

    Please notify me if you believe I somehow misunderstood’s T&C, given that I am neither an native English speaker nor a law student.



  6. Dear Neri,
    Thank you for your extensive reply.
    Please note that the blogpost was written about 2 years ago, taking into account the then active T&C and how the site’s features were working then. Given that situation, I wrote the blog post and I was sufficiently disappointed to leave it aside indefinitely.
    If it has changed for the better, fine, as the idea of automatically following other researchers itself is good. But the bad first experiences don’t make jump on board now, as it may well revert back, and, still, as for myself, I’d rather direct people searching for me to my homepage than to

  7. Pretty nice post. I just stumbled upon your weblog
    and wished to say that I’ve truly enjoyed browsing your blog posts. In any case I’ll be subscribing to your feed and I
    hope you write again soon!

  8. I had just started to be so active in Academia, but also was reluctant to put my papers there. Then I stumbled upon this entry – and read it – and thought: oh well… maybe putting the titles of my papers without uploading the files should be fine. I despise things related to copyright infringements & plagiarism and the likes, and really — reading your entry has made me more aware BEFORE uploading my stuff to Academia. I guess it’s just right to have the website for building networks among scientists, but if the website gets sneaky (to use your term), it’s another thing. Thank’s for the head’s up! 🙂

  9. Pingback: Google searches, sneaky, and data ...

  10. Great post. I literally just signed up for, and the first thing they asked me to do was upload my publications. Naturally, the next thing I did was search ‘ copyright infringement’ and found this post. I have now deleted my account until I check out this situation in detail. Thanks for writing.

  11. I’ve been surfing online more than 3 hours today,
    yet I never found any interesting article like yours. It is pretty worth enough for me.
    In my opinion, if all webmasters and bloggers made good content as you did, the web will be much more useful than ever before.

  12. Well, to add my two cents worth of opinion, is fine with me. I managed to congregate my publications under one roof, albeit not mine, but then there are always tradeoffs. And yes many colleagues and other interested persons downloaded the ones that interested them.
    I have not really searched the issue of searches but I am glad that Academia is found in Google. Otherwise where would I post my talks and make them accessible and available in web searches?
    Thanks for hosting my comment.

  13. Pingback: 8 years of keetblog | Keet blog

  14. Have you ever considered writing an e-book or guest authoring on other blogs?
    I have a blog centered on the same subjects you discuss and would really like to
    have you share some stories/information. I know my viewers would appreciate your work.
    If you are even remotely interested, feel free to shoot me an e mail.

  15. I’m no longer sure where you’re getting your information, but good topic.
    I needs to spend a while learning much more or figuring out more.
    Thank you for fantastic information I used to be on the lookout
    for this information for my mission.

  16. Pingback: Login - Sign In Page

  17. Is it really’s “fault” that their sites get to rank higher than the original sources of the papers? I would think they don’t have much influence on that (albeit they might consider it lucky), since it is Google’s search algorithms that determines which content “deserves” to rank highest.

    • There are ‘interesting’ things one can do from the SEO side, but my main point was that I (and I presume also any well-known university) don’t want showing my profile higher up than my own webpage resp. the university’s website due to their aggregate results, for several reasons mentioned in the post. Also, because an aggregate may come out higher by PageRank doesn’t imply it has better results for individuals and organizations (e.g., due to dirty or incomplete data has crawled).

    • afaik, no (but I don’t use anymore). if not logged in, then certainly not, but you’ll see just IP address and all that comes with it (city etc.)

  18. Definitely imagine that that you stated. Your favorite reason appeared to
    be on the internet the simplest factor tto
    bear in mind of. I say to you, I definitely geet annoyed whilst folks think about issues that they just
    do not realize about. You controlled to hit the nail upon the highest and defined out the whbole thing without having side effect , other people can take a
    signal. Will probably be again to get more. Thanks


    • one does not ‘post’ or ‘put’ them into google scholar; their bots crawl the web and their algorithms try to figure out which one are the scientific papers. I’m not sure if researchgate is any better than, but I’ve heard fewer complaints about them. Another option may be to put your papers on a preprint server such as arxiv (or similar, depending on your field of research). Also, registering with e.g. wordpress (that sets up the blog for you) is very easy to do and then make a blog-as-homepage [choose ‘new page’ rather than ‘new post’].

  20. Pingback: UH OH | jschoolmann

  21. It’s really a cool and useful piece of info.
    I’m glad that you just shared this helpful information with
    us. Please stay us informed like this. Thank you for sharing.

  22. I don’t belong to acadamia. Edu…. But received message today that they granted exses to my Google account… How is this possible

  23. Pingback: A brief reflection on maintaining a blog for 15 years (going on 16) | Keet blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.