A brief reflection on maintaining a blog for 15 years (going on 16)

Fifteen years is a long time in IT, yet blogging software is still around and working—the same WordPress I started my blog with, even. At the time, in 2006, when WordPress was still only offering blogging functionality, they had the air of being respectable and at least somewhat serious compared to blogspot (redirects to Blogger now) that hosted a larger share of the informal and whimsical blogs. Blogs are not nearly as popular now as they used to be, there seems to be a move to huddle together to take a ride on a branded bandwagon, like Medium and Substack, and all of the blog-providing companies have diversified the services they offer for blogging. WordPress now markets itself as website builder, rather than blogging, software.

One might even be tempted to argue that blogs are (nearly) obsolete, with TikTok and the like having come along over the years. No so, claims a blogger here, some 10 more more bloggers here, and even a necessity according to another that does provide a list of links to data to back it up. (Just maybe don’t try making a living from it—there are plenty of people who like to read, but writing doesn’t pay well.)

Some data for this blog, then. It has 325 published post, there are around 400-600 visitors per month in recent years (depending on the season and posting frequency), there are people still signed up to receive updates (78), some even like some of the posts, and some of them are shared Twitter and other social media. The most visited post of all time got over 21000 visits and counting (since 2011) and the most visited post in the past year (after the home page) still had a fine 355 visitors and is on my research and teaching topic (see also the occasionally updated vox populi). So, obsolete it is not. Admitted, the latter post had its heydays in 2010-2012 with about 2500 visits/year and the former saw its best of times in 2014-2015 (4425 and 4948 visits in each of those years alone, respectively). The best visited post of the mere 10 posts I wrote in 2021 is on bias in ontologies, having attracted the attention of 119 visitors. Summarizing this blog’s stats trends: numbers are down compared to 5-10 years ago, indeed, but insignificant it is not and multiple posts have staying power.

Heatmap of monthly views to this blog over time.

I also can reveal that there’s no clear correlation between the time-to-write and number-of-visits variables, nor between either of them and the post’s topic, and not with post length either. With more time, there would have been more, and more polished, posts. There’s plenty to write about, not only the long overdue posts for published papers that came out at an extra-busy time and therefore have slipped through writing about, but also other interesting research that’s going on and deserves that extra bit of attention, some more book reviews, teaching updates and so on. There’s no shortage of topics to write about, which therewith turned out to be an unfounded worry from 15 years ago.

Will I go on for another 15 years? Perhaps, perhaps not. I’m still fence-sitting, from the very first post in 2006 that summed up the reasons for starting a blog to this day, to give it a try nonetheless and see when and where it will end.

Why still fence-sitting? I still don’t know whether it’s beneficial or harmful to one’s career, and if beneficial, whether the time put into writing those posts could have been used better for obtaining more benefit from those alternative activities than from the blog post writing. What I do know, is that, among others, it has helped me to learn to write better, it made me take notes during conferences in order to write conference reports and therewith engage more productively with a conference, structure ideas and thoughts, and pitch papers. Also, the background searches for fact-checking, adding links, and trying to find pictures made me stumble into interesting detours as well. Some of the posts took a long time to write, but at least they were enjoyable pastimes or worktimes.

Uhm, so, the benefit is to (just?) me? I do hope the posts have been worthwhile to the readers. But, it brings into vision the question that’s well-known to aspiring writers: should I write for myself or for my readers? The answer depends on whom you consult: blog for yourself, says the blogger from paradise, write for another, imaginary, reader persona, says the novelist, and go for bothsideism for the best results according to the writer’s guide. I write for myself, and brush it up in an attempt to increase a post’s appeal. The brushing up mainly concerns the choice of words, phrases, and paragraphs and the ordering thereof, and the images to brighten up some of the otherwise text-only posts (like this one).

After so many years and posts, I ought to be able to say something more profound. It’s really just that, though: the joy of writing the posts, the hope it makes a difference to readers and to what I’ve written about, and the slight worry it may not be the best thing to do for advancing my career.

Be this as it may, over the past few days, I’ve added a bit more structure to the blog to assist readers finding the topics they may be interested in. The key different categories are now also accessible from the ‘Menu’, being work-related topics (research and papers, software, and teaching), posts on writing and publishing, and there are a few posts that belong to neither, which still can be found on the complete list of posts. Happy reading!

p.s.: in case you wondered: yes, I intended to do a reflection when the blog turned a nice round 15 in late March, were it not for that blurry extension to 2020 and lots of extra teaching and teaching admin duties in 2021. The summer break has started now and there’s not much of a chance to properly go on holiday, and writing also counts as leisure activity, so there the opportunity was, just about three months shy of the blog turning 16. (In case the post’s title vaguely rings a bell: yes, there’s that cheesy song from one of the top-5 movie musicals of all time [according to imdb], depicting a happy moment with promise of staying together before Rolfe makes some more bad decisions, but that’s 16 going on 17.)

Wikipedia + open access = not quite a revolution (not yet at least)

The title of the arxiv blog post sounded so catchy and wishful thinking into a high truthlikeness: “Why Wikipedia + open access = revolution”, summarizing and expanding on arxiv.org/abs/1506.07608 with the title “Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science.” [1], with some quotes:

“The odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to closed access journals,” say Teplitskiy and co.

Open access publishing has changed the way scientists communicate with each other but Teplitskiy and buddies have now shown that its influence is much more significant. “Our research suggests that open access policies have a tremendous impact on the diffusion of science to the broader general public through an intermediary like Wikipedia,” says Teplitskiy and co.

It means that open access publications are dramatically amplifying the way science diffuses through the world and ultimately changing the way we understand the universe around us.

I sooo want to believe. And, honestly, when I search for something and Wikipedia is the first hit and I do click, it does seem to give a decent introductory overview of something I know little about so that I can make a better start for searching the real sources. I never bothered to look up my own areas of specialisation, other than when a co-author mentioned there was (she put?—I can’t recall) a reference to her tool in Wikipedia some time ago. But there’s that nagging comment to the technologyreview blog post saying the same thing, and adding that when s/he looked up his/her own field, s/he

“then realized that in my own field, my main reaction was to want to scream at the cherry picking of sources to promote some minor researcher.”

So, I looked up “ontology engineering” and “Ontologies” that redirected to “Ontology (information science)” (‘information science’, tsk!)… and I kinda screamed. The next sections are, first, about the merits of the arxiv paper (outcome: their conclusions are certainly rather quite exaggerated) and, second, I’ll use that ‘ontology (information science)’ entry to dig a bit deeper as use case, using both the English entry and in several other languages as that’s what the arxiv paper covers as well. I’ll close with some thoughts on what to do about it.

 

On the arxiv paper’s data and results

There are several limitations to the paper; some of them discussed by its authors, some are not. The arxiv paper does not distinguish between online freely available scientific literature where only the final typesetted version is behind a paywall and official ‘open access’. This is problematic for processing the computer science entries in Wikipedia for trying to validate their hypothesis. In addition, they considered only journals with their open access policy, and journal-level analysis (cf article-level analysis), idem for the problematic ISI impact factor, and only those 21000 listed in Scopus, amounting eventually to the (ISI index-)top 4721 journals of which 335 open access to test Wikipedia content against. The open access list was taken from being listed in the directory of OA journals, ignoring the difference between ‘green’ and ‘gold’ and paywall-access from, say Elsevier. Overall, this already does not bode well for extending the obtained conclusion to computer science entries and, hence, the diffusion of knowledge claim.

The authors admit they may undercount references for the non-English entries, but they have few references anyway (Fig 1 in the arxiv paper), so it’s basically largely an English-Wikipedia analysis after all, i.e., so the conclusion is not really straightforwardly extending to ‘diffusion of knowledge’ for the non-English speaking world.

The statistical model is described on p19 of the pdf, and I don’t quite follow the rationale, with an elusive set of ‘journal characteristics’ and some estimated variables without detail. Maybe some stats person can shed a light on it.

Then the bubble-figure in the technologyreview, which is Fig 8 in the arxiv paper and it is reproduced in the screenshot below, which “shows that across 50 [non-English] Wikipedias, there is an inverse relationship between the effects of accessibility and status on referencing”. Come again? It’s not like the regression line fits well. And why are the language entries—presumably independent of one another—in a relation after all? Notwithstanding, the odds for a Serbian entry to have a reference to an open access journal is some 275% higher than to a paywalled one, vs entries in Turkish that cite higher impact factor journals some 200% more often, according to the arxiv paper. I haven’t found details of that data, though, other than a back-of-the-envelope calculation when glancing over the figure: Serbian has a 1.5 for impact and a 3.75 or so for open access, Turkish 3 and 1.3-ish. Of how many entries and how many citations for those languages? They state that “While the English Wikipedia references ~32,000 articles from top journals, the Slovak Wikipedia references only 108 and Volapuk references 0.”. But Volapuk still ends up with an open access odd ratio of 0.588 and an ln(impact factor) of 2.330 (Appendix A3), which is counted only with the set of top-rated journals only; how is that possible when there are no references to those top journals? The number of counted journal citations is not given for each language, so a ‘statistically significant’ may well actually be over a number that’s too low to do your statistics with. Waray-Waray is a very small dot, and reading from Fig 1, it’s probably not more than those 108 references in the Slovak entries.

All in all, there is some room for improvement on this paper, and, in any case, some toning down of the conclusions, let alone technologyreview’s sensationalist blog title.

fig8of1506.07608

Fig 8 from Teplitskiy et al (2015)

Ontology (information science) Wikipedia entry, some issues

Let me not be petty whining that none of my papers are in the references, but take a small example of the myriad of issues.

Take the statement “There are studies on generalized techniques for merging ontologies,[12] but this area of research is still largely theoretical.” Uh? The reference is to an obscure ‘dynamic ontology repair’ project pdf from the University of Edinburgh, retrieved in 2012. We merged DMOP’s domain content with DOLCE in 2011, with tool support (Protégé, to be precise). owl:import was around and working at that time as well. Not to mention the very large body of papers on ontology alignment, reference book by Shvaiko & Euzenat, and the Ontology Alignment Evaluation Initiative.

The list of ontology languages even includes SBVR and IDEF5 (not ontology languages), and, for good measure of scruffiness, a project (TOVE).

The obscure “Gellish” appears literally everywhere: it is an ontology, it is a language, it is an English dictionary (yes, the latter apparently also falls under ‘examples’ of ontologies. not), and it is even the one and only instantiation of a “hybrid ontology” combining a domain and an upper ontology. Yeah, right. Looking it up, Gellish is van Rensen’s PhD thesis of 2005 that has an underwhelming 2 citations according to Google Scholar (10 for the related journal paper), and there’s a follow-up 2nd edition of 2014 by the same author, published with lulu, no citations. That does not belong to an introductory overview of ontologies in computing. Dublin core as an example of an ontology? No (but it is a useful artefact for metadata annotations).

Under “criticisms”: aside from a Werner Ceusters statement from a commentary on someone from his website—since when deserves that to be upgraded to Wikipedia content?!?—there’s also “It’s also not clear how ontology fits with Schema on Read (NoSQL) databases.”. Ontologies with NoSQL? sigh.

“Further readings” would, I expect, have a fine set of core readings to get a more comprehensive overview of the field. While some relevant ones are there (e.g., the “what is an ontology?” paper by Oberle, Guarino, and Staab; “Ontology (Science)” by Smith, Gruber’s paper despite the flawed definition), numerous ones are the result of some authors’ self-promotion, like the one on bootstrapping biomedical ontologies, an ontology for user profiles, IE for disease intelligence—they’re not even close to ‘staple food’ for ontologies—and the 2001 OIL paper and Ontology Development 101 technical report are woefully out-dated. The “References” section is a mishmash of webpages, slides, and a few scientific papers most of which are not from mainstream ontology research venues.

And that’s just a sampling of the issues with the “Ontology (information science)” Wikipedia entry; the ontology engineering entry is worse. No wonder my students—having grown up with treating Wikipedia as gospel—get confused.

 

Ontologies entries in other languages

That much about the English language version of ‘ontology (information science)’. I happen to speak a few other languages as well, so I also checked most of those for their ‘ontology (information science)’ entry. For future reference as a stock-taking of today’s contents, I’ve pdf-printed them all (zipped). For starters, they all had ontologies at least categorised properly into ‘informatica’. +1.

The entry in Dutch is very short; one can quibble and nit-pick about term usage, and it is disappointing that there’s only one reference (in Dutch, so wouldn’t count in the arxiv analysis), but at least it’s not riddled with mistakes and inappropriate content.

The German one is quite elaborate, and starts off reasonably well, but has some mistakes. Among others, the typical novice pitfall of confusing classes for instances [“Stadt als Instanz des Begriffs topologisches Element der Klasse Punkte”] and the sample ontology—which of itself is a good idea to add to an overview page—has lots of modelling issues, such as datatypes and mixing subclasses with properties (the Maler [painter] with region of origin Flämish [Flemish]). Interestingly, ontology types for the English reader are foundational, domain, and hybrid, whereas the German reader has only lightweight and heavyweight ones. As for the references, there are some oddball ones, but the fair/good ones are in the majority, if incomplete, and perhaps a bit lopsided to Barry Smith material.

The Italian entry is of similar length as the German entry, but, unfortunately, has some copy-and-paste from the English one when it comes to the list of languages and examples, so, a propagation of issues; the ‘example of applications’ does list another project, and there is no ‘criticisms’ section. The text has been written separately instead of being a translation-of-English (idem ditto for the other entries, btw), and thus also consists of some other information. For starters, removing most of the ‘Premesse’ would be helpful (or elaborating on it in a criticism section; starting the topic with information warfare and terrorism? nah). The section after that (‘uso come glossario di base’) is chaotic, reading like a competitor-author per paragraph, and riddled with problematic statements like that all computer programs are based on foundational ontologies (“Tutti i programmi per computer si basano su ontologie fondazionali,”), and that the scope of an ontology is to develop a database (“Lo scopo di un’ontologia computazionale […] [è] di creare una base di dati”). It does mention OntoClean. Italian readers will also be treated on a brief discussion of the debate on one or multiple ontologies (absent from the other entries). It has a quite different set of ‘external links’ compared to the other entries, and there are hardly any references. Al in all, one leaves with a quite distinct impression of ontologies after reading the Italian one cf the Dutch, German, and English ones.

Last, the Spanish entry is about as short as the Dutch one. There’s overlap in content with the Italian entry in the sense of near-literal translation (on the foundational ontology and that Murray-Rust guy on the ‘semantic and ontological war’ due to ‘competition between standards’), and it has a plug for MathWorld (?!).

So, if the entries on topics I’m an expert in are such of such dubious quality (the German entry is, relatively, the best), then what does that imply for the other entries that superficially may seem potentially useful introductory overviews? By the same token, they probably are not. And the ontology topics are not even in an area with as much contention as topics in political sciences, history, etc. Go figure.

 

Now what?

Is this a bad thing? I already can see a response in the making along the line of “well, it’s crowdsourced and everyone can contribute, we invite you to not just complain, but instead improve the entry; really, you’re welcome to do so”. Maybe I will. But first, two other questions have to be answered. The arxiv paper that got my rant started claimed that open source papers are good, and that they’re reworked in interested-layperson digestible bites in Wikipedia to spread and diffuse knowledge in the world. The idea is nice, but the reality is different. Pretty much all the main papers on ontologies are freely available online even if not published ‘open access’ (computer science praxis, thank you), yet, they are not the ones that appear in Wikipedia. Question 1: Why are those—freely available—main references of ontologies not referenced there already?

A concern of a different type is that several schools in South Africa have petitioned to get free Internet access to search Wikipedia as a source of information for their studies. Their main argument was that books don’t arrive, or arrive late, and there is no library in many schools, which is a common problem. They got the zero-rate Wikipedia from MTN; more info here. (I’ll let you mull over its effects on the quality of education they get from that.) Question 2: Can Wikipedia be made a really authoritative resource with the current set-up so as to live up to what the learners [and interested laypersons] need? If I were to rewrite an update to the Wikipedia pages today, a pesky editor or someone else simply can click to roll it back to the previous version, or slowly but steadily have funny references seeping back in and sentences cut and rephrased. Writing free textbooks, or at least extensive lecture notes, seems a better option, or a ‘synthesis lectures’ booklet endorsed by lots of people researching and using ontologies. What about a ‘this version is endorsed by …’ button for Wikipedia entries?

Any better ideas, or answers to those questions, perhaps? Free diffusion of digested high quality scientific knowledge really does sound very appealing…

References

[1] Teplitskiy, M., Lu, G., Duede, E. Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science. arxiv.org/abs/1506.07608

Improving science blogging

As a brief diversion from report-writing to meet deadlines and setting aside for a moment the discussion on science blogging journalism vs blogging by scientist, I had a quick look at the PLoS Biology paper on Advancing Science through Conversations: Bridging the Gap between Blogs and the Academy [1]. After the usual introductory things, they set out to

propose a roadmap for turning blogs into institutional educational tools and present examples of successful collaborations that can serve as a model for such efforts. We offer suggestions for improving upon the traditionally used blog platform to make it more palatable to institutional hosts and more trustworthy to readers; creating mechanisms for institutions to provide appropriate (but not stifling) oversight to blogs and to facilitate high-quality interactions between blogs, institutions, and readers; and incorporating blogs into meta-conversations within and between institutions.

For instance, like done by Stanford (and several others mentioned in the article), the university or research institute could host a blogging site that aggregates their blogging scientists to give some trustworthiness to the blog and, perhaps, could be a showcase to the wide world that the ‘ivory towers’ do care about the public and that the institute would want to add a new mode of communication with the wide world. The variation by MIT is broader in types of content, e.g. with editorial and tech review, and more of a top-down approach (even though the scientists who are blogging show more of a bottom-up process). Our uni just went on facebook; would that be a step in the right direction (and add, say, LinkedIn for the alumni)?

Then there are issues of ‘blog review’, moderation, rankings and how one could approach that, as well as post categories of discussing peer reviewed published papers like research blogging, though I think one also could have other categories, such as with Ben Good’s experiment on putting out a draft for comment before submitting and where the blog (or a section thereof) is dedicated to some course the scientist is teaching.

More points and suggestions are being raised in the article (see in particular also the last section after figure 2), to which I might return after the deadline.

UPDATE: one of the article authors, Nick Anthis, has already written a blog post synthesising the various comments from other bloggers. Admitted, I lag behind the mainstream blogging and perhaps should have spend the time writting that paragraph in the report instead of browsing articles and blogs…

[1] Batts SA, Anthis NJ, Smith TC (2008) Advancing Science through Conversations: Bridging the Gap between Blogs and the Academy. PLoS Biol 6(9): e240.