Every now and then, I get side-tracked from what I was (supposed to be) doing. This time, it was a result of the combination of preparing ICPC training problems, preparing for a statistics tutorial for the postgraduate research methods, and a conversation from last week on an isiZulu corpus with Langa Khumalo from UKZN’s ULPDO (and my co-author on several papers on isiZulu CNLs). To make a long story short, I ended up sourcing some online news articles in isiZulu and writing a little python script to count the words and top-k words of the news articles to get a feel of what the most prevalent topics of the articles were.
Materials and data
10 Isolezwe, listed on the front page on August 8, 2015 (articles were from Aug 6 and 7—no updates in the long weekend)
10 News24 in isiZulu articles, listed on the front page on August 8, 2015 (articles were from Aug 8)
10 News24 in isiZulu articles, listed on the front page on August 9, 2015 (articles were from Aug 9, a Sunday, and Women’s Day in South Africa)
Simple basicCorpusStats.py that one can make already just by going through the first part of ThinkPython (in case you’re unfamiliar with python).
Note: ilanga doesn’t have articles online, and therefore was not included.
Note 2: for copyright issues, I probably cannot share the txt files online, but in case you’re interested, just ask me and I’ll email them.
Some general stats
Isolezwe had, on average, 265 words/article, whereas news24 had about half of that (110 and 134 on Saturday and Sunday, respectively). The top-20 of each is listed at the end of this post (the raw results of News24 had “–” removed [bug], as well as udaba and olunye [standard-text noise from the articles]).
Comparing them on the August 8 offering, Isolezwe had people saying this that and the other (ukuthi ‘saying/to say’ had the highest frequency of 60) and then the police (amaphoyisa, n=27), whereas News24 had amaphoyisa 27 times as most frequent word, then abasolwa (‘suspects’) 11 times that doesn’t even appear in Isolezwe’s top-20 most frequent words (though the stem –solwa appears 9 times). The police is problematic in South Africa—they commit crimes and other dubious behaviour under investigation (e.g., Marikana)—and more get killed than in may other countries (another one last week), and crime happens. But not on a public holiday, apparently: News24 had only one –phoyisa on Aug 9.
While I hoped to find a high incidence of women, for it being Women’s Day on August 9, none of –fazi appeared in the News24 mini-corpus of 1353 words of the 10 front page articles; instead, there was a lot of saying this that and the other (ukuthi had the highest frequency of 37), and little on suspects or blaming (-solwa n=3).
On that quasi wordle
While ukuthi is the infinitive, there are a gazillion conjugations and things agglutinated to it that is barely clear to the linguists on how it all works, so I did not analyse that further. Amaphoyisa, on the other hand, as a noun (plural of ‘police’), has fewer variations. In the Isolezwe mini-corpus, –phoyis– (the root of ‘police’) appeared 47 times, including variants like lwamaphoyisa, ngamaphoyisa, yiphoyisa, i.e., substantially more than the 27 amaphoyisa. If I were to create a wordle, they’d be missed unless one uses some stemmer, which doesn’t happen to be available and I didn’t write one (just regex in the txt). By the same token, News24’s mention of the police on August 8 goes up to 28 with –phoyisa, and as close second the blaming and suspects (-solwa, n=27).
The lack of a stemmer also means missing out on all sorts of variations on imali (‘money’, n=11) in the isolezwe articles, whereas its stem –mali pops up 29 times, due to, among others, kwemali (n=5), mali (n=3), yimali (y- functioning as copulative in that sentence, n=1), ngezimali (n=1) and others. Likewise on person/people (-ntu) for which n=17 that are distributed among abantu (plural) umuntu (singular), nabantu (‘and people’), among others.
Last, the second most frequently used word in News24 on August 9 was njengoba (‘as’, ‘whereas’, ‘since’), primarily due to the first article on the sports results of the matches played.
So, with all that background knowledge, Isolezwe’s wordle would be, in descending order (and in English for the readers of this blog): say, police, money, people. News24 on August 8: police, suspect/blame, say (two variations, n=9 each). News24 on August 9: say, as/since (and then some other adverbs).
This dabbling resulted in more problems and questions being raised than answered. But, for now, it’s at least still a bit of a peek into the kitchen of news in a language that I don’t master as well as I want to and should. It wasn’t useful either for the ICPC problem setting or the stats tutorial, nor is a 5123-word corpus of any use, but it was fun with python at least and satisfying at last a little of my curiosity, and perhaps it spurs someone to do all this properly/more systematically and on a grander scale. For the isiZulu speakers: it’s surely still up to you to read whichever news outlet you prefer reading.
 Pretorius, L., Bosch, S.E. (2010). Finite-state morphology of the Nguni language cluster: modelling and implementation Issues. In A. Yli-Jyrä, Kornai, A., Sakarovitch, J. & Watson, B. (Eds.), Finite-State Methods and Natural Language Processing 8th International Workshop, FSMNLP 2009. Lecture Notes in Computer Science, Vol. 6062, pp. 123–130
 Spiegler, S., van der Spuy, A., and Flach, P. A. (2010). Ukwabelana – an opensource morphological zulu corpus. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10), pages 1020-1028. Association for Computational Linguistics. Beijing
|Top-20 words Isolezwe on Aug8||Top-20 words News24 on Aug8||Top-20 words News24 on Aug9|
 There is some material on that (among others, [1,2]), though, but it’s mostly theoretical or very proof of concept, rather than the easy reuse of tools like for English, and the example rule in  isn’t right (it’s umfana, not umufana; the longer prefix with the extra –u– is used when the stem is one syllable, like –ntu -> umuntu).