On computer program being a whole

Who cares whether some computer program is a whole, how, and why? Turns out, more people than you may think—and so should you, since it can be costly depending on the answer. Consider the following two scenarios: 1) you download a ‘pirated’ version of MS Office or Adobe Photoshop (the most popular ones still) and 2) you take the source code of a popular open source program, such as Notepad++, add a little code for some additional function, and put it up for sale only as an executable app called ‘Notepad++ extreme (NEXT)’ so as to try to earn money quickly. Are these actions legal?

In both cases, you’d break the law, but how many infringements took place, of the one that you potentially could be fined for or face jail time? For the piracy case, is that once for the MS Office suite, or for each progam in the suite, or for each file created upon installing MS office, or for each source code file that went into making the suite during software development? For the open source case, was that violating its GNU GLP open source licence once for the zipped&downloaded or cloned source code or for each file in the source code, of which there are hundreds? It is possible to construct similar questions for trade secret violations and patent infringements for programs, as well as other software artefacts, like illegal downloads of TV series episodes (going strong during COVID-19 lockdowns indeed). Just in case you think this sort of issue is merely hypothetical: recently, Arista paid Cisco $400 million for copyright damages and just before that, Zenimax got $500 million from Oculus (yes, the VR software) for trade secret violations, and Google vs Oracle is ongoing with “billions of dollars at stake”.

Let’s consider some principles first. To be able to answer the number of infringements, we first need to know whether a computer program is a whole or not and why, and if so, what’s ‘in’ (i.e., a part of it) and what’s ‘out’ (i.e., definitely not part of it). Spoiler alert: a computer program is a functional whole.

To get to that conclusion, I had to combine insights from theories of parthood (mereology), granularity, modularity, unity, and function and add a little more into the mix. To provide less and more condensed versions of the argumentation, there is a longer technical report [1], of which I hope it is readable by a wider audience, and a condensed version for a specialist audience [2] that was published in the Proceedings of the 11th Conference on Formal Ontologies in Information Systems (FOIS’20) two weeks ago. Very briefly and informally, the state of affairs can be illustrated with the following picture:

(Source: adapted from [2])

This schematic representation shows, first, two levels of granularity: level 1 and level 2. At level 1, there’s some whole, like the a1 and a2 in the figure that could be referring to, say, a computer program, a module repository, an electorate, or a human body. At a more fine-grained level 2, there are different entities, which are in some way linked to the respective whole. This ‘link’ to the whole is indicated with the vertical dashed lines, and one can say that they are part of the whole. For the blue dots on the right residing at level 2, i.e., the parts of a1, there’s also a unifying relation among the parts, indicated with the solid lines with arrows, which makes a1 an integral whole. Moreover, for that sort of whole, it holds that if some object x (residing at level 2) is part of a1 then if there’s a y that is also part of a1, it participates in that unifying relation with x and vice versa (i.e., if y is in that unifying relation with x, then it must also be part of a1). For the computer program’s source code, that unifying relation can be the source tree graph.

There is some nitty gritty detail also involving the notion of function—a source code file contributes to doing something—and optional vs mandatory vs essential part that you can read about in the report or in the paper [1,2], covering the formalisation, more argumentation, and examples.

How would it pan out for the infringements? The Notepad++ exploitation scenario would simply be a case of one infringement in total for all the files needed to create the executable, not one for each source code file. This conclusion from the theory turns out remarkably in line with the GNU GPL’s explanation of their licence, albeit then providing a theoretical foundation for their intuition that there’s a difference between a mere aggregate where different things are bundled, loose coupling (e.g., sockets and pipes) and a single program (e.g., using function calls, being included in the same executable). The order of things perhaps should have been from there into the theory, but practically, I did the analysis and stumbled into a situation where I had to look up the GPL and its explanatory FAQ. On the bright side, in the other direction now then: just  in case someone wants to take on copyleft principles of open source software, here are some theoretical foundations to support that there’s probably much less money to be gained than you might think.

For the MS Office suite case mentioned at the start, I’d need a look under the hood to determine how it ties together and one may have to argue about the sameness of, or difference between, a suite and a program. The easier case for a self-standing app, like the 3rd-place most pirated Windows app Internet Download Manager, is that it is one whole and so one infringement then.

It’s a pity that FOIS 2020 has been postponed to 2021, but at least I got to talk about some of this as expert witness for a litigation case and I managed to weave an exercise about the source tree with open source licences into the social issues and professional practice module I thought to some 750 students this past winter.

References

[1] Keet, C.M. Why a computer program is a functional whole. Technical report 2008.07273, arXiv. 21 July 2020. 25 pages.

[2] Keet, C.M. The computer program as a functional whole. Proc. of FOIS 2020. Brodaric, B. and Neuhaus, F. (Eds.). IOS Press. FAIA vol. 330, 216-230.

Advertisement

UVa 11357 Ensuring truth solution description

We’re in the midst of preparing for the ICPC Southern Africa Regionals, to be held in October, and so I step up reading problems to find nice ones to train the interested students in a range of topics. The “Ensuring truth” problem was one of those, which I’ll discuss in the remainder of the post, since there’s no discussion of it online yet (only some code), and it is not as daunting as it may look like at first glance:

ensuringthruth

The task is to determine whether such a formula is satisfiable.

While it may ‘scare’ a 1st or 2nd-year student, when you actually break it down and play with an example or two, it turns out to be pretty easy. The ‘scary’ looking aspects are the basic propositional logic truth tables and the BNF grammar for (simplified!) Boolean formulas. Satisfiability of normal Boolean formulas is NP-compete, which you may have memorised, so that looks daunting as well, as if the contestant would have to come up with a nifty optimization to stay within the time limit. As it appears, not so.

Instead of being put off by it, let’s look at what is going on. The first line of the BNF grammar says that a formula can be a clause, or a formula followed by a clause that is separated by a disjunction (| ‘or’). The second line says that a clause is a conjunction of literals, which (in the third line) transpires to be just a series of ‘and’ (&) conjunctions between literals. The fourth lines states that a literal can be a variable or its negation, and the fifth line states that a variable is one of the letters in the alphabet.

Now try to generate a few inputs that adhere to this grammar. Swapping one variable at a time on the left of the “::=” sign for one of the elements on the right-hand side of the “::=” sign in the BNF grammar, with steps indicated with “=>”, then e.g.:

<formula> => <formula> | <clause> => <clause> | <clause> => (<conjunction-of-literals>) | <clause> => (<literal>) | <clause> => (<variable>) | <clause> => (a)| <clause> => (a)| (<conjunction-of-literals>) => (a)|(<conjunction-of-literals> & <literal>) => (a)|(<conjunction-of-literals> & <literal> & <literal>) => (a)|(<conjunction-of-literals> & <literal> & <literal> & <literal>) => (a)|(<literal> & <literal> & <literal> & <literal>) => (a)|(~<variable> & <literal> & <literal> & <literal>) => (a)|(~a & <literal> & <literal> & <literal>) => (a)|(~a & <variable> & <literal> & <literal>) => (a)|(~a&b& <literal> & <literal>) => (a)|(~a&b& <variable> & <literal>) => (a)|(~a&b&a& <literal>) => (a)|(~a&b&a& <variable>) => (a)|(~a&b&a&c)

That is, (a)|(~a&b&a&c) is in the language of the grammar, as are the two in the input given, being (a&b&c)|(a&b)|(a) and (x&~x). Do you see a pattern emerging of how the formulas look like with this grammar?

It’s a series of disjunctions of conjuncts, and only one of the conjuncts shouldn’t have a contradiction for the formula to be satisfiable. The only way we get a contradiction is if both a literal and its negation are in the same conjunct (analyse the truth tables if you didn’t know that). So, the only thing you have to do with the input is to check whether within the brackets there is, say, an x and a ~x, and with the first conjunct you encounter where there is no contradiction, then the formula is satisfiable and you print YES, else NO. That’s all. So, when given “(a)|(~a&b&a&c)”, you know upon processing the first conjunct “(a)”, that the answer is YES, because “(a)” is trivially not contradictory and thus we can ignore the “(~a&b&a&c)” that does have a contradiction (it doesn’t matter anymore, because we have found one already that doesn’t).

I’ll leave the implementation as an exercise to the reader  :).

Reblogging 2014: Coupon collecting the computing way

From the “10 years of keetblog – reblogging: 2014”: The 2014 post closest to ‘general interest’ is about calculating how much you will be ripped off when collecting team player cards to complete a Panini sticker book collection for some world sports championships, without swapping cards with friends and family. It coincided with the Soccer/football Word Cup in Brazil in 2014. The students of the ICPC Southern Africa Regionals training had some fun with it (and so did I when setting the problem). It may be of interest to students who are now preparing for the IT challenge heats (April 16) or the ICPC world finals (in May; we’ll go again with a UCT team [yay!], to Thailand this time).

Coupon collecting the computing way; Aug 3

———–

Coupon collecting is a very Dutch thing to do, though I never made a serious hobby out of it (nevertheless, I still have a great Brio Koekjesboek thanks to that), but I did collect stamps for a while, which was more interesting than cutting slips off of margarine wrappings. What does any of this have to do with computing, or math, for that matter? A lot. I mean, think of it: how much margarine must we have bought just to have enough slips to order the Brio cookie-baking booklet ‘for free’? Same story for the coffee packet wrappings. Post stamp collecting is harder: you’d want the whole series of a given edition. The Italian company Panini made a business out of it, enticing people to collect all stickers of all team members playing in a world cup. And that’s what got me into this post’s topic.

Coaching for the next ACM ICPC, which includes training sessions, made me surf on the web for some interesting problems to solve, so as not to have only previous ICPC regional’s and finals problems to train the students with. Simon Whitehouse has a great blog post on what it would cost to complete the whole Panini sticker book for the 2010 Soccer World Cup in South Africa, without swapping cards with friends, i.e.: how many packets of five stickers would you need to buy to get the whole series of 638 stickers (pictures of soccer players) to put in the sticker book? Answering this question sounded like fun. I reworked a bit the problem description from his post so as to generalize it to finding a way to be able to calculate what it would cost for any world cup—rugby and cricket are important in South Africa, too—and any cost of a packet of stickers (there’s some 6% inflation/year here); read the full problem description (pdf), on the first page.

In solving this, first, there are three variables: N for the number of unique stickers, P for the price of a packet of 5 stickers, and C for the total cost we want to know. To calculate C, we thus have \frac{total\_no\_of\_stickers}{5}*P , and we’ll round it up to the nearest integer. The crux is how to get to the total number of stickers.

Whitehouse’s post has a very readable explanation. In short, when you get the first sticker, it is guaranteed to be a new one, the second card has a \frac{638}{637} chance of being new, and so on to the last card \frac{638}{1} , wich follows from some basic notions of probabilities, which you can/will/have come across in a statistics intro course. Generalising this to the arbitrary number of N cards, we obtain

\frac{N}{N} + \frac{N}{N-1} + \frac{N}{N-2} + \ldots + \frac{N}{2} + \frac{N}{1}

to calculate the total amount of stickers you need to buy to have the N ones complete. This is as much as you really need from a computational viewpoint. Here’s a simple python code snippet that gets the job done:

def panini(n):
     tns = 0
     for i in range(1,n):
          tns = n/i + tns
     return tns

But why keep it simple when one can complicate matters…

This problem is an instance of the Coupon Collector’s Problem (CCP). The above formula is an harmonic series, and with some math on the CCP page, and the Euler-Mascheroni constant {\sf \gamma} (from number theory, with lots and lots of mathematics), one somehow obtains that the above-mentioned series is n*H_n , with H_n the harmonic number, and the whole thing equalling also

n \mbox{log} n + \gamma n + \frac{1}{2} + o(1) \mbox{ as } n \rightarrow \infty

according to the Wikipedia entry; there is a lot more online about it, e.g., here [course-level] and here [research], anong many resources. If that’s not enough, \gamma \approx 0.5772156649 , with the decimal digits computed now to over 119 billion decimal digits (it is a major question in mathematics whether it is an irrational number). Somewhere in the whole gamut of formula on Wikipedia and Whitehouse’s clean but unexplained jump (main text and a comment further down on that page), it boils down to, roughly,

total\_no\_of\_stickers = N*\mbox{ln}(N) + \gamma

The latter is easy to plug into a spreadsheet to obtain the answer. But lo and behold, what’s computed with the math-approach and natural log depends on what you plug in for \gamma , i.e., how many decimal digits, and only \gamma or the whole of Eq.2. The series with the simple algorithm does not have that problem with the approximations. And you don’t have to do all the math. I didn’t exactly record the time it took to create the spreadsheet versus typing up the simple algorithm, but the latter may even have been faster to do.

Besides the observation that the computing way made it simpler to solve the problem with respect to the design, there’s still a remark to be made on computing the total cost. With R10 per packet and the soccer world cup sticker book, you’ll end up paying R8977 to complete the soccer world cup book if you’d do it all by yourself! For many a South African, that’s more than the monthly salary. Completing a 400-sticker world cup for R35/packet is going to cost you R18389 (about €1268 with the current exchange rate). You’d be a lot better off swapping doubles with family and friends rather than buying new packets. Then again, mot people probably won’t calculate how much money they’d be spending on collecting things, so, here’s a basis for a business model for you.

SA ICPC Regionals 2013 problem analysis

Our 2015 Southern Africa ICPC Regionals is nearby, and we have been using some of the 2013 SA problems for training purposes as well as a teaser/taste of what’s to come on the 24th of October (registration closes on Oct 10). While the training materials are on vula (the UCT CMS for courses), some hints to solve some of them may be of general interest. I’ll give a breakdown and a ‘spoiler alert’ for five of the eight problems. The problem-solving aspects and explanations in the training sessions were longer, but these short notes will give you some useful starting points where to look for implementation details already anyway.

The problems can be categorised into the following types:

  1. Isle of the birds – computational geometry
  2. Fitness training – simple ad hoc
  3. Similarity – String processing
  4. Railways – Graphs
  5. Student IDs – String processing

 

Isle of the birds

There’s an island with trees, and the rubber band will enclose them all. That is, we need to find the polygon with corners of the outermost points enclosing the rest of the points. Thus, we need to compute a convex hull. How can that be done, and, more importantly, how can that be done efficiently? Computing the whole solution space is going to take too much time, as there can be between 3 and 15000 points. One technique is the sweepline (generally useful to check out), and one of those tailored to finding the convex hull is the Graham Scan algorithm: first, starting with the left-lowest point, scan the plane of points counter clock-wise to figure out where the points are (points on the same line are ignored), then, second, connect the points in a stepwise fashion from the bottom going counter-clock-wise again: if the angle is >180 (compare values of the coordinates), then discard the penultimate point and connect the 2nd last to the last point.

Only 4 teams solved this problem at the 2103 regionals (including the winning team ‘if cats programmed computers’).

 

Fitness training

John cycles A km, Mary runs B km, starting and finishing at the same place using one circular route of M km. This can be computed with a straight-forward modulo operation. All 53 teams solved this problem at the 2013 regionals.

 

Similarity

Spellchecking in the online search engine; well, given two words, what is the minimum cost of the change operation to go from word_A to word_B, given certain costs of additions, deletions, and character swaps? Comparing strings of characters is around quite a while, from spellchecking, to plagiarism checkers, to DNA sequence alignments, so surely a fine algorithm should be around for that already. Indeed: the minimum edit distance (Levenshtein distance) (nice explanation), where, instead of computing all possible options (very costly!), you fill in the table accordingly. The ‘tricky’ part is that the basic algorithm for the minimum edit distance counts each change as a cost of 1, whereas in this problem, some changes cost 2; hence, you will have to change those values in the standard algorithm (demo that lets you play with different costs).

Only 2 teams solved this problem at the 2103 regionals (including the winning team ‘if cats programmed computers’).

 

Railways

Construct a railroad network between cities in the shape of a tree, but put in a bid for the second-most cheapest option. So, we have lines and points, or: some graph algorithm. Two main groups are shortest path (Dijkstra, Bellman-Ford) and spanning tree. We need a minimum spanning tree (MST) to begin with. This reduces the option for the most suitable algorithm to Prim’s or Kruskal’s. Prim requires a particular starting vertex, Kruskal doesn’t. The problem statement doesn’t require a starting vertex, hence Kruskal’s algorithm is the one of choice (example). But then how do we get the second-best spanning tree? Also in this case, many have asked before (thoertically and practically—search online for both): take an edge with weight w that’s not in the MST and results in a cycle when added to the MST, compare w with the weight of the heaviest (non-w) edge in the cycle (v), then of those comparisons among the cycles, take the one with the lowest difference, add the edge with weight w and remove the other edge v. There you have your second-best option.

Only 4 teams solved this problem at the 2103 regionals (including the winning team ‘if cats programmed computers’)

 

Student IDs

Generate student IDs from the students’ names, following a given pattern. Of itself, this is a somewhat laborious implementation. The only real issue is to keep track of what’s been processed of the string. Here, it is especially useful to first design the solution separately before delving into the murky code, as it otherwise will require a lot of test cases to check the corner cases (and remember you have only one machine). A nice way to design it is to use automata and only then to convert that into code.

39 teams out of the 53 solved this problem at the 2103 regionals.

 

Finally

Just in case you’re trying out the remaining problems, and are banging your head against the wall or pulling your hair out: no team solved the Street lights (Problem B; looks like a maths problem, with floating point complication) and the Necklace (Problem G), and only 3 teams solved Matchstick maths (Problem D; ask a team member of ‘if cats programmed computers’, who solved it).

Exciting ICPC World Finals 2015 in Marrakech

Also this year did we participate in the 39th ACM Intercollegiate Programming Contest World Finals, held in Marrakech, Morocco; the ‘we’ being: the “I Can’t Pronounce Catachtonic” team composed of Yaseen Hamdulay, Robert Spencer, and Sean Wentzel, and me as coach, from the University of Cape Town. We’re the only team from Sub-Saharan Africa, and one of 10 teams in the Africa & Middle East Region, of a total of 128 teams that participated, who were selected from 38160 contestants from 2534 universities of 101 countries on 6 continents that competed in the qualifying regionals.

One year more of studying, practice, and training wiser, our last training session indicated we might be a contender for A&ME regional winner. (For overall winner, we’d need to have and do what the medalist teams do, such as starting training in your early teens and winning IMO and IOI, weekly training sessions, monthly local contests, week-long training camps by previous medal winners, designated labs, competitive programming courses, scholarships and whatnot that other coaches talked about regarding preparations. At this point in time, we don’t have nearly enough such resources.)

The ‘first to solve a problem’ did so in a mere 5 minutes, from opening the envelope with the problems to having submitted the right code! This was Problem A: Amalgamated Artichokes, and the honour went to Peking University, setting a new record. The UCT team did so in 14 minutes, due to being sidetracked with another first. Then it was like: is this the only ‘easy’ problem, and the rest as grueling as last year’s problem set, and will it come down to ‘the team who solves the second problem will win A&ME’? Soon thereafter, the UCT team solved a second problem—but then so did two other A&ME teams, upping the ante that perhaps the 3rd solved problem would be the decider. UCT was still leading—at some point even on 27th position in the overall scoreboard. I got too nervous, and went for lunch, hoping they would have solved a 3rd one upon returning to the scoreboard. And lo an behold, they had, still leading for A&ME, though overall moving down to the 50s-70s in the dynamic scoreboard. To make matters more exciting for spectators, there were 5 PCs with shared screens and video, so one can see one’s team live on webcam, and see what they are coding, every single keystroke. Nifty, imho; nail-biting for some coaches of medal-contenders.

Right before the scoreboard was frozen regarding solved problems (for the last hour of the 5-hour nonstop contest), the American University of Cairo had solved 4 problems, surpassing UCT, but at the cost of a lot of penalty time due to a few wrong submissions, so if the UCT team would solve another problem, and Cairo not, then we’d win A&ME region on time difference. I could see UCT submitted a solution, hoping it was right. Then, sitting in the spectator area, and the Cairo team sitting near that, the scoreboard updated that they had submitted a solution for their 5th problem… and then came the involuntary reaction of its team members, being a mini-cheer. And UCT did not surpass that in the last 30 minutes. Overall, this placed Cairo on 75th place in the final standing, winning the prize for the A&ME region, and UCT just below that, as second in the A&ME region on a very respectable 83, therewith also receiving congrats from other participants, coaches, and interested spectators.

The “I Can’t Pronounce Catachtonic” team from UCT

The “I Can’t Pronounce Catachtonic” team from UCT

So, relatively, they did well, having solved an impressive 4 problems, being A, D, I, and J, and all correct on first submission. This placed UCT ahead of other well-known, and arguably better resourced, universities, such as Uni Illinois at Urbana-Champaign, Virginia Tech, IIIT, Uni Western Australia, Cornell, Moscow Aviation, Calgary, and Rice. That said, at the other end of the spectrum, St. Petersburg ITMO broke the record of having solved all contest problems—a first of all the 39 editions of the ICPC world finals—and first to solve problem G. Moscow State Uni came second (11 problems solved out of 13, with first to solve B and H), Uni of Tokyo came third (also 11 problems solved, with first to solve J and K), and the fourth gold medal went to Tsinghua University (10 problems solved, and first to solve C).

If you don’t feel like solving the problems yourself, but still want to know the answer to, among others, cheese slicing, shooting asteroids, tile cutting, and the qanat irrigation system, then have a look at former UCT coach Bruce Merry’s analysis of the problems and directions of the solutions.

All in all, it was a good World Finals. An the food was good, the weather good, the other events too (including a fun camel ride), meeting up with coaches and some contestants met last year, the CLI symposium brought some useful information as well, and Steven and Felix Halim generously gave me a hardcopy of their Competitive Programming 3 book. Sean won the ICPC Quest, so a 1st prize was brought back to Cape Town.

The planning for participating with a strong UCT team next year has commenced; the 2016 finals will be in Thailand.

Three CS problem-solving strategies exercise sets

The preparations for the ACM ICPC World Finals 2015 in Morocco are in full swing, and so is the training for by with the 125 or so 3-person teams selected of over 30000 contestants who participated in the various selection rounds across the world. While a lot comes down to “practice, practice, practice” all the hard problems you can find, and learning more algorithms and maths, there’s also work to do on refining one’s team strategy and problem solving skills. For the latter, Steven and Felix Halim’s Competitive Programming book comes in handy (as coach at least). Instead of going back-to-back through the book and do the problems listed in the designated problems solving paradigm categories, I turned it around, and selected problems of which the students had to figure out which problem solving paradigm was the right one (Greedy, DP, ad hoc, etc., and more detailed, such as geometry, shortest path, etc.). I’ve made three sets of increasing difficulty:

  1. easy’, where most of them can be solved with one problem solving paradigm (also useful for those who are preparing for the Standard Bank IT Challenge ‘heats’ on May 16);
  2. medium’, where most of them can be solved with two problem solving paradigms combined;
  3. hard’, where most of the problems have appeared in a World Final.

The (direction of the) solutions are on the last page of each pdf, including the UVa Online Judge number, so you can implement and test your solution as well.

I did change some of the problem descriptions for various reasons:

  • Localization of the problem description to South Africa: among others, ‘Durban prawns’ (big fat cockroaches are endemic there), ‘nuts for nuts’ (we do have squirrels on campus), ‘shopping for operas’ (Amazon gave up delivering in SA, because so much packets were lost), and some characters got different names.
  • Changed story line: ‘charming canines’ (disagreeable storyline in original), and some now have female characters (cf. mostly male or none).
  • To make the title an alliteration, like the other titles in the problem set: ‘colliding catamarans’ and ‘cult caps’.

Happy solving 🙂

Coupon collecting the computing way

Coupon collecting is a very Dutch thing to do, though I never made a serious hobby out of it (nevertheless, I still have a great Brio Koekjesboek thanks to that), but I did collect stamps for a while, which was more interesting than cutting slips off of margarine wrappings. What does any of this have to do with computing, or math, for that matter? A lot. I mean, think of it: how much margarine must we have bought just to have enough slips to order the Brio cookie-baking booklet ‘for free’? Same story for the coffee packet wrappings. Post stamp collecting is harder: you’d want the whole series of a given edition. The Italian company Panini made a business out of it, enticing people to collect all stickers of all team members playing in a world cup. And that’s what got me into this post’s topic.

Coaching for the next ACM ICPC, which includes training sessions, made me surf on the web for some interesting problems to solve, so as not to have only previous ICPC regional’s and finals problems to train the students with. Simon Whitehouse has a great blog post on what it would cost to complete the whole Panini sticker book for the 2010 Soccer World Cup in South Africa, without swapping cards with friends, i.e.: how many packets of five stickers would you need to buy to get the whole series of 638 stickers (pictures of soccer players) to put in the sticker book? Answering this question sounded like fun. I reworked a bit the problem description from his post so as to generalize it to finding a way to be able to calculate what it would cost for any world cup—rugby and cricket are important in South Africa, too—and any cost of a packet of stickers (there’s some 6% inflation/year here); read the full problem description (pdf), on the first page.

In solving this, first, there are three variables: N for the number of unique stickers, P for the price of a packet of 5 stickers, and C for the total cost we want to know. To calculate C, we thus have \frac{total\_no\_of\_stickers}{5}*P , and we’ll round it up to the nearest integer. The crux is how to get to the total number of stickers.

Whitehouse’s post has a very readable explanation. In short, when you get the first sticker, it is guaranteed to be a new one, the second card has a \frac{638}{637} chance of being new, and so on to the last card \frac{638}{1} , wich follows from some basic notions of probabilities, which you can/will/have come across in a statistics intro course. Generalising this to the arbitrary number of N cards, we obtain

\frac{N}{N} + \frac{N}{N-1} + \frac{N}{N-2} + \ldots + \frac{N}{2} + \frac{N}{1}

to calculate the total amount of stickers you need to buy to have the N ones complete. This is as much as you really need from a computational viewpoint. Here’s a simple python code snippet that gets the job done:

def panini(n):
     tns = 0
     for i in range(1,n):
          tns = n/i + tns
     return tns

But why keep it simple when one can complicate matters…

This problem is an instance of the Coupon Collector’s Problem (CCP). The above formula is an harmonic series, and with some math on the CCP page, and the Euler-Mascheroni constant {\sf \gamma} (from number theory, with lots and lots of mathematics), one somehow obtains that the above-mentioned series is n*H_n , with H_n the harmonic number, and the whole thing equalling also

n \mbox{log} n + \gamma n + \frac{1}{2} + o(1) \mbox{ as } n \rightarrow \infty

according to the Wikipedia entry; there is a lot more online about it, e.g., here [course-level] and here [research], anong many resources. If that’s not enough, \gamma \approx 0.5772156649 , with the decimal digits computed now to over 119 billion decimal digits (it is a major question in mathematics whether it is an irrational number). Somewhere in the whole gamut of formula on Wikipedia and Whitehouse’s clean but unexplained jump (main text and a comment further down on that page), it boils down to, roughly,

total\_no\_of\_stickers = N*\mbox{ln}(N) + \gamma

The latter is easy to plug into a spreadsheet to obtain the answer. But lo and behold, what’s computed with the math-approach and natural log depends on what you plug in for \gamma , i.e., how many decimal digits, and only \gamma or the whole of Eq.2. The series with the simple algorithm does not have that problem with the approximations. And you don’t have to do all the math. I didn’t exactly record the time it took to create the spreadsheet versus typing up the simple algorithm, but the latter may even have been faster to do.

Besides the observation that the computing way made it simpler to solve the problem with respect to the design, there’s still a remark to be made on computing the total cost. With R10 per packet and the soccer world cup sticker book, you’ll end up paying R8977 to complete the soccer world cup book if you’d do it all by yourself! For many a South African, that’s more than the monthly salary. Completing a 400-sticker world cup for R35/packet is going to cost you R18389 (about €1268 with the current exchange rate). You’d be a lot better off swapping doubles with family and friends rather than buying new packets. Then again, mot people probably won’t calculate how much money they’d be spending on collecting things, so, here’s a basis for a business model for you.

ACM ICPC 2014 solution to problem A – baggage

Some of you already know I was on-site coach for the UCT team at the ACM Inter Collegiate Programming Contest World Finals in Ekaterinburg, held from 22 to 26 June, 2014. The problems were unusually hard this year, and solving 4 out of the 10 problems already got teams into the bronze medal range [results]. The technical coach of the UCT team, Bruce Merry, has written an analysis of 6 problems on his blog (D, K, C, E, B, and F–update: now also I and G, J, L, and H), and over at TopCoder, SnapDragon discusses 4 of the 10 problems (A, E, F, G), except that SnapDragon does not provide a solution to A (and I don’t like his hint of brute-force code-and-try) [update 3-7: there’s description of his solution here now]. Googling, I couldn’t find someone else discussing the solution to problem A, so here’s mine, which I solved on the plane from Moscow to Frankfurt (among other activities, and one among the four flights we had to take to arrive back in Cape Town).

The problem

Stripping the “baggage problem” to its essentials: you have a sequence of alternating Bs and As, which has to be reordered to first all the As then all the Bs, and the reordering has to occur moving two letters at the same time, and remain adjacent. For instance, with an n = 4, we have 2n characters, BABABABA, and in the end after sorting, it then will be AAAABBBB, and this has to occur in the minimum amount of moves. N is randomly chosen, and is an integer between 3 and 100. The cells on your tape are numbered from 0 to -2n+1, so with our n = 4, from -7 to 8, and the first B is on position 1.

Update (3-7): This problem appears to be “Tait’s counter puzzle”, which Peter Guthrie Tait, a Scottish mathematical physicist, described in 1884 [see description] (thanks to Davi Duarte who mentioned it in the topcoder thread).

Solution

Informally, there are two parts to the algorithm: move around the As and Bs to pair them as AA and BB, which cost you n/2 moves if n is even and n/2-rounded up moves if n is odd, and then sort those pairs in the remainder to a total of n moves. The first part occurs alternating moving AB from the right to the left, starting at the last AB (position 2n-2) and then every other 4 to its left, and BA from left to right, starting from position 3 and every other 4 positions to the right (i.e., 7, 11, etc.), and then the BBs are ferried to the right from left to right, and the AAs from the right to the left, also alternating. N=3 is an exception.

One can do this with a neat mathematical proof, but I found out with pattern creation and recognition, visualizing the provided sample input and moves for n = 5 and n = 8, which can be done in 5 resp. 8 steps (given in the problem statement as sample output). Here’s a description of that approach.

First, devising one for 3 is trivial, using the following moves (the only option), using the position indicated with the position of the first letter, and highlighting the ones that will be moved next:

Start: . . . . . . BABABA
Move 2 to -1, which gives . . . . ABB . . ABA
Move 5 to 2, which gives . . . . ABBBAA . .
Move 3 to -3, which gives . . AAABBB . . . .

This already suggests that the minimum amount of moves will always be n, because it is n moves also for n = 5 and n = 8. Next, devising one for n = 4, and knowing the moves for n = 3 and n = 5, I tried to work it out for BABABABA, i.e., the first AB, like with n = 3, BABABABA, and BABABABA like with the n = 5 case. The last one is the only one that worked in 4 moves, starting with moving 6 to -1, again. Again, because for all ns so far, the first move is to -1.

Let’s put this to the test with n = 7, with the ones to be moved in bold and the empty cells indicated with dots:

..BABABABABABABA

i.e., move the last AB (thus, position 12) as this worked for n=4 and n=5, then

ABBABABABABAB..A

i.e., move the first BA after position 1 (thus, position 3)

ABBA..BABABABBAA

i.e., move the last AB before a pair (thus, position 8)

ABBAABBAB..ABBAA

i.e., move the first BA after position 1 (thus, position 5). Then sorting the BBs and AAs:

ABBAAB..BBAABBAA
A..AABBBBBAABBAA
AAAAABBBBB..BBAA
AAAAABBBBBBB..AA
AAAAAAABBBBBBB..

This was enough for me to have discovered the pattern of alternating for the matching and alternating for the sorting.

What is ‘nasty’ in the problem description, retrospectively and for the pattern-based approach vs. a neat maths-y proof at least, is that the pattern deviates for n = 3, and the provided sample output for n = 8, although in 8 steps, is an alternative solution, not one that one obtains with the algorithm. With the approach as mentioned above, we have for n = 8 the following (I did that one to double-check):

..BABABABABABABABA
ABBABABABABABAB..A
ABBA..BABABABABBAA
ABBAABBABAB..ABBAA
ABBAABBA..BBAABBAA
A..AABBABBBBAABBAA
AAAAABBABBBB..BBAA
AAAAA..ABBBBBBBBAA
AAAAAAAABBBBBBBB..

This confirmed it works. It may look a bit craft-y, but the patterns show beautifully with larger n, which are shown below for the even and odd case (made just for the blog post, not in solving it); squint your eyes if it’s not immediately clear.

N = 16

..BABABABABABABABABABABABABABABABA
ABBABABABABABABABABABABABABABAB..A
ABBA..BABABABABABABABABABABABABBAA
ABBAABBABABABABABABABABABAB..ABBAA
ABBAABBA..BABABABABABABABABBAABBAA
ABBAABBAABBABABABABABAB..ABBAABBAA
ABBAABBAABBA..BABABABABBAABBAABBAA
ABBAABBAABBAABBABAB..ABBAABBAABBAA
ABBAABBAABBAABBA..BBAABBAABBAABBAA
A..AABBAABBAABBABBBBAABBAABBAABBAA
AAAAABBAABBAABBABBBB..BBAABBAABBAA
AAAAA..AABBAABBABBBBBBBBAABBAABBAA
AAAAAAAAABBAABBABBBBBBBB..BBAABBAA
AAAAAAAAA..AABBABBBBBBBBBBBBAABBAA
AAAAAAAAAAAAABBABBBBBBBBBBBB..BBAA
AAAAAAAAAAAAA..ABBBBBBBBBBBBBBBBAA
AAAAAAAAAAAAAAAABBBBBBBBBBBBBBBB..

N = 17

..BABABABABABABABABABABABABABABABABA
ABBABABABABABABABABABABABABABABAB..A
ABBA..BABABABABABABABABABABABABABBAA
ABBAABBABABABABABABABABABABAB..ABBAA
ABBAABBA..BABABABABABABABABABBAABBAA
ABBAABBAABBABABABABABABAB..ABBAABBAA
ABBAABBAABBA..BABABABABABBAABBAABBAA
ABBAABBAABBAABBABABAB..ABBAABBAABBAA
ABBAABBAABBAABBA..BABBAABBAABBAABBAA
ABBAABBAABBAABBAABB..BAABBAABBAABBAA
A..AABBAABBAABBAABBBBBAABBAABBAABBAA
AAAAABBAABBAABBAABBBBB..BBAABBAABBAA
AAAAA..AABBAABBAABBBBBBBBBAABBAABBAA
AAAAAAAAABBAABBAABBBBBBBBB..BBAABBAA
AAAAAAAAA..AABBAABBBBBBBBBBBBBAABBAA
AAAAAAAAAAAAABBAABBBBBBBBBBBBB..BBAA
AAAAAAAAAAAAA..AABBBBBBBBBBBBBBBBBAA
AAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBB..

Neat. Writing the code is left as an exercise to the reader.

I’m looking forward to the other problems (except the crane balancing problem C, whose description I did not like, at all), the upcoming regionals and next year’s ICPC World Finals in Morocco (if not winning, then to have at least a great time like we had this time)!