Sunday, March 25, 2007

On Slack: One Reason Why Software Maintenance is Hard

Software maintenance tends to be hard in part just because software tends to be hard. Most of our effort is spent on quite complicated software, running to hundreds or thousands of pages in printed form. And software tends to be fragile: for example, any small typographical error anywhere in all those pages tends to invalidate the program, or just to subtly corrupt it, so it behaves incorrectly occasionally.

However, I wish here to remark about software maintenance in particular, the problem of taking old software and making it better, or adapting it to a new purpose. And in particular, I want to talk about adapting old software to a new purpose, as opposed to finding and fixing lingering mistakes from when the software was built for its original purpose.

I have spent a lot of my mental effort over the past few years on "incremental algorithms." Such algorithms efficiently maintain a solution for a problem as the problem changes. For example, an ordinary sorting algorithm, when given input [101,8,4,11,14,88,89,16,12], might return [4,8,11,12,14,16,88,89,101]. An incremental algorithm allows an operation like "now insert 55 in that sequence," and efficiently returns [4,8,11,12,14,16,55,88,89,101] in that case.

Of course, even without any special incremental algorithm, you can just use an ordinary sort algorithm to find that answer: insert 5 into the input sequence, then restart the sort algorithm from scratch to build a new output sequence. However, for enormous input data sets, an incremental algorithm can be enormously more efficient. For exmaple, if you maintain a telephone directory covering much of the USA, containing perhaps a hundred million phone numbers, then re-running an ordinary (non-incremental) sort operation from scratch will tend to require at least checking the order of a hundred million pairs of numbers. (More or less, the algorithm must at least re-check the order as it runs.) An incremental algorithm needs less than one hundred comparisons to insert a new element in an existing ordered set of one hundred million elements.

One thing I have noticed, which I'm sure others have noticed and even perhaps formally analyzed, is that incremental algorithms need to have some slack in their representation of the problem. Consider the incremental sorting algorithm again, but this time not for telephone numbers but for library books in card catalog order. Any library will have some gaps on the shelves, quite intentionally, for libraries face the problem of inserting new books in the order. When a new book is delivered, a librarian wants to walk to the appropriate location, slide aside a few dozen books, and insert the new book. A library could economize on space by eliminating all the gaps. However, then inserting a new book in order could require quite a lot of work, sliding about half the library's books aside.

Similar things happen in filing systems inside your computer. On a computer's hard disk, where new files may be created, and old files may be expanded or deleted, usually quite a lot of space (20% or more) is reserved for such slack. This is not just a property of the way computers read files, for such slack does not usually exist on something like a CD-ROM, which will never be changed. It is instead a property of "who knows what the future will bring?"

It is an interesting challenge finding the appropriate form of slack for a given set of anticipated changes. First, though, note that the appropriate slack really does depend on the anticipated changes. To go back to the library example, the gaps in the shelf aren't just common sense, agnostic about the kinds of anticipated changes. Instead, they are more nearly slack specifically designed to cope with the anticipated problem of inserting a new book. Consider an unanticipated challenge, when Tom Gander seizes power, decrees that T and G are now the first two letters of the alphabet, and starts exterminating librarians who don't comply. At that point, the gaps on the shelves don't help, if you are going to comply efficiently, you need is wheels on the bookcases. (Or perhaps you should think outside the box and maintain an arsenal or cross-border escape tunnel in the library basement.)

I've been thinking about the problem especially in the last few weeks because I finally succeeded in finding a promising form of slack for a problem I've been working on, after thinking about for (not full time, but off and on) only a few years. It's a slightly tricky problem, related to the one solved here (where given any directed graph, one must return a minimal directed graph which is congruent to it). But compared to the problem solved by programmers (where given any problem specification, one must return a piece of software which solves the problem) it is hardly tricky at all. Thus, since it was hard for me to find a suitable form of slack to cope with one anticipated set of changes in my slightly-tricky problem of directed graphs, it now makes a lot of sense to me that it can be really hard to find suitable forms of slack to cope with anticipated small changes to programming problems so that it's easy to make a small change the program to handle the changed problem. (That's the main point of this post, any programmers reading this can stop now.)

Besides that, for any non-programmers reading this post, I'll illustrate an extra point, using the magnificent foundation of examples that I have constructed: unanticipated changes in software can be surprisingly hard. (You programmers already knew this, why didn't you go home?) If you don't know anything about the inside of a library, it might not be clear that changing T and G to the first two letters of the alphabet is harder than shelving a dozen boxes of new books. After all the T-and-G change is so simple that it can be described in a single sentence, while the dozen-boxes problem requires many pages to describe. (I.e., the receipt listing the titles of all the new books is many pages long.) However, in an ordinary library, the T-and-G change would probably be a lot more work. So if you are ever tempted to speculate how easily a software system can easily be changed to accommodate a simple change, remember that the answer is "it depends." By that I don't only mean that it depend on whether the programmers are lazy morons (though it does:-), but that it depends on design choices inside the software system, and those design choices depend in part on whether the change was anticipated around the time of the original design. (This is related to this; not quite the same thing, but said enough better than I can that nonetheless it deserves a nod.)

For example, the SBCL project (a free compiler that I set up years ago, and whose maintenance I still formally head) experiences chronic headaches because its design (made over twenty years ago in the CMU CL system that I copied for the original SBCL code) calls for most of its code to be stored at a fixed address in the computer's memory. Twenty years is a long time in the computer world. In the kinds of computers that SBCL usually runs on (general-purpose computers running operating systems like Linux and Microsoft Windows, not special-purpose computers like the ones which control antilock brake systems), it is now pretty abnormal for application software to expect to be stored at a fixed address in memory. Thus, this design now causes various kinds of friction. However, it looks like it would be a lot of work to rework the system to use an approach which is less surprising to modern operating systems. It's not an impossible amount of work, but it's enough that I find it unsurprising that despite all the headaches no one has done it yet.

Sunday, March 18, 2007

Google vs. Intellectuals

It's annoying to me to find smart, well-educated people arguing things which are inconsistent with what a non-specialist can find with an hour of searching Google — and not addressing the inconsistency.

Now, I appreciate that there can be rarefied truths which are hard for a layman to check, or even to understand. After all, I did my Ph.D. work on simulating quantum mechanical systems, in systems where the primary difficulty was the Pauli exclusion principle, and I don't know of anyone who has much success discussing such stuff with nonspecialists. My point, though, is that even rarefied truths should correspond to simple reality. My specialist experience doesn't much tempt me to waffle on that, but instead tends to make me a hard-liner.

For example, en route to my Ph.D. I was in an undergraduate quantum mechanics class. I questioned the formula which had just been derived for reflection of objects at a boundary where potential energy changes. I pointed out that the formula seemed to say that an ordinary macroscopic ball — such as a tennis ball — would have a high probability of bouncing back when one attempted to roll it over the edge of a table. (There is a standard way that quantum mechanical effects tend to fade away for large objects, but in this formula they didn't fade away that way.) How could that be right, since that doesn't happen in real life? The professor didn't respond that the question was uninteresting because it was so mundane and un-quantum-mechanical. Instead, he addressed it very seriously, as essentially any physicist would. (And ultimately we reached an answer which didn't change the formula, but which made it explicit that the formula applied only when the boundary was sharp compared to the quantum wavelength of the reflected object, in a way that macroscopic boundaries never are in practice.)

Life is short, and I wouldn't want to require intellectuals to be compulsive about descending to check every mundane implication of their ideas. Thus, I have no problem with a physics professor lecturing on a particular reflection formula without having thought out how it applies in the limit of tennis balls rolling off tables. But I do expect intellectuals to take such mundane questions seriously, and I do expect them either to be able to handle such questions (as in the reflection case, where the observed behavior still followed from ordinary quantum mechanics once more details of the problem were included) or to revise their position.

I also had the interesting experience of testing out of an inorganic chemistry requirement for graduate school by taking a multiple choice test. My undergraduate background was a biology degree with some extra physics courses, and some research experience in simulating biomolecules in solution. That left a sizable hole where a chemistry graduate school normally expects one's inorganic chemistry qualifications to be. But I had read a lot — essentially no inorganic chemistry specifically, but lots of things about what materials are technologically significant, and what was developed when, and so forth. I was amused that that — history and fiction and engineering and physics and so forth — was enough to constrain the answers to the questions enough to pass the test. That can be interpreted in lots of ways, of course: a common interpretation is "multiple choice tests are silly." But one interpretation is that even a specialized field can touch on common knowledge in so many places that it is highly constrained by the need to conform to nonspecialist knowledge.

Thus, it is annoying to find things like the examples which follow. Perhaps the first example could be somewhat excused by the difficulty of stating one's points precisely within a one-page opinion piece. (Though, ahem, trimming some of the padding like "noisy champions today" might help with the length constraints, hmm?) Perhaps the second example could be resolved by aggressively going after the details of problem, more or less in the same spirit as my physics professor did (but likely by poking around in a research library instead of by reanalyzing more carefully with pencil and paper and calculus). But they still grate, since I doubt that either excuse actually holds.

Example I: Paul Johnson wrote in Forbes (June 20, 2005) that (among other things)

Neither Darwin nor any of his followers — nor his noisy champions today — was a historian. None of them thought of time historically or made their calculations chronologically. Had they done so, they'd have seen that natural selection works much too slowly to fit into the time line allowed by the ages of the universe and our own planet.
Eh? If you were to search Google for Darwin's own most prominent work ("On the Origin of Species...") and surf around a bit, you'd have any number of routes into the original text. The natural one for me is "project gutenberg origin of species" which gives you a free downloadable copy of the text. Search for the text "here we encounter a formidable objection" and note that Darwin himself is already using numerical bounds on the age of the earth, committing himself to within a factor of ten or so (not unreasonable before knowledge of things like radioisotope dating and genetic clocks), and seriously addressing criticism based on those estimates.

Furthermore, this seems to be a case of the pot calling the sun black. Paul Johnson doesn't seem terribly interested giving in any estimate within a factor of ten how long it would take ordinary (non-divine-intervention) natural selection to evolve organisms comparable to modern lifeforms. That is: Johnson claims "natural selection works much too slowly to fit into the time line allowed by the ages of the universe and our own planet." Then how much too slowly, and using what estimate of the ages? Would it take fifty billion years for natural selection to evolve organisms comparable to modern ones? Fifty quadrillion years? Charitably assuming that Johnson the historian troubled to familiarize himself with Origin of Species, he must be criticizing Darwin only committing himself to within a factor of ten or so. Darwin had rather less to go on than we do: mostly taxonomy, folk genetics, and the fossil record, I think. In this age, when many details of mutation and genetic relatedness are well-known down to the last atom, wouldn't it be reasonable to expect Johnson to commit himself too? If not quite as precisely as Darwin did, then at least within a factor of a million?

Example II: The collective-right interpretation of the second amendment has been bouncing around the blogosphere recently, since it turns up (renounced by the majority and embraced by the minority) in the recent Parker v. District of Columbia decision. A moment's Googling brings up other high-profile bills of rights: anyone who had a civics class in the USA might know states tend to have bills of rights, so it's interesting to check big states like NY (right? what right?), PA (individual), VA (universal militia). (All are reachable with obvious queries like "virginia bill of rights".) Anyone who knows a little more history might know there was debate about ratification of the Constitution, that the US Bill of Rights was not unrelated to that debate, and that there's something called the English Bill of Rights, too. Given that those were in the air around the time of ratification, it's bit odd that even with the resources of research libraries, the collective rights folk seem unable to come up with contemporary documents which interpreted the document as having even a possible ambiguous collective right interpretation, whether to support it or to criticize it. After all, the Federalism debate generated quite a lot of writing. Why does doesn't some fraction of it turn on that? Imagine if we had a few historical documents like Paine (or, at the opposite religious end, some Catholic) arguing for the inadequacy of the guarantee protecting the right of local governments to choose who to arm. And, perhaps, Hamilton arguing that it was not only adequate but much wiser than the old-style guarantee to all individuals (of the religion on the winning side, anyway) in the English Bill of Rights. To judge from the tenor of the arguments from the collective rights folk, you might be forgiven for thinking that we do have such documents. To my rather confident knowledge, we don't. Certainly, if we do, collective-rights advocates are inexplicably reluctant to remind us of them.

From the extreme individual-rights point of view where the words of the amendment have suspiciously-convenient unmodern interpretations (like "well-regulated" meaning "smoothly-operating"), of course this is a non-puzzle. But is it explicable for a collective-rightist? Sadly, despite having followed the debate off and on for years, I have never seen a collective-rightist willing to tackle the question the way that my quantum mechanics professor tackled the tennis ball puzzle. If it has been tackled, I'd rather expect it to be findable by chasing citations from Parker v. District of Columbia. However, I rather doubt that in fact it is findable there. Certainly no one on the collective side pointed it out in the many pages of post-ruling web debate that I read.

I'm not trying to prove either conclusion here, that natural selection is a good theory or that handgun bans are bad law. But I am claiming that the particular arguments that I am picking on seem like pretty bad arguments.

I'm not trying to hold intellectuals to impossibly high standards in defending their complicated ideas against simplistic critiques. For what it's worth, I think decent examples of specialists responding with specialist detail to simple layman's critiques can be found close to the controversies from which I took my examples: responses to "how can you possibly reconcile evolution by natural selection with the second law of thermodynamics?" and "jurisdiction A has lots of legal guns and lots of murder, jurisdiction B has little of either, so how can you possibly deny the fact that legal guns are a danger to public safety?"

I also vaguely plan to write my own essay "What is Entropy Like?" which will be, in some part, a response to the thermodynamic critique of evolution. Entropy arises naturally in the statistical mechanics which was used in the work I used to do, so I'm pretty comfortable with the concept, and I think a lot can usefully be said about it without "hard" math like logarithms, and without truly-hard math like taking partial derivatives or integrating over phase space, or physics like counting the number of quantum states. I am thinking of explaining and illustrating how entropy is the complement of an order which behaves much like commonsense intuitions about wealth. (More precisely, wealth over any ordinary very short period, like 10 minutes, where wages and compound interest and such are negligible.) It comes in various forms like cash and deeds to land, and of course you can shift it around or destroy it, but don't expect to increase it. Also, use a suitably broad but not-too-broad definition or expect to get nonsense.)

Just because one's ideas are ever so very, very intellectual doesn't mean the the ideas should be held above simple reality checks by nonspecialists. John W. Gardner said something of the sort more memorably: "The society which scorns excellence in plumbing because plumbing is a humble activity, and tolerates shoddiness in philosophy because philosophy is an exalted activity, will have neither good plumbing nor good philosophy. Neither its pipes nor its theories will hold water."