Friday, April 6, 2007

Can Code Be Truly Self-Documenting?

I know I'm stepping into the middle of a holy war here but I've been in some conversations on this subject lately and thought it might be worth laying out my thoughts.  Recently a coworker in another part of the company told me that they are not allowed to put comment in their code.  The argument is that code should be self documenting and any comments in the code will become incorrect over time.  Is that really true?

The idea stems from the Extreme Programming(XP) movement although it could be a misinterpretation of their views.  Certainly Vasilli Bykov argues that it is.  They argue against large amounts of documentation for sure.  To an extent, they are right.  At some point the returns on documentation diminish.  A lot of work can go into creating detailed documentation which quickly get out of date.  People update the code without updating the comments or the specs and suddenly they are worse than worthless.  Is that then a condemnation of all comments?  No.

First off, let's take on the idea that code is self-documenting.  It is not.  Not fully anyway.  Code is ultimately designed to tell a computer how to accomplish a task.  It's not foremost intended to tell a human how to accomplish a task.  For that we have prose.  While code is often readable by a human and should be made as easily intelligible by a human as possible, that is still not its primary task.  Anyone who has tried to take over an undocumented code base--even one with clean code--will tell you it is difficult.  You have to read the whole thing at least twice.  Once to build up your token list and the second time to decipher how the tokens interact.  Not everything is obvious without tracing out the actual execution in your mind (or a debugger).  This is a time consuming process.  And don't even get me started on "modern" programming fancies like keeping each function to a screen and deep callstacks.  Those have terrible implications for readability.

Comments have a lot of utility in programming.  They tell the next user what to expect.  There are two sorts of comments that are most useful.  High level comments help the next guy understand how everything works together.  Low level comments can help explain a complex piece of code or justify a particular decision. 

High level comments might be class level or even file-level.  They explain what the intent of this part of the program is, some information about how the various classes and functions interact, etc.  When first tackling a new codebase, these kinds of comments can be invaluable.  Trying to build up an understanding of the architecture of code by reading each function is like trying to see the pattern in a mosaic using a magnifying glass.  It's really hard.  It's better to step back and take in the whole picture at once.

I once had to make some modifications to Postgresql for a class.  To do this I had to understand how the various pieces worked together to make sure I modified all the right parts.  There were no specs available but most of the files had a header block which explained what the functions and data structures in the file did.  This proved invaluable.  Instead of having to go understand each structure, then see how the functions used them, then finally to understand how the myriad functions interacted with each other, I could read this synopsis and focus on just the parts that mattered to me.  I also quickly got a sense of whether I was making modifications in line with the original intent or not.  Without any comments, I would have spent much longer trying to accomplish the same task.

Low level comments are usually interspersed within a function or method.  They should serve two primary purposes.  First, they should explain any complex actions going on.  Math is notoriously hard to understand from code.  This becomes even more true if it is optimized.  Comments telling you that a lookup table is being used to implement clipping is a lot easier to understand than to go look at the table and surmise its purpose from the values in it.  The other really valuable purpose is to explain any deviation from standard practice.  Sometimes the obvious solution is incorrect.  There are bugs caused by side effects or subtle corner cases which go unnoticed.  If you were to write the code to fix the bug in those cases without comments, the next guy is likely to undo your fix and recreate the bug.

Other sorts of comments can be useful too.  A header block on a function or method describing the purpose and what each parameter does is a lot faster to read and interpret than trying to divine the same information by reading the whole function.  It's not that it cannot be done.  It can.  It just takes a long time.

So code can't be fully self-documenting.  But if the comments are out of date, isn't that worse than no comments at all?  An argument can be made that it is.  It takes a while to notice that the comments are wrong and then you have to go back to the code anyway.  In that case, you might as well not have had any comments.  This argument, however, is based upon the premise that the comments will inevitably become incorrect.  I dispute that.  The answer is simple:  update the comments when you change the code.  When code is refactored, the comments must be refactored as well.  Not doing so is just as bad as not running the unit tests or not checking return values.

Of course the response is that this never happens.  Does that have to be the case?  Changing code without changing the comments is introducing a bug.  Not a computer-level bug but a programmer-level one.  This is poor programming.  Don't do it.  Code reviewers should be vigilant for this sort of thing and flag any wrong comments as errors.  Once upon a time no one tested the code they wrote.  No one had it reviewed.  No one wrote unit tests.  Most of these are part of a standard best-practices regimen today.  Can't updating comments just be added to the list of best practices?  I see no reason it cannot.

Comments are, in my mind, an indispensable part of healthy software.  Good programming is not just about communicating with the computer but also with the next programmer.  At some point you'll move on and someone else will have to work in your codebase.  At that point comments become important not only for reducing ramp-up time but critical to avoid making the same mistakes twice.


  1. Documenting code requires a wholistic approach.  There are left-wingers and right-wingers that will argue no comments or lots of comments.  My view: code should be as self-documenting as possible with comments that fill-in the gaps that self-documenting code doesn't fill.  Editing code also means editing comments; that's the part that most lazy programmers miss and then code documentation just falls apart.

  2. Another slick use of commenting is the XML doc comments that modern versions of VS use. They're nice because it's a standard format you can write a documentation extractor for, and the IDE's intellisense will pick them up. I find these invaluable for projects I do in VS.

  3. "And don't even get me started on "modern" programming fancies like keeping each function to a screen and deep callstacks.  Those have terrible implications for readability."
    And so you loose. Since you don’t even get the fundamentals of procedural abstraction its no wonder you still cling to comments.

  4. Procedural abstraction works better in theory than in reality.  It works pretty well, but the ability to treat procedures as pure black boxes exists only in academia.  When performance matters, implementation details become relevant.
    Thus my statement about readability still stands.  Sure, you can get the whole procedure in your head, but if you need to know the details of that procedure, you need to go find it.  Being a function call, it could be anywhere in the source and finding it takes effort.  Should you be trying to read the code in paper form, it becomes even harder.  Breaking up code merely to reduce the size of the chunks makes understanding the whole more difficult while making understanding the immidiate easier.  I find I need to understand the whole often enough that arbitrary function/method sizes do more harm than good.