Integrating Code Documentation

You Know What Really Grinds My Gears?

I’m about a month into a new job at the time of writing this post and one of my first mini-projects in this role has been to get a curated set of next-generation sequencing data analysis pipelines up and running on our institutional computing cluster. The countless hours of troubleshooting that this has involved over the past couple of weeks has meant that I have spent an unusually large amount of time reading scripts written by other developers and bioinformatics community members. This experience has reminded me that above all others, one thing should be considered a hanging offense when it comes to programming: poor code documentation.

Poor code documentation practices…and noisy eaters!

When most programmers think of code documentation, they think of comments. But code documentation is really anything written in addition to your code that provides insight into what it does, the thought process behind how it works, and why it is written in a particular way. Forms of code documentation other than comments include user guides, manuals, README and help files.

Well written and comprehensive code documentation will not only help potential users or collaborators understand your scripts, but it will also help your future self when you revisit a project or re-use code months or years down the line. Conversely, poorly written, or non-existent code documentation can really add unnecessary difficulty for others tasked with using or reworking your scripts.

This post is going to be a crash course in integrating code documentation during development; a foundational element of programming best practice and one of the easiest skills to learn and implement. In the immortal words of the well known development standards evangelist Vinnie Jones, let’s ‘ave it.

Follow Best Practice Coding Conventions

Regardless of comments and documentation, how can you expect someone to understand the code you write if it doesn’t follow a system of best practice coding conventions?

While a full discussion of software engineering practices is outside the scope of this post, the fundamental concept is to ensure that you are writing clean and legible code that, as a bare minimum, follows the accepted style for a specific language.

The key ideas to focus on when it comes to coding conventions and best practices are:

Follow a consistent coding style
Follow the syntax conventions of a given language
Write comments (covered shortly)
Use descriptive variable and function names
Use libraries
Use functions

A style guide for R can be found here, while the style guide for Python can be found here. Google is your friend if you script in another language. Following the conventions defined by a style guide not only helps those after you to understand your code, but it also makes the code as production ready as possible if and when that time comes.

Regarding the fourth point, variable names should generally be nouns while function names should be verbs. As the infamous Phil Karlton quote goes, there are only two hard things in computer science: cache invalidation and naming things. It might be difficult at times but strive for names that are both concise and meaningful.

To aid portability and reproducibility of your code, use functions from defined versions of open-source code libraries where possible in your scripts. Always load these dependencies at the top of a script.

Finally, keep your code DRY. That is, Don’t Repeat Yourself. Using properly documented bespoke functions defined early in your scripts, or loaded from a separate file for a pipeline, in place of repeating code blocks is much more efficient and far less error prone.

Being self-taught is not a legit excuse for lax practices

Employing these best practice coding conventions ensures that someone with only a rudimentary understanding of the programming language in question will likely be able to understand what you have written and how it should work even if your documentation is sub-par, which of course it won’t be by the end of this post.

Start with Pseudocode, End with Comments

Pseudocode, as the name suggests, is a “false code” representation of programmatic constructs that allows the programmer to describe the implementation of their code in plain written language; it is essentially a mix of step-by-step notes and informative annotations that initially detail how a block of code will work. Unlike any given programming language by which it will eventually be replaced, pseudocode has no syntax and can’t be compiled or interpreted by a computer. Critically though, it can be understood by a layperson with little to no programming knowledge.

As the main goal of writing pseudocode is to simply explain what each line or small chunk of code in a program should do, it is one of the best approaches to begin the development process. This will reduce the chance of unexpected errors and assist with debugging those that do occur.

Remember the five Ps: Prior Planning Prevents Poor Performance.

Thinking your way through the entire workflow or sub-workflow at the outset and drafting all or most of the steps you plan to encode in a sensible order will not only offer the most accurate glimpse into your thought process but will help you write code faster and more efficiently. It will also avoid the issue every single programmer in history has encountered at least once: coming back to a project after time off and not remembering your original thought process, which might have made complete sense at the time, but now might as well be a cryptic crossword written back-to-front in Japanese.

So how do you write pseudocode and how do you use it to document your code? Let’s look at a worked example in which we will construct a simple function in the R programming language to calculate the geometric mean for a series of input numbers:

1. Start with a statement to establish the main aim

write a function to calculate the geometric mean for a series of numbers

2. Outline all steps required to achieve the main aim

write a function to calculate the geometric mean for a series of numbers

steps for calculating the geometric mean are

	take the natural log of an input series of numbers
	calculate the mean of these values
	exponentiate the result

During this step you should be explicit about everything that is going to happen in the actual code in the exact order it should occur. This includes describing variable representation, the purpose of any functions called, and any results that should be returned.

Just as best practice conventions dictate that function bodies should be indented; you should also indent the pseudocode statements. This helps to install a framework for later that will ensure compliance with these best practice methods, which are designed to aid comprehension of the decision control execution mechanisms and greatly improve code readability.

Your pseudocode should not be abstract and should be written in simple language that is easy to understand for a layperson. You should also use appropriate naming conventions due to the engrained human tendency to follow what we see i.e., monkey see, monkey do; when a programmer refactors their pseudocode, their tendency will be to emulate it. You should therefore check whether all the sections of your pseudocode are complete and easy to comprehend while avoiding excessive use of technical terms.

3. Add in the actual code for each pseudocode statement

write a function to calculate the geometric mean for a series of numbers
calculate_geometric_mean <- function(x) {

steps for calculating the geometric mean are

	take the natural log of an input series of numbers
	log(x) %>%

	calculate the mean of these values
	mean() %>%

	exponentiate the result
	exp()
  
}

At this point you should leave your pseudocode in place and follow the indentation pattern you created previously. I usually add in new lines at this stage to break things up, improve readability and ensure I haven’t missed anything.

4. Transform your initial pseudocode notation into comments

# function to calculate geometric mean for input numeric vector
calculate_geometric_mean <- function(x) {

	# take natural log of input vector values
	log(x) %>%

	# calculate mean of all values
	mean() %>%

	# exponentiate result
	exp()
  
}

This step essentially involves making your initial pseudocode slightly less verbose without compromising information conveyance. At this stage, you might include slightly more concise technical or language-specific notation.

Note that, different languages have different ways of commenting out text so that the compiler or interpreter ignores it; you should familiarise yourself with the specific commenting conventions for the languages you are working with. In this case, R uses the ‘#’ symbol to comment out anything that follows on that same line.

5. Remove some line breaks (optional)

While things like function or loop bodies should always be discreetly identifiable by appropriate use of spacing and indentation, it is basically personal preference as to whether you leave in the space between sequential comments and their accompanying lines of code within these code blocks; sometimes one approach might be more desirable than another. Just remember that readability is paramount.

Example 1: residual line breaks

# function to calculate geometric mean for input numeric vector
calculate_geometric_mean <- function(x) {

	# take natural log of input vector values
	log(x) %>%

	# calculate mean of all values
	mean() %>%

	# exponentiate result
	exp()
  
}

Example 2: line breaks removed

# function to calculate geometric mean for input numeric vector
calculate_geometric_mean <- function(x) {

	# take natural log of input vector values
	log(x) %>%
	# calculate mean of all values
	mean() %>%
	# exponentiate result
	exp()
  
}

Personally, I tend to use the second approach for function or loop bodies, and then make sure that discreet code blocks are separated by a single line break.

Assume Your Reader Knows Nothing

While it might be true that anyone purposefully reviewing or using your code will likely have at least a rudimentary understanding of the programming language, you should write and document it based on the assumption they know nothing about what the code does or how and why it works.

The inspiration for this post was just one condensed series of the more numerous instances in which I’ve read code with little or no documentation where the person who wrote it very obviously assumed that the reader would know everything about it, from how it worked, to why it was written in specific way.

The two things that you can do to prevent these egregious situations are bothering to document your code in the first place and to write it as if the next person has no prior knowledge. This brings us swiftly on to my next point.

Get a Rubber Duck

Bear with me here, this is in fact a well-established practice that is commonly associated with debugging. So called rubberducking is a practice in which a programmer might undertake debugging by articulating the problem and explaining the related code line-by-line to the duck in plain spoken language.

Implementing this practice of describing how your code works so that a rubber duck, with its apparent lack of programmatic savvy, could understand it, will help to ensure that you do not assume knowledge, omit any key details, or incorporate excessive levels of jargon when it comes to finalising your code documentation. Using it might also help you to weed out the cause of errors during development.

You don’t have to go out and get yourself a rubber duck, a dog, cat, goldfish, houseplant, or disinterested partner will suffice. Only practice this technique with the latter in public; using any of the others in this situation might result in an extended holiday to Broadmoor. If you do get yourself a rubber duck, I don’t advise you get one as intimidating as mine 😳

“So tell me again, how does this program of yours work? Slowly…”

Provide Contact Information

Despite your best efforts, the documentation that you write might sometime simply not be enough for some people; occasionally they may need to clarify a specific issue or confirm that a particular programmatic construct works in the manner it was intended. Providing your contact information in the header or footer of your script is a simple yet effective way of ensuring that any questions or issues are directed to you, which can help save users or collaborators a lot of time and frustration.

Summary

The key takeaways here are that you should write detailed notes that can be used in code construction as the foundation for your documentation, assume that the reader knows nothing about your code, which should follow syntax conventions and best practices, and ensure that you could explain it to a rubber duck that has access to your contact details within your scripts so that you can provide further explanations if it has some questions.

Hopefully the tips I have outline here will provide a framework that you can implement each time you set out to write code and, by doing so, will make both your own experience and that of your readers much easier. Developing and using source code can be a frustrating process even for the experienced, don’t make it any more painful than it needs to be.

See you next time.

. . . . .

Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.

Happy Data Analysis!

. . . . .

Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

← Previous Post Next Post →