Archive for June, 2008

Make the most of side-by-side code differencing

Wednesday, June 11th, 2008

I’m constantly amazed how many developers shoot themselves in the foot by defeating the benefits of side-by-side source code differencing, which is perhaps the most routinely used technique in daily code development and maintenance with any VCS (Version Control System). In this post, I’d like to share a few tips for making the most of side-by-side differencing, which in my view should be adopted into every coding standard.

First of all, to benefit from side-by-side diff you need to limit the width of your lines so that you don’t need to scroll horizontally to see all the code. Countless bugs slip into a VCS, because they are hidden off screen during the final merge and people are simply tired of constantly scrolling back and forth. (All GUI usability studies agree that horizontal scrolling of text is always a bad idea.)

Granted, the modern high-resolution wide screens offer a lot of horizontal pixels, but ultimately you’ll always run out of the screen real estate if you allow lines to go on for miles. The column width must obviously allow comfortable viewing two code listings side-by-side, but you should also budget some horizontal space for the directory-tree view, vertical sliders, line numbers, and line margins, as shown in the screen shot below. I’ve been using the column width limit of no more than 78 characters. Your limit could perhaps be higher, but you must set such a limit and then enforce it without exceptions.

side-by-side diff

I can see two main reasons why people write very long lines. The first is long strings in the code. But C or C++ allow writing wide string constants in the following way:

char const s1[] = "This long string is acc\

eptable to all C compilers.";

char const s2[] = "This long string is permissible "

"in ANSI C.";

In other words, you can either use a backslash ‘\’ to terminate a string and continue in the next line, or you can terminate a string normally with a double quote ‘”‘, and an ANSI C compiler will concatenate such adjacent strings into a single zero-terminated string.

The second reason for long lines are preprocessor macros. Here again, you can use the backslash ‘\’ to break up a longer macro into lines. For example:
#define err(flag, msg) if (flag) \ printf(msg)

is the same as

#define err(flag, msg) if (flag) printf(msg)

The use of a backslash for breaking up longer lines brings up the issue of the end-of-line convention and the use of white space in your source code in general.

Let me start with the end-of-line convention. The issue here is that the backslash continuation won’t work unless the ‘\’ character is immediately followed by the end-of-line. Unfortunately, at lest two incompatible end-of-line conventions are in widespread use. The DOS/Windows end-of-line convention consists of the pair of characters CR-LF (0x0D, 0x0A in hex) to terminate lines. In contrast the UNIX™ end-of-line convention uses only one LF character (0x0A). As it turns out, Unix-like machines (e.g. Linux) are confused by the DOS end-of-line convention and will not correctly recognize the backslash-continuation, which looks like ‘\’-CR-LF (0x5C, 0x0D, 0x0A), instead of ‘\’-LF (0x5C, 0x0A).

My recommendation is to use consistently only the UNIX end-of-line convention, even on Windows machines. In my experience all Windows-based compilers have no problems with the UNIX convention, including the ancient tools from the DOS-era. As I mentioned, the converse is not true.

And finally, let me talk about the use of white space (spaces, tabs, end-of-line) in general. Obviously, to benefit from source code differencing you’d like to see only the relevant differences and differences in white space only are typically not relevant. Many code-differencing tools offer an option to ignore white space, but I would not recommend relying on it. Are files with different sizes really identical? And also, as I said before, extra spaces or tabs after the backslash, but before the end-of-line, are not allowed.

As far as tabs are concerned, I’d strongly recommend not to use them at all. Tabs are rendered differently by different editors and printers and bring only insignificant memory savings. Preferably, you should disable tabs at the editor level. At the very least, you should replace all tabs by spaces (“untabify”) before saving the file. As for spaces, I recommend removing any trailing spaces that precede the end-of-line character (LF).

Obviously, you can and should automate the source code cleanup. I use the QCLEAN utility (available here under the GPL license) for cleaning up the code from tabs, trailing blanks, and to enforce the Unix end-of-line convention. The simple console QCLEAN Windows executable scanns recursively all source files (.C, .CPP, .H, .ASM, .S, Makefile, etc.) down from the directory in which it is invoked. The following two listings show a code snippet before and after cleanup with the QCLEAN utility (spaces are shown as dots, tabs as \t, DOS end-of-lines as \r\n, UNIX end-of-lines as \n).

before cleanup:
.\t...\r\n

class.Foo.:.public.Bar.{...\n

public:.\r\n

\tFoo(int8_t.x,.int16_t.y,.int32_t z).//..ctor..\n

....:.Bar(x,.y),.m_z(z)....\n

....{}.............\n

.\t..\n

....virtual.~Foo();\t... //.xtor........\r\n

....virtual int32_t doSomething(int8_t.x);.//.method..\r\n

after cleanup with QCLEAN:
\n

class.Foo.:.public.Bar.{\n

public:\n

....Foo(int8_t.x,.int16_t.y,.int32_t z).//..ctor\n

....:.Bar(x,.y),.m_z(z)\n

....{}\n

\n

....virtual.~Foo();... //.xtor\n

....virtual int32_t doSomething(int8_t.x);.//.method\n