2018-11-18

The quirks of the C language

The C Programming Language logo
C is a language with a simple syntax. The only complexity of this language come from the fact that it acts in a machine-like way. However, a part of the C syntax is almost never taught. Let's tackle these mysterious cases! 🧞

To understand this post, it is necessary to have a basic knowledge of a language with a syntax and operation close to C.

The uncommon operators
1. The comma operator
2. The ternary operator
Access to an array
Initialisation
The compound literals
Introduction to VLAs
The VLAs exceptions
A flexible array
A labels history
Complex numbers
Generic macros
Too special characters
To conclude

The uncommon operators

There are two operators in the C language that are almost never used. The first is the comas operator. In C, the comma is used to separate the elements of a definition or to separate the elements of a function. In short, it is a punctuation element. But not only ! It's also an operator.

The comma operator

The following instruction, although unnecessary, is quite valid:

printf("%d", (5,3) );

It prints 3. The operator , is used to juxtapose expressions.

The value of the whole expression is equal to the value of the last expression.

This operator is very useful in a for loop to multiply the iterations. For example, to increment i and decrement j in the same iteration of a for loop, we can do:

for( ; i < j ; i++, j-- ) {
 // [...]
}

Or again, in small if to simplify

if( argc > 2 && argv[2][0] == '0' )
    action = 4, color = false;

Here, we assign action and color. Normally to do 2 assignations, we should use curly braces.

We can also use the comma operator to remove parentheses.

while( c = getchar(), c != EOF && c != '\n' ) {
  // [...]
}
// Is strictly equivalent to :
while( (c = getchar()) != EOF && c != '\n' ) {
  // [...]
}

But above all, do not abuse this operator! You can, in a rather fast way, obtain unreadable things.
This remark is also valid for the next operator!

The ternary operator

The "ternary" for the intimates. The only one operator of the language that takes 3 operands. It is used to simplify conditional expressions.

For example to print the minimum of 2 numbers, without ternary, we could do:

if (a < b)
    printf("%d", a);
else
    printf("%d", b);

Or simply:

int min = a;
if( b < a)
  min = b;
printf("%d", min);

And using the ternary operator:

printf("%d", a<b ? a : b);

Thanks to the use of ternary, we saved a repetition as well as a few lines. More importantly, we make it more readable. When one know how to read a ternary...

To read a ternary expression, you need to split it into 3 parts:

expression_1 ? expression_2 : expression_3

The value of the whole expression expression_2 is the expression_1 is evaluated to true and expression_3 otherwise.

So shorts expressions become easier to read. The expression a<b ? a : b reads “If a is less or equal to b then a else b”.

I still insist on the fact that this operator, if it is badly used, can harm the readability of the code. Note that a ternary expression can be used as the operand of another ternary expression:

printf("%d", a<b ? a<c ? a : b<c ? b : c : b < c ? b : c);

From now on, we take the minimum of three numbers. It is spaced, but impossible to follow. The most readable way is to use a macro :

#define MIN(a,b) ((a) < (b) ? (a) : (b))

printf("%d", MIN(a,MIN(b,c)));

Et voilà ! Two operators who will now gain some interest. By the way, even the operators that we already know, we do not necessarily master the syntax.

Access to an array

If you ever learn C, you must have learned that to access the third element of an array, you can do:

int arr[5] = {0, 1, 2, 3, 4, 5};
printf("%d", arr[2]);

2 not 3, because an array in C starts at 0. If an array start at 0, it's a story of address¹ The address of an array is the one of the first element of that array. By pointer arithmetic, the address of the third element of an array is therefor arr+2.

So we might have written:

printf("%d", *(arr+2));

Then, since addition is commutative, arr+2 and 2+arr are equivalent. In fact, we could even have done:

printf("%d", 2[arr]);

It's perfectly valid. With good reason: the syntax E[F] is strictly equivalent to *((E)+(F)), no more, no less. Suddenly the name of this section seams misleading. This operator has nothing to deal with arrays. In fact, it's sugar syntax for pointer arithmetic.²

For example to print the character = for 1, ! for 0 and ~ for 2.

We may do:

if( is_good == 1 )
  printf("%c", '=');
else if( is_good == 0 )
  printf("%c", '!');
else
  printf("%c", '~');

But it can be done easier:

printf("%c", "!=~"[is_good]);

// As we saw :
printf("%c", is_good["!=~"] ); // Prints '!' if is_good is 0
                               //        '=' if is_good is 1
                               //        '~' if is_good is 2

Since everybody writes arr[3], please don't write 3[arr], it's useful to know, not to do.

Initialisation

The initialisation is something we master in C. It is the fact of giving a value to a variable during its declaration. Basically, we define its value.

For an array³ :

int arr[10] = {0};
arr[2] = 5;

In the first line, we initialise the array with 0 because, every value not specified is set to 0 by default. The next line is an affectation, not an initialisation.

If we only want to initialise the third element, since the initialisation of an array following the values order, we should write:

int arr[10] = {0, 0, 5};

But in fact, there is an another syntax for that:

int arr[10] = {[2] = 5};

We simply say that the third element value is 5. The rest is 0 by default. An equivalent syntax also exists for structures and unions.

For the example, we will use the structure point that I will use many times in this article, same for the structure message.

We can initialise a point based on its components.

typedef struct point {
  int x,y;
} point;


point A = {.x = 1, .y = 2};

Here, there is no ambiguity. But for a structure more complex, this syntax is really helpful.

typedef struct message {
   char src[20], dst[20], msg[200];
} message;

// [...]

message to_send = {.src="", .dst="23:12:23", .msg="Code 10"};

// Is way more self-explanatory than :

message to_send = {"", "23:12:23", "Code 10"};

// We don't need to follow the declared order of fields structure

message to_send = { .msg="Code 10", .dst="23:12:23", .src=""};

// And also since any field of a structure is initialised to its null value if it is not initialized explicitly.
// We can also omit the `src` field.

message to_send = { .dst="23:12:23", .msg="Code 10"};

With these syntaxes we could also make the code more verbose or/and easier to read... Sometimes these syntaxes can be used very wisely ! As here in this base64 decoder:

static int b64_d[] = {
   ['A'] =  0, ['B'] =  1, ['C'] =  2, ['D'] =  3, ['E'] =  4,
   ['F'] =  5, ['G'] =  6, ['H'] =  7, ['I'] =  8, ['J'] =  9,
   ['K'] = 10, ['L'] = 11, ['M'] = 12, ['N'] = 13, ['O'] = 14,
   ['P'] = 15, ['Q'] = 16, ['R'] = 17, ['S'] = 18, ['T'] = 19,
   ['U'] = 20, ['V'] = 21, ['W'] = 22, ['X'] = 23, ['Y'] = 24,
   ['Z'] = 25, ['a'] = 26, ['b'] = 27, ['c'] = 28, ['d'] = 29,
   ['e'] = 30, ['f'] = 31, ['g'] = 32, ['h'] = 33, ['i'] = 34,
   ['j'] = 35, ['k'] = 36, ['l'] = 37, ['m'] = 38, ['n'] = 39,
   ['o'] = 40, ['p'] = 41, ['q'] = 42, ['r'] = 43, ['s'] = 44,
   ['t'] = 45, ['u'] = 46, ['v'] = 47, ['w'] = 48, ['x'] = 49,
   ['y'] = 50, ['z'] = 51, ['0'] = 52, ['1'] = 53, ['2'] = 54,
   ['3'] = 55, ['4'] = 56, ['5'] = 57, ['6'] = 58, ['7'] = 59,
   ['8'] = 60, ['9'] = 61, ['+'] = 62, ['/'] = 63, ['='] = 64
};

Source: Taurre

The compound literals

Since we are talking about arrays. There is a simple syntax for using single-use arrays.

I would like to use this array:

int arr[5] = {5, 4, 5, 2, 1};
printf("%d", arr[i]); // With i set to something >=0 and <5

However, I only use this array once... It's a bit disturbing to have to use an identifier just for that.

Well, I can do that:

printf("%d", ((int[]){5,4,5,2,1}) [i] ); // With i set to something >=0 and <5

It's not very readable, but in many cases this syntax is useful. For example with a structure:


// To send our message:
send_msg( (message){ .dst="192.168.11.1", .msg="Code 11"} );

// To print the distance between two points
printf("%d", distance( (point){1, 2}, (point){2, 3} )  );

// Or on Linux, in system programming
execvp( "bash" , (char*[]){"bash", "-c", "ls", NULL} );

We call these expressions compound literals (which is a pain to translate in any other language)

Introduction to VLAs

Variable Length Arrays are arrays with length only know at runtime. If never encounter VLA, this should clink you:

int n = 11;

int arr[n];

for(int i = 0 ; i < n ; i++)
  arr[i] = 0;

A lot of teachers must have repressed that code. We have been taught that an array must have a know size at compile time. VLA are the exception. Introduced with the C99 standard, VLAs have a bad reputation. There are several reasons for this, which I won't go into here⁴.

I'm just going to talk about the non-intuitive behaviors introduced with VLAs.

To define a VLA, it's the same syntax as a classical array, but the size of the array is a non-constant expression. But first, let's see what and how to use a VLA (the normal way).

int n = 50;
int arr[n];

double arr2[2*n];

unsigned int arr3[foo()]; // avec foo une fonction définie ailleurs

A variable length array can not be initialised nor declared static. Thus, both of these statements are incorrect:

int n = 30;

int arr[n] = {0};
static arr2[n];

In a function, we can use this syntax to refer to a VLA:

void bar(int n, int arr[n]) {

}

Since, in C the size of the first dimension of an array isn't really of interest as an argument of a function. A real-life case may be in passing a 2-dimension VLA where the second dimension must be specified:

void foo( int n, int m, int arr[][m]) {

}

Note that it is possible to use the character * (yet another use ...) instead of the size of one or more dimensions of a VLA, but only within a prototype.

void foo(int, int, int[][*]);

Well, after that short introduction, let's talk about the interesting cases. The quirks and eccentricities that the VLAs have introduced !

The VLAs exceptions

The most known deviant behavior of VLAs is their relation to sizeof.

sizeof is a unary operator that retrieves the size of a type from an expression or from the name of a type surrounded by parentheses.

/*  How sizeof works using examples */ 
float a;
size_t b = 0;

printf("%zu", sizeof(char)); // Prints 1
printf("%zu", sizeof(int));  // Prints 4
printf("%zu", sizeof a);     // Prints 4
printf("%zu", sizeof(a*2.)); // Prints 8
printf("%zu", sizeof b++);   // Prints 8

The first result are very surprising, the size of a char is defined to be 1 and sizeof(char) must return 1 (as per the C standard). The second one is the size of int⁵. The third one is the size of the type of the expression a (which is float⁵). The fourth is the size of a double⁵ (the type of a*2.⁶). The last one is the size of the type size_t since it is the type of the expression b++⁵.

Here, we don't care about the value of the expression since sizeof doesn't care more. The value of the sizeof expression is determined at compile time. The operations inside the sizeof statements aren't executed. Since the expression must be valid, its type is determined at compile time.

int n = 5;
printf("%zu", sizeof(char[++n])); // Prints 6

Ouch ! Here are the VLAs. In the type int[++n], ++n is a non-constant expression. So the array is a VLA. To know the size of the array, the compiler must execute the expression inside the bracket. This, n holds 6 now and sizeof indicates that an array of char declared within this expression should have a size of 6.

This is only slightly intuitive since the VLAs here have introduced an exception to the rule which is not to execute the expression passed to sizeof.

Another odd behaviour introduced by VLAs is the execution of expressions related to the size of a VLA in the definition of a function. Thus :

int foo( char arr[printf("bar")] ) {
   printf("%zu", sizeof arr);
}

Assuming that the displays do not cause an error, calling this function will display bar3. The printf("bar") statement is evaluated and then only the body of the function is executed (the "3").

Note that there are other exceptions induced by the standardisation of VLAs such as I already state, the impossibility to allocate VLAs statically (quite logical), or the impossibility to use VLAs in a structure (GNU GCC supports it anyway). And even some conditional branches are forbidden when using a VLA.

A flexible array

You may never hear something like “flexible array members”. This is normal, these respond to a very specific and uncommon problem.

The objective is to allocate a structure but with one field (an array) of unknown size at compile time and all on a contiguous space⁷.

Here, there is no VLAs, because as we already stated, VLAs are forbidden as structure field. We must use dynamic allocation We could write that:

struct foo {
  int* arr;
};

And use it like that:

  struct foo* contiguous = malloc( sizeof(struct foo) );
  if (contiguous) {
    contiguous->arr = malloc( N * sizeof *contiguous->arr );
    if (contiguous->arr) {
      contiguous->arr[0] = 11;
    }
  }

But here the array may not be next to the structure in memory. As a consequence, if we copy the structure the value of arr will be the same for the copy and for the original one. To avoid that, we must copy the structure, reallocate the array and copy it. Let's see another way.

struct foo {
  /* ... At least another field because the C standard say so ... */
  int flexiArr[];
};

Here, the field flexiArr is a member array flexible. Such an array must be the last element of the structure and not specify a size⁸. It is used like this:

  struct foo* contiguous = malloc( sizeof(struct foo) + N * sizeof *flexiArr );
  if (contiguous) {
    flexiArr[0] = 11;
  }

This syntax responds as much to a need for portability on architectures imposing a particular alignment (the array is contiguous to the structure) as to the need to show a semantic link between the array and the structure (the array belongs to the structure).

A labels history

In C, if there's one thing we shouldn't talk about, it's labels. We use it with the goto statement. The forbidden one !

To hide them, we replace the goto by named statement more explicit like break or continue. So we don't have goto anymore, and we never learn what is a label anymore when we learn the C syntax.

This is how goto and a label are used:

goto end;


end: return 0;
}

Basically, a label is a name given to an instruction. We use it mainly in switch statement now a day.


switch( action ) {
    case 0:
        do_action0();
    case 1:
        do_action1();
        break;
    case 2:
        do_action2();
    break;
    default:
        do_action3();
}

Here each case and the default are in fact labels. Except that you can't use them with goto since there is no name to refer.

Why are you telling us this?

Firstly, it's good to know that it's called a label. Secondly, because I'm going to tell you about a classic. The Duff's device.

It's a kind of loop unrolled and optimised. The goal is to reduce the number of loop check (as well as the number of decrements).

Here is the historical version written by Tom Duff

{
    register n = (count + 7) / 8;
    switch (count % 8) {
    case 0: do { *to = *from++;
    case 7:      *to = *from++;
    case 6:      *to = *from++;
    case 5:      *to = *from++;
    case 4:      *to = *from++;
    case 3:      *to = *from++;
    case 2:      *to = *from++;
    case 1:      *to = *from++;
            } while (--n > 0);
    }
}

It doesn't matter what register means. Also, to is a particular pointer, but it doesn't really matter.

Here, what I want to tell you about is that do-while loop in the middle of a switch.

The test we try to avoid is --v > 0.

Normally, n would actually be count. And we would have to test count times. The same goes for its decrement.

That's to say:

while( count-- > 0 )
    *to = *from++;

Dividing by 8 (arbitrary number) we also divide the number of tests and decrements by 8. However, if count is not divisible by 8, we have a problem, we don't do all the instructions. It would be nice to be able to jump directly to the 2nd instruction, if you only have 6 instructions left.

And this is where labels can help us! Thanks to the switch we can jump directly to the right instruction.

We only have to label every instruction with the number of instruction remaining to do. Then we jump to that instruction with the switch statement.

It is very rare to have to use this type of trick. It's mostly an optimisation from another time. But since I would like to talk about that syntax, it was necessary to talk about labels (or was it ?)

Complex numbers

Once again, we will study a syntax introduced by C99. More exactly there are 3 types that have been introduced which are complex numbers. The type of complex number is double _Complex (the other two follow the same pattern, I will only write about the double version)

Thus, in C, it is possible to declare a complex number like this:

double complex point = 2 + 3 * I;

Here we find the special macros complex and I (defined in the <complex.h> header). The former is used to create a complex type while the latter is used to define the imaginary part of a complex number.

In memory a complex variable takes up as much space as 2 times the real type on which it is based. A complex variable is used as a normal variable. The arithmetic is intuitive since it is based on the real type. Note that it is recommended to use the macro CMPLX to initialise a complex number:

double complex cplx = CMPLX(2, 3);

For a better handling of cases where the imaginary part (the one multiplied by I) would be NAN, INFINITY or even more or less 0.

The <complex.h> header offers us a really simple way to use imaginary numbers. Indeed, many common functions for manipulating imaginary numbers are available.

Generic macros

There is a way in C to have macros that are defined differently depending on the type of one of its arguments. This syntax is however "new" since it dates from the C11 standard.

This genericity is achieved with generic selections based on the syntax _Generic ( /* ... */ ).
To understand the syntax, let's look at a simplistic example:

#include <stdio.h>
#include <limits.h>

#define MAXIMUM_OF(x) _Generic ((x), \
                                char: CHAR_MAX, \
                                int:  INT_MAX,  \
                                long: LONG_MAX  \
                                )

int main(int argc, char* argv[]) {
    int i = 0;
    long l = 0;
    char c = 0;

    printf("%i\n", MAXIMUM_OF(i));
    printf("%d\n", MAXIMUM_OF(c));
    printf("%ld\n", MAXIMUM_OF(l));
    return 0;
}

Here we print the maximum that can be stored by each of the types we use. This is something that would not have been possible without the use of this new keyword _Generic. To use this syntax, we use the keyword _Generic to which we pass 2 parameters. The first is an expression whose type will influence the expression that is finally executed. The second is a sequence of type and expression associations (type: expression) whose associations are separated by commas. In the end, only the expression designated by the type of the first expression is finally evaluated.

A real-world example could be:

int powInt(int,int);

#define POW(x,y) _Generic ((y), double: pow, float: powf, long double: powl, int: powInt)((x), (y))

There's not much more to say except that it's possible to have the word default in the list of types, which will then correspond to all unmentioned types. So a cleaner definition of the POW macro from earlier could be :

int powIntuInt(int a, unsigned int b);
double powIntInt(int a, int b);
double powFltInt(double a,int b) { return pow (a,b); }
double powfFltInt(float a,int b) { return powf(a,b); }
double powlFltInt(long double a,int b) { return powl(a,b); }


#define POWINT(x) _Generic((x), double: powFltInt,          \
                                float : powfFltInt,         \
                                long double: powlFltInt,    \
                                unsigned int: powIntuInt,   \
                                default: powIntInt)
#define POW(x,y) _Generic ((y), double: pow, float: powf, long double: powl, default: POWINT((x)) )((x), (y))

Too special characters

Let's go back in time again. I'm going to mention one more thing. A time when not all characters were as accessible as they are today on so many types of keyboards.

Keyboards didn't necessarily have compose keys. Thus, it was impossible to type the # character.

The # character could then be replaced by the sequence ??=. And for each character not on the keyboard and used in the C language, there was a ?? based sequence called trigraph. Another version based on 2 more readable characters is called digraphs.

Here is a table summarising the trigraph and digraph sequences and their character representation.

Character	Digraph	Trigraph
`#`	`%:`	`??=`
`[`	`<:`	`??(`
`]`	`:>`	`??)`
`{`	`<%`	`??<`
`}`	`%>`	`??>`
`\`		`??/`
`^`		`??'`
`\|`		`??!`
`~`		`??-`
`##`	`%:%:`

The main difference between the digraphs and trigraphs is inside a string:

puts("??= is a hashtag");
puts("%:");

These medieval mechanisms are still valid today in C. ⁹ So this line of code is perfectly valid:

??=define FIRST arr<:0]

The only use of this syntax nowadays is to obfuscate a source code very easily. With a combination of a ternary with trigraphs and a digraph you get an absolutely unreadable code 😉

printf("%d", a ?5??((arr):>:0);

Never use it in a serious code.

To conclude

That's it, I hope you've learned something from this post. Don't forget to use these syntaxes sparingly.

I would like to thank Taurre for validating this article in French, but also for his pedagogy on the forums for years, as well as blo yhg for his careful proofreading.

Note that you can (re)discover a lot of code abusing the C language syntax at IOCCC. 😈

It's a little more complicated than that. When it was necessary to choose if an array must start at 0 or 1, the compilation time of a program was of particular importance. It had been decided to start at 0 to gain time (processor cycle actually) by not doing the indices translation needed for 1 based array. The Citation Needed article from Mike Hoye explains a lot better than I could. ↩
For the story [] is inherited from B language where the concept of array wasn't even a thing. An array (or vector like it was called back them) was just the address of the first element of a bytes sequence. Only pointers arithmetic allowed to access to all elements of a vector, it was an evolution compared to BCPL, the B ancestor, who use the syntax V!4 to access to the fifth element of an array, but I digress... ↩
{0} can also be used for a number. Any value initializing a simple variable (pointer, number, ...) can optionally take braces.
```
int a = {11};
float pi = {3.1415926};
char* s = {"unicorn"};
```
The main feature of the array is the use of commas between the braces. ↩
I can nevertheless give you two references “Is it safe to use variable-length arrays?” from stack overflow and this article : “The Linux Kernel Is Now VLA-Free”. ↩
On my computer. ↩ ↩² ↩³ ↩⁴
Here, the implementation follows the IEEE 754 standard, where the size of floating number “simple” is 4 bytes and “double” is 8. 2. has type double so 2.*a has the same type as its greater operand. ↩
We may want that this space to be contiguous for many reasons. One is to optimise the use of the processor cache. Another one is that the management of the network layers which are well suited to the use of flexible array. ↩
Prior to the specification of flexible array members in C99, it was common practice to use arrays of size one to replicate the concept. ↩
In C23, trigraphs are deprecated and doesn't work anymore. ↩