Character Strings in C/C++

Character strings are an important data structure as they occur in one form or another in almost all programs. In classic C, a character string is really just an array of characters with a NUL character to mark the end of the string.


Defining strings within a program

There are several ways in which we can define strings in a program. We can declare initialised character string variables:

             char mesg[] = "Hello"; 

This is really just a shorthand notation for the following initialiser:

             char mesg[] = {'H', 'e', 'l', 'l', 'o', '\0'}; 

In both cases mesg is defined to be exactly the same string. The first form is clearly recommended for clarity; in particular, note how we needed to specify the NUL character explicitly in the second case. Just as for other arrays we can use the name mesg as a pointer to the start of the string, and we have the following equivalences:

             mesg == &mesg[0] 
             *mesg == 'H' 
             *(mesg + 1) == mesg[1] == 'e' 

Indeed, we could have declared mesg to be a pointer rather than an array:

             char * mesg = "Hello"; 

There is one rather subtle difference: the array form makes the name mesg a pointer constant, whereas the pointer form declares mesg to be a pointer variable. This affects the operations which we may perform using the identifier mesg. Since we cannot alter the value of a constant, we cannot write expressions like mesg++ after the first style of declaration, although that is allowed in the second case. We can still write expressions like mesg+1 in either case. To see the logic behind this, note that, in the array case, the compiler treats mesg as a synonym for the expression &mesg[0]. If we could change the value of mesg we would be changing the address of mesg[0] - moving the array in memory! Since we cannot move the entire array, it makes no sense to allow the address of the first element to be changed. In the pointer version we can alter the value of the pointer variable. Unfortunately, this may leave us with no way of getting back to the start of the string again.

In the above example, we let the compiler determine the length of the array. Alternatively, we can explicitly set the length of the array ourselves:

             char mesg[20] = "Hello"; 

This will initialise just the first six elements of the array, setting the rest to zero (the ASCII NUL character).

We can also declare arrays of strings (the program argument vector argv, is an example of such an array). Again we can specify initialisers for the elements of such an array, as follows:

             static char * fruits[3] = { "Apple", "Pear", "Orange" }; 

This sets up what is known as a ragged array, since the rows are of unequal lengths.

             .----.         .-----------.
           0 | .--+-------->|A|p|p|l|e|¤|
             |----|         `-----------'
           1 | .--+------.
             |----|      |  .---------.
           2 | .--+---.  `->|P|e|a|r|¤|
             `----'   |     `---------'
             fruits   |
                      |     .-------------.
                      `---->|O|r|a|n|g|e|¤|
                            `-------------'

where ¤ represents the terminating NUL character. Compare this with the following example, where we specify the length of the rows explicitly and so set up a normal, rectangular array:

             static char fruits[3][7] = { "Apple", "Pear", "Orange" };

                            .-------------.
                            |A|p|p|l|e|¤|¤|
                            |-+-+-+-+-+-+-|
                            |P|e|a|r|¤|¤|¤|
                            |-+-+-+-+-+-+-|
                            |O|r|a|n|g|e|¤|
                            `-------------'

Notice how the ragged array wastes no space whereas the rectangular array has several unused locations.


The string library

The question now arises of how best to perform operations like copying strings, comparing them, and so on.

Since strings almost appear to be fundamental types, it is easy to go astray. Consider the following program which attempts to copy a string:

      #include <stdio.h>

      #define PR(X) printf(#X " = %s; value = %u; &" #X " = %u.\n", X, X, &X)

      void main ()
      { static char * mesg = "Greetings Earthpeople!";
        static char * copy;

        copy = mesg;
        puts(copy);
        PR(mesg);
        PR(copy);
      } /* main */

The output from the program would be:

      Greetings Earthpeople!
      mesg = Greetings Earthpeople!; value = 406; &mesg = 404.
      copy = Greetings Earthpeople!; value = 406; &copy = 1156.

At first it appears that the assignment copy = mesg has actually copied the string, but this is illusory. A study of the values of the variables mesg and copy will reveal that they have the same value (406), which is the address at which the string is stored. What we have copied in the assignment is the address of the string and not the actual string itself. If we were to alter the string (using mesg to access it) we would find that the "copy" would be altered too. The picture in memory is as follows:

                         |------|
                    404  |  406 |   mesg
                         |------|
                    406  |'G''r'|
                         |------|
                    408  |'e''e'|
                         |------|
                    410  |'t''i'|
                         |------|
                    412  |'n''g'|
                         |------|
                    414  |'s'' '|
                         |------|
                    416  |'E''a'|
                         |------|
                    418  |'r''t'|
                         |------|
                         :   :  :
                         |------|
                   1156  |  406 |   copy
                         |------|

Comparison of strings at first appears to be feasible using code like

             if (string1 == string2) ... 

but again this is illusory - what we should really be comparing is two addresses and not the strings stored from those addresses.

How then do we copy strings if we really do want two copies of the same string, and not just two pointers to a single string? The answer to this, and to similar questions, lies in using a standard library of string handling routines.

C has a library, with function prototypes in the header file string.h, which provides facilities for copying strings, concatenating strings, searching for specific characters in strings, and so on.

  /* string.h - strings routines */
  /* size_t is int or long, depending on the implementation */

  char *strcpy (char *dest, const char *src);
  /* Copies string src to dest.
     Returns dest. */

  char *strncpy (char *dest, const char *src, size_t n);
  /* Copies at most n chars from src to dest.  If n characters are copied, no null character
     is appended; the contents of the dest area is not a null-terminated string. */

  size_t strlen (const char *str);
  /* Returns length of str. */

  int strcmp (const char *s1, const char *s2);
  /* Compares one string to another, case significant
     Returns 0 if s1 = s2,  < 0 if s1 < s2, > 0 if s1 > s2. */

  int stricmp (const char *s1, const char *s2);  /* might be called strcasecmp */
  /* Compares one string to another, ignoring case
     Returns 0 if s1 = s2,  < 0 if s1 < s2, > 0 if s1 > s2. */

  char *strstr (const char *str, const char *substr);
  /* Returns pointer to first location of substr within str or NULL. */

  char *strcat (char *s1, const char *s2);
  /* Appends s2 to s1. */

  char *strncat (char *s1, const char *s2, size_t n);
  /* Appends not more than n chars from s2 to s1. */

The length of a string can be found using the strlen function. Code such as

      char * name = "George";
      printf("The length of \"%s\" is %d.\n", name, strlen(name));

would lead to output like

      The length of "George" is 6.

where, obviously, the length does not include space used by the NUL terminator.

The simplest function used to copy a string is strcpy. Consider the following version of the program above:

      /* Copying strings. */
      #include <stdio.h>
      #include <string.h>

      #define PR(X) printf(#X " = %s; value = %u; &" #X " = %u.\n", X, X, &X)

      void main ()
      { static char * mesg = "Greetings Earthpeople!";
        static char * copy = "Farewell Good Martians!";

        strcpy(copy, mesg);
        puts(copy);
        PR(mesg);
        PR(copy);
      } /* main */

The output here would be

      Greetings Earthpeople!
      mesg = Greetings Earthpeople!; value = 408; &mesg = 404.
      copy = Greetings Earthpeople!; value = 431; &copy = 406.

Now mesg and copy both point to strings containing "Greetings Earthpeople!" (isn't it good to know that Martians aren't sexist?!) stored at different locations: mesg points to a string stored at address 408, and copy points to a string stored at address 431 (which is where the string "Farewell Good Martians!" would have been stored initially). In memory we have the following situation:

                         |------|
                    404  |  408 |   mesg
                         |------|
                    406  |  431 |   copy
                         |------|
                    408  |'G''r'|
                         |------|
                    410  |'e''e'|
                         |------|
                    412  |'t''i'|
                         |------|
                    414  |'n''g'|
                         |------|
                         :   :  :
                         |------|
                    430  |   'G'|
                         |------|
                    432  |'r''e'|
                         |------|
                    434  |'e''t'|
                         |------|

Note that we had to declare and initialise the contents of copy so that there would be enough space to store the contents of mesg when we copied it. The function strcpy copies one string into another, using only the terminating NUL character to tell when to stop copying - bad luck if the copy overwrites something else of value!

The use of functions like this, while in the general spirit of C, is highly dangerous. Another, much safer, variation on strcpy is strncpy which will copy one string to another until either it reaches the NUL character or a maximum length (specified as the third parameter) has been reached.

Concatenating strings can be done using the strcat function, which, since it appends, effectively destroys one of the initial strings). For example:

      char mesg[40] = "Hello ";
      strcat(mesg, "George");
      puts(mesg);

results in the output

      Hello George

If you call strcat when one or other or the "strings" has no terminating NUL chaos may (and probably will) result. Once again, in C it is the programmer's responsibility to ensure that the destination string (array) will have been created long enough to receive the concatenation. In this connection, dynamic string allocation amy proved to be useful. Suppose that we want to concatenate strings s1 and s2, without destroying either of them, but storing the resulting string in as small a space as possible. We could proceed as follows:

      char * result;                                 /* declare a pointer to the result */
      result = malloc(strlen(s1) + strlen(s2) + 1);  /* reserve space (+1 for null) */
      strcpy(result, s1);                            /* copy string s1 over */
      strcat(result, s2);                            /* append s2 */

and finally we note that there is a safer version of strcat, namely strncat.