CodeGuru : Thinking in C++

A library usually starts out as a collection of functions, but if you have used third-party C libraries you know there’s usually more to it than that because there’s more to life than behavior, actions and functions. There are also characteristics (blue, pounds, texture, luminance), which are represented by data. And when you start to deal with a set of characteristics in C, it is very convenient to clump them together into a struct, especially if you want to represent more than one similar thing in your problem space. Then you can make a variable of this struct for each thing.

Thus, most C libraries have a set of structs and a set of functions that act on those structs. As an example of what such a system looks like, consider a programming tool that acts like an array, but whose size can be established at runtime, when it is created. I’ll call it a CStash. Although it’s written in C++, it has the style of what you’d write in C:

A tag name like CStashTag is generally used for a struct in case you need to reference the struct inside itself. For example, when creating a linked list (each element in your list contains a pointer to the next element), you need a pointer to the next struct variable, so you need a way to identify the type of that pointer within the struct body. Also, you'll almost universally see the typedef as shown above for every struct in a C library. This is done so you can treat the struct as if it were a new type and define variables of that struct like this:

The storage pointer is an unsigned char* . An unsigned char is the smallest piece of storage a C compiler supports, although on some machines it can be the same size as the largest. It’s implementation dependent, but is often one byte long. You might think that because the CStash is designed to hold any type of variable, a void* would be more appropriate here. However, the purpose is not to treat this storage as a block of some unknown type, but rather as a block of contiguous bytes.

The source code for the implementation file (which you may not get if you buy a library commercially – you might get only a compiled obj or lib or dll, etc.) looks like this:

initialize( ) performs the necessary setup for struct CStash by setting the internal variables to appropriate values. Initially, the storage pointer is set to zero, and the size indicator is also zero – no initial storage is allocated.

The add( ) function inserts an element into the CStash at the next available location. First, it checks to see if there is any available space left. If not, it expands the storage using the inflate( ) function, described later.

Because the compiler doesn’t know the specific type of the variable being stored (all the function gets is a void*), you can’t just do an assignment, which would certainly be the convenient thing. Instead, you must copy the variable byte-by-byte. The most straightforward way to perform the copying is with array indexing. Typically, there are already data bytes in storage, and this is indicated by the value of next. To start with the right byte offset, next is multiplied by the size of each element (in bytes) to produce startBytes. Then the argument element is cast to an unsigned char * so that it can be addressed byte-by-byte and copied into the available storage space. next is incremented so that it indicates the next available piece of storage, and the “index number” where the value was stored so that value can be retrieved using this index number with fetch( ).

fetch( ) checks to see that the index isn’t out of bounds and then returns the address of the desired variable, calculated using the index argument. Since index indicates the number of elements to offset into the Cstash, it must be multiplied by the number of bytes occupied by each piece to produce the numerical offset in bytes. When this offset is used to index into storage using array indexing, you don’t get the address, but instead the byte at the address. To produce the address, you must use the address-of operator &.

count( ) may look a bit strange at first to a seasoned C programmer. It seems like a lot of trouble to go through to do something that would probably be a lot easier to do by hand. If you have a struct CStash called intStash, for example, it would seem much more straightforward to find out how many elements it has by saying intStash.next instead of making a function call (which has overhead) like count(&intStash). However, if you wanted to change the internal representation of CStash and thus the way the count was calculated, the function call interface allows the necessary flexibility. But alas, most programmers won’t bother to find out about your “better” design for the library. They’ll look at the struct and grab the next value directly, and possibly even change next without your permission. If only there were some way for the library designer to have better control over things like this! (Yes, that’s foreshadowing.)

Dynamic storage allocation

You never know the maximum amount of storage you might need for a CStash, so the memory pointed to by storage is allocated from the heap. The heap is a big block of memory used for allocating smaller pieces at runtime. You use the heap when you don’t know the size of the memory you’ll need while you’re writing a program. That is, only at runtime will you find out that you need space to hold 200 Airplane variables instead of 20. In Standard C, dynamic-memory allocation functions are part of and include malloc( ), calloc( ), realloc( ), and free( ). Instead of library calls, however, C++ has a more sophisticated (albeit simpler to use) approach to dynamic memory which is integrated into the language via the keywords new and delete.

The inflate( ) function uses new to get a bigger chunk of space for the CStash. In this situation, we will only expand memory and not shrink it, and the assert( ) will guarantee that a negative number is not passed to inflate( ) as the increase value. The new number of elements that can be held (after inflate( ) completes) is calculated as newQuantity, and this is multiplied by the number of bytes per element to produce newBytes, which will be the number of bytes in the allocation. So that we know how many bytes to copy over from the old location, oldBytes is calculated using the old quantity.

where Type describes the type of variable you want allocated on the heap. In this case, we want an array of unsigned char that is newBytes long, so that is what appears as the Type. You can also allocate something as simple as an int by saying:

A new-expression returns a pointer to an object of the exact type that you asked for. So if you say new Type , you get back a pointer to a Type. If you say new int , you get back a pointer to an int. If you want a new unsigned char array, you get back a pointer to the first element of that array. The compiler will ensure that you assign the return value of the new-expression to a pointer of the correct type.

Of course, any time you request memory it’s possible for the request to fail, if there is no more memory. As you will learn, C++ has mechanisms that come into play if the memory-allocation operation is unsuccessful.

Once the new storage is allocated, the data in the old storage must be copied to the new storage – this is again accomplished with array indexing, copying one byte at a time in a loop. After the data is copied, the old storage must be released so that it can be used by other parts of the program if they need new storage. The delete keyword is the complement of new, and must be applied to release any storage that is allocated with new (if you forget to use delete, that storage remains unavailable, and if this so-called memory leak happens enough, you’ll run out of memory). In addition, there’s a special syntax when you’re deleting an array. It’s as if you must remind the compiler that this pointer is not just pointing to one object, but to an array of objects: you put a set of empty square brackets in front of the pointer to be deleted:

Once the old storage has been deleted, the pointer to the new storage can be assigned to the storage pointer, the quantity is adjusted, and inflate( ) has completed its job.

Note that the heap manager is fairly primitive. It gives you chunks of memory and takes them back when you delete them. There’s no inherent facility for heap compaction , which compresses the heap to provide bigger free chunks. If a program allocates and frees heap storage for a while, you can end up with a fragmented heap that has lots of memory free, but without any pieces that are big enough to allocate the size you’re looking for at the moment. A heap compactor complicates a program because it moves memory chunks around, so your pointers won’t retain their proper values. Some operating environments have heap compaction built in, but they require you to use special memory handles (which can be temporarily converted to pointers, after locking the memory so the heap compactor can’t move it) instead of pointers. You can also build your own heap-compaction scheme, but this is not a task to be undertaken lightly.

When you create a variable on the stack at compile-time, the storage for that variable is automatically created and freed by the compiler. The compiler knows exactly how much storage is needed, and it knows the lifetime of the variables because of scoping. With dynamic memory allocation, however, the compiler doesn’t know how much storage you’re going to need, and it doesn’t know the lifetime of that storage. That is, the storage doesn’t get cleaned up automatically. Therefore, you’re responsible for releasing the storage using delete, which tells the heap manager that storage can be used by the next call to new. The logical place for this to happen in the library is in the cleanup( ) function because that is where all the closing-up housekeeping is done.

Following the form required by C, all the variables are created at the beginning of the scope of main( ). Of course, you must remember to initialize the CStash variables later in the block, by calling initialize( ). One of the problems with C libraries is that you must carefully convey to the user the importance of the initialization and cleanup functions. If these functions aren’t called, there will be a lot of trouble. Unfortunately, the user doesn’t always wonder if initialization and cleanup are mandatory. They know what they want to accomplish, and they’re not as concerned about you jumping up and down saying, “Hey, wait, you have to do this first!” Some users have even been known to initialize the elements of a structure themselves. There’s certainly no mechanism in C to prevent it (more foreshadowing).

The intStash is filled up with integers, and the stringStash is filled with character arrays. These character arrays are produced by opening the source code file, CLibTest.cpp, and reading the lines from it into a string called line, and then producing a pointer to the character representation of line using the member function c_str( ).

After each Stash is loaded, it is displayed. The intStash is printed using a for loop, which uses count( ) to establish its limit. The stringStash is printed with a while, which breaks out when fetch( ) returns zero to indicate it is out of bounds.

Bad guesses

There is one more important issue you should understand before we look at the general problems in creating a C library. Note that the CLib.h header file must be included in any file that refers to CStash because the compiler can’t even guess at what that structure looks like. However, it can guess at what a function looks like – this sounds like a feature but it turns out to be a major C pitfall.

Although you should always declare functions by including a header file, function declarations aren’t essential in C. It’s possible in C (but not in C++) to call a function that you haven’t declared. A good compiler will warn you that you probably ought to declare a function first, but it isn’t enforced by the C language standard. This is a dangerous practice, because the C compiler can assume that a function that you call with an int argument has an argument list containing int, even if it may actually contain a float. This can produce bugs that are very difficult to find, as you will see.

Each separate C file is a translation unit. That is, the compiler is run separately on each translation unit, and when it is running it is aware of only that unit. Thus, any information you provide by including header files is quite important because it provides the compiler’s understanding of the rest of your program. Declarations in header files are particularly important, because everywhere the header is included, the compiler will know exactly what to do. If, for example, you have a declaration in a header file that says void func(float) , the compiler knows that if you call that function with an integer argument, it should convert the int to a float as it passes the argument (this is called promotion). Without the declaration, the C compiler would simply assume that a function func(int) existed, it wouldn’t do the promotion, and the wrong data would quietly be passed into func( ).

For each translation unit, the compiler creates an object file, with an extension of o or obj or something similar. These object files, along with the necessary start-up code, must be collected by the linker into the executable program. During linking, all the external references must be resolved. For example, in CLibTest.cpp, functions like initialize( ) and fetch( ) are declared (that is, the compiler is told what they look like) and used, but not defined. They are defined elsewhere, in CLib.cpp. Thus, the calls in CLib.cpp are external references. The linker must, when it puts all the object files together, take the unresolved external references and find the addresses they actually refer to. Those addresses are put into the executable program to replace the external references.

It’s important to realize that in C, the external references that the linker searches for are simply function names, generally with an underscore in front of them. So all the linker has to do is match up the function name where it is called and the function body in the object file, and it’s done. If you accidentally made a call that the compiler interpreted as func(int) and there’s a function body for func(float) in some other object file, the linker will see _func in one place and _func in another, and it will think everything’s OK. The func( ) at the calling location will push an int onto the stack, and the func( ) function body will expect a float to be on the stack. If the function only reads the value and doesn’t write to it, it won’t blow up the stack. In fact, the float value it reads off the stack might even make some kind of sense. That’s worse because it’s harder to find the bug.

A tiny C-like library

Dynamic storage allocation

Bad guesses