CodeGuru : Thinking in C++

set

The set produces a container that will accept only one of each thing you place in it; it also sorts the elements (sorting isn’t intrinsic to the conceptual definition of a set, but the STL set stores its elements in a balanced binary tree to provide rapid lookups, thus producing sorted results when you traverse it). The first two examples in this chapter used sets.

Consider the problem of creating an index for a book. You might like to start with all the words in the book, but you only want one instance of each word and you want them sorted. Of course, a set is perfect for this, and solves the problem effortlessly. However, there’s also the problem of punctuation and any other non-alpha characters, which must be stripped off to generate proper words. One solution to this problem is to use the Standard C library function strtok( ), which produces tokens (in our case, words) given a set of delimiters to strip out:

//: C20:WordList.cpp
// Display a list of words used in a document
#include "../require.h"
#include <string>
#include <cstring>
#include <set>
#include <iostream>
#include <fstream>
using namespace std;

const char* delimiters =
  " \t;()\"<>:{}[]+-=&*#.,/\\~";

int main(int argc, char* argv[]) {
  requireArgs(argc, 1);
  ifstream in(argv[1]);
  assure(in, argv[1]);
  set<string> wordlist;
  string line;
  while(getline(in, line)) {
    // Capture individual words:
    char* s = // Cast probably won’t crash:
      strtok((char*)line.c_str(), delimiters);
    while(s) {
      // Automatic type conversion:
      wordlist.insert(s); 
      s = strtok(0, delimiters);
    }
  }
  // Output results:
  copy(wordlist.begin(), wordlist.end(),
       ostream_iterator<string>(cout, "\n"));

} ///:~

strtok( ) takes the starting address of a character buffer (the first argument) and looks for delimiters (the second argument). It replaces the delimiter with a zero, and returns the address of the beginning of the token. If you call it subsequent times with a first argument of zero it will continue extracting tokens from the rest of the string until it finds the end. In this case, the delimiters are those that delimit the keywords and identifiers of C++, so it extracts these keywords and identifiers. Each word is turned into a string and placed into the wordlist vector, which eventually contains the whole file, broken up into words.

You don’t have to use a set just to get a sorted sequence. You can use the sort( ) function (along with a multitude of other functions in the STL) on different STL containers. However, it’s likely that set will be faster.

Eliminating strtok( )

Some programmers consider strtok( ) to be the poorest design in the Standard C library because it uses a static buffer to hold its data between function calls. This means:

You can’t use strtok( ) in two places at the same time
You can’t use strtok( ) in a multithreaded program
You can’t use strtok( ) in a library that might be used in a multithreaded program
strtok( ) modifies the input sequence, which can produce unexpected side effects
strtok( ) depends on reading in “lines”, which means you need a buffer big enough for the longest line. This produces both wastefully-sized buffers, and lines longer than the “longest” line. This can also introduce security holes. (Notice that the buffer size problem was eliminated in WordList.cpp by using string input, but this required a cast so that strtok( ) could modify the data in the string – a dangerous approach for general-purpose programming).

For all these reasons it seems like a good idea to find an alternative for strtok( ). The following example will use an istreambuf_iterator (introduced earlier) to move the characters from one place (which happens to be an istream) to another (which happens to be a string), depending on whether the Standard C library function isalpha( ) is true:

//: C20:WordList2.cpp
// Eliminating strtok() from Wordlist.cpp
#include "../require.h"
#include <string>
#include <cstring>
#include <set>
#include <iostream>
#include <fstream>
#include <iterator>
using namespace std;

int main(int argc, char* argv[]) {
  using namespace std;
  requireArgs(argc, 1);
  ifstream in(argv[1]);
  assure(in, argv[1]);
  istreambuf_iterator<char> p(in), end;
  set<string> wordlist;
  while (p != end) {
    string word;
    insert_iterator<string> 
      ii(word, word.begin());
    // Find the first alpha character:
    while(!isalpha(*p) && p != end)
      p++;
    // Copy until the first non-alpha character:
    while (isalpha(*p) && p != end)
      *ii++ = *p++;
    if (word.size() != 0)
      wordlist.insert(word);
  } 
  // Output results:
  copy(wordlist.begin(), wordlist.end(),
    ostream_iterator<string>(cout, "\n"));

} ///:~

This example was suggested by Nathan Myers, who invented the istreambuf_iterator and its relatives. This iterator extracts information character-by-character from a stream. Although the istreambuf_iterator template argument might suggest to you that you could extract, for example, ints instead of char, that’s not the case. The argument must be of some character type – a regular char or a wide character.

After the file is open, an istreambuf_iterator called p is attached to the istream so characters can be extracted from it. The set<string> called wordlist will be used to hold the resulting words.

The while loop reads words until the end of the input stream is found. This is detected using the default constructor for istreambuf_iterator which produces the past-the-end iterator object end. Thus, if you want to test to make sure you’re not at the end of the stream, you simply say p != end .

The second type of iterator that’s used here is the insert_iterator, which creates an iterator that knows how to insert objects into a container. Here, the “container” is the string called word which, for the purposes of insert_iterator, behaves like a container. The constructor for insert_iterator requires the container and an iterator indicating where it should start inserting the characters. You could also use a back_insert_iterator, which requires that the container have a push_back( ) ( string does). There is also a back_insert_iterator, but that requires that the container have a push_back( ), which string does not.

After the while loop sets everything up, it begins by looking for the first alpha character, incrementing start until that character is found. Then it copies characters from one iterator to the other, stopping when a non-alpha character is found. Each word, assuming it is non-empty, is added to wordlist.

StreamTokenizer:
a more flexible solution

The above program parses its input into strings of words containing only alpha characters, but that’s still a special case compared to the generality of strtok( ). What we’d like now is an actual replacement for strtok( ) so we’re never tempted to use it. WordList2.cpp can be modified to create a class called StreamTokenizer that delivers a new token as a string whenever you call next( ), according to the delimiters you give it upon construction (very similar to strtok( )):

//: C20:StreamTokenizer.h
// C++ Replacement for Standard C strtok()
#ifndef STREAMTOKENIZER_H
#define STREAMTOKENIZER_H
#include <string>
#include <iostream>
#include <iterator>
  
class StreamTokenizer {
  typedef std::istreambuf_iterator<char> It;
  It p, end;
  std::string delimiters;
  bool isDelimiter(char c) {
    return 
      delimiters.find(c) != std::string::npos;
  }
public:
  StreamTokenizer(std::istream& is, 
    std::string delim = " \t\n;()\"<>:{}[]+-=&*#"
    ".,/\\~!0123456789") : p(is), end(It()),
    delimiters(delim) {}
  std::string next(); // Get next token
};

#endif STREAMTOKENIZER_H ///:~

The default delimiters for the StreamTokenizer constructor extract words with only alpha characters, as before, but now you can choose different delimiters to parse different tokens. The implementation of next( ) looks similar to Wordlist2.cpp:

//: C20:StreamTokenizer.cpp {O}
#include "StreamTokenizer.h"
using namespace std;

string StreamTokenizer::next() {
  string result;
  if(p != end) {
    insert_iterator<string>
      ii(result, result.begin());
    while(isDelimiter(*p) && p != end)
      p++;
    while (!isDelimiter(*p) && p != end)
      *ii++ = *p++;
  }
  return result;

} ///:~

The first non-delimiter is found, then characters are copied until a delimiter is found, and the resulting string is returned. Here’s a test:

//: C20:TokenizeTest.cpp
//{L} StreamTokenizer
// Test StreamTokenizer
#include "StreamTokenizer.h"
#include "../require.h"
#include <iostream>
#include <fstream>
#include <set>
using namespace std;

int main(int argc, char* argv[]) {
  requireArgs(argc, 1);
  ifstream in(argv[1]);
  assure(in, argv[1]);
  StreamTokenizer words(in);
  set<string> wordlist;
  string word;
  while((word = words.next()).size() != 0)
    wordlist.insert(word);
  // Output results:
  copy(wordlist.begin(), wordlist.end(),
    ostream_iterator<string>(cout, "\n"));

} ///:~

Now the tool is more reusable than before, but it’s still inflexible, because it can only work with an istream. This isn’t as bad as it first seems, since a string can be turned into an istream via an istringstream. But in the next section we’ll come up with the most general, reusable tokenizing tool, and this should give you a feeling of what “reusable” really means, and the effort necessary to create truly reusable code.

A completely reusable tokenizer

Since the STL containers and algorithms all revolve around iterators, the most flexible solution will itself be an iterator. You could think of the TokenIterator as an iterator that wraps itself around any other iterator that can produce characters. Because it is designed as an input iterator (the most primitive type of iterator) it can be used with any STL algorithm. Not only is it a useful tool in itself, the TokenIterator is also a good example of how you can design your own iterators. [63]

The TokenIterator is doubly flexible: first, you can choose the type of iterator that will produce the char input. Second, instead of just saying what characters represent the delimiters, TokenIterator will use a predicate which is a function object whose operator( ) takes a char and decides if it should be in the token or not. Although the two examples given here have a static concept of what characters belong in a token, you could easily design your own function object to change its state as the characters are read, producing a more sophisticated parser.

The following header file contains the two basic predicates Isalpha and Delimiters, along with the template for TokenIterator:

//: C20:TokenIterator.h
#ifndef TOKENITERATOR_H
#define TOKENITERATOR_H
#include <string>
#include <iterator>
#include <algorithm>
#include <cctype>

struct Isalpha { 
  bool operator()(char c) { 
    using namespace std; //[[For a compiler bug]]
    return isalpha(c); 
  }
};

class Delimiters {
  std::string exclude;
public:
  Delimiters() {}
  Delimiters(const std::string& excl) 
    : exclude(excl) {}
  bool operator()(char c) {
    return exclude.find(c) == std::string::npos;
  }
};

template <class InputIter, class Pred = Isalpha>
class TokenIterator: public std::iterator<
  std::input_iterator_tag,std::string,ptrdiff_t>{
  InputIter first;
  InputIter last;
  std::string word;
  Pred predicate;
public:
  TokenIterator(InputIter begin, InputIter end, 
    Pred pred = Pred()) 
    : first(begin), last(end), predicate(pred) {
      ++*this; 
  }
  TokenIterator() {} // End sentinel
  // Prefix increment:
  TokenIterator& operator++() {
    word.resize(0);
    first = std::find_if(first, last, predicate);
    while (first != last && predicate(*first))
      word += *first++;
    return *this;
  }
  // Postfix increment
  class Proxy { 
    std::string word;
  public:
    Proxy(const std::string& w) : word(w) {}
    std::string operator*() { return word; } 
  };
  Proxy operator++(int) { 
    Proxy d(word);
    ++*this; 
    return d; 
  }
  // Produce the actual value:
  std::string operator*() const { return word; }
  std::string* operator->() const {
    return &(operator*()); 
  }
  // Compare iterators:
  bool operator==(const TokenIterator&) { 
    return word.size() == 0 && first == last; 
  }
  bool operator!=(const TokenIterator& rv) { 
    return !(*this == rv);
  }
};

#endif // TOKENITERATOR_H ///:~

TokenIterator is inherited from the std::iterator template. It might appear that there’s some kind of functionality that comes with std::iterator, but it is purely a way of tagging an iterator so that a container that uses it knows what it’s capable of. Here, you can see input_iterator_tag as a template argument – this tells anyone who asks that a TokenIterator only has the capabilities of an input iterator, and cannot be used with algorithms requiring more sophisticated iterators. Apart from the tagging, std::iterator doesn’t do anything else, which means you must design all the other functionality in yourself.

TokenIterator may look a little strange at first, because the first constructor requires both a “begin” and “end” iterator as arguments, along with the predicate. Remember that this is a “wrapper” iterator that has no idea of how to tell whether it’s at the end of its input source, so the ending iterator is necessary in the first constructor. The reason for the second (default) constructor is that the STL algorithms (and any algorithms you write) need a TokenIterator sentinel to be the past-the-end value. Since all the information necessary to see if the TokenIterator has reached the end of its input is collected in the first constructor, this second constructor creates a TokenIterator that is merely used as a placeholder in algorithms.

The core of the behavior happens in operator++. This erases the current value of word using string::resize( ), then finds the first character that satisfies the predicate (thus discovering the beginning of the new token) using find_if( ) (from the STL algorithms, discussed in the following chapter). The resulting iterator is assigned to first, thus moving first forward to the beginning of the token. Then, as long as the end of the input is not reached and the predicate is satisfied, characters are copied into the word from the input. Finally the new token is returned.

The postfix increment requires a proxy object to hold the value before the increment, so it can be returned (see the operator overloading chapter for more details of this). Producing the actual value is a straightforward operator*. The only other functions that must be defined for an output iterator are the operator== and operator!= to indicate whether the TokenIterator has reached the end of its input. You can see that the argument for operator== is ignored – it only cares about whether it has reached its internal last iterator. Notice that operator!= is defined in terms of operator==.

A good test of TokenIterator includes a number of different sources of input characters including a streambuf_iterator, a char*, and a deque<char>::iterator. Finally, the original Wordlist.cpp problem is solved:

//: C20:TokenIteratorTest.cpp
#include "TokenIterator.h"
#include "../require.h"
#include <fstream>
#include <iostream>
#include <vector>
#include <deque>
#include <set>
using namespace std;

int main() {
  ifstream in("TokenIteratorTest.cpp");
  assure(in, "TokenIteratorTest.cpp");
  ostream_iterator<string> out(cout, "\n");
  typedef istreambuf_iterator<char> IsbIt;
  IsbIt begin(in), isbEnd;
  Delimiters 
    delimiters(" \t\n~;()\"<>:{}[]+-=&*#.,/\\");
  TokenIterator<IsbIt, Delimiters> 
    wordIter(begin, isbEnd, delimiters),
    end;
  vector<string> wordlist;
  copy(wordIter, end, back_inserter(wordlist));
  // Output results:
  copy(wordlist.begin(), wordlist.end(), out);
  *out++ = "-----------------------------------";
  // Use a char array as the source:
  char* cp = 
    "typedef std::istreambuf_iterator<char> It";
  TokenIterator<char*, Delimiters>
    charIter(cp, cp + strlen(cp), delimiters),
    end2;
  vector<string> wordlist2;
  copy(charIter, end2, back_inserter(wordlist2));
  copy(wordlist2.begin(), wordlist2.end(), out);
  *out++ = "-----------------------------------";
  // Use a deque<char> as the source:
  ifstream in2("TokenIteratorTest.cpp");
  deque<char> dc;
  copy(IsbIt(in2), IsbIt(), back_inserter(dc));
  TokenIterator<deque<char>::iterator,Delimiters>
    dcIter(dc.begin(), dc.end(), delimiters),
    end3;
  vector<string> wordlist3;
  copy(dcIter, end3, back_inserter(wordlist3));
  copy(wordlist3.begin(), wordlist3.end(), out);
  *out++ = "-----------------------------------";
  // Reproduce the Wordlist.cpp example:
  ifstream in3("TokenIteratorTest.cpp");
  TokenIterator<IsbIt, Delimiters>
    wordIter2(IsbIt(in3), isbEnd, delimiters);
  set<string> wordlist4;
  while(wordIter2 != end)
    wordlist4.insert(*wordIter2++);
  copy(wordlist4.begin(), wordlist4.end(), out);

} ///:~

When using an istreambuf_iterator, you create one to attach to the istream object, and one with the default constructor as the past-the-end marker. Both of these are used to create the TokenIterator that will actually produce the tokens; the default constructor produces the faux TokenIterator past-the-end sentinel (this is just a placeholder, and as mentioned previously is actually ignored). The TokenIterator produces strings that are inserted into a container which must, naturally, be a container of string – here a vector<string> is used in all cases except the last (you could also concatenate the results onto a string). Other than that, a TokenIterator works like any other input iterator.

[63] This is another example coached by Nathan Myers.

Contents | Prev | Next

Contact: webmaster@codeguru.com
CodeGuru - the website for developers. [an error occurred while processing this directive]

set

Eliminating strtok( )

StreamTokenizer: a more flexible solution

A completely reusable tokenizer

StreamTokenizer:
a more flexible solution