set
The
set
produces a container that will accept only one of each thing you place in it;
it also sorts the elements (sorting isn’t intrinsic to the conceptual
definition of a set, but the STL 
set
stores its elements in a balanced binary tree to provide rapid lookups, thus
producing sorted results when you traverse it). The first two examples in this
chapter used 
sets. Consider
the problem of creating an index for a book. You might like to start with all
the words in the book, but you only want one instance of each word and you want
them sorted. Of course, a 
set
is perfect for this, and solves the problem effortlessly. However,
there’s also the problem of punctuation and any other non-alpha
characters, which must be stripped off to generate proper words. One solution
to this problem is to use the Standard C library function 
strtok( ),
which produces tokens (in our case, words) given a set of delimiters to strip
out:
 
//: C20:WordList.cpp
// Display a list of words used in a document
#include "../require.h"
#include <string>
#include <cstring>
#include <set>
#include <iostream>
#include <fstream>
using namespace std;
const char* delimiters =
  " \t;()\"<>:{}[]+-=&*#.,/\\~";
int main(int argc, char* argv[]) {
  requireArgs(argc, 1);
  ifstream in(argv[1]);
  assure(in, argv[1]);
  set<string> wordlist;
  string line;
  while(getline(in, line)) {
    // Capture individual words:
    char* s = // Cast probably won’t crash:
      strtok((char*)line.c_str(), delimiters);
    while(s) {
      // Automatic type conversion:
      wordlist.insert(s); 
      s = strtok(0, delimiters);
    }
  }
  // Output results:
  copy(wordlist.begin(), wordlist.end(),
       ostream_iterator<string>(cout, "\n"));strtok( )
takes
the starting address of a character buffer (the first argument) and looks for
delimiters (the second argument). It replaces the delimiter with a zero, and
returns the address of the beginning of the token. If you call it subsequent
times with a first argument of zero it will continue extracting tokens from the
rest of the string until it finds the end. In this case, the delimiters are
those that delimit the keywords and identifiers of C++, so it extracts these
keywords and identifiers. Each word is turned into a 
string
and placed into the 
wordlist
vector, which eventually contains the whole file, broken up into words.
 You
don’t have to use a 
set
just to get a sorted sequence. You can use the 
sort( )
function (along with a multitude of other functions in the STL) on different
STL containers. However, it’s likely that 
set
will be faster.
 
Eliminating
strtok( )
Some
programmers consider 
strtok( )
to be the poorest design in the Standard C library because it uses a 
static
buffer to hold its data between function calls. This means:
 
- 	You
can’t use 
strtok( )
in
two places at the same time
- 	You
can’t use 
strtok( )
in
a multithreaded program
- 	You
can’t use 
strtok( )
in a library that might be used in a multithreaded program
- 	strtok( )
modifies the input sequence, which can produce unexpected side effects
- 	strtok( )
 depends on reading in “lines”, which means you need a buffer big
enough for the longest line. This produces both wastefully-sized buffers, and
lines longer than the “longest” line. This can also introduce
security holes. (Notice that the buffer size problem was eliminated in 
WordList.cpp
by using 
string
input, but this required a cast so that 
strtok( )
could modify the data in the string – a dangerous approach for
general-purpose programming).
For
all these reasons it seems like a good idea to find an alternative for 
strtok( ).
The following example will use an 
istreambuf_iterator
(introduced earlier) to move the characters from one place (which happens to be
an 
istream)
to another (which happens to be a 
string),
depending on whether the Standard C library function 
isalpha( )
is true:
 
//: C20:WordList2.cpp
// Eliminating strtok() from Wordlist.cpp
#include "../require.h"
#include <string>
#include <cstring>
#include <set>
#include <iostream>
#include <fstream>
#include <iterator>
using namespace std;
int main(int argc, char* argv[]) {
  using namespace std;
  requireArgs(argc, 1);
  ifstream in(argv[1]);
  assure(in, argv[1]);
  istreambuf_iterator<char> p(in), end;
  set<string> wordlist;
  while (p != end) {
    string word;
    insert_iterator<string> 
      ii(word, word.begin());
    // Find the first alpha character:
    while(!isalpha(*p) && p != end)
      p++;
    // Copy until the first non-alpha character:
    while (isalpha(*p) && p != end)
      *ii++ = *p++;
    if (word.size() != 0)
      wordlist.insert(word);
  } 
  // Output results:
  copy(wordlist.begin(), wordlist.end(),
    ostream_iterator<string>(cout, "\n"));This
example was suggested by Nathan Myers, who invented the 
istreambuf_iterator
and its relatives. This iterator extracts information character-by-character
from a stream. Although the 
istreambuf_iterator
template
argument might suggest to you that you could extract, for example, 
ints
instead of 
char,
that’s not the case. The argument must be of some character type –
a regular 
char
or a wide character.
 After
the file is open, an 
istreambuf_iterator
called 
p
is attached to the 
istream
so characters can be extracted from it. The 
set<string>
called 
wordlist
will be used to hold the resulting words.
 The
while
loop reads words until the end of the input stream is found. This is detected
using the default constructor for 
istreambuf_iterator
which produces the past-the-end iterator object 
end.
Thus, if you want to test to make sure you’re not at the end of the
stream, you simply say 
p
!= end
. The
second type of iterator that’s used here is the 
insert_iterator,
which creates an iterator that knows how to insert objects into a container.
Here, the “container” is the 
string
called 
word
which, for the purposes of 
insert_iterator,
behaves like a container. The constructor for 
insert_iterator
requires the container and an iterator indicating where it should start
inserting the characters. You could also use a 
back_insert_iterator,
which requires that the container have a 
push_back( )
(
string
does). There is also a 
back_insert_iterator,
but that requires that the container have a 
push_back( ),
which 
string
does not.
 After
the 
while
loop sets everything up, it begins by looking for the first alpha character,
incrementing 
start
until that character is found. Then it copies characters from one iterator to
the other, stopping when a non-alpha character is found. Each 
word,
assuming it is non-empty, is added to 
wordlist. 
StreamTokenizer:
a
more flexible solution
The
above program parses its input into strings of words containing only alpha
characters, but that’s still a special case compared to the generality of 
strtok( ).
What we’d like now is an actual replacement for 
strtok( )
so we’re never tempted to use it. 
WordList2.cpp
can be modified to create a class called 
StreamTokenizer
that delivers a new token as a 
string
whenever you call 
next( ),
according to the delimiters you give it upon construction (very similar to 
strtok( )): 
//: C20:StreamTokenizer.h
// C++ Replacement for Standard C strtok()
#ifndef STREAMTOKENIZER_H
#define STREAMTOKENIZER_H
#include <string>
#include <iostream>
#include <iterator>
  
class StreamTokenizer {
  typedef std::istreambuf_iterator<char> It;
  It p, end;
  std::string delimiters;
  bool isDelimiter(char c) {
    return 
      delimiters.find(c) != std::string::npos;
  }
public:
  StreamTokenizer(std::istream& is, 
    std::string delim = " \t\n;()\"<>:{}[]+-=&*#"
    ".,/\\~!0123456789") : p(is), end(It()),
    delimiters(delim) {}
  std::string next(); // Get next token
};#endif
STREAMTOKENIZER_H ///:~
 The
default delimiters for the 
StreamTokenizer
constructor extract words with only alpha characters, as before, but now you
can choose different delimiters to parse different tokens. The implementation of 
next( )
looks similar to 
Wordlist2.cpp: 
//: C20:StreamTokenizer.cpp {O}
#include "StreamTokenizer.h"
using namespace std;
string StreamTokenizer::next() {
  string result;
  if(p != end) {
    insert_iterator<string>
      ii(result, result.begin());
    while(isDelimiter(*p) && p != end)
      p++;
    while (!isDelimiter(*p) && p != end)
      *ii++ = *p++;
  }
  return result;The
first non-delimiter is found, then characters are copied until a delimiter is
found, and the resulting 
string
is returned. Here’s a test:
 
//: C20:TokenizeTest.cpp
//{L} StreamTokenizer
// Test StreamTokenizer
#include "StreamTokenizer.h"
#include "../require.h"
#include <iostream>
#include <fstream>
#include <set>
using namespace std;
int main(int argc, char* argv[]) {
  requireArgs(argc, 1);
  ifstream in(argv[1]);
  assure(in, argv[1]);
  StreamTokenizer words(in);
  set<string> wordlist;
  string word;
  while((word = words.next()).size() != 0)
    wordlist.insert(word);
  // Output results:
  copy(wordlist.begin(), wordlist.end(),
    ostream_iterator<string>(cout, "\n"));Now
the tool is more reusable than before, but it’s still inflexible, because
it can only work with an 
istream.
This isn’t as bad as it first seems, since a 
string
can be turned into an 
istream
via an 
istringstream.
But in the next section we’ll come up with the most general, reusable
tokenizing tool, and this should give you a feeling of what
“reusable” really means, and the effort necessary to create truly
reusable code.
 
A
completely reusable tokenizer
Since
the STL containers and algorithms all revolve around iterators, the most
flexible solution will itself be an iterator. You could think of the 
TokenIterator
as an iterator that wraps itself around any other iterator that can produce
characters. Because it is designed as an input iterator (the most primitive
type of iterator) it can be used with any STL algorithm. Not only is it a
useful tool in itself, the 
TokenIterator
is also a good example of how you can design your own iterators.
[63] The
TokenIterator
is doubly flexible: first, you can choose the type of iterator that will
produce the 
char
input. Second, instead of just saying what characters represent the delimiters, 
TokenIterator
will use a predicate which is a function object whose 
operator( )
takes a 
char
and decides if it should be in the token or not. Although the two examples
given here have a static concept of what characters belong in a token, you
could easily design your own function object to change its state as the
characters are read, producing a more sophisticated parser.
 The
following header file contains the two basic predicates 
Isalpha
and 
Delimiters,
along with the template for 
TokenIterator: 
//: C20:TokenIterator.h
#ifndef TOKENITERATOR_H
#define TOKENITERATOR_H
#include <string>
#include <iterator>
#include <algorithm>
#include <cctype>
struct Isalpha { 
  bool operator()(char c) { 
    using namespace std; //[[For a compiler bug]]
    return isalpha(c); 
  }
};
class Delimiters {
  std::string exclude;
public:
  Delimiters() {}
  Delimiters(const std::string& excl) 
    : exclude(excl) {}
  bool operator()(char c) {
    return exclude.find(c) == std::string::npos;
  }
};
template <class InputIter, class Pred = Isalpha>
class TokenIterator: public std::iterator<
  std::input_iterator_tag,std::string,ptrdiff_t>{
  InputIter first;
  InputIter last;
  std::string word;
  Pred predicate;
public:
  TokenIterator(InputIter begin, InputIter end, 
    Pred pred = Pred()) 
    : first(begin), last(end), predicate(pred) {
      ++*this; 
  }
  TokenIterator() {} // End sentinel
  // Prefix increment:
  TokenIterator& operator++() {
    word.resize(0);
    first = std::find_if(first, last, predicate);
    while (first != last && predicate(*first))
      word += *first++;
    return *this;
  }
  // Postfix increment
  class Proxy { 
    std::string word;
  public:
    Proxy(const std::string& w) : word(w) {}
    std::string operator*() { return word; } 
  };
  Proxy operator++(int) { 
    Proxy d(word);
    ++*this; 
    return d; 
  }
  // Produce the actual value:
  std::string operator*() const { return word; }
  std::string* operator->() const {
    return &(operator*()); 
  }
  // Compare iterators:
  bool operator==(const TokenIterator&) { 
    return word.size() == 0 && first == last; 
  }
  bool operator!=(const TokenIterator& rv) { 
    return !(*this == rv);
  }
};#endif
// TOKENITERATOR_H ///:~
 TokenIterator
is inherited from the 
std::iterator
template. It might appear that there’s some kind of functionality that
comes with 
std::iterator,
but it is purely a way of tagging an iterator so that a container that uses it
knows what it’s capable of. Here, you can see 
input_iterator_tag
as a template argument – this tells anyone who asks that a 
TokenIterator
only has the capabilities of an input iterator, and cannot be used with
algorithms requiring more sophisticated iterators. Apart from the tagging, 
std::iterator
doesn’t do anything else, which means you must design all the other
functionality in yourself.
 TokenIterator
may look a little strange at first, because the first constructor requires both
a “begin” and “end” iterator as arguments, along with
the predicate. Remember that this is a “wrapper” iterator that has
no idea of how to tell whether it’s at the end of its input source, so
the ending iterator is necessary in the first constructor. The reason for the
second (default) constructor is that the STL algorithms (and any algorithms you
write) need a 
TokenIterator
sentinel
to be the past-the-end value. Since all the information necessary to see if the 
TokenIterator
has reached the end of its input is collected in the first constructor, this
second constructor creates a 
TokenIterator
that is merely used as a placeholder in algorithms.
 The
core of the behavior happens in 
operator++.
This erases the current value of 
word
using 
string::resize( ),
then finds the first character that satisfies the predicate (thus discovering
the beginning of the new token) using 
find_if( )
(from the STL algorithms, discussed in the following chapter). The resulting
iterator is assigned to 
first,
thus moving 
first
forward to the beginning of the token. Then, as long as the end of the input is
not reached and the predicate is satisfied, characters are copied into the word
from the input. Finally the new token is returned.
 The
postfix increment requires a proxy object to hold the value before the
increment, so it can be returned (see the operator overloading chapter for more
details of this). Producing the actual value is a straightforward 
operator*.
The only other functions that must be defined for an output iterator are the 
operator==
and 
operator!=
to indicate whether the 
TokenIterator
has reached the end of its input. You can see that the argument for 
operator==
is
ignored – it only cares about whether it has reached its internal 
last
iterator. Notice that 
operator!=
is defined in terms of 
operator==. A
good test of 
TokenIterator
includes a number of different sources of input characters including a 
streambuf_iterator,
a 
char*,
and a 
deque<char>::iterator.
Finally, the original 
Wordlist.cpp
problem is solved:
 
//: C20:TokenIteratorTest.cpp
#include "TokenIterator.h"
#include "../require.h"
#include <fstream>
#include <iostream>
#include <vector>
#include <deque>
#include <set>
using namespace std;
int main() {
  ifstream in("TokenIteratorTest.cpp");
  assure(in, "TokenIteratorTest.cpp");
  ostream_iterator<string> out(cout, "\n");
  typedef istreambuf_iterator<char> IsbIt;
  IsbIt begin(in), isbEnd;
  Delimiters 
    delimiters(" \t\n~;()\"<>:{}[]+-=&*#.,/\\");
  TokenIterator<IsbIt, Delimiters> 
    wordIter(begin, isbEnd, delimiters),
    end;
  vector<string> wordlist;
  copy(wordIter, end, back_inserter(wordlist));
  // Output results:
  copy(wordlist.begin(), wordlist.end(), out);
  *out++ = "-----------------------------------";
  // Use a char array as the source:
  char* cp = 
    "typedef std::istreambuf_iterator<char> It";
  TokenIterator<char*, Delimiters>
    charIter(cp, cp + strlen(cp), delimiters),
    end2;
  vector<string> wordlist2;
  copy(charIter, end2, back_inserter(wordlist2));
  copy(wordlist2.begin(), wordlist2.end(), out);
  *out++ = "-----------------------------------";
  // Use a deque<char> as the source:
  ifstream in2("TokenIteratorTest.cpp");
  deque<char> dc;
  copy(IsbIt(in2), IsbIt(), back_inserter(dc));
  TokenIterator<deque<char>::iterator,Delimiters>
    dcIter(dc.begin(), dc.end(), delimiters),
    end3;
  vector<string> wordlist3;
  copy(dcIter, end3, back_inserter(wordlist3));
  copy(wordlist3.begin(), wordlist3.end(), out);
  *out++ = "-----------------------------------";
  // Reproduce the Wordlist.cpp example:
  ifstream in3("TokenIteratorTest.cpp");
  TokenIterator<IsbIt, Delimiters>
    wordIter2(IsbIt(in3), isbEnd, delimiters);
  set<string> wordlist4;
  while(wordIter2 != end)
    wordlist4.insert(*wordIter2++);
  copy(wordlist4.begin(), wordlist4.end(), out);When
using an 
istreambuf_iterator,
you create one to attach to the 
istream
object, and one with the default constructor as the past-the-end marker. Both
of these are used to create the 
TokenIterator
that will actually produce the tokens; the default constructor produces the faux 
TokenIterator
past-the-end sentinel (this is just a placeholder, and as mentioned previously
is actually ignored). The
TokenIterator
produces 
strings
that are inserted into a container which must, naturally, be a container of 
string
– here a 
vector<string>
is used in all cases except the last (you could also concatenate the results
onto a 
string).
Other than that, a 
TokenIterator
works like any other input iterator.
 
[63]
This is another example coached by Nathan Myers.
 
  
	
		
		  Contact: webmaster@codeguru.com
		  
		  CodeGuru - the website for developers.