Thursday, March 01, 2007

boost::tokenizer and streams

I had to parse a huge file and build some lookup tables based on that. Each line had a comma seperated fields with white spaces. Initially some "performance" minded person wrote aperl script which writes out c++ function which statically populate the lookup tables. The compiler took literally an hour to compile that file (whose size was ~ 1MB). So I set out to write a small parser and populate the lookup fields. So I read each line and used boost tokenizer to split the comma seperated value. It looked something like this...

while( f.getline(str) )
{
typedef boost::tokenizer<boost::char_separator<char>>
tokenizer;
boost::char_separator<char> sep("\t, ", "\n");
tokenizer tokens(str, sep);
for (tokenizer::iterator tok_iter = tokens.begin();
tok_iter != tokens.end(); ++tok_iter)
std::cout << "<" << *tok_iter << "> ";
}


The initial version worked. But seeing the constructor for boost::tokenizer I got curious and thought what if I passed the the stream iterator to tokenizer? That would make my code much prettier and it is obviously a better way of doing it. So I did this.

{
std::ifstream ifile(filename.c_str());
std::istream_iterator<char> file_iter(ifile);
std::istream_iterator<char> end_of_stream;

typedef boost::tokenizer<boost::char_separator<char>,
std::istream_iterator<char> >
tokenizer;
boost::char_separator<char> sep("\t, ", "\n");

tokenizer tokens(file_iter,end_of_stream, sep);

for (tokenizer::iterator tok_iter = tokens.begin();
tok_iter != tokens.end(); ++tok_iter)
std::cout << "<" << *tok_iter << "> ";
}


Soon I hit a snag. For some reason I'm not seeing the "newline" characters which are supposed to be printed since I specifically instruct the tokenizer to keep the "newline" delimeters. That could mean only one thing...The stream iterator is eating up the "\n"s. Ofcourse it is..duh..! Forgot the locales? So I changed to istreambuf_iterator which won't do any parental controls over the stream and show me everything...

{
std::ifstream ifile(filename.c_str());
std::istreambuf_iterator<char> file_iter(ifile);
std::istreambuf_iterator<char> end_of_stream;

typedef boost::tokenizer<boost::char_separator<char>,
std::istreambuf_iterator<char> >
tokenizer;

boost::char_separator<char> sep("\t, ", "\n");

tokenizer tokens(file_iter,end_of_stream, sep);

for (tokenizer::iterator tok_iter = tokens.begin();
tok_iter != tokens.end(); ++tok_iter)
std::cout << "<" << *tok_iter << "> ";
}


I got what I wanted. A token parser which could parse a stream. Now the stream could be any stream and it will work. And yeah...it takes only few seconds to compile this program and building the lookup table is actually faster than the statically populated version ( because of all the temporary storage the compiler has to allocate and deallocate in the static version ) .

No comments: