Assignment 3: Document Indexer

Document indexing is an interesting and useful topic. Web search engines, online library information systems, and application search utilities all rely on indexing. In this assignment, you will create a simple document indexing system. The sequence of steps below is meant to serve as an implementation guide; please turn in just the final program and test output.

Requirements

This is an in-memory indexing system (meaning that all index information is held in memory rather than stored on disk). For now, it will index only files reachable through a filesystem, and will index only individual words. For simplicity, you can say that a words are sequences of characters separated by whitespace (which is how an input iostream reads in a string using operator >>). You may want to strip off punctuation characters as well, but it isn't required.

The indexer is an interactive application. Users should be able to:

Index a collection of files.
Add a new file to the index, or update the information for an existing one, at any time.
Search for a word in the index, and retrieve the names of the documents that contain it.
Refine a previous search result by adding another word, giving the intersection of the documents that contain both words.
Expand a previous search result by adding another word, giving the union of the documents that contain either or both words.

A simple command line interface for doing this is fine.

The index keeps track of the number of times each word occurs ("hits") in each document. When a search result is displayed, the documents are displayed according to the number of "hits" for the word, with the document containing the most hits listed first.
If a search result is the result of an intersection, the number of hits for each document is equal to the minimum of the hits for each word. So when search for bar AND baz, if document foo.txt has 3 hits for bar and 2 for baz, the intersection contains foo.txt with 2 hits.
If a search result is the result of a union, the number of hits for each document is equal to the maximum of the hits for each word. So when search for bar OR baz, if document foo.txt has 3 hits for bar and 2 for baz, the intersection contains foo.txt with 3 hits.
All searches are case-insensitive, so "FOO" and "foo" are treated equivalently.
Search operations (including intersection and union) should be efficient; in particular, the implementation must take at most O(log N) time to look up a given word, where N is the total number of indexed words. Intersection and union must be faster than O(M^2), where M is the number of search results in the larger of the two sets.

Design

Create the indexing system as a reusable component. This means separating out the UI (command line, user interaction) from the indexing engine, and creating a reusable interface. Make sure that the engine allows multiple, unrelated search results to exist at the same time.

Here is a suggested design:

In this design, the index is maintained by the Index object. It populates SearchResult objects, which are basically some kind of Collection in the STL sense. (In other words, SearchResult may be a class name for a user-defined Collection, or it may be just a typedef for some STL collection.) SearchResult contains multiple LocationInfo objects, each of which contains the name of a location (or file) and a hit count for that search.

Create and run unit tests for your classes that don't require interactive input. Use these to test the indexer at each step as you create it. Turn in the unit test code and results along with your interactive sessions.

You may want to create some additional member functions to make unit testing easier. Below is a sample which uses the unit test framework published in the recent C/C++ User's Journal ( ftp://ftp.cuj.com/pub/2000/1809/allison.zip ):

...

class Indexer
{
public:
    // Process a file/location, adding all the words
    // found to the index:
    void processLocation(std::string const &filename);

    // Add a single word to the index. Location of
    // word is recorded along with the count.
    void addWord(std::string const &word,
                 std::string const &location);

    // Narrow a search result by adding another word.
    // Resulting (modified) collection contains only
    // locations that contain both words.
    // The new occurrence count for each location is equal to the
    // minimum of the original and the new counts.
    void intersectResult(SearchResult &result, std::string const &word);

    // Broaden a search result by adding another word.
    // Resulting (modified) collection will have
    // all locations that contain either word.
    // For locations that contain both words, the
    // occurrence count is equal to the maximum of
    // the two individual word counts.
    void unionResult(SearchResult &result, std::string const &word);

    // Remove locations that contain the given word from the
    // result set. (Not required.)
    void subtractResult(std::string const &word, WordInfoColl &result);

    // Print out contents of index for debugging.
    void dump() const;
private:
    ...
};

...

class IndexerTest : public Test
{
    Index idx;
public:
    IndexerTest() {
        setStream(&cerr);
    }
    void run() {
        // Add a single word and see if we get
        // it back when we search:
        idx.addWord("hello","hellogoodbye.txt");
        SearchResult s;
        idx.unionResult(s,"hello");
        _test(s.size()==1);
        _test(s.begin()->first == "hellogoodbye.txt");

        // Add a new document with the same word
        // and test that we get back two results:
        s.clear();
        idx.addWord("hello","hello.txt");
        idx.unionResult(s,"hello");
        _test(s.size()==2);

        // Increase the word count for a single document
        // and ensure that we still get back two documents:
        s.clear();
        idx.addWord("hello","hello.txt");
        idx.unionResult(s,"hello");
        _test(s.size()==2);

        // Add a couple more words and inspect index contents:
        idx.addWord("goodbye","hellogoodbye.txt");
        idx.addWord("hello","hellogoodbye.txt");
        idx.dump();

        // Look for document that contains
        // both "hello" and "goodbye"; this
        // should be just hellogoodbye.txt:
        s.clear();
        idx.unionResult(s,"hello");
        idx.intersectResult(s,"goodbye");
        _test(s.size()==1);
        _test(s.begin()->first == "hellogoodbye.txt");
        _test(s.begin()->second.hitCount == 1);

        // Intersect with word "boffo" which
        // is not in index:
        idx.intersectResult("boffo",s);
        _test(s.size()==0);

// Also need to test document file interface!
}
};

int main()
{
Suite s("Suite Test");

    // Catch null test:
    s.addTest(new IndexerTest);
    s.run();
    long fails = s.report();
    cout << "There were " << fails << " failures." << endl;
    cout << (fails ? "FAILURE!" : "SUCCESS!") << endl;
}

You can download test files here.