Perorate Parsimoniously

J. Hardie

Assigned: Jan 23, 2005
Due: Feb 13, 2005 at 11:59 PM

Introduction

Remember when you first discovered a thesaurus? Before you learned how to use one? All of a sudden your writing was peppered with obscure words and strange constructions. For example, the first sentence of this paragraph might have been written ``Commemorate the superego glimpse of a nomenclator''.

But do you also remember how tedious (vapid, flat, invariable, insipid or tiresome) it was to perform this feat (victory, acta, move...)? You had to spend a lot of time poring over (weighing, taking stock of, tracing) Roget's Thesaurus to find your great new words.

Well, computers have changed all that. It is now possible to have your computer generate bizarre sentences almost as well as you yourself could.

This is our goal!

You will write a program called ``lawyer'' which will load a thesaurus file and then read sentences from the keyboard. For each word read from the keyboard, the program will search the thesaurus. If the word is found, a random substitute will be chosen and printed in its place. Note that the random substitute can be the original word. If the word is not found in the thesaurus, the original word will be printed.

For example, ``A Tale of Two Cities'' by Charles Dickens begins like this:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity...
but we can ``improve'' this with our program. For example:
Inner man was the surpass in re contemporaneousness, I myself was the brutal as to presentness, her was the perish in point of usefulness, ethical self was the florid pertaining to roguishness, myself was the age in relation to self-importance, oneself was the age as for atheism...

Much better, isn't it?

Specifications

You will write a program called lawyer which will consist of four files. The first file, lawyer.cc contains the main program which is specified in advance. Your main program is:

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <stdexcept>
#include <sstream>
#include <ctime>

using namespace std;

#include "thesaurus.h"

int main(int argc, char **argv)
{
    if (argc > 2)
        srand(time(0));

    if (argc < 2)
    {
        cerr << "Error -  must provide a file name" << endl;
        return EXIT_FAILURE;
    }
 
    thesaurus t;
    t.load(argv[1]);

    string line;
    string word;

    cerr << "Line: ";
    while (getline(cin,line))
    {
        istringstream ins(line);
        while (ins >> word)
        {
            cout << t.get(word) << " ";
        }
        cout << endl;
        cerr << "Line: ";
    }
    
    return EXIT_SUCCESS;
}

and you are not allowed to change it! This main program accepts two command line arguments. The first is the name of a thesaurus file. The second is optional, but if it is provided then the program will randomize the word selection. If the second argument is not provided then the ``random'' word selection will not be random. (As strange as this sounds, it is very useful for testing).

The two remaining files are called thesaurus.h and thesaurus.cc and they contain the interface definition and function definitions for the thesaurus class.

The thesaurus class is required to have exactly two public member functions:

You may place as many functions and variables as you need in the private section of the thesaurus class, but you may not add any other public variables or functions.

The thesaurus.h file should contain only the class declaration for the thesaurus class. All of the thesaurus member functions should be defined in the thesaurus.cc file.

You will also supply a Makefile which will build the lawyer program when the command

make
is executed.

See the "Details" section below for suggestions, hints and information about coding this program.

Deliverables

You will submit four source code files using the handin command:

~jhardie/public/handin231 2 1

These four files are:

None of your member functions should print anything. Your program should do no prompting other than that already in the main program above.

Grading

After your program has been successfully submitted, I will test it. This test will determine your starting grade as follows:

Once your program is tested, it will be examined for code structure, compliance with the specifications and coding style. Additional points may be deducted for egregious violations of the coding standards.

Good Luck.

Details

This section describes some information you will find useful in writing your program.

The thesaurus file

I have provided a very large thesaurus file for you to use. However, for initial testing, I strongly suggest that you create your own, much smaller, file. This will save you a lot of time because the complete thesaurus takes a while to process and you don't want to do that any more than is necessary.

The format of this file is that there are any number of words on a line, separated by commas. The first word on the line is the ``main word'' which is the word that could match on a search. The remaining words on the line are ``synonyms'' for the main word. Any of these synonyms can replace the main word when your program is ``improving'' some writing.

For example, here's a very short thesaurus file

bad,terrible,awful,miserable,nasty
good,wonderful,great,amazing,saintly,tasty
food,sustenance,grub,eats
which defines three main words, ``bad'', ``good'', and ``food''. The first word, ``bad'' has four synonyms. The second, ``good'' has five and the last ``food'', has three.

All lines end in a newline character.

The main file which I have provided is called

~jhardie/public/thesaurus.txt
and it defines just over 30,260 main words and a total of just over 2,000,000 synonyms.

This file contains some very long lines with an average of 21 synonyms per word. The longest line contains 1448 words and the shortest contains 18.

Needless to say this file takes quite a while to load. A reasonable implementation will load the large thesaurus file in less than a minute. It is possible to do it in 10-15 seconds. If your program takes more than 2 minutes, there's something seriously wrong and you should rethink your algorithms and examine your design.

DO NOT COPY THE THESAURUS FILE TO YOUR DIRECTORY! It's too large. Just give the path to my copy of the file on the command line when you run your program.

Of course, you can keep your own test thesaurii in your directory since they'll be considerably shorter than the main file.

Reading lines from a file

You are going to want to read entire lines from the thesaurus file. The following code fragment shows you how to open a file, read the lines from that file one at a time and then send them to a processing function. Note that this is just an example - you will need to use these lines in your program but you have to decide where and how.

#include <fstream>

// We assume that this exists somewhere...
void process(string &s);

void readfile(string &filename)
{
    ifstream in(filename.c_str());
    if (!in)
    {
        // process the error - file didn't open
    }

    string theline;
    
    while (getline(in,theline))
    {
        process(theline);
    }
    in.close();
}


int main(int argc, char *argv[])
{
    // first command line parameter is the file name
    readfile(argv[1])
    return EXIT_SUCCESS;
}

Looking up words

In order to correctly store your synonyms, you will have to use a vector. I suggest that you use a vector of strings.

Then, you need some way to associate each vector of synonyms with the particular main word. The C++ standard library provides a very nice data structure called a map which will do this.

I would encourage you to use this structure. You can define a map which will associate a single string (the main word) with a vector of strings (the list of synonyms for that word). Once this association is made, you can look up the main word, find the associated vector and pick one of the synonyms.

The following code fragment example shows how to associate a vector of integers with a string. It creates the map, fills it, and then prints out one of the elements.

#include <iostream>
#include <vector>
#include <map>
#include <string>
using namespace std;

// Print a vector of integers.
void printvector(vector<int> &v)
{
    for (unsigned int i=0; i<v.size(); i++)
         cout << v[i] << ' ';
    cout << endl;
}

// Create a vector with the numbers 1..max
vector<int> createvect(int max)
{
    vector<int> v(max);
    for (unsigned int i=0; i<max; i++)
        v[i] = i+1;
    return v;
}

int main()
{
   // create a temporary string
   string t;
  
   // create a temporary vector
   vector<int> v;

   // create the map.  The following line creates
   // a variable called ``bob'' which relates 
   // strings to vectors of integers.
   // Note: the space in ``> >'' is required!
   map< string, vector<int> > bob;

   // Now that we've got a map, let's store some stuff
   // in it.

   // The word "first" is associated with the numbers from
   // 1..10.  Note that we use the string as an array index.
   t = "first";
   v = createvect(10);
   bob[t] = v;

   // The word "second" is associated with 1..50
   t = "second";
   v = createvect(50);
   bob[t] = v;

   // Ok, let's check to see if the word "third" is present.
   // If it is, the count() function will return 1, otherwise
   // it will return 0.
   if (bob.count("three") == 1)
       cout << "Three was found" << endl;
   else
       cout << "Three was not found" << endl;

   // Next, let's see if "second" is there.  If it is, let's
   // print the vector.  For variety, we'll use a variable
   // to hold the string ``second'' this time.
   t = "second";
   if (bob.count(t) == 1)
   {
       printvector(bob[t]);
   }
   else
       cout << "Second was not found" << endl;

   return EXIT_SUCCESS;
}

In your program, you will be associating a string with a vector of strings.

Creating the vector of strings

To do this, you will want to define a function for locating a synonym. For the purposes of this program, a synonym is any section of a line except commas. So, the following line has four synonyms

this,line,has,four synonyms
and the synonyms are ``this'', ``line'', ``has'', and ``four synonyms''.

To find these, treat the string containing the line as an array. You can loop through it with a ``for'' loop, just like you can with a vector and with arrays.

As you loop through, you will want to keep track of the beginning of the current synonym (one index) while you're looking for the end (the character before the next comma, or the last character on the line if there are no more commas).

Once you have these two indices, compute the size of the word with the expression

int len = lastindex - firstindex + 1;
and extract the substring function using the substr() member function for C++ strings (look it up in Dietel and Dietel).

You should do this with each word on each line of the thesaurus.

Save the first word in a separate variable (to use as the ``main word'') and store all the words as entries in the vector of strings.

Think carefully about this. I would also suggest that you write some small test programs to figure out how to parse the thesaurus lines before you start coding your main program.

Random numbers

Use the system supplied rand() function. For choosing a random vector element, the following code fragment will compute a random index in the vector v.

// Assume that vector v exists.
// index will be a random index into the vector v
int index;
index = int(double(rand())/RAND_MAX * v.size());

The ``load'' function in thesaurus

This function should

  1. Open the file. If the file can't be opened, throw an invalid_argument exception.
  2. Read a line from the file.
  3. Locate the first word and store it in a variable.
  4. Clear the vector of strings (``v.clear()'')
  5. Add the first word to the vector of strings.
  6. Locate the rest of the words and add them to the vector.
  7. When all the words are stored in the vector, add the vector to the map, indexed by the first word.
  8. Repeat from step 2 until there are no more lines in the file.

The ``get'' function in thesaurus

This function should
  1. Check to see if the given word is in the map.
  2. If it is, select a random index into the vector of strings associated with this word.
  3. Return the word at that index.
  4. If the word is not found, just return the word.

About this document ...

Perorate Parsimoniously

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -link 4 -no_subdir -no_navigation -image_type gif -prefix program2 program2.tex

The translation was initiated by John G. Hardie on 2005-01-18


John G. Hardie 2005-01-18