Generate Text with Probabilities
Background
Some of you may be familiar with the “chimpanzees and typewriters” story. Imagine aroom full of chimpanzees, each sitting before a typewriter, diligently pecking away at thekeys in a completely random manner. In principle, at least, if we were prepared to waitlong enough, it is not beyond the realm of possibility that eventually we might find thecomplete works of Shakespeare among mountains of garbage that might have beenproduced. In the electronic age, we can easily simulate the performance of achimpanzee, by using a computer actually generating a valid English word in this way ina reasonable amount of time is slim, never mind a weighty piece of text.
A better approach in generating random English text would be to compute thefrequencies of the letters between A and Z and the blank character, in typical English text. By choosing letters withprobability equal to their frequencies, we might hope to generate text that more closelyapproximates English than the chimpanzee approach. The travesty game generalizesthis idea even further, to become a form of computerized solitaire. The game datesback to work done in the late 1940’s by Claude Shannon, and has been the subject ofseveral amusing articles. Also recently, applications of travesty have also beenexplored, in areas such as coding of text, and in word-processing.
Your assignment is to program the travesty game, which is described in some detail inthe next section. The assignment is designed to refresh your memory on arrays,characters, andprocedural programming; and get you familiarized with the Linuxsystem on hills. If you apply the principles of data abstraction in this implementation,you will be able to re-use much of your code in the later version.
Description of the Travesty Game
A selection of English text is given as input. Instead of computing the frequencies ofsingle characters from the input text, we compute for each prefix of some fixed length,the frequency of letters immediately following that prefix. That is, we set a constantcalled prefix_size, and for each prefix of length prefix_sizethat occurs in thetext, and each character suffix s, we compute the probability that s immediately followsthe prefix in the text. This is the frequency of s with respect to the prefix.
For example, if the prefix_sizewere 1, we would have 27 prefixes, one for eachletter and one for the blank character. With respect to the prefix Q, the frequency of theletter U would be 1 (or very close), while all other letters would have frequency zero (orvery close), reflecting the fact that the prefix Q is very likely to be followed by U. Asanother example, if the prefix_sizewere 2, we would have at most (27 * 27)prefixes. With respect to the prefix TH, we would expect the frequency of the letter E tobe high and the frequency of the letter Z to be zero.
Having compute the frequencies, the output text is constructed as follows:
1. choose a prefix randomly, with probability equal to its frequency and print itCS110B programming assignment#1: The Travesty Game due on Oct. 19, 2017! 1
2.w.r.t. prefix, choose a suffix character s with probability equal to its frequency andprint s
3. make a new prefix by removing the leftmost character from prefix and appending s tothe right end of prefix
4. repeat with the new prefix from step 2. until the number of letters print equals a predefined constant called output_size
Specification of the Travesty Data Structure
The elements of the travesty-structure are simply pairs (prefix,s), where prefix is a userdefined type prefix_typeas array of characters (or string) and s is a character calledthe suffix. The user-defined type prefix_typeas array of characters (or string)consists only of upper-case letters and blanks while the suffix is a single upper-caseletter or a blank. For simplicity, the prefix_typehas been resolved into a singlecharacter in this assignment. We defined five operations for the travesty-structure.
void initialize_travesty();
results: the travesty structure is initialized (or created)
void update_travesty(char prefix, char s);
requires: initialize_travesty() is already executed
results: the pair (prefix, s) is added to travesty structure
char choose_prefix();
requires: update_travesty() is already executed
results: prefix is chosen with probability equal to its frequency
char choose_suffix(char prefix);
results: suffix s is chosen with probability equal to its frequency w.r.t. prefix. If there is
no pair of the form (prefix,-) in travesty structure, then s is chosen randomly and
uniformly from the set {‘A’, … , ‘Z, ‘ ‘}
void print_travesty();
results: the frequencies of prefixes and suffixes in the travesty structure is printed out in
a useful way for debugging purposes
travesty.cpp.rtf
/*
* program description:
* anenglish text is given as input. the program will read in
* the text and store in a data structure which is called travesty
* structure. it consists of an array of prefix table. the array
* is index by a fixed length prefix. the prefix table has two fields.
* one is the total number of a particular prefix that appear in the text, while
* the other field is the array of suffix count which is index by a single
* character suffix from ‘A’ to ‘Z’ and ‘@’.
*
* the first part of the program will fill the travesty structure by reading
* one (prefix, s) pair at a time from the input english text.
*
* the second part of the program will print out the travesty structure based on
* the frequency of a prefix or suffix that appear in the original text.
*
* the third part of the program is to print out the random text. i use a
* “line_length” constant to determine the output line length.
*
*
* assumptions:
* 1. the input text must be in CAPITAL letters and only consists of
* ‘A’ to ‘Z’ and ‘ ‘.
*
* 2. the file size must be greater than the prefix size
*
* 3. all the newline marks are treated as blanks
*
* 4. the end of file mark indicates the end of file and it won’t be
* treated as a character. Therefore, the last valid character of the file
* will be treated as suffix only
*
*/
#include // srand() and rand()
#include // time()
//********************
// global declarations
//********************
// constants
const unsigned OUTPUT_SIZE = 1000;
const unsigned LINE_LENGTH = 80;
const unsigned PREFIX_SIZE = 1;
const unsigned NUM_TABLE_INDEX = 27;
// travesty structure
chartable_index[NUM_TABLE_INDEX] = {‘@’,
‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’,
‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘V’, ‘W’, ‘X’, ‘Y’, ‘Z’};
unsignedprefix_count[NUM_TABLE_INDEX]; // count for each prefix
unsignedsuffix_count_wrt_prefix[NUM_TABLE_INDEX][NUM_TABLE_INDEX]; // 2D array of suffix count w.r.t. prefix
unsignedtotal_prefix_count; // total number of prefix
//********************
// function prototypes
//********************
// random returns a number randomly and uniformly between 1 and limit
unsigned random(unsigned);
// results: the travesty-structure is initialized (or created)
voidinitialize_travesty();
// requires: initial_travesty() is already executed
// results: the pair (prefix , s) is added to travesty-structure
voidupdate_travesty(char prefix, char s);
// requires: update_travesty() is already executed
// results: prefix is chosen with probability equal to its frequency
charchoose_prefix();
// results: suffix s is chosen with proability equal to its frequency w.r.t. prefix
charchoose_suffix(char prefix);
// results: the frequencies of prefixes and suffixes in the travesty-structure is printed out in a useful way
voidprint_travesty();
//*************
// main program
//*************
int main() {
unsigned seed = time(0); // get the system time
srand(seed); // seed the random number generator
// put your code here
return 0;
}
//*********************
// function definitions
//*********************
unsigned random(unsigned limit) {
return (rand() % limit + 1);
}
charchoose_prefix() {
unsigned r, accum;
charcurr_table_index;
r = random(total_prefix_count); // pick a number r that is randomly and uniformly distributed between 1 and total_prefix_count
accum = 0;
curr_table_index = ‘@’-1; // the character, ‘?’, precedes ‘@’
while (accum< r) {
curr_table_index++; // the succeed character
accum += prefix_count[curr_table_index-‘@’]; // accumulate the frequency of each prefix
}
return ((curr_table_index == ‘@’) ? ‘ ‘ :curr_table_index); // return the actual character
}
Solution
travesty.cpp
/*
*
* program description:
* anenglish text is given as input. the program will read in
* the text and store in a data structure which is called travesty
* structure. it consists of an array of prefix table. the array
* is index by a fixed length prefix. the prefix table has two fields.
* one is the total number of a particular prefix that appear in the text, while
* the other field is the array of suffix count which is index by a single
* character suffix from ‘A’ to ‘Z’ and ‘@’.
*
* the first part of the program will fill the travesty structure by reading
* one (prefix, s) pair at a time from the input english text.
*
* the second part of the program will print out the travesty structure based on
* the frequency of a prefix or suffix that appear in the original text.
*
* the third part of the program is to print out the random text. i use a
* “line_length” constant to determine the output line length.
*
*
* assumptions:
* 1. the input text must be in CAPITAL letters and only consists of
* ‘A’ to ‘Z’ and ‘ ‘.
*
* 2. the file size must be greater than the prefix size
*
* 3. all the newline marks are treated as blanks
*
* 4. the end of file mark indicates the end of file and it won’t be
* treated as a character. Therefore, the last valid character of the file
* will be treated as suffix only
*
*/
#include // srand() and rand()
#include // time()
#include // toupper()
#include // printf() to easy format the table
#include // ifstream
usingstd::ifstream;
//********************
// global declarations
//********************
// constants
const unsigned OUTPUT_SIZE = 1000;
const unsigned LINE_LENGTH = 80;
const unsigned PREFIX_SIZE = 1;
const unsigned NUM_TABLE_INDEX = 27;
// travesty structure
chartable_index[NUM_TABLE_INDEX] = {‘@’,
‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’,
‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘V’, ‘W’, ‘X’, ‘Y’, ‘Z’};
unsignedprefix_count[NUM_TABLE_INDEX]; // count for each prefix
unsignedsuffix_count_wrt_prefix[NUM_TABLE_INDEX][NUM_TABLE_INDEX]; // 2D array of suffix count w.r.t. prefix
unsignedtotal_prefix_count; // total number of prefix
//********************
// function prototypes
//********************
// random returns a number randomly and uniformly between 1 and limit
unsigned random(unsigned);
// results: the travesty-structure is initialized (or created)
voidinitialize_travesty();
// requires: initial_travesty() is already executed
// results: the pair (prefix , s) is added to travesty-structure
voidupdate_travesty(char prefix, char s);
// requires: update_travesty() is already executed
// results: prefix is chosen with probability equal to its frequency
charchoose_prefix();
// results: suffix s is chosen with proability equal to its frequency w.r.t. prefix
charchoose_suffix(char prefix);
// results: the frequencies of prefixes and suffixes in the travesty-structure is printed out in a useful way
voidprint_travesty();
//*************
// main program
//*************
int main() {
ifstream in; // input file
unsigned i;
unsignedlen; // when output, len is the number of characters in current line.
char c1, c2; // store two continuous characters in the file
unsigned seed = time(0); // get the system time
srand(seed); // seed the random number generator
// initialize the data structure
initialize_travesty();
// open the input file and read character one by one
// to feed the data structure
in.open(“data”);
if (in.fail()) {
printf(“does not find data file in the current folder.\n”);
return -1;
}
if (in.get(c1)) {
while (in.get(c2)) {
// update the pair (c1, c2) in the data structure
update_travesty(c1, c2);
c1 = c2;
}
}
in.close();
// for debug purpose, print the table
print_travesty();
printf(“\n\nOUTPUT :\n\n”);
// now generate the output
if (OUTPUT_SIZE > 0) {
c1 = choose_prefix();
len = 0;
for (i = 0; i < OUTPUT_SIZE; i++) {
printf(“%c”, c1);
len ++; // increase the length of current line
if (len == LINE_LENGTH) { // generate a new line
len = 0;
printf(“\n”);
}
c1 = choose_suffix(c1); // generate next character
}
}
return 0;
}
//*********************
// function definitions
//*********************
unsigned random(unsigned limit) {
return (rand() % limit + 1);
}
charchoose_prefix() {
unsigned r, accum;
charcurr_table_index;
r = random(total_prefix_count); // pick a number r that is randomly and uniformly distributed between 1 and total_prefix_count
accum = 0;
curr_table_index = ‘@’-1; // the character, ‘?’, precedes ‘@’
while (accum< r) {
curr_table_index++; // the succeed character
accum += prefix_count[curr_table_index-‘@’]; // accumulate the frequency of each prefix
}
return ((curr_table_index == ‘@’) ? ‘ ‘ :curr_table_index); // return the actual character
}
voidinitialize_travesty() {
unsignedint i, j;
// initialize all the counts of prefix and suffix to 0.
for (i = 0; i < NUM_TABLE_INDEX; i++) {
prefix_count[i] = 0;
for (j = 0; j < NUM_TABLE_INDEX; j++) {
suffix_count_wrt_prefix[i][j] = 0;
}
}
total_prefix_count = 0;
}
voidupdate_travesty(char prefix, char s) {
unsigned i, j; // index of prefix and s in the table
// replace any space with @ (internal symbol for spaces)
if (isspace(prefix)) {
prefix = ‘@’;
}
if (isspace(s)) {
s = ‘@’;
}
// calculate the index of prefix and s in the table
i = toupper(prefix) – ‘@’;
j = toupper(s) – ‘@’;
// increase the counts
prefix_count[i] ++;
suffix_count_wrt_prefix[i][j] ++;
total_prefix_count ++;
}
charchoose_suffix(char prefix) {
unsigned i, j, r, accum, curr_table_index;
unsignedtotal_suffix_count;
// replace any spaces with @ (internal symbol for spaces)
if (isspace(prefix)) {
prefix = ‘@’;
}
// calculate the total number of suffix for the given prefix
i = toupper(prefix) – ‘@’;
total_suffix_count = 0;
for (j = 0; j < NUM_TABLE_INDEX; j++) {
total_suffix_count += suffix_count_wrt_prefix[i][j];
}
// the following logic is similar to the code in choose_prefix
// the difference is here we choose a suffix in the array suffix_count_wrt_prefix[i].
//
r = random(total_suffix_count); // pick a number r that is randomly and uniformly distributed between 1 and total_suffix_count
accum = 0;
curr_table_index = ‘@’-1; // the character, ‘?’, precedes ‘@’
while (accum< r) {
curr_table_index++; // the succeed character
accum += suffix_count_wrt_prefix[i][curr_table_index-‘@’]; // accumulate the frequency of each prefix
}
return ((curr_table_index == ‘@’) ? ‘ ‘ :curr_table_index);
}
voidprint_travesty() {
unsigned i, j;
// print the total header
printf(“%5s %5s”, “CNT”, “PFX”);
for (i = 0; i < NUM_TABLE_INDEX; i++) {
printf(” %5c”, table_index[i]);
}
printf(“\n”);
// for each prefix, print the count of each suffix
for (i = 0; i < NUM_TABLE_INDEX; i++) {
printf(“%5d %5c”, prefix_count[i], table_index[i]);
for (j = 0; j < NUM_TABLE_INDEX; j++) {
printf(” %5d”, suffix_count_wrt_prefix[i][j]);
}
printf(“\n”);
}
// print the total number of prefix
printf(“total number of ALL prefixes count is : %d\n”, total_prefix_count);
}