Previous: 14.1.4 Variable Definition vs Declaration
Up: 14.1 Storage Classes
Next: 14.1.6 Static Variables
Previous Page: 14.1.4 Variable Definition vs Declaration
Next Page: 14.1.6 Static Variables

14.1.5 An Example: Lexical Scanner

To illustrate the use of external storage class variables, let us now consider an example in which a good program design is facilitated by the use of an external variable. The task is to find the next token in an input stream of characters. A token is a useful chunk of characters in the input stream, e.g. an operator, an identifier, an integer, a floating point number, etc. Tokens are also called symbols. A function that finds the next token in an input stream and identifies its type is called a lexical scanner. For our example, we will write a simple lexical scanner, get_token(), to find the next token and its type until an end of file is reached.

We will assume that the only valid tokens in the input stream to be identified by the program are either integers or operators. Further, we assume that integers can have no more than five digit characters and the operator can have no more than a single character. The operators allowed are +, -, *, /. If an integer type token exceeds the size limit, an oversize type is to be identified. White space characters between tokens are to be ignored. Any other character is an invalid character which is to be identified as an illegal type of token. Finally, the end of file is to be signaled by an end_of_text type of token.

We assume that get_token() determines the next token in the input stream and its type. We use a file symdef.h for all the defines. The function prototype for get_token() is included in symtok.h. The function takes two arguments: a string for the token, and the maximum size of the token. The function returns the type of the token, a symbolic constant with an integer value. The files symdef.h and symtok.h are shown in Figure 14.5.

The logic for the driver is straightforward and the implementation is in the file called symbol.c shown in Figure 14.6.

A loop is executed as long as there is a new token, and for each iteration, a token and its type are printed. When the end of file is reached, the token type returned by get_token() is EOT, the loop is terminated and the program ends. The size limit on a token is defined by LIM. The string, symbol, has a size of LIM plus one to accommodate the terminating NULL character.

Here is our logic for get_token(). The function scans the input stream, skipping over any leading white space. The first non-white character determines the type of token to build. For example, if the first non-white character is a digit character, the function builds a token of type INT. The integer type token is built using a loop. As long as the input character is a digit character and the token size limit is not exceeded, the input character is appended to the token string. If the token size limit is exceeded, the type is identified as OVR and the digit is discarded. The process of discarding digits continues until a non-digit character is read. The token string is terminated with a NULL, and the token type is returned. Otherwise, the building of an integer token is terminated when a non-digit character is read. The non-digit character read must somehow be returned to the input stream, so that it is available in building the next token. For example, if the next character is an operator, +, that character must be used in building the next token. If this non-digit character were discarded, it would be lost. Thus, the extra character that was read must be placed back into the input stream to be available once again for building the next token.

We will assume that the desired I/O actions are performed using an ``effective input stream''. We will write two functions, getchr() and ungetchr(c) for I/O with the effective stream. The function getchr() correctly reads a character from the effective input stream, and ungetchr(c) puts a character, c, back into the effective input stream. Assuming these functions, the algorithm for building an integer type token is simple:

if (isdigit(c)) {   /* if c is a digit, */
          type = INT;         /* type is integer */
          while (isdigit(c)) {  /* repeat as long as c is a digit: */
               if (i < lim)        /* if the size limit is not exceeded, */
                    s[i++] = c;    /* append the digit char; */
               else type = OVR;    /* otherwise, we have an oversize token */
               c = getchr();  /* get the next input char */
          }
          s[i] = NULL;        /* append the NULL */
          ungetchr(c);        /* put back the extra char read */
     }

The prototypes for the functions getchr() and ungetchr() are:

/* File: symio.h */
     int getchr(void);
     void ungetchr(int c);

Assuming these functions are available in the source file, symio.c, we can implement the function get_token() in Figure 14.7.

Finally, we are ready to write the functions getchr() and ungetchr() in a separate file. We will use a buffer to simulate the effective input stream so that when a character is to be returned to the input stream, it is placed in the buffer. When a character is to be read, the buffer is examined first. If there is a character in the buffer, that character is taken as the next input character. If the buffer is empty, a new character is read from standard input using getchar(). Thus, the one character buffer serves as an adjunct to the input stream; getchr() gets the next character either from the buffer or from the standard input, depending on the state of the buffer, while ungetchr() saves a character into the buffer for later use. Effectively, getchr() gets a character from the input stream, and ungetchr() returns a character to the input stream. Both getchr() and ungetchr() must access the buffer. However, get_token() should not be concerned with the details of accessing the input stream. Such details should be hidden from the rest of the program. Such information hiding is an important component of modular program design. The above case obviously calls for it; thus, get_token() should not be involved with the details of maintaining the buffer.

To achieve this information hiding, we put getchr() and ungetchr() in a separate file together with the external variable used as a one character buffer which is accessible to both getchr() and ungetchr(). Figure 14.8 shows the implementation.

The external variable for the character buffer used in the file symio.c makes it unnecessary for other functions to pass a buffer variable as an argument in function calls to getchr() and ungetchr(). Separation of these functions and the external variable they use into a distinct file makes for a modular program design. No other function needs access to the external variable defined in the file symio.c.

A standard library function, ungetch(), is available which returns its argument to the keyboard buffer. We could have also used ungetch() and getchar() to handle the above tasks of getting and ungetting characters from the keyboard input stream.

A sample run of the program symbol.c is shown below:

In the first input line, we use blanks to separate the tokens. We also have an oversize token in this case. In the second input line, no blanks are used to separate the tokens. Finally, the last line includes many illegal characters. In each case, the longest possible token is built.

While we caution against the use of external variables as a rule, there are occasions when the use of external variables results in better programs. The deciding factor should always be better program design that provides modularity and flexibility, and that facilitates debugging.



Previous: 14.1.4 Variable Definition vs Declaration
Up: 14.1 Storage Classes
Next: 14.1.6 Static Variables
Previous Page: 14.1.4 Variable Definition vs Declaration
Next Page: 14.1.6 Static Variables

tep@wiliki.eng.hawaii.edu
Sat Sep 3 07:21:51 HST 1994