Description

Oleg Kiselyov's basic input-stream tokenizing and parsing routines.

Version

Download

input-parse.egg

Usage

(require-extension input-parse)

Documentation

input-parse is a chicken port of Oleg Kiselyov's input-stream tokenizing and parsing library found at http://okmij.org/ftp/Scheme/parsing.html.

The following is adopted from Oleg's source documentation.

Simple Parsing of Input

The following simple functions surprisingly often suffice to parse an input stream. They either skip, or build and return tokens, according to inclusion or delimiting semantics. The list of characters to expect, include, or to break at may vary from one invocation of a function to another. This allows the functions to easily parse even context-sensitive languages.

EOF is generally frowned on, and thrown up upon if encountered. Exceptions are mentioned specifically. The list of expected characters (characters to skip until, or break-characters) may include an EOF "character", which is to be coded as symbol*eof*

The input stream to parse is specified as a PORT, which is usually he last (and optional) argument. It defaults to the current input port if omitted.

This package relies on a function parser-error, which must be defined by a user of the package. The function has the following signature:

procedure: (parser-error PORT MESSAGE SPECIALISING-MSG*)

Many procedures of this package call parser-error to report a parsing error. The first argument is a port, which typically points to the offending character or its neighborhood. Most of the Scheme systems let the user query a PORT for the current position. MESSAGE is the description of the error. Other arguments supply more details about the problem.

API

Scanning

procedure: (peek-next-char [PORT])

advances to the next character in the PORT and peeks at it. This function is useful when parsing LR(1)-type languages (one-char-read-ahead).

The optional argument PORT defaults to the current input port.

procedure: (assert-curr-char CHAR-LIST STRING [PORT])

Reads a character from the PORT and looks it up in the CHAR-LIST of expected characters. If the read character was found among expected, it is returned Otherwise, the procedure writes a nasty message using STRING as a comment, and quits.

The optional argument PORT defaults to the current input port.

procedure: (skip-until CHAR-LIST [PORT])

Reads and skips characters from the PORT until one of the break characters is encountered. This break character is returned. The break characters are specified as the CHAR-LIST. This list may include EOF, which is to be coded as a symbol *eof*.

The optional argument PORT defaults to the current input port.

procedure: (skip-until NUMBER [PORT])

Skips the specified NUMBER of characters from the PORT and returns #f.

The optional argument PORT defaults to the current input port.

procedure: (skip-while CHAR-LIST [PORT])

Reads characters from the PORT and disregards them, as long as they are mentioned in the CHAR-LIST. The first character (which may be EOF) peeked from the stream that is NOT a member of the CHAR-LIST is returned. This character is left on the stream.

The optional argument PORT defaults to the current input port.

Stream tokenizers

procedure: (next-token PREFIX-CHAR-LIST BREAK-CHAR-LIST [COMMENT-STRING] [PORT])

Skips any number of the prefix characters (members of the PREFIX-CHAR-LIST), if any, and reads the sequence of characters up to (but not including) a break character, one of the BREAK-CHAR-LIST.The string of characters thus read is returned. The break character is left on the input stream. The list of break characters may include EOF, which is to be coded as a symbol *eof*. Otherwise, EOF is fatal, generating an error message including a specified COMMENT-STRING(if any)

The optional argument PORT defaults to the current input port.

Note: since we can't tell offhand how large the token being read is going to be, we make a guess, pre-allocate a string, and grow it by quanta if necessary. The quantum is always the length of the string before it was extended the last time. Thus the algorithm does a Fibonacci-type extension, which has been proven optimal. Note, explicit port specification in read-char, peek-char helps.

procedure: (next-token-of INC-CHARSET [PORT])

Reads characters from the PORT that belong to the list of characters INC-CHARSET . The reading stops at the first character which is not a member of the set. This character is left on the stream. All the read characters are returned in a string.

procedure: (next-token-of PRED [PORT])

Reads characters from the PORT for which PRED (a procedure of one argument) returns non-#f . The reading stops at the first characterfor which PRED returns #f. That character is left on the stream. All the results of evaluating of PRED up to #f are returned in a string.

PRED is a procedure that takes one argument (a character or the EOF object) and returns a character or #f. The returned character does not have to be the same as the input argument to the PRED. For example,

(next-token-of (lambda (c)
		  (cond ((eof-object? c) #f)
			((char-alphabetic? c) (char-downcase c))
			(else #f))))

will try to read an alphabetic token from the current input port, and return it in lower case.

The optional argument PORT defaults to the current input port.

This procedure is similar to next-token but only it implements an inclusion rather than delimiting semantics.

Reading input

procedure: (read-text-line [PORT])

Reads one line of text from the PORT, and returns it as a string. A line is a (possibly empty) sequence of characters terminated by CR, CRLF or LF (or even the end of file). The terminating character (or CRLF combination) is removed from the input stream. The terminating character(s) is not a part of the return string either. If EOF is encountered before any character is read, the return value is EOF.

The optional argument PORT defaults to the current input port.

procedure: (read-string N [PORT])

Reads N characters from the PORT, and returns them in a string. If EOF is encountered before N characters are read, a shorter string will be returned. If N is not positive, an empty string will be returned. The optional argument PORT defaults to the current input port.

Note: In this implementation, this function is part of the 'extras' unit.

Examples

See the source of the cgi-utils egg.