lp.parse
Class LpLexer

java.lang.Object
  extended by lp.parse.LpLexer
All Implemented Interfaces:
Closeable
Direct Known Subclasses:
LpLookaheadLexer

public class LpLexer
extends Object
implements Closeable

Class that tokenizes a textual input. The nextToken() method reads tokens as they appear in the input. get* methods (getTokenType(), getLexem(), getLineNumber(), getPosition() and getToken() return information relevant to the last token read. All whitespace (as defined by Character.isWhitespace(char)) and line comments (parts of input after a '%' character until the next line break or input end) are ignored, i.e. they are not used to generate any tokens. 9 types of tokens are recognized:

  1. LpTokenType.LEFT_PAREN -- a left parenthesis '('
  2. LpTokenType.RIGHT_PAREN -- a right parenthesis ')'
  3. LpTokenType.COMMA -- a comma ','
  4. LpTokenType.DOT -- a dot '.'
  5. LpTokenType.RULE_ARROW -- a string "<-" or a string ":-"
  6. LpTokenType.LOWERCASE_WORD is a string of characters from the set {'_', 'a', 'b', ..., 'z', 'A', 'B', ..., 'Z', '0', '1', ..., '9'} not beginning with an uppercase letter. In other words ([_a-z0-9][_a-zA-Z0-9]*). The token is parsed greedily -- it ends only in case the next character does not belong to the set mentioned above, even if it's whitespace or a beginning of a comment.
  7. LpTokenType.UPPERCASE_WORD is a string of characters from the set {'_', 'a', 'b', ..., 'z', 'A', 'B', ..., 'Z', '0', '1', ..., '9'} beginning with an uppercase letter. In other words ([A-Z][_a-zA-Z0-9]*).
  8. LpTokenType.EOF is returned when the end of input is happily reached and also ever after.
  9. LpTokenType.UNKNOWN_CHAR is returned if a character occurs that couldn't be matched against any other token (just to be precise, it is none of the following: whitespace, part of an inline comment, '(', ')', ',', '.', a '<' of ':" followed by a '-', '_', lower- or uppercase letter). After this token is returned by nextToken(), getLexem() returns a string of length 1 with the alien character.
Example: If you execute this code:
LpLexer l = new LpLexer();
l.setInput("Simple, short sentence.");
l.nextToken();
LpTokenType t = l.getTokenType();
while (t != LpTokenType.EOF) {
    System.out.println("token: " + t.toString() + "; lexem: "
            + l.getLexem() + "; line number: " + l.getLineNumber()
            + "; position: " + l.getPosition());
    l.nextToken();
    t = l.getTokenType();
}
l.close();
 
you should get the following output:
 token: UPPERCASE_WORD; lexem: Simple; line number: 1; position: 1
 token: COMMA; lexem: ,; line number: 1; position: 7
 token: LOWERCASE_WORD; lexem: short; line number: 1; position: 9
 token: LOWERCASE_WORD; lexem: sentence; line number: 1; position: 15
 token: DOT; lexem: .; line number: 1; position: 23
 

Version:
1.0.0
Author:
Martin Slota
See Also:
LpTokenType, LpToken

Field Summary
private  int la
          The lookahead character.
private  StringBuilder lexem
          A StringBuilder where the lexem corresponding to the last token read is kept.
private  int lineNumber
          A container for the number of line on which the last token occured.
private  int position
          A container for the position of the last token's beginning within a line.
private  Reader reader
          The reader used to read the input.
private  LpTokenType type
          Type of the last token read.
 
Constructor Summary
LpLexer()
          Creates a new instance of LpLexer.
 
Method Summary
private  void appendOne()
          Appends the current lookahead character to lexem and reads a new one.
 void close()
          Closes the underlying reader.
 String getLexem()
          Returns the lexem corresponding to the last token read.
 int getLineNumber()
          Returns the number of line of input on which the last token occured.
 int getPosition()
          Returns the position of the last token's beginning within the line of input it's on.
 LpToken getToken()
          Returns a LpToken instance containing information about the last token read.
 LpTokenType getTokenType()
          Returns the type of the last token read.
protected  void initialize()
          Reinitializes members and reads the first lookahead character.
private  boolean isWordLetter(char c)
          Determines if a character belongs to the set {'_', 'a', 'b', ..., 'z', 'A', 'B', ..., 'Z', '0', '1', ..., '9'}.
 void nextToken()
          Reads the next token occuring on the input.
private  void readNewLA()
          Reads one character from the input and stores it in the lookahead container la.
 void setInput(File file)
          Sets the contents of the given file as an input for this LpLexer.
 void setInput(CharSequence input)
          Sets the character input of this LpLexer.
 void setInput(Reader reader)
          The given character reader will be used used as input for this LpLexer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader

private Reader reader
The reader used to read the input.


la

private int la
The lookahead character. See readNewLA().


type

private LpTokenType type
Type of the last token read. See getTokenType().


lexem

private final StringBuilder lexem
A StringBuilder where the lexem corresponding to the last token read is kept. See getLexem().


lineNumber

private int lineNumber
A container for the number of line on which the last token occured. See getLineNumber() for more information on how lines are numbered.


position

private int position
A container for the position of the last token's beginning within a line. See getPosition().

Constructor Detail

LpLexer

public LpLexer()
Creates a new instance of LpLexer.

Method Detail

setInput

public void setInput(CharSequence input)
Sets the character input of this LpLexer. A StringReader is used to read the input character by character. Also resets information about the previously read token to the default values (as if no token was read before).

Parameters:
input - string with input for the LpLexer
Throws:
IllegalArgumentException - if input is null

setInput

public void setInput(File file)
Sets the contents of the given file as an input for this LpLexer. The default system encoding is used to read the contents of the file. Also resets information about the previously read token to the default values (as if no token was read before).

Parameters:
file - the file with input for this LpLexer
Throws:
IOException - (wrapped in an ExceptionAdapter) in case an I/O exception occurs while opening or reading the file
IllegalArgumentException - if file is null

setInput

public void setInput(Reader reader)
The given character reader will be used used as input for this LpLexer. Also resets information about the previously read token to the default values (as if no token was read before).

Parameters:
reader - a reader with input for the LpLexer
Throws:
IOException - (wrapped in an ExceptionAdapter) in case an I/O exception occurs while reading from the Reader
IllegalArgumentException - if reader is null

initialize

protected void initialize()
Reinitializes members and reads the first lookahead character.

Throws:
IOException - (wrapped in an ExceptionAdapter) in case an I/O error occurs while reading the first lookahead character

close

public void close()
Closes the underlying reader. If setInput(CharSequence) or setInput(File) was used to set the current character source, this method should be called when no more tokens are required from the source. In other cases it is up to the programmer whether she will close the Reader given to setInput(Reader) herself or call this method.

Specified by:
close in interface Closeable
Throws:
IOException - (wrapped in an ExceptionAdapter) in case an I/O exception occurs while closing the underlying Reader

nextToken

public void nextToken()
Reads the next token occuring on the input. More information about tokens can be found in the class description.

Throws:
IOException - (wrapped in an ExceptionAdapter) in case an I/O exception occurs while reading the input

getTokenType

public LpTokenType getTokenType()
Returns the type of the last token read. This method is not meant to be called before nextToken() is called at least once after the last setInput() call. But if such a situation occurs, null is returned. Similarily, if close() has already been called, null is returned.

Returns:
type of the last token read

getLexem

public String getLexem()
Returns the lexem corresponding to the last token read. In case it is a LpTokenType.EOF token, empty string is returned. This method is not meant to be called before nextToken() is called at least once after the last setInput() call. But if such a situation occurs, null is returned. Similarily, if close() has already been called, null is returned.

Returns:
lexem corresponding to the last token read

getLineNumber

public int getLineNumber()
Returns the number of line of input on which the last token occured. Lines are numbered from 1 (see the example in class description). A newline starts when either a '\n' or a '\r' character is detected. There is one exception: a '\n' character occuring right after a '\r' character is ignored, i.e. not considered to be another line delimiter. This method is not meant to be called before nextToken() is called at least once after the last setInput() call. But if such a situation occurs, -1 is returned. Similarily, if close() has already been called, -1 is returned.

Returns:
the number of line of input on which the last token occured

getPosition

public int getPosition()
Returns the position of the last token's beginning within the line of input it's on. The characters on the line are numbered from 1 (see the example in class description). Tabs also count as 1 character. This method is not meant to be called before nextToken() is called at least once after the last setInput() call. But if such a situation occurs, -1 is returned. Similarily, if close() has already been called, -1 is returned.

Returns:
position of the last token's beginning within a line of input

getToken

public LpToken getToken()
Returns a LpToken instance containing information about the last token read. The information is read using the getTokenType(), getLexem(), getPosition() and getLineNumber() methods.

Returns:
a LpToken instance containing information about the last token read

readNewLA

private void readNewLA()
Reads one character from the input and stores it in the lookahead container la. Updates lineNumber and position.

Throws:
IOException - (wrapped in an ExceptionAdapter) in case an I/O exception occurs while reading the character

appendOne

private void appendOne()
Appends the current lookahead character to lexem and reads a new one.

Throws:
IOException - (wrapped in an ExceptionAdapter) in case an I/O exception occurs while reading the the new lookahead character

isWordLetter

private boolean isWordLetter(char c)
Determines if a character belongs to the set {'_', 'a', 'b', ..., 'z', 'A', 'B', ..., 'Z', '0', '1', ..., '9'}.

Parameters:
c - the character in question
Returns:
true if it does belong to the set mentioned above, false otherwise.