JavaCC 简介

红薯 发布于 2009/04/22 22:29
阅读 3K+
收藏 0

Introduction

Most of the time when it is needed to parses a file or stream, programmers tend to depend on “Tokenizer” or “streamTokenizer” rather creating a parser.Ofcourse creating a parser is time consuming since it needs iterative testing of all possible states.However it makes your application robust and error free particularly when dealing files with specific format. Once you get a good starting with creating simple parser, sure you will find it as a better alternative in many hectic situations.This article targets beginners to parser development. This will give you an idea of creating a parser through suitable step by step simple example. At the end of the tutorial we will parse an sql file and extract table specifications (please note that this is for an illustrative purpose, complete sql formats are not supported)

Prerequisites

JavaCC works fine with java VM 1.2 or greater. You need to install JavaCC.For help on installation referhttps://javacc.dev.java.net/doc/installhelp.html If you are using eclipse, then there are free JavaCC plug-in available. Download and install the plug-in (Google for easy-javacc-1.5.7.exe, a free eclipse plug-in for JavaCC)

What is JavaCC?

Java Compiler Compiler is an open source parser generator for java programs. Unlike YACC (Yet Another Compiler Compiler).JavaCC is a top down parser for LL type grammars. So strictly speaking, LR parsing (left recursion) is not possible in JavaCC. JavaCC creates LL parsers for context free grammars (a context free grammar contains production rulesin format NT -->T,Where NT is a known terminal and t is a combination of terminals and or non terminals).
An LL parser parses the input from left to right and create the leftmost derivation of a sentence when compared tothe LR parser where the right most derivation of a sentence is created. These kinds of parsers uses next tokens to take the parsing decisions without any back tracing (Look Ahead).So these LL Parsers not much complicated and hence widely used even if it is fairly restrictive.

Before Starting

Parser and lexer are software components that describe the syntax and semantics of any characters stream. Lexical analyzer identifies and separates a set of predefined character known as Tokens. Parser specifies semantics of different token in a statement and/or different statements. So parser can be easily used to check the structure file and to extract specific components from the file.
For example, in the following java statement

Collapse
String test = "Testing";

The lexical analysis finds following tokens

Collapse
{ "String"|" "|"Test"|" "|"="|" "| "\"" |"Testing"| "\""|";"}

So after lexical analysis we will get the group of tokens

Collapse
{STRING_TYPE|SPACE|VAR_NAME|SPACE|EQUAL|SPACE|D_QUATES|VAR_VAL|D_QUATES|SEMCOLN}

Parser checks the semantic of the tokens identified by the lexer as specified in the grammar file.In case of JavaCC which itself is not a lexical analyzer or parser but it generates java code for lexical analyzer and parser according to the specifications in a context free grammar file. These grammar files are named with extension .jj by convension. These grammar files yield more modularity and are easy to read, modify or write when compared to handwritten java parser and hence it saves a lot of time and effort.

The association of tokens in a file is described by the BNF products. Backus–Naur Form (BNF) is a metasyntax notation used to express context-free grammars. Most of the parser generators support the BNF production rules for specifying the sequence of token kinds in an error free output. In the next sections you will see how token associations are specified using BNF productions in a .jj file in the next sections.

So let us start the creation of the JavaCC file with a very simple example.Suppose we are having a file with an sql create statement like the following

For this example we are not considering fields in the create table statement. We will add more items to the file in the coming steps. It is often found to be a good practice to create and modify grammar in steps with increasing complexities of the files.

In the above example you are having the tokens

Collapse
CTCMD :- "Create Table" //the create table command
TNAME :- "test" //the table name followed
OBRA :- "(" //Opening bracket
CBRA: - ")" //Closing bracket
EOF //End of the file.

Structure of a Grammar File

To generate a parser the only input to JavaCC is a context free grammar file.JavaCC produces 7 java files in the output. The JavaCC grammar file (.jj files) composed of 4 sections.

1. Options 2.Class declaration 3.Specification for lexical analysis 4.BNF notations.

Option section is to specify the optional parameters such as DEBUG, STATIC etc.In this example STATIC is set false to make the parser class nonstatic.so that multiple instance of parser can exist at a time. STSTIC is true by default. Class declaration in the grammar file provides the main entry point in the generated parser. The other section in the grammar file is meant for the specification of token for lexical analysis. Grammar may contain BNF productions rule with specifies the association of tokens that defines the structure of the file. For parsing the above file our grammar file look like following

Collapse
/* Sample grammar file*/ 
options
{STATIC = false ;}
PARSER_BEGIN(SqlParser)
package sqlParserDemo;
class SqlParser {
{Start () ;}
}
PARSER_END (SqlParser)

SKIP: { "\n" | "\r" | "\r\n" |"\\"|"\t"|" "}

TOKEN [IGNORE_CASE]:
{
<CTCMD :("
Create Table")>
| <TNAME :(["
a"-"z"])+ >
| <OBRA :("
(")>| <CBRA:(")")>
}
void Start (): {}
{<CTCMD><TNAME><OBRA><CBRA><EOF>}

In the above grammar file the class declaration section is enclosed in PARSER_BEGIN and PARSER_END keywords. In this section we defines the package, all imports and the parser class. Here “SqlParser” class is having the method initParser which serves as the entry point. The parser throws “ParseException” and “TokenMgrError” exceptions.TokenMgrError is thrown when scanning encounters an undefined tokens. If an undefined state or production encountered parser will throw “ParseException”.By default the parser code created should have a constructor which accept a "reader" type.

In many cases we need to ignore some characters in the file such as formatting characters like Newline, Whitespace, Tabs etc. These sequence may not have relation with the meaning. Such characters can be skipped while scanning if they are specified as the SKIP terminals. In the above grammar file new line, carriage (note that newline representation varies from os-os) and whitespace are specified as the SKIPJA terminals. Scanner reads these characters and ignores them. They are not passed to the parser.

The keyword TOKEN is for specifying tokens. Each token is a character sequence associated with a name."|" character is used to separate tokens.JavaCC provides Token qualifier IGNORE_CASE.It is used to make scanner case insensitive to the tokens. In this example sql is case insensitive and hence "create table" command can be written in any case. if you have case sensitive and case insensitive token then you can specify them in different TOKEN statement. Tokens are enclosed within angle brackets.

Here we have created tokens CTCMD, TNAME, OBRA, CBRA .The character sequence for the tokens is defined with regular expression syntax.(["a"-"z"])+ means a sequence of any number of characters form "a"-"z".

The BNF production rules are specified after token declaration. These seem almost similar to method syntax. one can add any java code in the first set of curly brasis. Usually contains declarations of variables that are used in the production rule. The return type is the intended return type from the BNF. In this example we are checking the file structure only. So the return type is void. The start function in the example initializes the parsing. You should call the Start method from your class declaration. In this example it defines the structure of the file as

{<CTCMD><TNAME><OBRA><CBRA><EOF>}.

Any change in the file format will throw an error.

Generating the Parser

  1. Once you created the grammar file save it in a directory. General name convention follows ".jj" extension, say demogrammar.jj
  2. Go to the command prompt and change directory to demogrammar.jj files directory.
  3. Type command "javacc demogrammar.jj"

If you are using eclipse with JavaCC plug-in installed, then follow the below steps.

  1. Create a new project with package as specified in the grammar file.
  2. For creating the grammar file right click the package and go to new-->other-->Javacc-->Javacc Template file. Create the grammar file and save.
  3. Build the project.

After invoking Javacc on the grammar file (demogrammar.jj) you will get following 7 classes in its own files.

  1. Token: - Class for representing a token. Each token is associated with a 'Token kind' which represent the type of the token.A string 'image' represents the character sequence associated with the token.
  2. TokenMgrError: - Sub class of Error. Throws exceptions on errors in lexical analysis.
  3. ParseException: - Sub class of exception.Throws exceptions on errors detected by parser.
  4. SimpleCharStream :- implementation of interface Character Stream for the lexer.
  5. SqlParserConstants: - an interface for defining the token classes used in the lexer and parser.
  6. SqlParserTokenManager: -The lexical analyzer class
  7. SqlParser: - The parser class

Compile the generated classes (javac *.java).

Running the Parser

After successful compilation you are ready to test a sample file. For testing the example adds a class having main function to the package.

Collapse
 /*for testing the parser class*/ 
public class ParseDemoTest {
public static void main(String[] args) {
try{SqlParserparser = new SqlParser(new FileReader(FilePath));
parser.initParser () ;}

catch (Exception ex)
{ex.printStackTrace() ;}
}

Creating the parser object, you can specify a reader as the constructor argument. Note that the generated parser codecontains constructor that accepts reader.InitParser method initializes the parsing. Now you can build and run “ParseDemoTest”.Exception is thrown if the file specified in the given file path is not confined to the grammar. As we have discussed an overall idea of JavaCC operation, now we adds some more items in the file to parse. In this example we will extract the table specifications provided in the sql file (note that the example is only for illustrative purpose and hence the grammar doesn’t comply with all sql syntax).The new sql file is in following format

Collapse
##JavaccParserExample##### 
CREATE TABLE STUDENT
(
StudentName varchar (20),
Class varchar(10),
Rnum integer,
)

CREATE TABLE BOOKS
(
BookName varchar(10),
Edition integer,
Stock integer,
)
CREATE TABLE CODES
(
StudentKey varchar(20),
StudentCode varchar(20),
)

It is clear from the file that the file can have multiple create statement and each create statement may contain multiple columns. Also there are comments enclose in character "#".For getting the table spec, we need a structure which contains a list of tables and their details. Create a class in the package for representing a table

Collapse
public class TableStruct { 
String TableName;
HashMap<String,String> Variables = new HashMap<String, String> (); }

This class represents the table with table name along with the Colum names mapped to their data type. Below is the modified grammar file for the new file.

On examining the new grammar file you can see a SPECIAL_TOKEN.As mentioned earlier Special Tokens are those which don’t contribute any meaning but still informative such as comments .Here the comment is defined by the special token.Lexer identifies and passes the special tokens to the parser There is no BNF notations for the special tokens. You can see that all the variable declarations and other java codes associated with a BNF notations are enclosed in '{}'.An expression may contain another expressions such as TType = DType().It is a good practice to identify the reusable expressions and specifythem in separate. In the BNF notation for the Variables

Collapse
( TName = <TNAME> 
TType = DType()
<COMMA>
{var.put(TName.image,TType.image);}
)*

The "*" specifies the m,eaning that any number of occurrence of the token sequence is possible. For running and testing this grammar file change your main class as follows

Collapse
public class ParseDemoTest { 
public static void main(String[] args) {
try{
SqlParser parser = new SqlParser (new FileReader("D:\\sqltest.txt"));
ArrayList<TableStruct> tableList = parser.initParser();
for(TableStruct t1 : tableList)
{ System.out.println("--------------------------");
System.out.println("Table Name :"+t1.TableName);
System.out.println("Field names :"+t1.Variables.keySet());
System.out.println("Data Types :"+t1.Variables.values());
System.out.println("--------------------------");
}
}catch (Exception ex)
{ex.printStackTrace() ;}
}
}

Complile grammar file and run the appilication.For the sql test file you will get the output as

Collapse
--------------------------
Table Name :STUDENT
Field names :[StudentName, Rnum, Clas
--------------------------
--------------------------
Table Name :BOOKS
Field names :[Stock, Edition, BookName]
Data Types :[integer, integer, varchar]
--------------------------
--------------------------
Table Name :CODES
Field names :[StudentCode, StudentKey]
Data Types :[varchar, varchar]
--------------------------

Conclusion

JavaCC is a widely used tool for lexical and parser component generation which follows regular expression and BNF notation syntax for lex and parser specifications. Creating a parser is needs iterative step. Never expect to get desired output in one pass. After creating your first parser in the example try to modify it and add other possibilities to the input file.A step by step approach is desired while dealing with complex file. Also try to make expressions common and reusable as far as possible. Once you are comfortable with .jj files you can use more advanced tools like JJTree and JTB with javacc to automate the augmentation of a grammar.

Collapse
/* demo grammar.jj*/
options
{
STATIC = false ;
}
PARSER_BEGIN (SqlParser)
package sqlParserDemo;
import java.util.ArrayList;
import java.util.HashMap;
class SqlParser {
ArrayList<TableStruct> initParser()throws ParseException, TokenMgrError
{ return(init()) ; }
}
PARSER_END (SqlParser)

SKIP: { "\n" | "\r" | "\r\n" |"\\"|"\t"|" "}

TOKEN [IGNORE_CASE]:
{
<CTCMD :("
Create Table")>
|<NUMBER :(["
0"-"9"])+ >
|<TNAME:(["
a"-"z"])+ >
|<OBRA:("
(")+>
|<CBRA:("
)")+>
|<COMMA:("
,")>
}

SPECIAL_TOKEN : {<COMMENT:("
#")+(<TNAME>)+("#")+>}

ArrayList<TableStruct> init():
{
Token T;
ArrayList<TableStruct> tableList = new ArrayList<TableStruct>();
TableStruct tableStruct;
}
{
(
<CTCMD>
T =<TNAME>
{ tableStruct = new TableStruct ();
tableStruct.TableName = T.image ;}
<OBRA>
tableStruct.Variables = Variables()
<CBRA>
{tableList.add (tableStruct) ;}
)*
<EOF>
{return tableList;}
}

HashMap Variables():
{
Token TName;
Token TType;
HashMap<String,String> var = new HashMap<String, String>();
}

(
TName = <TNAME>
TType = DType()
<COMMA>
{var.put(TName.image,TType.image);}
)*
{return var;}
}
Token DType():
{
Token TDType;
}
{
TDType=<TNAME>
[<OBRA><NUMBER><CBRA>]
{return TDType;}
}Create Table test
(
)
返回顶部
顶部