API for implementing a parser generator
My software engineering class is working to generate xml from controlled english for programatic processing of software resource specifications. To do this we need a grammar and parser for controlled english. Currently a grammar and accomanying parser have been written in VisualBasic. The use of this solution though in developing a server solution is limited and so a more general solution is desired. This project is to create an object oriented parser generator.
The inputs to the system will be:
- A file containing the grammar for the language. This file also describes a set of semantic actions to perform as different rules are actvated
- An input file to validate against the grammar and generate semantic actions
From these inputs a parser should be created and the accompanying file processed. In reality the parser will likely be initialized once and then several documents will be processed by it.
The format for the grammar will be ISO EBNF. The current grammar in that format is:
- requirements = sentence, rest of requirements;
- requirements = ;
- rest of requirements = '.', requirements;
- rest of requirements = 'conjunction', abbreviation, rest of requirements;
- sentence = 'if', condition, 'then', sentence, executor;
- sentence = noun phrase, predicate;
- abbreviation = noun phrase;
- abbreviation = predicate;
- noun phrase = 'determiner', adjective list, noun;
- noun phrase = noun, prepositional phrase;
- adjective list = 'adjective', adjective list;
- adjective list = ;
- noun = 'noun', phrase head;
- phrase head = 'noun', phrase head;
- phrase head = ;
- predicate = verb phrase, preposition, prepositional phrase;
- predicate = 'w (I don't know what w is)', prepositional phrase;
- condition = predicate, predicate adjective;
- predicate adjective = 'adjective', adjective list;
- predicate adjective = noun phrase;
- verb phrase = 'adverb', 'verb';
- verb phrase = 'verb';
- executor = 'execution', sentence;
- executor = ;
- prepositional phrase = 'preposition', noun phrase, prepositional phrase;
- prepositional phrase = ;
This format is easier to parse because it does not have repeats, but I think that the grammar would be easier to read if it had repeats. (A repeat of {n,m} means it must be repeated at least n times, but no more than m. {n} means it must be repeated n times with no limit after that. ? = {0,1}; * = {0}; + = {1};) If the grammar had repetition symbols and choice symbols in it it would look like:
- requirements = sentence*;
- sentence = 'if', condition, 'then', sentence, executor?, '.';
- sentence = noun phrase, predicate, ('conjunction', (noun phrase | predicate))*, '.';
- noun phrase = 'determiner', adjective list, noun list;
- noun phrase = noun list, prepositional phrase;
- adjective list = 'adjective'*;
- noun list = 'noun'+;
- predicate = verb phrase, preposition, prepositional phrase;
- predicate = 'w (I don't know what w is)', prepositional phrase;
- condition = predicate, predicate adjective;
- predicate adjective = 'adjective', adjective list;
- predicate adjective = noun phrase;
- verb phrase = 'adverb'?, 'verb';
- executor = 'execution', sentence;
- prepositional phrase = 'preposition', noun phrase, prepositional phrase;
- prepositional phrase = ;
This grammar is slightly different from many in that the "terminals" are not really terminal. They are instead simply the lower limit to what this grammar deals with. The lexer will return tokens with the appropriate code. The parser generator can generate the non-terminals, but it will need access to the lexer to get the appropriate code for the terminals.