User Manual [Previous] [Next]

Grammar Notation

In Umple we use our own extended EBNF syntax for the grammar, with our own parsing tool. The reason for this is that we wanted several advanced features in the grammar and parser.

Samples of syntax in this user manual are generated directly from the input grammar files used to parse Umple code.

Our syntax offers a very simple mechanism to define a new language, as well as extend an existing one. We will be using examples to help explain our syntax.

Example of a simple rule

Let's start with a simple assignment rule (not part of Umple, just an example):

assignment : [name] = [value] ;

Above, the rule name is "assignment". An assignment is comprised of a non-terminal called "name", then the equals symbol ("="), a non-terminal "value" and finally a semi-colon symbol (";").

A non-terminal by default as shown above is a sequence of characters that is a non-whitespace that is delimited by the next symbol (based on the specified grammar). In the above case, the non-terminal "name" will be defined by the characters leading up to either a space, newline, or an equals ("=").

Here are a few examples that satisfy the assignment rule above:

key = "one";
wasSet=true;
numberOfItems =7;

Rules that refer to other rules

Let us now consider nesting sub-rules within a rule. Sub-rules are words between [[ and ]]. Again, the following is illustrative, and is not Umple itself

directive- : [[facadeStatement]] | [[useStatement]]

facadeStatement- : [=facade] ;

useStatement : use [=type:file|url] [location] ;

Above, we have three rules, "directive", "facadeStatement", and "useStatement". A "directive" is either a "facadeStatement" or a "useStatement" (the "or" expression is defined by the vertical bar "|"). As indicated earlier, to refer to a rule within another rule, we use double square brackets ("[[" and "]]").

By default, rule names are added to the tokenization string (the result of parsing the input, shown in blue below). But, some rules act more like placeholders to help modularize the grammar (and to promote reuse). To exclude a rule name (and just the name, the rule itself will still be evaluated and tokenized as required), simply add a minus ("-") at the end of its name.

Above, we see that the rule names "directive", and "facadeStatement" are not added to the tokenization string because of the trailing minus signs..

For example, the text "facade;" is tokenized as follows:
[facade:facade]

Without the ability to exclude rule names, that same text would be tokenized with the following additional (and unnecessary) text:
[directive][facadeStatement][facade:facade]

Terminal symbols and constants

Symbols (i.e.terminals), such as "=" and ";" are used in the analysis phase of the parsing (to decide which parsing rule to invoke), but they are not added to the resulting tokenziation string for later processing.

If we want to tokenize symbols, we can create a constant using the notation

[=name]

In the earlier example we see that a "facadeStatement" is represented by the sequence of characters "facade" (i.e. a constant).

To support lists of potential matches we use a similar notation

[=name:list|separated|by|bars].

In the earlier example, we see that the "type" non-terminal can be the constant string sequence "file" or "url".

Hence, here are a few examples that satisfy the earlier example:

facade;
use file Parser.ump;
use url http://cruise.site.uottawa.ca/Parser.ump;

Optionality and repeating

Parentheses can be used to group several elements in the grammar into a single element for the purposes of the following special treatment.

An asterisk * means that zero or more of the preceding elements may occur.

A plus sign + means that one or more of the preceding elements must occur.

A question mark ? means that the preceding element may occur.

Avoiding consumption of whitespace

Normally when parsing, all whitespace (spaces, carriage returns, tabs, etc.) around tokens are ignored, and the token output by the parser does not contain them. However, if a # symbol is found after the rule (on the left hand side) then all whitespace is preserved. This is useful for cases where that space is actually useful, such as in Umple's templates.

Comments and arbitrary input

The grammar syntax supports a simple mechanism for non-terminals that can include whitespace (e.g. comments). Text in [** ] such as [**arbitraryStuff] is not parsed. However if the rule name is followed by a colon, such as [**templateTextContent:<<(=|#|/[**] then a pattern for different types of brackets that can be internally parsed can be specified. So the above says ignore everything except things in <<= <<# <</* , which will be processed.

Let us consider the rules to define inline and multi-line comments.

inlineComment- : // [*inlineComment]

multilineComment- : /* [**multilineComment] */

The [*name] (e.g. [*inlineComment]) non-terminal will match everything until a newline character. The [**name] (e.g. [**multilineComment]) non-terminal will match everything (including newlines) until the next character sequence is matched. In the case above, a "multilineComment" will match everything between "/*" and "*/".

Here are a few examples that satisfy the assignment rule above:

// remove all references to "x" once complete
/* This class will help calculate
your overdue library fees */

Special matching cases

By default, [name] matches identifiers that can include underscore and certain other symbols. To match alphanumeric identifiers only, then use

[~name]

To match based on a regular expression (such as a sequence of one or more digits in this case):

[!bound:\d+]

To match one or more identifiers, but with some being optionally omitted use this notation:

[attr_qualifier,attr_type,attr_name>2,1,0]

This states that there will be one, two, or three identifiers. The priority of inputs is attr_name,attr_type and attr_qualifier.