Last Updated: | 27 August 2006 |
Latest Version: | Jpattern 1.1 |
Minimum JDK Level: | JDK 1.5 |
The goal of this project is to make Snobol4-style pattern matching available as a package for the Java programming language. Rather than build such a package from scratch, the existing Ada-based Gnat Spitbol Patterns code was converted to Java. The result is generally consistent with that Ada package, although some changes were made to conform to the capabilities and limitations of the Java language. In light of the derived nature of this code, the original GNAT license is assumed to cover this code. The compiler, however is licensed under a BSD license. See the License Section for more details.
Description | URL |
---|---|
Source Version 1.1: | jpattern-1.1.jar |
jpattern.jar
.
It is assumed that the JDK 1.5 bin directory
is in the PATH environment variable.
jpattern.jar
.
A set of tests is provided in the directory name tests
.
A make Makefile and an ant build.xml file are both provided.
The output of the tests is captured and compared to the expected output. Any discrepancies are reported.
This pattern could be construct using the following sequence of Pattern functions.("abc" | "ab") & ("de" | "cd")
Pattern pe1 = Pattern.Alt("abc","ab");
Pattern pe2 = Pattern.Alt("de","cd");
Pattern pe3 = Pattern.Concat(pe1,pe2);
As a rule, the variable name should be preceded by the defer operator ("+") or an assignment operator ("." "$" "*" "**") or a replacement operator ("="). The leading "+" is not required unless the variable name is also a keyword.
As a rule, any elementary pattern operator that takes an argument may have a variable as its argument.
During the match, if a reference to a variable is required, then it searched for in the VarMap passed to the Match() function. The value associated with the variable in the map can be of the following types.
Whitespace (" \t\n\r...") and comments may appear in expressions where desired. The comments are either the "//...\n" type or the "/*...*/" type. Be careful to not put an end-of-line escape before any comments on that line. That is, do not writepattern := alt; alt := cat | cat '|' alt; cat := replace | replace '&' cat | replace cat; replace := assign | assign '=' variable; assign := element | element '*' variable | element '**' variable | element '.' variable | element '$' variable; element := operator | '(' pattern ')' | defer | INT | STRING; operator := | "+" variable | "abort" | cancel | "any" "(" svarg ")" | "arb" "(" svarg ")" | "arbno" "(" svarg ")" | "bal" | bal "(" svarg ")" | "break" "(" svarg ")" | "breakx" "(" svarg ")" | "fence" | fence "(" svarg ")" | "fail" | "len" "(" nvarg ")" | "notany", "(" svarg ")" | "nspan" "(" svarg ")" | "pos" "(" nvarg ")" | "rem" | "rest" | "rpos", "(" nvarg ")" | "rtab" "(" nvarg ")" | "setcur" "(" defer ")" | "span" "(" svarg ")" | "succeed" | "tab" "(" nvarg ")" ; svarg := STRING | defer; nvarg := INT | defer; defer := variable | "+" variable; variable := ID; # Lexical Elements ID := [_a-zA-Z][_a-zA-Z0-9]* that is not a keyword. INT := [-]?[0-9]+ | "0x" [0-9a-fA-F]+ STRING := '"''"'
@len(5) \ //comment 1 bal()@but rather write this.
@len(5) //comment 1 \ bal()@
The expression "A | B" means try to match A and if that fails, then try to match B starting at the same place where the attempt to match A began.
There is full backtracking, which means that if a given pattern element fails to match, then previous alternatives are matched. For example if we have the pattern:
First we attempt to match A, if that succeeds, then we go on to try to match C, and if that succeeds, we go on to try to match E. If E fails, then we try F. If F fails, then we go back and try matching D instead of C. Let's make this explicit using a specific example, and introducing the simplest kind of pattern element, which is a literal string. The meaning of this pattern element is simply to match the characters that correspond to the string characters. Now let's rewrite the above pattern form with specific string literals as the pattern elements:(A | B) (C | D) (E | F)
The following strings will be attempted in sequence:("ABC" | "AB")("DEF" | "CDE")("GH" | "IJ")
The entire match fails only when every possible starting point has been attempted. As an example, suppose that we had the subject string
matched using the pattern in the previous example:"ABABCDEIJKL"
This would succeed, after two anchor point moves:("ABC" | "AB") & ("DEF" | "CDE") & ("GH" or "IJ")
where the "^^^^^^^" indicates the matched section."ABABCDEIJKL" ^^^^^^^
This mode of pattern matching is called the unanchored mode. It is also possible to put the pattern matcher into anchored mode by calling the function Match.setAnchorMode(true) This will cause all subsequent matches to be performed in anchored mode, where the match is required to start at the first character.
We will also see later how the effect of an anchored match can be obtained for a single specified anchor point if this is desired.
Abort/Cancel | |
---|---|
Pattern Function: | Abort() or Cancel() |
Compiler Syntax: | abort or cancel |
Description: | The Abort (Cancel) operator immediately aborts the entire pattern match, signalling failure. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. Note: the original spitbol name was Abort, but the GNU version changed it to Cancel because Abort is a reserved Ada word. |
Alternation | |
---|---|
Pattern Function: | Alternate(Pattern A, Pattern B) |
Compiler Syntax: | A | B |
Description: | The alternation operator means first attempt to match A, and then if that does not succeed, match B. |
Any | |
---|---|
Pattern Function: | Any(String s) |
Compiler Syntax: | any(<string>) |
Description: | The Any(S) operator where S is a string, matches a single character that is any one of the characters in S. Fails if the current character is not one of the given set of characters. |
Arb | |
---|---|
Pattern Function: | Arb() |
Compiler Syntax: | arb |
Description: | The Arb operator matches any string. First it matches the null string, and then on a subsequent failure, matches one character, and then two characters, and so on. It only fails if the entire remaining string is matched. |
Arbno | |
---|---|
Pattern Function: | Arbno(String S) or Arbno(Pattern P) |
Compiler Syntax: | arbno(<pattern>) |
Description: | The Arbno(P) operator where P is any pattern, matches any number of instances of the pattern, starting with zero occurrences. It is thus equivalent to ("" | (P & ("" | (P & ("" ....))))). The pattern P may contain any number of pattern elements including the use of alternatiion and concatenation. |
Bal | |
---|---|
Pattern Function: | Bal() or Bal(String parens) |
Compiler Syntax: | bal or bal(<string>) |
Description: | The Bal operator matches a non-empty string that is parentheses balanced with respect to ordinary () characters. Examples of balanced strings are "ABC", "A((B)C)", and "A(B)C(D)E". Bal matches the shortest possible balanced string on the first attempt, and if there is a subsequent failure, attempts to extend the string. Optionally, the Bal operator can be given a two character string to indicate the parentheses characters. For example Bal("[]") will match balanced square brackets. |
Break | |
---|---|
Pattern Function: | Break(String S) |
Compiler Syntax: | break(<string>) |
Description: | The Break(S) operator where S is a string, matches a string of zero or more characters up to but not including a break character that is one of the characters given in the string S. Can match the null string, but cannot match the last character in the string, since a break character is required to be present. |
BreakX | |
---|---|
Pattern Function: | BreakX(String S) |
Compiler Syntax: | breakx(<string>) |
Description: | The BreakX(S) operator where S is a string, behaves exactly like Break(S) when it first matches, but if a string is successfully matched, then a susequent failure causes an attempt to extend the matched string. |
Concatenation | |
---|---|
Pattern Function: | Concat(Pattern A, Pattern B) |
Compiler Syntax: | A & B |
Description: | The concatenation operator means match A followed immediately by matching B. |
Fail | |
---|---|
Pattern Function: | Fail() |
Compiler Syntax: | fail |
Description: | The Fail operator is equivalent to the null alternation. Matches no possible strings, so it always signals failure. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. |
Fence | |
---|---|
Pattern Function: | Fence() or Fence(Pattern p) |
Compiler Syntax: | fence or Fence(<pattern>) |
Description: | The Fence operator
matches the null string at first, and then if a failure
causes alternatives to be sought, aborts the match (like
a Cancel). Note that using Fence at the start of a pattern
has the same effect as matching in anchored mode.
The Fence(P) operator, where P is a pattern, attempts to match the pattern P including trying all possible alternatives of P. If none of these alternatives succeeds, then the Fence pattern fails. If one alternative succeeds, then the pattern match proceeds, but on a subsequent failure, no attempt is made to search for alternative matches of P. The pattern P may contain any number of pattern elements including the use of alternatiion and concatenation. |
Len | |
---|---|
Pattern Function: | Len(int n) |
Compiler Syntax: | len(<integer>) |
Description: | The Len(N) operator where N is a natural number, matches the given number of characters. For example, Len(10) matches any string that is exactly ten characters long. |
NotAny | |
---|---|
Pattern Function: | NotAny(String S) |
Compiler Syntax: | notany(<string>) |
Description: | The NotAny(N) operator where S is a string, matches a single character that is not one of the characters of S. Fails if the current characer is one of the given set of characters. |
NSpan | |
---|---|
Pattern Function: | NSpan(String S) |
Compiler Syntax: | nspan(<string>) |
Description: | The NSpan(S) operator where S is a string, matches a string of zero or more characters that is among the characters given in the string. Always matches the longest possible such string. Always succeeds, since it can match the null string. |
Pos | |
---|---|
Pattern Function: | Pos(int n) |
Compiler Syntax: | pos(<integer>) |
Description: | The Pos(N) operator where N is a natural number, matches the null string if exactly N characters have been matched so far, and otherwise fails. |
Rem/Rest | |
---|---|
Pattern Function: | Rem() or Rest() |
Compiler Syntax: | rem or rest |
Description: | The Rem (Rest) operator matches from the current point to the last character in the string. This is a specialized pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. Note: the original spitbol name was Rem, but the GNU version changed it to Rest because Rem is a reserved Ada word. |
Rpos | |
---|---|
Pattern Function: | Rpos(int n) |
Compiler Syntax: | rpos(<integer>) |
Description: | The Rpos(N) operator where N is a natural number, matches the null string if exactly N characters remain to be matched, and otherwise fails. |
Rtab | |
---|---|
Pattern Function: | Rtab(int n) |
Compiler Syntax: | rtab(<integer>) |
Description: | The Rtab(N) operator where N is a natural number, matches characters from the current position until exactly N characters remain to be matched in the string. Fails if fewer than N unmatched characters remain in the string. |
Span | |
---|---|
Pattern Function: | Span(String S) |
Compiler Syntax: | span(<string>) |
Description: | The Span(S) operator where S is a string, matches a string of one or more characters that is among the characters given in the string. Always matches the longest possible such string. Fails if the current character is not one of the given set of characters. |
Succeed | |
---|---|
Pattern Function: | Succeed() |
Compiler Syntax: | succeed |
Description: | The Succeed operator repeatedly matches the null string (it is equivalent to the alternation ("" | "" | "" ....)). This is a special pattern element, which is useful in conjunction with some of the special pattern elements that have side effects. |
Tab | |
---|---|
Pattern Function: | Tab(int n) |
Compiler Syntax: | tab(<integer>) |
Description: | The Tab(N) operator where N is a natural number, matches characters from the current position until exactly N characters have been matched in all. Fails if more than N characters have already been matched. |
Defer | |
---|---|
Pattern Function: | Defer(Variable B) |
Compiler Syntax: | [+]<name>/td> |
Description: |
The Defer operator (+V) where V is a variable,
obtains at match time
a value for V from the VarMap argument to the Match function.
It then proceeds to match that value as if it had been present
in the original pattern at that point. The leading "+" is optional
unless the variable name matches a keyword.
See also the Variables discussion.
A deferred pattern can be used to create a recursive pattern that will, at pattern matching time, follow the pointer to obtain the referenced pattern, and then match this pattern. Consider for example, the following Java code. By placing the pattern P into the map with the name P (the name can be anything consistent), we cause P to recurse through the map. The Examples section provides illustrations of recursive patterns.Pattern P = Pattern.Alternate("A", Pattern.Concat("B", Pattern.Defer("P"))); VarMap vars = new VarMap(); vars.put("P",P); On the first attempt to match, this pattern attempts to match the string "A". If this fails, then the alternative matches a "B", followed by an attempt to match the value of "P", which is of source the Pattern P. This second attempt now first attempts to match "A", and so on. The result is a pattern that will match a string of B's followed by a single A. This particular example could simply be written as nspan("B") & "A", but the use of recursive patterns in the general case can construct complex patterns which could not otherwise be built. |
Deferred Assignment | |
---|---|
Pattern Function: | Assign(Pattern P, Variable v) |
Compiler Syntax: | <pattern> * <variable> or <pattern> . <variable> |
Description: | The deferred assignment operator takes an arbitrary pattern P and a variable V that will be set to the substring matched by P. The assignment is deferred to the end of the match. If the entire match is successful, and if the pattern P was part of the successful match, then at the end of the matching operation the assignment to S of the string matching P is performed. |
Immediate Assignment | |
---|---|
Pattern Function: | IAssign(Pattern P, Variable v) |
Compiler Syntax: | <pattern> ** <variable> or <pattern> $ <variable> |
Description: | The immediate assignment operator takes an arbitrary pattern P P a variable V that will be set to the substring matched by P. This assignment happens during pattern matching, so if P matches more than once, then the assignment happens more than once. |
Set Cursor | ||
---|---|---|
Pattern Function: | Setcur(Variable v) |
|
Compiler Syntax: | setcur(<variable>) | |
Description: | The cursor assignment operation assigns the current cursor position to the variable N. The cursor position is defined as the count of characters that have been matched so far (including any start point moves). |
Deferred Replacement | ||
---|---|---|
Pattern Function: | Replace(Pattern P, Variable v) |
|
Compiler Syntax: | <pattern>=<variable> | |
Description: |
Deferred replacement is similar to deferred assignment.
With assignment, a variable is associated with a pattern and
when matching completes, the substring matched by that pattern
is assigned to the variable.
With deferred replacement, after a successful match, the substring matched by the pattern is replaced by the value of the variable (at the end of matching). In order to access the modified subject string, the MatchResult argument must used. |
Option | Description |
-pat '<pattern>' | Specify a pattern expression. |
-subject="<string>" | Specify a subject string. |
-var "var=<value>" | Specify an initial variable assignment.; may be repeated. |
-tag "<string> | Specify a string that identifies pattern lines in the standard input. The same tag is used to mark the beginning and the end of the string. |
-xmltag "<string> | Specify an xml style opening tag that identifies pattern lines in the standard input.
The closing tag is the same as the open tag, but with the standard "/".
For example, -xmltag <pattern> would mark strings with the tags
<pattern>...</pattern> . |
-squote | Indicate that pattern expressions are using single quotes instead of the usual double quotes. |
-debug | Turn on lowest level of debug output. |
-debugn <N> | Set the debug level to N |
-v | Print out the parameters. |
-f <filename> | Take input from this file |
-o <filename> | Send non-error output to this file. |
-err <filename> | Send error output to this file. |
java -classpath jpattern.jar jpattern.compiler.Main
-pat "arb & len(5)=x" -subject "1234567" -var "x=xyz"
java -classpath jpattern.jar jpattern.compiler.Main
-tag "@" < javacode.pat > javacode.java
The second case reads lines from standard input ("javacode.pat" in this case) and writes them to standard output ("javacode.java" in this case). As each line is read, it is searched for paired occurrences of the java tag string (the argument to the -tag flag or -xmltag flag). The text between the occurrences is assumed to be a pattern expression. Multiple patterns per line is disallowed. That expression is compiled to equivalent Java code to construct the Pattern and is written to standard output in place of the tagged text. Patterns may span multiple lines by ending each line with a backslash character ("\").
Suppose the first example was run with the following input file.
public class Test0 { static void makePattern() { Pattern Lnum = @pos(0) (Digs ".") span(" ")@; Pattern Digs = @span("0123456789")@; } }It would produce the following output file.
public class Test0 { static void makePattern() { Pattern Lnum = Pattern.Concat( Pattern.Pos(0), Pattern.Concat( Pattern.Concat( Pattern.Defer(Variable.create("Digs")), Pattern.StringPattern(".") ), Pattern.Span(" ") ) ); Pattern Digs = Pattern.Span("0123456789"); } }
You can examine the files Test1.pat thru Test5.pat in the test directory to see other examples.
// Instantiate the compiler, a matcher, // a MatchResult, and a VarMap for // definiing variable names. Compiler compiler = new Compiler(); Match matcher = new Match(); MatchResult result = new MatchResult(); VarMap vars = new VarMap(); Pattern p = compiler.compile("arb & len(5)=x"); String subject = "1234567"; boolean ok = matcher.Match(subject,p,result,vars); stdout.print(ok?"succeed: "+result.Subject:"fail.");
Pattern Digs = @span("0123456789")@; Pattern Lnum = @pos(0) & (+Digs & ".")=nil & span(" ")@;
Now to use this pattern we simply do a match with a replacement:
which replaces the line number by the null string.VarMap vars = new VarMap(); vars.put("Digs",Digs); vars.put("nil",""); boolean ok = matcher.Match(Line, Lnum, vars);
The style we use here -- defining constant patterns and then using them -- is typical. It is possible to build up patterns dynamically, but it is usually more efficient to build them in pieces in advance using constant declarations. Note in particular that although it is possible to construct a pattern directly as an argument for the Match routine, it is much more efficient to preconstruct the pattern as we did in this example.
Example 2. Now let's look at the use of pattern assignment to break a string into sections. Suppose that the input string has two unsigned decimal integers, separated by spaces or a comma, with spaces allowed anywhere. Then we can isolate the two numbers with the following pattern:
Pattern B = @nspan(" ")@; Pattern N = @span("0123456789")@; Pattern T = @nspan(" ") +N*Num1 span(" ,") +N*Num2@;
The match operation
would assign the string 124 to Num1 and the string 257 to Num2.Match(" 124, 257 ", T)
Example 3. Now let's see how more complex elements can be built from the set of primitive elements. The following pattern matches strings that have the syntax of Ada 95 based literals:
A match against Bnum will now match the desired strings, e.g. it will match "16#123_abc#", but not "a#b#". However, this pattern is not quite complete, since it does not allow colons to replace the pound signs. The following is more complete:VarMap vars = new VarMap(); vars.put("DecDigits","0123456789"); vars.put("HexDigits","0123456789abcdefABCDEF"); Pattern Digs = @span(+DecDigits)@; Pattern UDigs = @+Digs arbno("_" +Digs")@; Pattern Hdig = @span(+HexDigits)@; Pattern UHdig = @+Hdig arbno("_" +Hdig)@; Pattern Bnum = @+UDigs "#" +UHdig "#"@; vars.put("Digs",Digs); vars.put("UDigs",UDigs); vars.put("Hdig",Hdigs); vars.put("UHdig",UHdigs); vars.put("Bnum",Bnum);
But that is still not quite right, since it allows # and : to be mixed, and they are supposed to be used consistently. We solve this by using a deferred match.Pattern Bchar = @any("#:")@; vars.put("Bchar",Bchar); Pattern Bnum = @+UDigs +Bchar +UHdig +Bchar@;
Here the first instance of the base character is stored in Temp, and then later in the pattern we rematch the value that was assigned.Pattern Bnum = @+UDigs Bchar*Temp UHdig +Temp@;
Example 4. For an example of a recursive pattern, let's define a pattern that is like the built in Bal, but the string matched is balanced with respect to square brackets or curly brackets.
The language for such strings might be defined in extended BNF as
Here we use "{}" to indicate zero or more occurrences of a term, as is common practice in extended BNF. Now we can translate the above BNF into recursive patterns as follows:ELEMENT ::= <any character other than [] or {}> | '[' BALANCED_STRING ']' | '{' BALANCED_STRING '}' BALANCED_STRING ::= ELEMENT {ELEMENT}
Note the important use of + here to refer to a pattern not yet defined. Note also that we use assignments precisely because we cannot refer to as yet undeclared variables in initializations.Pattern Element = @notany("[]{}") \ | ("[" Balanced_String "]") \ | ("{" Balanced_String "}")@; Pattern Balanced_String = @Element arbno(Element)@; VarMap vars = new VarMap(); vars.put("Balanced_String",Balanced_String);
Now that this pattern is constructed, we can use it as though it were a new primitive pattern element. For example, the match:
will generate the output:Match("xy[ab{cd}]", @+Balanced_String*Current_Output fail@);
Note that the function of the fail here is simply to force the pattern Balanced_String to match all possible alternatives. Studying the operation of this pattern in detail is highly instructive.x xy xy[ab{cd}] y y[ab{cd}] [ab{cd}] a ab ab{cd} b b{cd} {cd} c cd d
Example 5. Finally we give a rather elaborate example of the use of deferred matching. The following declarations build up a pattern which will find the longest string of decimal digits in the subject string.
As we see from the comments here, complex patterns like this take on aspects of sequential programs. In fact they are sequential programs with general backtracking. In this pattern, we first use a pattern assignment that matches null and assigns it to Max, so that it is initialized for the new match. Now BreakX scans to the next digit. Arb would do here, but BreakX will be more efficient. Once we have found a digit, we scan out the longest string of digits with Span, and assign it to Cur. The deferred call to GtS tests if the string we assigned to Cur is the longest so far. If not, then failure is signalled, and we seek alternatives (this means that BreakX will extend and look for the next digit string). If the call to GtS succeeds then the matched string is assigned as the largest string so far into Max and its location is saved in Loc. Finally Fail forces the match to fail and seek alternatives, so that the entire string is searched.VarMap vars = new Varmap(); Function GTS = new Function() { Object get(VarMap vars) { String Cur = vars.getString("Cur",""); String Max = vars.getString("Max",""); return Boolean.valueOf(Cur.length() > Max.length()); }} vars.put("GTS",GTS); vars.put("Digit","0123456789"); Pattern Digs = @span(+Digit)@; Pattern Find = @""*Max fence \ // initialize Max to null & breakx(+Digit) \ // scan looking for digits & ((span(+Digit)*Cur \ // assign next string to Cur & (+GTS) \ // check size(Cur) > Size(Max) & setcur(+Loc)) \ // if so, save location * Max) \ // and assign to Max & fail@; // seek all alternatives
If the pattern Find is matched against a string, the variable Max at the end of the pattern will have the longest string of digits, and Loc will be the starting character location of the string. For example, Match("ab123cd4657ef23", Find) will assign "4657" to Max and 11 to Loc (indicating that Subject.substring(0,11) has been matched.
Note that this list of changes is a partial summary taken from the more complete list in the Changelog file.
Minor version levels are indicated in parentheses.
Author: | Dennis Heimbigner |
Software Engineering Research Laboratory | |
Computer Science Department | |
University of Colorado | |
Boulder, Colorado 80309-0430 USA | |
dennis.heimbigner@colorado.edu |