com.codename1.util.regex.RE

public class RE extends Object

RE is an efficient, lightweight regular expression evaluator/matcher class. Regular expressions are pattern descriptions which enable sophisticated matching of strings. In addition to being able to match a string against a pattern, you can also extract parts of the match. This is especially useful in text parsing! Details on the syntax of regular expression patterns are given below.

To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:

 RE r = new RE("a*b");

Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:

 boolean matched = r.match("aaaab");

will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".

If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:

 RE r = new RE("(a*)b");                  // Compile expression
 boolean matched = r.match("xaaaab");     // Match against "xaaaab"

 String wholeExpr = r.getParen(0);        // wholeExpr will be 'aaaab'
 String insideParens = r.getParen(1);     // insideParens will be 'aaaa'

 int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
 int endWholeExpr = r.getParenEnd(0);     // endWholeExpr will be index 6
 int lenWholeExpr = r.getParenLength(0);  // lenWholeExpr will be 5

 int startInside = r.getParenStart(1);    // startInside will be index 1
 int endInside = r.getParenEnd(1);        // endInside will be index 5
 int lenInside = r.getParenLength(1);     // lenInside will be 4

You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:

 ([0-9]+)=\1

will match any string of the form n=n (like 0=0 or 2=2).

The full regular expression syntax accepted by RE is described here:

 **Characters**

   *unicodeChar*   Matches any identical unicode character
   \                    Used to quote a meta-character (like '*')
   \\                   Matches a single '\' character
   \0nnn                Matches a given octal character
   \xhh                 Matches a given 8-bit hexadecimal character
   \\uhhhh              Matches a given 16-bit hexadecimal character
   \t                   Matches an ASCII tab character
   \n                   Matches an ASCII newline character
   \r                   Matches an ASCII return character
   \f                   Matches an ASCII form feed character

 **Character Classes**

   [abc]                Simple character class
   [a-zA-Z]             Character class with ranges
   [^abc]               Negated character class

NOTE: Incomplete ranges will be interpreted as "starts from zero" or "ends with last character".

I.e. [-a] is the same as [\u0000-a], and [a-] is the same as [a-\uFFFF], [-] means "all characters".

 **Standard POSIX Character Classes**

   [:alnum:]            Alphanumeric characters.
   [:alpha:]            Alphabetic characters.
   [:blank:]            Space and tab characters.
   [:cntrl:]            Control characters.
   [:digit:]            Numeric characters.
   [:graph:]            Characters that are printable and are also visible.
                        (A space is printable, but not visible, while an
                        `a' is both.)
   [:lower:]            Lower-case alphabetic characters.
   [:print:]            Printable characters (characters that are not
                        control characters.)
   [:punct:]            Punctuation characters (characters that are not letter,
                        digits, control characters, or space characters).
   [:space:]            Space characters (such as space, tab, and formfeed,
                        to name a few).
   [:upper:]            Upper-case alphabetic characters.
   [:xdigit:]           Characters that are hexadecimal digits.

 **Non-standard POSIX-style Character Classes**

   [:javastart:]        Start of a Java identifier
   [:javapart:]         Part of a Java identifier

 **Predefined Classes**

   .         Matches any character other than newline
   \w        Matches a "word" character (alphanumeric plus "_")
   \W        Matches a non-word character
   \s        Matches a whitespace character
   \S        Matches a non-whitespace character
   \d        Matches a digit character
   \D        Matches a non-digit character

 **Boundary Matchers**

   ^         Matches only at the beginning of a line
   $         Matches only at the end of a line
   \b        Matches only at a word boundary
   \B        Matches only at a non-word boundary

 **Greedy Closures**

   A*        Matches A 0 or more times (greedy)
   A+        Matches A 1 or more times (greedy)
   A?        Matches A 1 or 0 times (greedy)
   A{n}      Matches A exactly n times (greedy)
   A{n,}     Matches A at least n times (greedy)
   A{n,m}    Matches A at least n but not more than m times (greedy)

 **Reluctant Closures**

   A*?       Matches A 0 or more times (reluctant)
   A+?       Matches A 1 or more times (reluctant)
   A??       Matches A 0 or 1 times (reluctant)

 **Logical Operators**

   AB        Matches A followed by B
   A|B       Matches either A or B
   (A)       Used for subexpression grouping
  (?:A)      Used for subexpression clustering (just like grouping but
             no backrefs)

 **Backreferences**

   \1    Backreference to 1st parenthesized subexpression
   \2    Backreference to 2nd parenthesized subexpression
   \3    Backreference to 3rd parenthesized subexpression
   \4    Backreference to 4th parenthesized subexpression
   \5    Backreference to 5th parenthesized subexpression
   \6    Backreference to 6th parenthesized subexpression
   \7    Backreference to 7th parenthesized subexpression
   \8    Backreference to 8th parenthesized subexpression
   \9    Backreference to 9th parenthesized subexpression

All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they match as many elements of the string as possible without causing the overall match to fail. If you want a closure to be reluctant (non-greedy), you can simply follow it with a '?'. A reluctant closure will match as few elements of the string as possible when finding matches. {m,n} closures don't currently support reluctancy.

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

A newline (line feed) character ('\n'),
A carriage-return character followed immediately by a newline character ("\r\n"),
A standalone carriage-return character ('\r'),
A next-line character (''),
A line-separator character (' '), or
A paragraph-separator character (' ).

RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular expression compiler for reasons of efficiency. You can construct a single RECompiler object and re-use it to compile each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely shared across multiple threads and RE objects.

ISSUES:

com.weusours.util.re is not currently compatible with all standard POSIX regcomp flags
com.weusours.util.re does not support POSIX equivalence classes ([=foo=] syntax) (I18N/locale issue)
com.weusours.util.re does not support nested POSIX character classes (definitely should, but not completely trivial)
com.weusours.util.re Does not support POSIX character collation concepts ([.foo.] syntax) (I18N/locale issue)
Should there be different matching styles (simple, POSIX, Perl etc?)
Should RE support character iterators (for backwards RE matching!)?
Should RE support reluctant {m,n} closures (does anyone care)?
Not all possibilities are considered for greediness when backreferences are involved (as POSIX suggests should be the case). The POSIX RE "(ac*)c*d

invalid reference

ac

*\1", when matched against "acdacaa" should yield a match

of acdacaa where \1 is "a". This is not the case in this RE package, and actually Perl doesn't go to this extent either! Until someone actually complains about this, I'm not sure it's worth "fixing". If it ever is fixed, test #137 in RETest.txt should be updated.

Field Summary

Fields

Modifier and Type

Field

Description

static final int

MATCH_CASEINDEPENDENT

Flag to indicate that matching should be case-independent (folded)

static final int

MATCH_MULTILINE

Newlines should match as BOL/EOL (^ and $)

static final int

MATCH_NORMAL

Specifies normal, case-sensitive matching behaviour.

static final int

MATCH_SINGLELINE

Consider all input a single body of text - newlines are matched by .

static final int

REPLACE_ALL

Flag bit that indicates that subst should replace all occurrences of this regular expression.

static final int

REPLACE_BACKREFERENCES

Flag bit that indicates that subst should replace backreferences

static final int

REPLACE_FIRSTONLY

Flag bit that indicates that subst should only replace the first occurrence of this regular expression.
Constructor Summary

Constructors

Constructor

Description

RE()

Constructs a regular expression matcher with no initial program.

RE(REProgram program)

Construct a matcher for a pre-compiled regular expression from program (bytecode) data.

RE(REProgram program, int matchFlags)

Construct a matcher for a pre-compiled regular expression from program (bytecode) data.

RE(String pattern)

Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.

RE(String pattern, int matchFlags)

Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.
Method Summary

Modifier and Type

Method

Description

int

getMatchFlags()

Returns the current match behaviour flags.

String

getParen(int which)

Gets the contents of a parenthesized subexpression after a successful match.

int

getParenCount()

Returns the number of parenthesized subexpressions available after a successful match.

final int

getParenEnd(int which)

Returns the end index of a given paren level.

final int

getParenLength(int which)

Returns the length of a given paren level.

final int

getParenStart(int which)

Returns the start index of a given paren level.

REProgram

getProgram()

Returns the current regular expression program in use by this matcher object.

String[]

grep(Object[] search)

Returns an array of Strings, whose toString representation matches a regular expression.

protected void

internalError(String s)

Throws an Error representing an internal error condition probably resulting from a bug in the regular expression compiler (or possibly data corruption).

boolean

match(CharacterIterator search, int i)

Matches the current regular expression program against a character array, starting at a given index.

boolean

match(String search)

Matches the current regular expression program against a String.

boolean

match(String search, int i)

Matches the current regular expression program against a character array, starting at a given index.

protected boolean

matchAt(int i)

Match the current regular expression program against the current input string, starting at index i of the input string.

protected int

matchNodes(int firstNode, int lastNode, int idxStart)

Try to match a string against a subset of nodes in the program

void

setMatchFlags(int matchFlags)

Sets match behaviour flags which alter the way RE does matching.

protected final void

setParenEnd(int which, int i)

Sets the end of a paren level

protected final void

setParenStart(int which, int i)

Sets the start of a paren level

void

setProgram(REProgram program)

Sets the current regular expression program used by this matcher object.

static String

simplePatternToFullRegularExpression(String pattern)

Converts a 'simplified' regular expression to a full regular expression

String[]

split(String s)

Splits a string into an array of strings on regular expression boundaries.

String

subst(String substituteIn, String substitution)

Substitutes a string for this regular expression in another string.

String

subst(String substituteIn, String substitution, int flags)

Substitutes a string for this regular expression in another string.

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MATCH_NORMAL
  public static final int MATCH_NORMAL
  
  Specifies normal, case-sensitive matching behaviour.
  
  See Also:
  
  Constant Field Values
- MATCH_CASEINDEPENDENT
  public static final int MATCH_CASEINDEPENDENT
  
  Flag to indicate that matching should be case-independent (folded)
  
  See Also:
  
  Constant Field Values
- MATCH_MULTILINE
  public static final int MATCH_MULTILINE
  
  Newlines should match as BOL/EOL (^ and $)
  
  See Also:
  
  Constant Field Values
- MATCH_SINGLELINE
  public static final int MATCH_SINGLELINE
  
  Consider all input a single body of text - newlines are matched by .
  
  See Also:
  
  Constant Field Values
- REPLACE_ALL
  public static final int REPLACE_ALL
  
  Flag bit that indicates that subst should replace all occurrences of this regular expression.
  
  See Also:
  
  Constant Field Values
- REPLACE_FIRSTONLY
  public static final int REPLACE_FIRSTONLY
  
  Flag bit that indicates that subst should only replace the first occurrence of this regular expression.
  
  See Also:
  
  Constant Field Values
- REPLACE_BACKREFERENCES
  public static final int REPLACE_BACKREFERENCES
  
  Flag bit that indicates that subst should replace backreferences
  
  See Also:
  
  Constant Field Values
Constructor Details
- RE
  public RE(String pattern) throws RESyntaxException
  
  Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.
  
  Parameters
  
  pattern: The regular expression pattern to compile.
  
  Throws
  
  RESyntaxException: Thrown if the regular expression has invalid syntax.
  
  See also
  
  RECompiler
  
  Throws:
  
  RESyntaxException
- RE
  public RE(String pattern, int matchFlags) throws RESyntaxException
  
  Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.
  
  Parameters
  
  pattern: The regular expression pattern to compile.
  
  matchFlags: The matching style
  
  Throws
  
  RESyntaxException: Thrown if the regular expression has invalid syntax.
  
  See also
  
  RECompiler
  
  Throws:
  
  RESyntaxException
- RE
  public RE(REProgram program, int matchFlags)
  
  Construct a matcher for a pre-compiled regular expression from program (bytecode) data. Permits special flags to be passed in to modify matching behaviour.
  
  Parameters
  
  program: Compiled regular expression program (see RECompiler)
  
  matchFlags: @param matchFlags One or more of the RE match behaviour flags (RE.MATCH_*):
  
  MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
  
  See also
  
  RECompiler
  
  REProgram
- RE
  public RE(REProgram program)
  
  Construct a matcher for a pre-compiled regular expression from program (bytecode) data.
  
  Parameters
  
  program: Compiled regular expression program
  
  See also
  
  RECompiler
- RE
  
  public RE()
  
  Constructs a regular expression matcher with no initial program. This is likely to be an uncommon practice, but is still supported.
Method Details
- simplePatternToFullRegularExpression
  public static String simplePatternToFullRegularExpression(String pattern)
  
  Converts a 'simplified' regular expression to a full regular expression
  
  Parameters
  
  pattern: The pattern to convert
  
  Returns
  
  The full regular expression
- getMatchFlags
  public int getMatchFlags()
  
  Returns the current match behaviour flags.
  
  Returns
  
  Returns:
  
  Current match behaviour flags (RE.MATCH_*).
  
  MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
  
  See also
  
  #setMatchFlags
- setMatchFlags
  public void setMatchFlags(int matchFlags)
  
  Sets match behaviour flags which alter the way RE does matching.
  
  Parameters
  
  matchFlags: @param matchFlags One or more of the RE match behaviour flags (RE.MATCH_*):
  
  MATCH_NORMAL // Normal (case-sensitive) matching MATCH_CASEINDEPENDENT // Case folded comparisons MATCH_MULTILINE // Newline matches as BOL/EOL
- getProgram
  public REProgram getProgram()
  
  Returns the current regular expression program in use by this matcher object.
  
  Returns
  
  Regular expression program
  
  See also
  
  #setProgram
- setProgram
  public void setProgram(REProgram program)
  
  Sets the current regular expression program used by this matcher object.
  
  Parameters
  
  program: Regular expression program compiled by RECompiler.
  
  See also
  
  RECompiler
  
  REProgram
- getParenCount
  
  public int getParenCount()
  
  Returns the number of parenthesized subexpressions available after a successful match.
  
  Returns
  
  Number of available parenthesized subexpressions
- getParen
  public String getParen(int which)
  
  Gets the contents of a parenthesized subexpression after a successful match.
  
  Parameters
  
  which: Nesting level of subexpression
  
  Returns
  
  String
- getParenStart
  public final int getParenStart(int which)
  
  Returns the start index of a given paren level.
  
  Parameters
  
  which: Nesting level of subexpression
  
  Returns
  
  String index
- getParenEnd
  public final int getParenEnd(int which)
  
  Returns the end index of a given paren level.
  
  Parameters
  
  which: Nesting level of subexpression
  
  Returns
  
  String index
- getParenLength
  public final int getParenLength(int which)
  
  Returns the length of a given paren level.
  
  Parameters
  
  which: Nesting level of subexpression
  
  Returns
  
  Number of characters in the parenthesized subexpression
- setParenStart
  protected final void setParenStart(int which, int i)
  
  Sets the start of a paren level
  
  Parameters
  
  which: Which paren level
  
  i: Index in input array
- setParenEnd
  protected final void setParenEnd(int which, int i)
  
  Sets the end of a paren level
  
  Parameters
  
  which: Which paren level
  
  i: Index in input array
- internalError
  protected void internalError(String s) throws Error
  
  Throws an Error representing an internal error condition probably resulting from a bug in the regular expression compiler (or possibly data corruption). In practice, this should be very rare.
  
  Parameters
  
  s: Error description
  
  Throws:
  
  Error
- matchNodes
  protected int matchNodes(int firstNode, int lastNode, int idxStart)
  
  Try to match a string against a subset of nodes in the program
  
  Parameters
  
  firstNode: Node to start at in program
  
  lastNode: @param lastNode Last valid node (used for matching a subexpression without matching the rest of the program as well).
  
  idxStart: Starting position in character array
  
  Returns
  
  Final input array index if match succeeded. -1 if not.
- matchAt
  protected boolean matchAt(int i)
  
  Match the current regular expression program against the current input string, starting at index i of the input string. This method is only meant for internal use.
  
  Parameters
  
  i: The input string index to start matching at
  
  Returns
  
  True if the input matched the expression
- match
  public boolean match(String search, int i)
  
  Matches the current regular expression program against a character array, starting at a given index.
  
  Parameters
  
  search: String to match against
  
  i: Index to start searching at
  
  Returns
  
  True if string matched
- match
  public boolean match(CharacterIterator search, int i)
  
  Matches the current regular expression program against a character array, starting at a given index.
  
  Parameters
  
  search: String to match against
  
  i: Index to start searching at
  
  Returns
  
  True if string matched
- match
  public boolean match(String search)
  
  Matches the current regular expression program against a String.
  
  Parameters
  
  search: String to match against
  
  Returns
  
  True if string matched
- split
  public String[] split(String s)
  
  Splits a string into an array of strings on regular expression boundaries. This function works the same way as the Perl function of the same name. Given a regular expression of "
  
  invalid reference
  
  ab
  
  +" and a string to split of
  "xyzzyababbayyzabbbab123", the result would be the array of Strings "[xyzzy, yyz, 123]".
  
  Please note that the first string in the resulting array may be an empty string. This happens when the very first character of input string is matched by the pattern.
  
  Parameters
  
  s: String to split on this regular exression
  
  Returns
  
  Array of strings
- subst
  public String subst(String substituteIn, String substitution)
  
  Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".
  
  Parameters
  
  substituteIn: String to substitute within
  
  substitution: String to substitute for all matches of this regular expression.
  
  Returns
  
  Returns:
  
  The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).
- subst
  public String subst(String substituteIn, String substitution, int flags)
  
  Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".
  
  It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\.\w\-\?/~_@&=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href="$0">$0", the resulting String returned by subst would be "visit us: <a href="http://www.apache.org">http://www.apache.org!".
  
  Note: $0 represents the whole match.
  
  Parameters
  
  substituteIn: String to substitute within
  
  substitution: String to substitute for matches of this regular expression
  
  flags: @param flags One or more bitwise flags from REPLACE_*. If the REPLACE_FIRSTONLY flag bit is set, only the first occurrence of this regular expression is replaced. If the bit is not set (REPLACE_ALL), all occurrences of this pattern will be replaced. If the flag REPLACE_BACKREFERENCES is set, all backreferences will be processed.
  
  Returns
  
  Returns:
  
  The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).
- grep
  public String[] grep(Object[] search)
  
  Returns an array of Strings, whose toString representation matches a regular expression. This method works like the Perl function of the same name. Given a regular expression of "a*b" and an array of String objects of [foo, aab, zzz, aaaab], the array of Strings returned by grep would be [aab, aaaab].
  
  Parameters
  
  search: Array of Objects to search
  
  Returns
  
  Array of Strings whose toString() value matches this regular expression.

Class RE

Field Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Field Details

MATCH_NORMAL

MATCH_CASEINDEPENDENT

MATCH_MULTILINE

MATCH_SINGLELINE

REPLACE_ALL

REPLACE_FIRSTONLY

REPLACE_BACKREFERENCES

Constructor Details

RE

Parameters

Throws

See also

RE

Parameters

Throws

See also

RE

Parameters

See also

RE

Parameters

See also

RE

Method Details

simplePatternToFullRegularExpression

Parameters

Returns

getMatchFlags

Returns

See also

setMatchFlags

Parameters

getProgram

Returns

See also

setProgram

Parameters

See also

getParenCount

Returns

getParen

Parameters

Returns

getParenStart

Parameters

Returns

getParenEnd

Parameters

Returns

getParenLength

Parameters

Returns

setParenStart

Parameters

setParenEnd

Parameters

internalError

Parameters

matchNodes

Parameters

Returns

matchAt

Parameters

Returns

match

Parameters

Returns

match

Parameters

Returns

match

Parameters

Returns

split