Regular Expressions

Regular Expression Searching

When you turn on the "Regex search (egrep)" option in the Search Text window, you turn on the full power of Perl Regular Expressions. Most strings that you enter will work just as before, matching exactly what you typed in, but some characters have special meaning, as described below.

To guarantee that a non-alphabetic character does not have a special meaning, precede it with a backslash (\). This also works for backslash itself, so \\ matches a single backslash.

Note that since Perl regular expressions are incredibly powerful, this section only provides a summary of the most useful options.

Matching characters

. (period)

Matches any single character.

a.c can match abc or axc but not axyc.

[ ] (square brackets)

This defines a "character class" and matches any single character out of the characters listed between the brackets (e.g. [abc] can match any one of a, b, or c), with the exception of the following:

- (dash): The most common use of character classes is to specify ranges of characters in the ASCII table by using a dash, as in [0-9], which can match any character between 0 and 9. Note that, since it is based on the ASCII table, [A-z] will match A through Z, [, \, ], ^, _, `, and a through z. To match any alphabetic character, regardless of case, use [A-Za-z] or turn on the "Ignore case" checkbox in the Search window.; To include a dash as a regular character to be matched, place it first (but after any leading caret), last, or as the second endpoint of a range. ([.-.] allows a dash to be the first character in a range.)
^ (caret): This is only special when it is the first character after the open bracket. It then means "match any single character other than what follows." As an example, [^abc] can match d. Note that, since it is based on the ASCII table, [^0-9] will match any ASCII character other than 0 through 9, including newlines and weird binary noise!

Character classes cannot be nested, so [ is not a special character.

Since ] normally ends the list, it must be listed first (but after any leading caret) in order to include it as a regular character.

Types of characters

\t: Matches a horizontal tab character.
\n: Matches a newline character.
\w: Matches alphanumeric characters and connector punctuation, e.g., underscore.
\W: The opposite of \w.
\s: Matches a single whitespace character.
\S: The opposite of \s.
\d: Matches a single decimal digit. Equivalent to [0-9].
\D: The opposite of \d. Matches any single character other than a decimal digit. Equivalent to [^0-9].

Suffixes that specify the number of matches

These suffixes are called "quantifiers" and act on the sub-expression that they follow.

? (question mark): Matches zero or one occurrence.
ab? can match both a and ab
* (asterisk): Matches zero or more occurrences.
ab* can match a, ab, abb, abbb, etc.
+ (plus): Matches one or more occurrences.
ab+ can match ab, abb, abbb, etc.
{min,max}: Matches at least min occurrences and at most max occurrences.
ab{2,4} can match abb, abbb, and abbbb
{n}: Matches exactly n occurrences.
{min,}: Matches at least min occurrences.

Note: A single open brace ({) that is not part of one of the above expressions is treated as an ordinary character.

Combining expressions

| (vertical bar): Either one expression or the other
a[xy]|b[xy] can match ax, ay, bx, and by

Grouping

Parentheses define a subexpression. They act as grouping symbols, so that quantifiers and the vertical bar act on the complete expression rather than on a single character or character class.: (ab){1,2} can match ab and abab, while ab{1,2} can match ab and abb.
Parentheses also "capture" what they match so that it can later be used in backreferences (see below) and the replacement pattern (see Regex Replacement, below).: ([ab])([cd]) matches exactly the same strings as [ab][cd], except that what each character class matched can be extracted individually and used in a backreference or the replacement pattern.
Parentheses can also be named, so that it is easier to refer to what they captured. The syntax is (?P<name>...), where name can contain only letters, numbers, and underscores. Obviously, each name can only be used once inside the regex.: (?P<first>[ab])(?P<second>[cd]) matches exactly the same strings as [ab][cd], except that what each character class matched can be extracted individually by name and used in a backreference or the replacement pattern.

Backreferences

Backreferences allow you to construct an expression which says, in effect, "Match something, and then match what was previously matched." The syntax is \N, where N is an integer that does not start with a zero and is the index of a previously encountered subexpression.: ([ab])\1 can match aa and bb.
Named backreferences allow easier access to a previously encountered subexpression. The syntax is (?P=name).: (?P[ab])(?P=x) can match aa and bb.

Matching special positions (anchors)

^ (caret): Forces the following expression to match only at the beginning of a line.
^a can match the a at the beginning of a line
$ (dollar sign): Forces the preceding expression to match only at the end of a line.
a$ can match the a at the end of a line
\b: Forces the following expression to match only at the beginning or end of a word, between a character matching \W and a character matching \w.
\ba can match the a in bad art.
\B: Forces the following expression to match only when not at the beginning or end of a word.
\Ba can match the a in bad art.

Look-behind and look-ahead assertions

(?<=y)x: Forces the expression x to match only if preceded by the expression y.
(?<=a)b can match only the second occurrence of b in the string cbabc
(?<!y)x: Forces the expression x to match only if not preceded by the expression y.
(?<!a)b can match only the first occurrence of b in the string cbabc
x(?=y): Forces the expression x to match only if followed by the expression y.
b(?=a) can match only the first occurrence of b in the string cbabc
x(?!y): Forces the expression x to match only if not followed by the expression y.
b(?!a) can match only the second occurrence of b in the string cbabc

Modifying options inside an expression

(?i): The rest of the expression, up to the first closing parenthesis, will ignore case, even if the option is not set in the Search Text dialog.
(?-i): The rest of the expression, up to the first closing parenthesis, will not ignore case, even if the option is set in the Search Text dialog.

Special expressions

There are several "shorthands" that can be used inside character classes:

[:alnum:]    [A-Za-z0-9]
[:alpha:]    [A-Za-z]
[:blank:]    space or tab
[:cntrl:]    any control character
[:digit:]    [0-9] or \d
[:graph:]    any printable character except space
[:lower:]    [a-z]
[:print:]    any printable character including space
[:punct:]    any printable character except [ A-Za-z0-9]
[:space:]    space, tab, newline, carriage return, form feed, vertical tab
[:upper:]    [A-Z]
[:word:]     any word character, i.e., same as \w
[:xdigit:]   any hexadecimal digit: [0-9A-Fa-f]

As an example, [^abc[:digit:]] can match any character other than a, b, c, and 0 through 9. You can also place a caret after the first colon to negate the class, e.g., [:^digit:].

Paraphrased from the PCRE man page and Mastering Regular Expressions, by Jeffrey Friedl, O'Reilly, 1997

Regular Expression Replacement

When you check the "Regex replace" checkbox in the Search Text window, you get more flexibility for specifying how to perform replacement after a successful search. The syntax is much simpler than for searching. Every character other than backslash (\) and dollar ($) acts the same as before.

When a backslash is followed by one of the following characters, it is translated just like in the search pattern:

\\: backslash
\$: dollar
\t: horizontal tab
\n: newline

When a dollar is followed by a positive integer N, this is converted into the string that the Nth pair of parentheses in the search string matched. $0 represents the entire match, even if it was not enclosed in parentheses. As an example:

ab(c[ef])gh: This can match abcegh and abcfgh.
wx$1yz: This converts the above matches as follows:
abcegh -> wxceyz
abcfgh -> wxcfyz

When a dollar is followed by a name in curly braces, this is converted into the string that the named pair of parentheses in the search string matched. As an example:

ab(?P<x>c[ef])gh: This can match abcegh and abcfgh.
wx${x}yz: This converts the above matches as follows:
abcegh -> wxceyz
abcfgh -> wxcfyz

Without named parentheses, complicated regex's with nested parentheses can make it very difficult to determine the correct value of N. There are two features that alleviate this problem.

If the opening parenthesis is followed by ?:, then the contents are not counted when determining N. As an example, if the regex is ((?:red|white) (king|queen)), then the replacement pattern $2 will produce "king" or "queen" rather than "red" or "white."

For the hackers out there, you can also use negative integers after a dollar. $-1 is converted into the string that the last pair of parentheses in the search string matched, $-2 is converted into the string that the next to last pair of parentheses in the search string matched, etc. This can be useful if you want to count from the end of the string.

For the experts...

The best book on the subject is:

Mastering Regular Expressions, by Jeffrey Friedl, O'Reilly, 1997.

The PCRE man pages are the final authority on how regular expressions behave. Therein are described many features which are too complicated and/or obscure to be included here, e.g., lookahead and lookbehind assertions, conditional subpatterns, and subpattern recursion.

Matching ambiguities (the longest of the leftmost)

It is easy to write a regular expression that can match a given string in many ways at many positions. The rule that avoids such ambiguities is that the match that starts closest to the beginning of the string wins. If more than one match is possible at this position, the longest possible match wins, unless you explicitly use the ? modifier to make a quantifier non-greedy.

Operator precedence

When more than one operator can apply to a subexpression, they are applied in the following order of precedence:

(highest precedence, applied first)

() [] ^ $ \
* + ? {,}
catenation
|

(lowest precedence, applied last)

As an example, a|b* is equivalent to (a)|(b*) and matches a single a or any number of b's rather than being equivalent to (a|b)* and matching any number of consecutive a's and b's.