When you turn on the "Regex search (egrep)" option in the Search Text window, you turn on the full power of Perl Regular Expressions. Most strings that you enter will work just as before, matching exactly what you typed in, but some characters have special meaning, as described below.
To guarantee that a non-alphabetic character does not have a special meaning, precede it with a backslash (\). This also works for backslash itself, so \\ matches a single backslash.
Note that since Perl regular expressions are incredibly powerful, this section only provides a summary of the most useful options.
a.c
can match abc
or axc
but not axyc
.[abc]
can match any one of a
, b
, or c
), with the exception of the following:
[0-9]
, which can match any character between 0 and 9. Note that, since it is based on the ASCII table, [A-z]
will match A through Z, [, \, ], ^, _, `, and a through z. To match any alphabetic character, regardless of case, use [A-Za-z]
or turn on the "Ignore case" checkbox in the Search window.[.-.]
allows a dash to be the first character in a range.)[^abc]
can match d
. Note that, since it is based on the ASCII table, [^0-9]
will match any ASCII character other than 0 through 9, including newlines and weird binary noise!Character classes cannot be nested, so [
is not a special character.
Since ]
normally ends the list, it must be listed first (but after any leading caret) in order to include it as a regular character.
\w
.\s
.[0-9]
.\d
. Matches any single character other than a decimal digit. Equivalent to [^0-9]
.These suffixes are called "quantifiers" and act on the sub-expression that they follow.
ab?
can match both a
and ab
ab*
can match a
, ab
, abb
, abbb
, etc.ab+
can match ab
, abb
, abbb
, etc.min
occurrences and at most max
occurrences. ab{2,4}
can match abb
, abbb
, and abbbb
n
occurrences.min
occurrences.Note: A single open brace ({
) that is not part of one of the above expressions is treated as an ordinary character.
a[xy]|b[xy]
can match ax
, ay
, bx
, and by
(ab){1,2}
can match ab
and abab
, while ab{1,2}
can match ab
and abb
.([ab])([cd])
matches exactly the same strings as [ab][cd]
, except that what each character class matched can be extracted individually and used in a backreference or the replacement pattern.(?P<name>...)
, where name can contain only letters, numbers, and underscores. Obviously, each name can only be used once inside the regex.(?P<first>[ab])(?P<second>[cd])
matches exactly the same strings as [ab][cd]
, except that what each character class matched can be extracted individually by name and used in a backreference or the replacement pattern.\N
, where N is an integer that does not start with a zero and is the index of a previously encountered subexpression.([ab])\1
can match aa
and bb
.(?P=name)
.(?P[ab])(?P=x)
can match aa
and bb
.^a
can match the a
at the beginning of a linea$
can match the a
at the end of a line\W
and a character matching \w
. \ba
can match the a
in bad art
.\Ba
can match the a
in bad art
.(?<=a)b
can match only the second occurrence of b
in the string cbabc
(?<!a)b
can match only the first occurrence of b
in the string cbabc
b(?=a)
can match only the first occurrence of b
in the string cbabc
b(?!a)
can match only the second occurrence of b
in the string cbabc
There are several "shorthands" that can be used inside character classes:
[:alnum:] [A-Za-z0-9] [:alpha:] [A-Za-z] [:blank:] space or tab [:cntrl:] any control character [:digit:] [0-9] or \d [:graph:] any printable character except space [:lower:] [a-z] [:print:] any printable character including space [:punct:] any printable character except [ A-Za-z0-9] [:space:] space, tab, newline, carriage return, form feed, vertical tab [:upper:] [A-Z] [:word:] any word character, i.e., same as \w [:xdigit:] any hexadecimal digit: [0-9A-Fa-f]
As an example, [^abc[:digit:]]
can match any character other than a, b, c, and 0 through 9. You can also place a caret after the first colon to negate the class, e.g., [:^digit:]
.
Paraphrased from the PCRE man page and Mastering Regular Expressions, by Jeffrey Friedl, O'Reilly, 1997
When you check the "Regex replace" checkbox in the Search Text window, you get more flexibility for specifying how to perform replacement after a successful search. The syntax is much simpler than for searching. Every character other than backslash (\) and dollar ($) acts the same as before.
When a backslash is followed by one of the following characters, it is translated just like in the search pattern:
When a dollar is followed by a positive integer N, this is converted into the string that the Nth pair of parentheses in the search string matched. $0 represents the entire match, even if it was not enclosed in parentheses. As an example:
abcegh
and abcfgh
.abcegh
-> wxceyz
abcfgh
-> wxcfyz
When a dollar is followed by a name in curly braces, this is converted into the string that the named pair of parentheses in the search string matched. As an example:
abcegh
and abcfgh
.abcegh
-> wxceyz
abcfgh
-> wxcfyz
Without named parentheses, complicated regex's with nested parentheses can make it very difficult to determine the correct value of N. There are two features that alleviate this problem.
If the opening parenthesis is followed by ?:
, then the contents are not counted when determining N. As an example, if the regex is ((?:red|white) (king|queen))
, then the replacement pattern $2
will produce "king" or "queen" rather than "red" or "white."
For the hackers out there, you can also use negative integers after a dollar. $-1 is converted into the string that the last pair of parentheses in the search string matched, $-2 is converted into the string that the next to last pair of parentheses in the search string matched, etc. This can be useful if you want to count from the end of the string.
Mastering Regular Expressions, by Jeffrey Friedl, O'Reilly, 1997.
The PCRE man pages are the final authority on how regular expressions behave. Therein are described many features which are too complicated and/or obscure to be included here, e.g., lookahead and lookbehind assertions, conditional subpatterns, and subpattern recursion.
It is easy to write a regular expression that can match a given string in many ways at many positions. The rule that avoids such ambiguities is that the match that starts closest to the beginning of the string wins. If more than one match is possible at this position, the longest possible match wins, unless you explicitly use the ?
modifier to make a quantifier non-greedy.
When more than one operator can apply to a subexpression, they are applied in the following order of precedence:
(highest precedence, applied first) () [] ^ $ \ * + ? {,} catenation | (lowest precedence, applied last)
As an example, a|b*
is equivalent to (a)|(b*)
and matches a single a
or any number of b
's rather than being equivalent to (a|b)*
and matching any number of consecutive a
's and b
's.