x
matches substring 'x'
; regex 9
matches '9'
; regex =
matches '='
; and regex @
matches '@'
..
, +
, *
, ?
, ^
, $
, (
, )
, [
, ]
, {
, }
, |
,
.
). E.g., .
matches '.'
; regex +
matches '+'
; and regex (
matches '('
.
to match '
(back-slash).n
for newline, t
for tab, r
for carriage-return, nnn
for a up to 3-digit octal number, xhh
for a two-digit hex code, uhhhh
for a 4-digit Unicode, uhhhhhhhh
for a 8-digit Unicode.Saturday
matches 'Saturday'
. The matching, by default, is case-sensitive, but can be set to case-insensitive via modifier.four|4
accepts strings 'four'
or '4'
.[aeiou]
matches 'a'
, 'e'
, 'i'
, 'o'
or 'u'
.[0-9]
matches any digit; [A-Za-z]
matches any uppercase or lowercase letters.[^0-9]
matches any non-digit.^
, -
, ]
,
.1+
), e.g., [0-9]+
matches one or more digits such as '123'
, '000'
.0+
), e.g., [0-9]*
matches zero or more digits. It accepts all those in [0-9]+
plus the empty string.[+-]?
matches an optional '+'
, '-'
, or an empty string.m
to n
(both inclusive)m
timesm
or more (m+
)[^n]
[0-9]
[a-zA-Z0-9_]
[ nrtf]
^[0-9]$
matches a numeric string.bcatb
matches the word 'cat'
in the input string.b
. E.g., <cat>
matches the word 'cat'
in the input string.( )
to create a back reference.$1
, $2
, .. (Java, Perl, JavaScript) or 1
, 2
, .. (Python) to retreive the back references in sequential order.*?
, +?
, ??
, {m,n}?
, {m,}?
[0-9]
and +
.[..]
, known as character class (or bracket list), encloses a list of characters. It matches any SINGLE character in the list. In this example, [0-9]
matches any SINGLE character between 0 and 9 (i.e., a digit), where dash (-
) denotes the range.+
, known as occurrence indicator (or repetition operator), indicates one or more occurrences (1+
) of the previous sub-expression. In this case, [0-9]+
matches one or more digits.'abc123xyz'
, it matches substring '123'
.'abcxyz'
, it matches nothing.'abc00123xyz456_0'
, it matches substrings '00123'
, '456'
and '0'
(three matches).'000'
, '0123'
and '0001'
, which may not be desirable.d+
, where d
is known as a metacharacter that matches any digit (same as [0-9]
). There are more than one ways to write a regex! Take note that many programming languages (C, Java, JavaScript, Python) use backslash
as the prefix for escape sequences (e.g., n
for newline), and you need to write 'd+'
instead.re
module for Regular Expression' for full coverage.re
. Python also uses backslash (
) for escape sequences (i.e., you need to write
for
, d
for d
), but it supports raw string in the form of r'..'
, which ignore the interpretation of escape sequences - great for writing regex.java.util.regex
./regex/
. You can use built-in operators:regex
. m
is optional.'..'
to write regex to disable interpretation of backslash (
) by Perl./../
. There are two sets of methods, issue via a RegEx
object or a String
object.^
and the trailing $
are known as position anchors, which match the start and end positions of the line, respectively. As the result, the entire input string shall be matched fully, instead of a portion of the input string (substring).0
' and '12345
'. It does not match with ' (empty string), 'abc
', 'a123
', 'abc123xyz
', etc. However, it also matches '000
', '0123
' and '0001
' with leading zeros.[1-9]
matches any character between 1 to 9; [0-9]*
matches zero or more digits. The *
is an occurrence indicator representing zero or more occurrences. Together, [1-9][0-9]*
matches any numbers without a leading zero.|
represents the OR operator; which is used to include the number 0
.0
' and '123
'; but does not match '000
' and '0123
' (but see below).[0-9]
by metacharacter d
, but not [1-9]
.^
and $
in this regex. Hence, it can match any parts of the input string. For examples, abc123xyz
', it matches the substring '123'
.'abcxyz'
, it matches nothing.'abc123xyz456_0'
, it matches substrings '123'
, '456'
and '0'
(three matches).'0012300'
, it matches substrings: '0'
, '0'
and '12300'
(three matches)!!![+-]
matches either +
or -
sign. ?
is an occurrence indicator denoting 0 or 1 occurrence, i.e. optional. Hence, [+-]?
matches an optional leading +
or -
sign.+
for one or more, *
for zero or more, and ?
for zero or one.w
for a word character [a-zA-Z0-9_]
. Recall that metacharacterd
can be used for a digit [0-9]
.^
and $
match the beginning and the ending of the input string, respectively. That is, this regex shall match the entire input string, instead of a part of the input string (substring).w+
matches one or more word characters (same as [a-zA-Z0-9_]+
)..
matches the dot (.)
character. We need to use .
to represent .
as .
has special meaning in regex. The
is known as the escape code, which restore the original literal meaning of the following character. Similarly, *
, +
, ?
(occurrence indicators), ^
, $
(position anchors) have special meaning in regex. You need to use an escape code to match with these characters.(gif|png|jpg|jpeg)
matches either 'gif
', 'png
', 'jpg
' or 'jpeg
'. The |
denotes 'OR' operator. The parentheses are used for grouping the selections.i
after the regex specifies case-insensitive matching (applicable to some languages like Perl and JavaScript only). That is, it accepts 'test.GIF
' and 'TesT.Gif
'.^
and $
match the beginning and the ending of the input string, respectively. That is, this regex shall match the entire input string, instead of a part of the input string (substring).w+
matches 1 or more word characters (same as [a-zA-Z0-9_]+
).[.-]?
matches an optional character .
or -
. Although dot (.
) has special meaning in regex, in a character class (square brackets) any characters except ^
, -
, ]
or
is a literal, and do not require escape sequence.([.-]?w+)*
matches 0 or more occurrences of [.-]?w+
.w+([.-]?w+)*
is used to match the username in the email, before the @
sign. It begins with at least one word character [a-zA-Z0-9_]
, followed by more word characters or .
or -
. However, a .
or -
must follow by a word character [a-zA-Z0-9_]
. That is, the input string cannot begin with .
or -
; and cannot contain '.
', '--
', '.-
' or '-.
'. Example of valid string are 'a.1-2-3
'.@
matches itself. In regex, all characters other than those having special meanings matches itself, e.g., a
matches a
, b
matches b
, and etc.w+([.-]?w+)*
is used to match the email domain name, with the same pattern as the username described above..w{2,3}
matches a .
followed by two or three word characters, e.g., '.com
', '.edu
', '.us
', '.uk
', '.co
'.(.w{2,3})+
specifies that the above sub-expression could occur one or more times, e.g., '.com
', '.co.uk
', '.edu.sg
' etc.^[w-.+]+@[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$
.^
and $
match the beginning and ending of the input string, respectively.s
(lowercase s
) matches a whitespace (blank, tab t
, and newline r
or n
). On the other hand, the S+
(uppercase S
) matches anything that is NOT matched by s
, i.e., non-whitespace. In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, w
for word character and W
for non-word character; d
for digit and D
or non-digit.()
have two meanings in regex: (abc)*
(S+)
, called parenthesized back-reference, is used to extract the matched substring from the input string. In this regex, there are two (S+)
, match the first two words, separated by one or more whitespaces s+
. The two matched words are extracted from the input string and typically kept in special variables $1
and $2
(or 1
and 2
in Python), respectively.$2 $1
' (via a programming language); or substitute operator 's/(S+)s+(S+)/$2 $1/
' (in Perl).1
, 2
, .. Also, 0
keeps the entire match.$1
, $2
, ..http://
. Take note that you may need to write /
as /
with an escape code in some languages (JavaScript, Perl).S+
, one or more non-whitespaces, for the domain name.(/S+)*
, zero or more '/..', for the sub-directories.(/)?
, an optional (0 or 1) trailing /
, for directory request..
, d
, D
,
s, S
, w
, W
) and operators (such as +
, *
, ?
, |
, ^
). They are constructed by combining many smaller sub-expressions.a-z
and A-Z
) and digits (0-9
), match itself. For example, the regex x
matches substring 'x'
; z
matches 'z'
; and 9
matches '9'
.=
matches '='
; @
matches '@'
..
)[ ]
^
, $
+
, *
, ?
, { }
( )
|
)
), known as escape sequence. For examples, +
matches '+'
; [
matches '['
; and .
matches '.'
.n
for newline, t
for tab, r
for carriage-return, nnn
for a up to 3-digit octal number, xhh
for a two-digit hex code, uhhhh
for a 4-digit Unicode, uhhhhhhhh
for a 8-digit Unicode.Friday
matches the string 'Friday
'. The matching, by default, is case-sensitive, but can be set to case-insensitive via modifier.'|'
. For example, the regex four|for|floor|4
accepts strings 'four
', 'for
', 'floor
' or '4
'.[ ]
, also called character class. It matches ANY ONE character in the list. However, if the first character of the list is the caret (^
), then it matches ANY ONE character NOT in the list. For example, the regex [02468]
matches a single digit 0
, 2
, 4
, 6
, or 8
; the regex [^02468]
matches any single character other than 0
, 2
, 4
, 6
, or 8
.-
). It matches any single character that sorts between the two characters, inclusive. For example, [a-d]
is the same as [abcd]
. You could include a caret (^
) in front of the range to invert the matching. For example, [^a-d]
is equivalent to [^abcd]
.^
, -
, ]
or
.]
, place it first in the list, or use escape ]
.^
, place it anywhere but first, or use escape ^
.-
place it last, or use escape -
.
, use escape
..
, +
, *
, ?
, (
, )
, {
, }
, and etc, inside the bracket listw
, W
, d
, D
, s
, S
inside the bracket list.[:alnum:]
, [:alpha:]
, [:digit:]
: letters+digits, letters, digits.[:xdigit:]
: hexadecimal digits.[:lower:]
, [:upper:]
: lowercase/uppercase letters.[:cntrl:]
: Control characters[:graph:]
: printable characters, except space.[:print:]
: printable characters, include space.[:punct:]
: printable characters, excluding letters and digits.[:space:]
: whitespace[[:alnum:]]
means [0-9A-Za-z]
. (Note that the square brackets in these class names are part of the symbolic names, and must be included in addition to the square brackets delimiting the bracket list.).
) matches any single character except newline n
(same as [^n]
). For example, ..
matches any 3 characters (including alphabets, numbers, whitespaces, but except newline); the.
matches 'there
', 'these
', 'the
', and so on.w
(word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_]
). The uppercase counterpart W
(non-word-character) matches any single character that doesn't match by w
(same as [^a-zA-Z0-9_]
).d
(digit) matches any single digit (same as [0-9]
). The uppercase counterpart D
(non-digit) matches any single character that is not a digit (same as [^0-9]
).s
(space) matches any single whitespace (same as [ tnrf]
, blank, tab, newline, carriage-return and form-feed). The uppercase counterpart S
(non-space) matches any single character that doesn't match by s
(same as [^ tnrf]
).
) for two purposes:d
(digit), D
(non-digit), s
(space), S
(non-space), w
(word), W
(non-word)..
for .
, +
for +
, *
for *
, ?
for ?
. You also need to write
for
in regex to avoid ambiguity.n
for newline, t
for tab, etc.
) is also used for escape sequences in string, e.g., 'n'
for newline, 't'
for tab, and you also need to write '
for
. Consequently, to write regex pattern
(which matches one
) in these languages, you need to write '
(two levels of escape!!!). Similarly, you need to write 'd'
for regex metacharacter d
. This is cumbersome and error-prone!!!?
: The preceding item is optional and matched at most once (i.e., occurs 0 or 1 times or optional).*
: The preceding item will be matched zero or more times, i.e., 0+
+
: The preceding item will be matched one or more times, i.e., 1+
{m}
: The preceding item is matched exactly m times.{m,}
: The preceding item is matched m or more times, i.e., m+
{m,n}
: The preceding item is matched at least m times, but not more than n times.xy{2,4}
accepts 'xyy
', 'xyyy
' and 'xyyyy
'./../modifiers
. For examples:Pattern
. For example,i
): case-insensitive matching for letters.g
): match All instead of first match.m
): affect ^
, $
, A
and Z
. In multiline mode, ^
matches start-of-line or start-of-input; $
matches end-of-line or end-of-input, A
matches start-of-input; Z
matches end-of-input.s
): Dot (.
) will match all characters, including newline.x
): allow and ignore embedded comment starting with #
till end-of-line (EOL).xy{2,4}
try to match for 'xyyyy
', then 'xyyy
', and then 'xyy
'.?
after the repetition operators to curb its greediness (i.e., stop at the shortest match). For example,z*zzz
is matched against the string 'zzzz
', the z*
first matches 'zzzz
'; unwinds to match 'zzz
'; unwinds to match 'zz
'; and finally unwinds to match 'z
', such that the rest of the patterns can find a match.+
to the repetition operators to disable backtracking, even it may result in match failure. e.g, z++z
will not match 'zzzz'
. This feature might not be supported in some languages.^
matches the start-of-line. The $
matches the end-of-line excluding newline, or end-of-input (for input not ending with newline). These are the most commonly-used position anchors. For examples, b
matches the boundary of a word (i.e., start-of-word or end-of-word); and B
matches inverse of b
, or non-word-boundary. For examples, <
and >
: The <
and >
match the start-of-word and end-of-word, respectively (compared with b
, which can match both the start and end of a word).A
matches the start of the input. The Z
matches the end of the input. ^
and $
when it comes to matching input with multiple lines. ^
matches at the start of the string and after each line break, while A
only matches at the start of the string. $
matches at the end of the string and before each line break, while Z
only matches at the end of the string. For examples, ( )
serve two purposes in regex:( )
can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example,(abc)+
(accepts abc
, abcabc
, abcabcabc
, ..) is different from abc+
(accepts abc
, abcc
, abccc
, ..).(S+)
creates one back-reference (S+)
, which contains the first word (consecutive non-spaces) of the input string; the regex (S+)s+(S+)
creates two back-references: (S+)
and another (S+)
, containing the first two words, separated by one or more spaces s+
.$1
, $2
, … (or 1
, 2
, .. in Python), where $1
contains the substring matched the first pair of parentheses, and so on. For example, (S+)s+(S+)
creates two back-references which matched with the first two words. The matched words are stored in $1
and $2
(or 1
and 2
), respectively. (?=pattern)
is known as positive lookahead. It performs the match, but does not capture the match, returning only the result: match or no match. It is also called assertion as it does not consume any characters in matching. For example, the following complex regex is used to match email addresses by AngularJS:^(?=.{1,254}$)
sets the maximum length to 254 characters. The second positive lookahead ^(?=.{1,64}@)
sets maximum of 64 characters before the '@'
sign for the username.(?=pattern)
. Match if pattern
is missing. For example, a(?=b)
matches 'a'
in 'abc'
(not consuming 'b'
); but not 'acc'
. Whereas a(?!b)
matches 'a'
in 'acc'
, but not abc
.?:
inside the parentheses in the form of (?:pattern)
. In other words, ?:
disables the creation of a capturing group, so as not to create an unnecessary capturing group.name
.w
, W
, (word and non-word character), b
, B
(word and non-word boundary) recongize Unicode characters.re
module for Regular Expression'java.util.regex
Package @ https://docs.oracle.com/javase/10/docs/api/java/util/regex/package-summary.html (JDK 10).