- Learn what a regular expression is.
- Learn how to use the regex notation.
A regular expression (or regex for short) is a sequence of characters that forms a search pattern. These expressions are helpful when we want to validate input from a user! For example, if we want the user to create a strong password that includes uppercase characters, lowercase characters, numbers, and special characters and also be at least 8 characters long, then we could easily verify a strong password using regular expressions!
A regular expression describes a pattern in text. A pattern could be looking for a specific word, a grouping of words, punctuation, and white space. In a general sense, we can look for certain formats using regex. For example, looking for a phone number with the format (XXX) XXX-XXXX. Before we even incorporate regex into some Java code, we will look at some syntax along with a useful tool to help us further our understanding!
Regular expressions aren't always the easiest to learn and can be hard to read even for a seasoned programmer. Click on the link below to take you to tool called Regex 101:
When you get to the page, you will see a text box for you to enter your regular expression and another text box where you can enter a test string. On the right-hand side, you will see an explanation box, a match information box, and even some quick references of various tokens that can be used with regular expressions. This is an extremely helpful tool when you want to write regular expressions, and you'd like to test them first before even writing them in your code. It is advised you use this tool as you go through these next few lessons to help you understand exactly what the regex evaluates to and how to improve your own regular expressions when writing them in the future.
To get started using Regex 101, go ahead and change the "Flavor" to Java 8 since we will eventually be using regular expressions with Java. You can also go ahead and click on the "gm" in the far right of the "Regular Expression" text box and turn off the "multi line" flag for now as well.
Your screen should now look like this:
Again, try to follow along with the rest of this lesson using Regex 101 to help understand the syntax a little bit better.
In its simplest form, a regex can look for an exact literal piece of text. For example, if given the sentence:
Twinkle twinkle little star, how I wonder what you are.
If we were to provide a regular expression of "star", it would find the literal
word star
in the above sentence. If we decided to look for the pattern of
"ar", then it would find the characters "ar" in star
and are
.
Pretty simple, huh? Let's look at some metacharacters now to start showing just how powerful regular expressions can be.
Remember how we learned about escape sequences in the last module in the Characters and Strings lesson? Well guess what - our friend, the backslash, '', is coming back to help us with regular expressions! Consider the table below of predefined metacharacters:
Metacharacter | Description |
---|---|
. |
Matches any character |
\s |
Matches any whitespace character |
\S |
Matches any non-whitespace character |
\d |
Matches any digit |
\D |
Matches any non-digit |
\w |
Matches any word character |
\W |
Matches any non-word character |
As we can see from the table above, case is crucial - so make sure if you are looking for a digit that it is a lowercase 'd' and not an uppercase 'D'!
Let's look at some examples to showcase these metacharacters. In the Regex 101 tool, provide the following test string:
September 15th is National Online Learning Day!
The special character .
will match any single character. Let's enter the
pattern "."
in the "Regular Expression" text box in Regex 101. Notice how
every single character, including the whitespace and punctuation is highlighted!
Time to be a little more specific and narrow our search using the .
. In the
"Regular Expression" text box in Regex 101, let's change the pattern to ".s"
and see what happens now.
is
We only have one match now! The pattern ".s"
evaluates to "Find me all the
places in the string where there is exactly 1 character before the character s."
Given our example test string, it will only match the pattern with "is".
Let's change it up one more time before we move onto the next metacharacter. In
the "Regular Expression" text box in Regex 101, change the pattern to "..a"
.
Na
ona
Lea
Da
This time we have 4 matches. The pattern "..a"
evaluates to "Find me all the
places in the string where there is exactly 2 characters before the character
a." Given our example test string, the pattern will match with the four
substrings. Notice how it even captured the space character in the two matches!
The special character \s
will match a literal space character ' ', a tab
(\t
), a newline character (\n
), or a carriage return (\r
). The special
character \S
does the inverse and will match any character except
whitespace. Let's enter the pattern "\s"
in the "Regular Expression" text box
in Regex 101. Notice how all the spaces are highlighted. Now change the regular
expression to "\S"
and see every other character except the spaces
highlighted.
Now modify the test string to this:
September 15th
is National Online Learning Day!
Now enter in "\s"
again into the regular expression text box and see how the
pattern matches with the carriage return character!
The special character \d
will match any digit 0-9 whereas the special
character \D
does the inverse and will match any character except a digit.
Let's return to our Regex 101 tool and enter the pattern "\d"
in the
"Regular Expression" text box. Notice how the digits 1
and 5
are
highlighted. Now change the regular expression to "\D"
and see every other
character except the two digits.
Say we want to find a two-digit number now. We could enter "\d\d"
into the
"Regular Expression" text box then. Now instead of two matches, we only get one
match of 15
since we are specifying in our pattern that it must be a two-digit
number.
The special character \w
will match a "word" character. A "word" character is
considered an uppercase letter (A-Z), a lowercase letter (a-z), any digit 0-9,
and the underscore character (_). The special character \W
does the inverse
and will match any character_except_a word character. Let's enter the pattern
"\w"
in the "Regular Expression" text box in Regex 101. Notice how all the
characters except the whitespace characters are highlighted. Now change the
regular expression to "\W"
to see all the whitespace characters highlighted.
Maybe we want to match specific characters in a string though. Perhaps we want
to just find the vowels. For this, we can use the square brackets []
notation.
For example, the pattern "[aeiou]"
will match any single character that is
part of that character set. Let's try entering that in Regex 101 as the pattern.
Notice that only the characters between the square brackets is highlighted as a match.
Now let's say we want to specify a range of characters. For example, maybe we
only want to find the characters that are letters. We can use the -
between
two characters to represent a character range. Try entering the following
pattern into Regex 101 as a regular expression: "[A-Za-z]"
.
The pattern [A-Za-z]
defines a character set containing all uppercase and
lowercase letters using the -
to specify a range. If we want only uppercase
letters, we could change the pattern to be [A-Z]
then or if we only want
lowercase letters, we could use the pattern [a-z]
. We could even use this
syntax with numbers! [0-9]
would be a pattern we could use to get any digit
between 0-9, which would be the same as using the shorthand \d
character.
When looking at the metacharacters above, we saw there was usually an inverse.
(Like \d
representing digits and \D
representing every other character
except for digits.) Can we do the same thing with character sets? The answer is
yes, and we can use the carat, ^
, inside the square bracket notation to help
us out! Try entering the pattern "[^A-F]"
into the "Regular Expression" text
box. This will match any character except for uppercase letters A, B, C, D, E,
and F. This includes spaces, numbers, just any character as long as it is not in
the character set we specified. In the example above, we can see that every
character seems to be highlighted except for the D
character.
An anchor in regex allows us to specify the relative location of a pattern
match. The anchors we will look at are the carat symbol, ^
, and the dollar
sign symbol, $
.
To help us truly look at how anchors work, let's change our "Test String" in Regex 101 to the following:
A rose is a rose is a rose
Now let's enter the pattern "[Aa] rose"
into the "Regular Expression" text box
and see what happens.
We get three matches since we are looking for either the character A
or a
followed by a space and then the word rose
:
A rose
a rose
a rose
Let's modify the pattern now and enter "^[Aa] rose"
into the "Regular
Expression" text box. Notice how only the first part of the test string is now
a match! The ^
in this context is an anchor that says "Only check to see if
the pattern exists at the start of the string." Since it does, it will only
return A rose
and not the other two matches.
Try to use the ^
anchor again by typing in "^is"
in the "Regular Expression"
text box now. Even though the character combination of is
exists in the
string, it will not show a match since it is not in the beginning of the test
string. If we were to remove the anchor, ^
, we would then have two matches.
The other anchor, $
, will only check to see if the pattern exists at the end
of the string - so the opposite of the ^
anchor! Let's put this to use and
enter the pattern "[Aa] rose$"
into the "Regular Expression" text box. Again,
see how we only have one match, and it is matching with the last a rose
in the
test string this time! This is because it is at the end of the string.
The last syntax we will discuss is quantifiers. Quantifiers are regex operators that can count the number of occurrences of a character. Consider the table below:
Quantifier | Description |
---|---|
* |
Matches zero or more occurrences |
+ |
Matches one or more occurrences |
? |
Matches zero or one occurrence |
Let's look at some examples to showcase these quantifiers. In the Regex 101 tool, replace the test string with the following:
Hmm that is interesting that hummingbirds can fly backwards
The "any" quantifier is the asterisk, *
, quantifier. If we place a *
after
any character or character set, we will match with zero or more occurrences of
that character. Let's look at an example to help explain this more. In the
"Regular Expression" text box, enter the pattern "[Hh]m"
.
We get only one match since we are looking for either the character H
or h
followed by a space and the character m
:
Now let's update the pattern to have an asterisk, *
, follow the character m
:
"[Hh]m*"
.
Notice now that there are now 4 matches! The pattern [Hh]m*
has been modified
to say "Find me all occurrences where there is either an H
or h
followed by
zero or more occurrences of the character m
." This resulted into finding all
the lowercase h
characters in the test string along with the character
sequence of Hmm
. Since Hmm
has two occurrences of m
, the *
quantifier
will pick up both m
characters.
The "some" quantifier is represented by the +
symbol. If we place a +
after
any character or character set, we will match with one or more occurrences of
that character. Let's try entering the pattern "[Hh]m+"
now into the "Regular
Expression" text box:
There is now only one match and that is the character sequence of Hmm
. Notice
how the +
quantifier picked up both m
characters. This is because we are
now requiring the m
character to be present at least once. It will grab any
other m
characters following it as well since it fits the definition of the
"some" quantifier.
The "optional" quantifier, ?
, allows exactly zero or one occurrence of a
character or character set. Let's see it in action! Enter the pattern "[Hh]m?"
into the "Regular Expression" text box:
There is, again, 4 matches of the above pattern! This may look very similar to
the resulting matches from when we used the *
quantifier, but look closer.
The rule is now to find all occurrences where there is either an H
or h
followed by zero or one occurrence of the character m
. This found all the
lowercase h
characters again in the test string but only picked up the
sequence Hm
instead of Hmm
this time. Since Hmm
has two occurrences of
m
, the ?
quantifier will only pick up the first m
character.