Using RegEx (Regular Expressions) in Java

In this article I am trying to explain how use regular expressions in Java. This article is not going to explain, how to create regular expressions, but how to use them in Java. Each one of the regex features are explained with small java programs

Table of Contents

  1. What is a Regular Expression?
  2. Why should we use Regular Expressions?
  3. How to use Regular Expressions in Java?
  4. Using Regular Expressions with String methods
  5. Regex matching with String.matches()
  6. Regex Splitting and Replacing with String.split() and String.replaceAll()
  7. String.replaceAll() using groups
  8. Using Pattern and Matcher classes in Java
  9. Difference between Matcher.matches() and Matcher.find() methods
  10. Extracting the matched characters
  11. Extracting matched groups (characters belong to regex groups)
  12. Replacing matched characters using Matcher.replaceXXX() methods
  13. Replacing matched Regex Groups in Java
  14. Replacing matched regex groups by creating two extra groups to the right and left of the group to be replaced
  15. Replacing matched regex groups using “Look Ahead” and “Look Behind” expressions
  16. Regex Options (Flags) Available in Java
  17. Using the Pattern.CASE_INSENSITIVE option
  18. Using the Pattern.DOTALL option
  19. Using Pattern.MULTILINE option
  20. Using Pattern.COMMENTS option
  21. Using Pattern.LITERAL option
  22. Using Pattern.UNIX_LINES option
  23. Using Pattern.UNICODE_CASE option
  24. Using Pattern.CANON_EQ option

What is a Regular Expression?

A regular expression is a pattern of characters. Following are some examples of a pattern

  1. A word starting with letter ‘A’ and ending with letter ‘Z’
  2. A line which does not contain spaces
  3. A word which contain only digits 0-9 etc.

Using regular expressions, all these patterns can be expressed as a set of characters and can be used for the following purposes

  1. searching inside strings or text files
  2. finding matches
  3. replace the matching string with another string etc

Why should we use Regular Expressions?

In most of the programming languages, there is a separate engine used to execute regular expressions. So regular expressions can execute much faster than the conventional indexOf() and substring() methods. There will be a huge performance gain, if you are manipulating large text or if you are doing text manipulations numerous times. In java, regular expressions can be compiled into a “Pattern” object which can be used to do fast regular expression matches.

Another advantage of using regular expressions is, your code will be neat, clean and small, if you use regular expressions.

How to use Regular Expressions in Java?

We can use regular expressions in two ways. They are

  1. using the String.matches(), String.split() and String.replace() methods
  2. using the Pattern and Matcher objects

Using Regular Expressions with String methods

The String methods provides only basic matching functionality. If you want advanced functions you can use the Pattern and Matcher objects (explained below)

String Method Description
String.matches(“regex”) This method will match the regex against the WHOLE string. If the string matches regex, it will return true and false otherwise
String.split(“regex”) This method will split the string based on the regex
String.replace(“regex”, “replacement”) This method will search the string and if a match is found with regex, the matched portion of the string will be replaced with replacement string

Regex matching with String.matches()

This method will return true, if the regex pattern matches the WHOLE string; and will return false otherwise. The matching abilities of this method is very limited. For example, the matching will be always case sensitive, this method is not unicode enabled. If you want to do maching with more options, you may use the Pattern and Matcher classes (explained below)

package com.easyprograming.regex public class TestClass  {     public static void main(String[] args)      {         boolean result;                   result = "Where are you?".matches("are");          // result will be false, because "are" will be matched against whole string.         // If you need to find such substring matches, you need to use          // Pattern and Matcher classes                   result = "Where are you?".matches(".*are.*");         // result will be true, because .* will match any character          // before and after "are"; so full string match                   result = "Where ARE you?".matches("where are you?");          // result will be false, because matching is case sensitive         // To do case insensitive match, we need to use Pattern and Matcher classes                   result = "Where".matches("\w+");          // result will be true, because "Where" contains only ASCII          // and \w will match each character of the word           result = "Español".matches("\w+");          // result will be false, because matches() method will count only         // ASCII characters as word characters. To match words with unicode         // characters, we need to use Pattern and Matcher classes     } } 

Regex Splitting and Replacing with String.split() and String.replaceAll()

Java String class offers two more methods to split and replace strings using regex. These methods are similar to String.matches() method, in their capabilities and incapabilities. ie, these methods also cannot do case insensitive matching and cannot match unicode characters. Following a small java program to demonstrate split() and replaceAll() methods.

public class TestClass  {     public static void main(String[] args)      {         String[] splitResult;                   splitResult = "Hello what's up?".split("\s");         //String will be split based on s (white space character)         //Result will be three strings viz. "Hello", "what's" and "up?"                   splitResult = "Hello what's up?".split("\s",2);         //The second argument 2 (limit), decides, how many times the         //split will be done. If you give limit as n, then split will         //be done (n-1) times. Here split will be done 1 (2-1) time.         //The result will be two strings viz. "Hello" and "what's up?"                   splitResult = "Hello what's up?".split("\s",1);         //Here split will be done 0 (1-1) times. ie, no split will be done         //The result will be a single string "Hello what's up?"     } } 

 

public class TestClass  {     public static void main(String[] args)      {         String result = "Hello what's up?".replaceAll("\s", "-");         //Here every occurance of space character will be replaced with dash(-)         //The result will be "Hello-what's-up?"     } } 

String.replaceAll() using groups

If you have groups in your regex, you can reference the matched group in the replacement text. For example consider the regex “-(@+)-“  if @ character repeats any number of times between dashes, it will be matched. In this regex I have a group specified by (). The group will have the @ characters alone. The following java program shows you how to replace -@-, -@@-, -@@@- etc with @, @@ and @@@ respectively (ie, strip out dashes)

public class TestClass  {     public static void main(String[] args)      {         String result = "Hello -@- what's -@@@- up?".replaceAll("-(@+)-", "$1");         //Here -@- and -@@@- will be matched against the regex.         //In the fist match (-@-), @ represents first group $1.          //So -@- will be replaced with @         //In the second match (-@@@-), @@@ represents first group $1.          //So -@@@- will be replaced with @@@                   System.out.println(result);         //Result will be "Hello @ what's @@@ up?"     } } 

Using Pattern and Matcher classes in Java

In java, the Pattern class represents a compiled regex pattern. Even though the compiling is a costly operation, there will be huge performance gain, if you use Pattern class for repeated matches or match against large strings. The Matcher class provides methods for regex match. The Matcher class has two methods which will tell whether the input string matches regex pattern. They are

  1. Matcher.matches() – This method will try to match the entire input string. This method is similar to String.matches() method

  2. Matcher.find() – This method will try to match for a substring inside the input string

 

Difference between Matcher.matches() and Matcher.find() methods

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         Pattern pattern = Pattern.compile("\s");         Matcher matcher = pattern.matcher("Hello what's up?");                   //Since matches() method try to match against whole string         //the input string and pattern won't match here         if(matcher.matches())         {             System.out.println("Matching with matches() method");         }         else         {             System.out.println("Doesn't match with matches() method");          }                   //Since find() method try to find a match anywhere inside the input string         //it will find the match         if(matcher.find())         {             System.out.println("Matching with find() method");         }         else         {             System.out.println("Doesn't match with find() method");         }     } }   // The result printed will be // Doesn't match with matches() method // Matching with find() method 

Extracting the matched characters

The matched characters can be extracted using Matcher.group() method. Following program shows how to extract the matched characters

   import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String regex = "hello";                    Pattern pattern = Pattern.compile(regex);         Matcher matcher = pattern.matcher("hello what's up?");                    if(matcher.find())         {             //group() method will return the matched characters             String word = matcher.group();             System.out.println(word);         }     } } 

 

Extracting matched groups (characters belong to regex groups)

We can use the Matcher.group(<group number>) method to get the characters of specific groups. Matcher.group(0) will be same as Matcher.group() with no arguments. Group 0 will represent the complete matched characters. If you have defined groups in your regex, you can access them using indices starting from 1. Following example shows how to use groups in your regex.

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {    public static void main(String[] args)     {     String word="";               //regex matches word with single quote     String regex = "(\S*)'(\S*)";                    Pattern pattern = Pattern.compile(regex);         Matcher matcher = pattern.matcher("hello what's up?");                    if(matcher.find())         {             //group() method will return the matched characters             word = matcher.group();             System.out.println(word);                           word = matcher.group(0);             System.out.println(word);             //group(0) will print same result as group() with no arguments                           word = matcher.group(1);             System.out.println(word);             //This will print the first group, ie. characters before single quote                           word = matcher.group(2);             System.out.println(word);             //This will print the second group, ie. characters after single quote         }    } } 

 

Replacing matched characters using Matcher.replaceXXX() methods

To replace the matched characters inside the input string, there are two methods available

  1. replaceFirst(<replacement string>): This method will replace only the first match
  2. replaceAll(<replacement string>): This method will replace all the matches in the input string

 In the replacement string we can use $<group number> to access the value of the group. Following example shows you how to replace matches

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String inputString ="All the king's horses and all the king's men " +                             "Couldn't put Humpty together again!";                   //regex matches word with single quote         String regex = "(\S*)'(\S*)";                    Pattern pattern = Pattern.compile(regex);         Matcher matcher = pattern.matcher(inputString);                    if(matcher.find())         {             String result = matcher.replaceAll("$1$2(*)");             // $ can be used in replacement string to refer groups             // $1 refers to first regex group and $2 refers to second               System.out.println(result);             //The result printed will be             //All the kings(*) horses and all the kings(*) men Couldnt(*) put Humpty together again!         }     } } 

Replacing matched Regex Groups in Java

By default there is no direct way to replace matched regex groups in Java. But we can achieve the same result using one of the following methods

  1. Create two extra groups, to the left and right of the group to be replaced
  2. Using “Look Ahead” and “Look Behind” constructs

Replacing matched regex groups by creating two extra groups to the right and left of the group to be replaced

In the following image, I want to replace the domain in the email with xxxx. ie, username@gmail.com will become username@xxxx.com. In the regex in the image, I want to replace group-2 with ‘xxxx’. To do that, I have created two extra groups viz. group-1 to left and group-3 to the right of group-2 (which has to be replaced with ‘xxxx’). All these groups, I can refer in my replacement string as $1, $2 and $3.

import java.util.regex.Matcher;  import java.util.regex.Pattern;   public class TestClass   { 	public static void main(String[] args)   	{  		String inputString = "username@gmail.com";  		String pattern="([a-z]+@)([a-z]+)(\.[a-z]+)"; 		 		Pattern p = Pattern.compile(pattern);  		Matcher m = p.matcher(inputString);  		 		if(m.find())  		{ 			//Use only $1 and $3 (equivalent to replacing $2)  			String resultString = m.replaceAll("$1xxxx$3");  			System.out.println(resultString);  		} 	} } 

 

Replacing matched regex groups using “Look Ahead” and “Look Behind” expressions

In Java, a look ahead expression can be specified using the construct (?=regex) and a look behind expression can be specified using the construct (?<=regex). The specialty of look ahead and look behind expressions is that “the look ahead and look behind matched characters actually does not belong to the regex matched characters.” Lets see how we can solve the above problem using look ahead and look behind expressions. Problem is to replace the domain in the email with xxxx. ie, username@gmail.com will become username@xxxx.com. The regex we are going to use is (?<=@)[a-z]+(?=\.). In this regex the look behind is (?<=@) (will look for @ behind the matched characters) and look ahead is (?=\.) (will look for a dot after the matched characters); but these @ and dot will not be part of the matched characters. The following example demonstrate this

import java.util.regex.Matcher; import java.util.regex.Pattern;  public class TestClass  { 	public static void main(String[] args)  	{ 		String inputString = "username@gmail.com"; 		String pattern="(?<=@)[a-z]+(?=\.)"; 		 		Pattern p = Pattern.compile(pattern); 		Matcher m = p.matcher(inputString); 		 		if(m.find()) 		{ 			String resultString = m.replaceAll("xxxx"); 			System.out.println(resultString); 			//output will be username@xxxx.com 		} 	} } 

 

Regex Options (Flags) Available in Java

In java following regex options are available. These options can be given while compiling a regex into a Pattern class. Multiple options can be given using bitwise OR operator

  1. Pattern.CASE_INSENSITIVE
  2. Pattern.DOTALL
  3. Pattern.MULTILINE
  4. Pattern.COMMENTS
  5. Pattern.LITERAL
  6. Pattern.UNIX_LINES
  7. Pattern.UNICODE_CASE
  8. Pattern.CANON_EQ

Using the Pattern.CASE_INSENSITIVE option

This flag is used to specify that matching should be case insensitive. Following program shows the usage of this flag

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String regex = "hello"; //pattern is lower case                   Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);         Matcher matcher = pattern.matcher("HELLO what's up?");                   boolean result = matcher.find();         //result will be true, because "hello" will match "HELLO"     } } 

 

Using the Pattern.DOTALL option

We know that in regex dot(.) matches any character. But in Java, by default, dot(.) will not match the newline (n) and carriage return (r). If we specify the Pattern.DOTALL option, dot(.) will match newline (n) and carriage return (r) characters also. Following example shows the difference

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String regex = "Michael.*Jackson";                   //input string has newline and carriage return characters         String inputString = "How are you Michael nr"+                              "Jackson?";                   Pattern pattern1 = Pattern.compile(regex);         Matcher matcher1 = pattern1.matcher(inputString);                   boolean result = matcher1.find();         //result will be false, because .* will not match n and r                   Pattern pattern2 = Pattern.compile(regex, Pattern.DOTALL);         Matcher matcher2 = pattern2.matcher(inputString);                   result = matcher2.find();         //result will be true, because in DOTALL mode .* will match n and r     } } 

 

Using Pattern.MULTILINE option

Pattern.MULTILINE flag is useful if you are using ^(Start of String) or $(End of  String) characters in your regex. If you are trying to match a regex pattern with ^ and $ against a multiline string, by default, java will ignore the newline character and will try to match against the whole string. If Pattern.MULTILINE flag is set, the match will be done line by line. ie, ^ will match the start of line and $ will match end of line, for every line in the input string. Following program shows the usage of MULTILINE flag

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String regex = "^This is line-1$";                   //input string has newline characters         String inputString = "This is line-1n"+                              "This is line-2n"+                              "This is line-3";                   Pattern pattern1 = Pattern.compile(regex);         Matcher matcher1 = pattern1.matcher(inputString);                   boolean result = matcher1.find();         //result will be false, because match will be done         //against the whole string                   Pattern pattern2 = Pattern.compile(regex, Pattern.MULTILINE);         Matcher matcher2 = pattern2.matcher(inputString);                   result = matcher2.find();         //result will be true, because match will be done for every line         //Since the first line matches the regex pattern, result will be true     } } 

 

Using Pattern.COMMENTS option

The sole purpose of this flag is to help the programer to write more readable regex patterns. Think about adding some extra spaces and a comment inside your regex pattern, it will definitely improve the readability of regex pattern. If you provide the Pattern.COMMENTS flag, all the white spaces and comments (What ever after a #) in your regex pattern will be ignored unless escaped using backslash. (To escape white space in such “more readable” regex patterns, use \<space> or \s). Comments can be added to regex using #. Everything from # character till end of line in the pattern will be ignored, while matching. The following example shows how to use this flag

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         //less readable regex pattern         String regex1 = "\d{3}-\d{3}-\d{4}";                   //same regex pattern with more readability         String regex2 = "\d {3} - \d {3} - \d {4} #Phone number pattern";                   String inputString = "345-456-9876";                   Pattern pattern1 = Pattern.compile(regex1);         Matcher matcher1 = pattern1.matcher(inputString);                   boolean result = matcher1.find();         //result will be true because the pattern matches the input string                   Pattern pattern2 = Pattern.compile(regex2);         Matcher matcher2 = pattern2.matcher(inputString);                   result = matcher2.find();         //result will be false because Pattern.COOMENTS flag is not set                           Pattern pattern3 = Pattern.compile(regex2, Pattern.COMMENTS);         Matcher matcher3 = pattern3.matcher(inputString);                   result = matcher3.find();         //result will be true because, if the Pattern.COMMENTS flag is set         //all the spaces and comments (#...) will be ignored from the regex                             /** Escaping spaces in a more readable regex pattern**/                   //regex pattern with more readability and escaped spaces (\s)         String regex4 = "\d {3} \s - \s \d {3} \s - \s \d {4} #Phone number pattern";                   //input string has spaces, which should be matched         String inputString4 = "345 - 456 - 9876";                   Pattern pattern4 = Pattern.compile(regex4, Pattern.COMMENTS);         Matcher matcher4 = pattern4.matcher(inputString4);                   result = matcher4.find();         //result will be true. \s (escaped spaces) inside the regex will          //properly match the regex with input string     } } 

 

Using Pattern.LITERAL option

If we specify the Pattern.LITERAL flag, meta characters (like ^, $, {3,4}) etc and escape sequences (like d, w, \) WILL NOT be considered as meta characters and escape sequences. All of their special meanings will be lost. In LITERAL mode, only Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags will be applicable. Even if you specify other flags, it won’t make any difference while matching. Following example shows how to use this flag

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String regex = "\d{3}-\d{3}-\d{4}";                   String inputString1 = "345-456-9876";         Pattern pattern1 = Pattern.compile(regex);         Matcher matcher1 = pattern1.matcher(inputString1);                   boolean result = matcher1.find();         //result will be true because the pattern matches the input string                   Pattern pattern2 = Pattern.compile(regex, Pattern.LITERAL);         Matcher matcher2 = pattern2.matcher(inputString1);                   result = matcher2.find();         //result will be false, because in LITERAL mode, \d won't be         //considered as a digit. \d will only match against \d                   String inputString3 = "\d{3}-\d{3}-\d{4}";         Pattern pattern3 = Pattern.compile(regex, Pattern.LITERAL);         Matcher matcher3 = pattern3.matcher(inputString3);                   result = matcher3.find();         //result will be true, because regex pattern and input string are         //exactly matching (no special meanings for any character)     } } 

 

Using Pattern.UNIX_LINES option

We know that in Unix/Linux a new line means n(Line Feed) character. But in Windows a new line means rn (Carriage Return + Line Feed). By default java will consider both Unix and Windows flavor newlines as new line. So by default, a multi-line string with n or r or rn will not make a difference. If we enable Pattern.UNIX_LINES flag, only n will be considered as newline character. In Pattern.UNIX_LINES mode, if we try to match against a multi line string with rn, only n will be considered as newline character and r will become part of the line. The following program shows how to use the Pattern.UNIX_LINES mode. (Pattern.UNIX_LINES mode is applicable only if you use ^, $ or dot(.) in your regex pattern)

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         //Multi line string with rn (Windows newline)         String inputString = "This is line-1rn"                             + "This is line-2rn";                   String regex = "^This is line-[0-9]$";                   Pattern pattern1 = Pattern.compile(regex, Pattern.MULTILINE);         Matcher matcher1 = pattern1.matcher(inputString);                   boolean result = matcher1.matches();         //Result will be true, because in default MULTILINE mode rn will be         //taken out of the string as newline characters                             //Multi line string with n alone (Unix newline)         String inputString2 = "This is line-1n"                             + "This is line-2n";                   Pattern pattern2 = Pattern.compile(regex, Pattern.MULTILINE);         Matcher matcher2 = pattern2.matcher(inputString2);                   result = matcher2.matches();         //Result will be true, because in default MULTILINE mode n also will be         //taken out of the string as newline characters                             //Multi line string with rn (Windows newline)         String inputString3 = "This is line-1rn"                             + "This is line-2rn";                   //Enabling the UNIX_LINES mode         Pattern pattern3 = Pattern.compile(regex, Pattern.MULTILINE | Pattern.UNIX_LINES);         Matcher matcher3 = pattern3.matcher(inputString3);                   result = matcher3.matches();         //result will be false, because in UNIX_LINES mode, only n will be taken out of          //the string as newline. So regex will be matched against "This is line-1r", which         //will not match                             //Multi line string with n alone (Unix newline)         String inputString4 = "This is line-1n"                             + "This is line-2n";                   //Enabling the UNIX_LINES mode         Pattern pattern4 = Pattern.compile(regex, Pattern.MULTILINE | Pattern.UNIX_LINES);         Matcher matcher4 = pattern4.matcher(inputString4);                   result = matcher4.matches();         //Result will be true, because in UNIX_LINES mode, n will be taken out of the         //string as newline characters     } } 

 

Using Pattern.UNICODE_CASE option

Using Pattern.UNICODE_CASE mode, we can do regex matching on UNICODE strings (Strings in non-english language). If we do regex matching on unicode strings without specifying this flag, the case insensitive matching (Pattern.CASE_INSENSITIVE) will not work properly. Following example shows the usage of this flag.

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String inputString = "ESPAÑOL";         String regex = "español";                   Pattern pattern1 = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);         Matcher matcher1 = pattern1.matcher(inputString);                   boolean result = matcher1.find();         //result is false because without UNICODE_CASE,          //case insensitive match won't be proper                    Pattern pattern2 = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);         Matcher matcher2 = pattern2.matcher(inputString);                   result = matcher2.find();         //result is true, since UNICODE case is specified     } } 

Using Pattern.CANON_EQ option

This flag is to enable canonical equivalence while matching. According to java docs, if this flag is specified, “au030A” () will match “u00E5” (). We can see that even though both characters looks similar, they are represented in two ways, using different hex codes. If we specify Pattern.CANON_EQ flag, java will decompose both to the lowest level binary representation and it will be exactly equal. Enabling this flag may impose a performance penalty. Following program tries to match regex with input string, both contains the character . This character can be represented in three ways

  1. ä 
  2. au0308
  3. u0061u0308.

When this flag is enabled the match will be done correctly.

import java.util.regex.Matcher; import java.util.regex.Pattern;   public class TestClass  {     public static void main(String[] args)      {         String inputString1 = "ääää";         String inputString2 = "au0308au0308au0308au0308";         String inputString3 = "u0061u0308u0061u0308u0061u0308u0061u0308";                   String regex = "ä+";         boolean result;                   Pattern pattern1 = Pattern.compile(regex);         Matcher matcher1 = pattern1.matcher(inputString1);         result = matcher1.find();         //result will be true, since both characters are the same                   Pattern pattern2 = Pattern.compile(regex);         Matcher matcher2 = pattern2.matcher(inputString2);         result = matcher2.find();         //result will be false, since canonical equivalence         //will not be considered                   Pattern pattern3 = Pattern.compile(regex);         Matcher matcher3 = pattern3.matcher(inputString3);         result = matcher3.find();         //result will be false, since canonical equivalence         //will not be considered                   Pattern pattern4 = Pattern.compile(regex, Pattern.CANON_EQ);         Matcher matcher4 = pattern4.matcher(inputString1);         result = matcher4.find();         //result will be true, since both characters are the same                   Pattern pattern5 = Pattern.compile(regex, Pattern.CANON_EQ);         Matcher matcher5 = pattern5.matcher(inputString2);         result = matcher5.find();         //result will be true, since Pattern.CANON_EQ flag is specified         //canonical equivalence will be considered while matching                    Pattern pattern6 = Pattern.compile(regex, Pattern.CANON_EQ);         Matcher matcher6 = pattern6.matcher(inputString3);         result = matcher6.find();         //result will be true, since Pattern.CANON_EQ flag is specified         //canonical equivalence will be considered while matching     } } 
Posted in Java Articles

Leave a Reply