Pattern matching with PHP

Pattern matching with PHP: Part II

We have already examined asterisk (*) quantifier. This meta character operates on the previous atom and matches zero or more instances of that atom. For example, the RE 'ab*' will match 'a', 'ab', 'abbbb' and so on; the RE '(ab)*' will match '' (nothing), 'ab', 'abab', etc.

In many applications, however, one wants to match one or more instances of a pattern. A regular expression to match one or more instances of 'ab' might look like this: "ab(ab)*". This is slightly cumbersome, however. For this reason, regular expressions can have patterns followed by a quantifier plus ("+"). Unlike the asterisk, this matches one or more instances of the previous atom. So the previous RE could be modified to '(ab)+'.

There is another option, however. Some applications might require matching zero or one instance of a pattern! For this reason, a question mark (?) quantifier is supported. The RE '(ab)?' will match only '' (nothing) and 'ab'.

Notice that current syntax supports matching zero, one or an infinite number of patterns. If an application was required to match between one and three instances of 'ab', it could not be done.

It might at first appear that to match between one and three instances of a pattern 'ab', you could use the RE '(ab)*(ab)*(ab)*. This is not correct, since the first pattern atom will match all three instances of 'ab' in 'ababab'. You can test this with the following script:

01 <?
02 $str = "ababababababab test abab";
03 $pat = "(ab)*(ab)*(ab)*";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: {$regs[0]}\n";
07 }
08 ?>

Line 02 contains a string, $str, with several instances of 'ab'. On line 05, the script tests the pattern against $str and reports the first matched pattern on line 06. This is the whole initial 'ab' segment.

As such, to support a specific range of matches, REs support a quantifier called a 'bound'. Bounds determine how many instances of a pattern should be matched. For example, the following RE matches one to three instances of 'ab': '(ab){1,3}'. The bound is denoted by curly brackets ({}). The first value is the lower bound and the second is the upper bound. If the second value is omitted, only a lower bound is used when matching patterns.

Complex patterns

The wild card meta character '.' matches any character except a new line. The reason it doesn't match a new line as well is due to one of the most important rules in the RE syntax: the largest pattern is always matched. That is, the RE 'a.*b', when used against the string 'avbab', matches 'avbab', not 'avb' or 'ab'. If '.' matched new lines as well, many REs would match an entire string immediately (since '.' matches all characters in it). It is also convenient to match REs line by line - a point which will become more and more apparent with increased usage.

This begs the question, then, of how to match new lines with REs. There are two meta characters for new lines: circumflex/hat ('^') and dollar sign ('$'). The first denotes the beginning of a line, the second denotes the end of a line. The following RE matches a whole line: '^.*$'.

This syntax allows much more complex REs. For example, to match all lines beginning with 'Hello', the following RE would be used: '^Hello.*'. To match all lines ending with 'World' the following RE would be used: '.*World$'.

What if, however, an application was required to match either lines beginning with 'Hello' or ending with 'World'? The pipe ('|') meta character provides an alternation mechanism by which a string matches one pattern or another. Using '|', the following RE matches the above requirement: '(^Hello|World$)'.

In some circumstances, applications are required to match one of a number of characters. For example, you may be required to match a space, a tab or a lower case character. Using our existing syntax, the RE would look something like this: '^( |\t|a|b|c|....)' where '\t' is a meta character for a tab and '....' represents the rest of the alphabet in lower case joined by pipes. This is cumbersome and very inefficient from a performance perspective. The solution to this is 'bracket expressions'. A bracket expression is a list of one or more characters which are treated, character by character, as a pattern. The previous RE could be rewritten as: '^[ \tabc....]'.

It is still cumbersome, however, to type out the whole alphabet in lowercase. For this reason, bracket expressions support ranges of characters. Ranges include 'a-z', which matches all lower case alphabet characters; 'A-Z', which matches all upper case alphabet characters; and 0-9, matching the digits 0 to 9. This means we can finally create a basic RE which matches all possible patterns we're interested in: '^[ \ta-z]'.

On the next page, we will take a detailed look at back references in conjunction with regular expression substitutions in PHP.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Gavin Sherry

PC World
Show Comments

Brand Post

Most Popular Reviews

Latest Articles


PCW Evaluation Team

Jack Jeffries


As the Maserati or BMW of laptops, it would fit perfectly in the hands of a professional needing firepower under the hood, sophistication and class on the surface, and gaming prowess (sports mode if you will) in between.

Taylor Carr


The MSI PS63 is an amazing laptop and I would definitely consider buying one in the future.

Christopher Low

Brother RJ-4230B

This small mobile printer is exactly what I need for invoicing and other jobs such as sending fellow tradesman details or step-by-step instructions that I can easily print off from my phone or the Web.

Aysha Strobbe

Microsoft Office 365/HP Spectre x360

Microsoft Office continues to make a student’s life that little bit easier by offering reliable, easy to use, time-saving functionality, while continuing to develop new features that further enhance what is already a formidable collection of applications

Michael Hargreaves

Microsoft Office 365/Dell XPS 15 2-in-1

I’d recommend a Dell XPS 15 2-in-1 and the new Windows 10 to anyone who needs to get serious work done (before you kick back on your couch with your favourite Netflix show.)

Maryellen Rose George

Brother PT-P750W

It’s useful for office tasks as well as pragmatic labelling of equipment and storage – just don’t get too excited and label everything in sight!

Featured Content

Product Launch Showcase

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?