Pattern matching with PHP

Pattern matching with PHP: Part II

We have already examined asterisk (*) quantifier. This meta character operates on the previous atom and matches zero or more instances of that atom. For example, the RE 'ab*' will match 'a', 'ab', 'abbbb' and so on; the RE '(ab)*' will match '' (nothing), 'ab', 'abab', etc.

In many applications, however, one wants to match one or more instances of a pattern. A regular expression to match one or more instances of 'ab' might look like this: "ab(ab)*". This is slightly cumbersome, however. For this reason, regular expressions can have patterns followed by a quantifier plus ("+"). Unlike the asterisk, this matches one or more instances of the previous atom. So the previous RE could be modified to '(ab)+'.

There is another option, however. Some applications might require matching zero or one instance of a pattern! For this reason, a question mark (?) quantifier is supported. The RE '(ab)?' will match only '' (nothing) and 'ab'.

Notice that current syntax supports matching zero, one or an infinite number of patterns. If an application was required to match between one and three instances of 'ab', it could not be done.

It might at first appear that to match between one and three instances of a pattern 'ab', you could use the RE '(ab)*(ab)*(ab)*. This is not correct, since the first pattern atom will match all three instances of 'ab' in 'ababab'. You can test this with the following script:

01 <?
02 $str = "ababababababab test abab";
03 $pat = "(ab)*(ab)*(ab)*";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: {$regs[0]}\n";
07 }
08 ?>

Line 02 contains a string, $str, with several instances of 'ab'. On line 05, the script tests the pattern against $str and reports the first matched pattern on line 06. This is the whole initial 'ab' segment.

As such, to support a specific range of matches, REs support a quantifier called a 'bound'. Bounds determine how many instances of a pattern should be matched. For example, the following RE matches one to three instances of 'ab': '(ab){1,3}'. The bound is denoted by curly brackets ({}). The first value is the lower bound and the second is the upper bound. If the second value is omitted, only a lower bound is used when matching patterns.

Complex patterns

The wild card meta character '.' matches any character except a new line. The reason it doesn't match a new line as well is due to one of the most important rules in the RE syntax: the largest pattern is always matched. That is, the RE 'a.*b', when used against the string 'avbab', matches 'avbab', not 'avb' or 'ab'. If '.' matched new lines as well, many REs would match an entire string immediately (since '.' matches all characters in it). It is also convenient to match REs line by line - a point which will become more and more apparent with increased usage.

This begs the question, then, of how to match new lines with REs. There are two meta characters for new lines: circumflex/hat ('^') and dollar sign ('$'). The first denotes the beginning of a line, the second denotes the end of a line. The following RE matches a whole line: '^.*$'.

This syntax allows much more complex REs. For example, to match all lines beginning with 'Hello', the following RE would be used: '^Hello.*'. To match all lines ending with 'World' the following RE would be used: '.*World$'.

What if, however, an application was required to match either lines beginning with 'Hello' or ending with 'World'? The pipe ('|') meta character provides an alternation mechanism by which a string matches one pattern or another. Using '|', the following RE matches the above requirement: '(^Hello|World$)'.

In some circumstances, applications are required to match one of a number of characters. For example, you may be required to match a space, a tab or a lower case character. Using our existing syntax, the RE would look something like this: '^( |\t|a|b|c|....)' where '\t' is a meta character for a tab and '....' represents the rest of the alphabet in lower case joined by pipes. This is cumbersome and very inefficient from a performance perspective. The solution to this is 'bracket expressions'. A bracket expression is a list of one or more characters which are treated, character by character, as a pattern. The previous RE could be rewritten as: '^[ \tabc....]'.

It is still cumbersome, however, to type out the whole alphabet in lowercase. For this reason, bracket expressions support ranges of characters. Ranges include 'a-z', which matches all lower case alphabet characters; 'A-Z', which matches all upper case alphabet characters; and 0-9, matching the digits 0 to 9. This means we can finally create a basic RE which matches all possible patterns we're interested in: '^[ \ta-z]'.

On the next page, we will take a detailed look at back references in conjunction with regular expression substitutions in PHP.

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Gavin Sherry

PC World
Show Comments

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Kurt Hegetschweiler

Brother PocketJet PJ-773 A4 Portable Thermal Printer

It’s perfect for mobile workers. Just take it out — it’s small enough to sit anywhere — turn it on, load a sheet of paper, and start printing.

Matthew Stivala

HP OfficeJet 250 Mobile Printer

The HP OfficeJet 250 Mobile Printer is a great device that fits perfectly into my fast paced and mobile lifestyle. My first impression of the printer itself was how incredibly compact and sleek the device was.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?