Pattern matching with PHP

Pattern matching with PHP: Part II

We have already examined asterisk (*) quantifier. This meta character operates on the previous atom and matches zero or more instances of that atom. For example, the RE 'ab*' will match 'a', 'ab', 'abbbb' and so on; the RE '(ab)*' will match '' (nothing), 'ab', 'abab', etc.

In many applications, however, one wants to match one or more instances of a pattern. A regular expression to match one or more instances of 'ab' might look like this: "ab(ab)*". This is slightly cumbersome, however. For this reason, regular expressions can have patterns followed by a quantifier plus ("+"). Unlike the asterisk, this matches one or more instances of the previous atom. So the previous RE could be modified to '(ab)+'.

There is another option, however. Some applications might require matching zero or one instance of a pattern! For this reason, a question mark (?) quantifier is supported. The RE '(ab)?' will match only '' (nothing) and 'ab'.

Notice that current syntax supports matching zero, one or an infinite number of patterns. If an application was required to match between one and three instances of 'ab', it could not be done.

It might at first appear that to match between one and three instances of a pattern 'ab', you could use the RE '(ab)*(ab)*(ab)*. This is not correct, since the first pattern atom will match all three instances of 'ab' in 'ababab'. You can test this with the following script:

01 <?
02 $str = "ababababababab test abab";
03 $pat = "(ab)*(ab)*(ab)*";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06     echo "match: {$regs[0]}\n";
07 }
08 ?>

Line 02 contains a string, $str, with several instances of 'ab'. On line 05, the script tests the pattern against $str and reports the first matched pattern on line 06. This is the whole initial 'ab' segment.

As such, to support a specific range of matches, REs support a quantifier called a 'bound'. Bounds determine how many instances of a pattern should be matched. For example, the following RE matches one to three instances of 'ab': '(ab){1,3}'. The bound is denoted by curly brackets ({}). The first value is the lower bound and the second is the upper bound. If the second value is omitted, only a lower bound is used when matching patterns.

Complex patterns

The wild card meta character '.' matches any character except a new line. The reason it doesn't match a new line as well is due to one of the most important rules in the RE syntax: the largest pattern is always matched. That is, the RE 'a.*b', when used against the string 'avbab', matches 'avbab', not 'avb' or 'ab'. If '.' matched new lines as well, many REs would match an entire string immediately (since '.' matches all characters in it). It is also convenient to match REs line by line - a point which will become more and more apparent with increased usage.

This begs the question, then, of how to match new lines with REs. There are two meta characters for new lines: circumflex/hat ('^') and dollar sign ('$'). The first denotes the beginning of a line, the second denotes the end of a line. The following RE matches a whole line: '^.*$'.

This syntax allows much more complex REs. For example, to match all lines beginning with 'Hello', the following RE would be used: '^Hello.*'. To match all lines ending with 'World' the following RE would be used: '.*World$'.

What if, however, an application was required to match either lines beginning with 'Hello' or ending with 'World'? The pipe ('|') meta character provides an alternation mechanism by which a string matches one pattern or another. Using '|', the following RE matches the above requirement: '(^Hello|World$)'.

In some circumstances, applications are required to match one of a number of characters. For example, you may be required to match a space, a tab or a lower case character. Using our existing syntax, the RE would look something like this: '^( |\t|a|b|c|....)' where '\t' is a meta character for a tab and '....' represents the rest of the alphabet in lower case joined by pipes. This is cumbersome and very inefficient from a performance perspective. The solution to this is 'bracket expressions'. A bracket expression is a list of one or more characters which are treated, character by character, as a pattern. The previous RE could be rewritten as: '^[ \tabc....]'.

It is still cumbersome, however, to type out the whole alphabet in lowercase. For this reason, bracket expressions support ranges of characters. Ranges include 'a-z', which matches all lower case alphabet characters; 'A-Z', which matches all upper case alphabet characters; and 0-9, matching the digits 0 to 9. This means we can finally create a basic RE which matches all possible patterns we're interested in: '^[ \ta-z]'.

On the next page, we will take a detailed look at back references in conjunction with regular expression substitutions in PHP.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Gavin Sherry

PC World

1 Comment

rajesh

1

very nice

Comments are now closed.

Most Popular Reviews

Follow Us

Best Deals on GoodGearGuide

Shopping.com

Latest News Articles

Resources

GGG Evaluation Team

Kathy Cassidy

STYLISTIC Q702

First impression on unpacking the Q702 test unit was the solid feel and clean, minimalist styling.

Anthony Grifoni

STYLISTIC Q572

For work use, Microsoft Word and Excel programs pre-installed on the device are adequate for preparing short documents.

Steph Mundell

LIFEBOOK UH574

The Fujitsu LifeBook UH574 allowed for great mobility without being obnoxiously heavy or clunky. Its twelve hours of battery life did not disappoint.

Andrew Mitsi

STYLISTIC Q702

The screen was particularly good. It is bright and visible from most angles, however heat is an issue, particularly around the Windows button on the front, and on the back where the battery housing is located.

Simon Harriott

STYLISTIC Q702

My first impression after unboxing the Q702 is that it is a nice looking unit. Styling is somewhat minimalist but very effective. The tablet part, once detached, has a nice weight, and no buttons or switches are located in awkward or intrusive positions.

Latest Jobs

Shopping.com

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?