Pattern matching with PHP

Pattern matching with PHP: Part II

We have already examined asterisk (*) quantifier. This meta character operates on the previous atom and matches zero or more instances of that atom. For example, the RE 'ab*' will match 'a', 'ab', 'abbbb' and so on; the RE '(ab)*' will match '' (nothing), 'ab', 'abab', etc.

In many applications, however, one wants to match one or more instances of a pattern. A regular expression to match one or more instances of 'ab' might look like this: "ab(ab)*". This is slightly cumbersome, however. For this reason, regular expressions can have patterns followed by a quantifier plus ("+"). Unlike the asterisk, this matches one or more instances of the previous atom. So the previous RE could be modified to '(ab)+'.

There is another option, however. Some applications might require matching zero or one instance of a pattern! For this reason, a question mark (?) quantifier is supported. The RE '(ab)?' will match only '' (nothing) and 'ab'.

Notice that current syntax supports matching zero, one or an infinite number of patterns. If an application was required to match between one and three instances of 'ab', it could not be done.

It might at first appear that to match between one and three instances of a pattern 'ab', you could use the RE '(ab)*(ab)*(ab)*. This is not correct, since the first pattern atom will match all three instances of 'ab' in 'ababab'. You can test this with the following script:

01 <?
02 $str = "ababababababab test abab";
03 $pat = "(ab)*(ab)*(ab)*";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: {$regs[0]}\n";
07 }
08 ?>

Line 02 contains a string, $str, with several instances of 'ab'. On line 05, the script tests the pattern against $str and reports the first matched pattern on line 06. This is the whole initial 'ab' segment.

As such, to support a specific range of matches, REs support a quantifier called a 'bound'. Bounds determine how many instances of a pattern should be matched. For example, the following RE matches one to three instances of 'ab': '(ab){1,3}'. The bound is denoted by curly brackets ({}). The first value is the lower bound and the second is the upper bound. If the second value is omitted, only a lower bound is used when matching patterns.

Complex patterns

The wild card meta character '.' matches any character except a new line. The reason it doesn't match a new line as well is due to one of the most important rules in the RE syntax: the largest pattern is always matched. That is, the RE 'a.*b', when used against the string 'avbab', matches 'avbab', not 'avb' or 'ab'. If '.' matched new lines as well, many REs would match an entire string immediately (since '.' matches all characters in it). It is also convenient to match REs line by line - a point which will become more and more apparent with increased usage.

This begs the question, then, of how to match new lines with REs. There are two meta characters for new lines: circumflex/hat ('^') and dollar sign ('$'). The first denotes the beginning of a line, the second denotes the end of a line. The following RE matches a whole line: '^.*$'.

This syntax allows much more complex REs. For example, to match all lines beginning with 'Hello', the following RE would be used: '^Hello.*'. To match all lines ending with 'World' the following RE would be used: '.*World$'.

What if, however, an application was required to match either lines beginning with 'Hello' or ending with 'World'? The pipe ('|') meta character provides an alternation mechanism by which a string matches one pattern or another. Using '|', the following RE matches the above requirement: '(^Hello|World$)'.

In some circumstances, applications are required to match one of a number of characters. For example, you may be required to match a space, a tab or a lower case character. Using our existing syntax, the RE would look something like this: '^( |\t|a|b|c|....)' where '\t' is a meta character for a tab and '....' represents the rest of the alphabet in lower case joined by pipes. This is cumbersome and very inefficient from a performance perspective. The solution to this is 'bracket expressions'. A bracket expression is a list of one or more characters which are treated, character by character, as a pattern. The previous RE could be rewritten as: '^[ \tabc....]'.

It is still cumbersome, however, to type out the whole alphabet in lowercase. For this reason, bracket expressions support ranges of characters. Ranges include 'a-z', which matches all lower case alphabet characters; 'A-Z', which matches all upper case alphabet characters; and 0-9, matching the digits 0 to 9. This means we can finally create a basic RE which matches all possible patterns we're interested in: '^[ \ta-z]'.

On the next page, we will take a detailed look at back references in conjunction with regular expression substitutions in PHP.

Join the PC World newsletter!

Error: Please check your email address.

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Gavin Sherry

PC World
Show Comments

Essentials

Microsoft L5V-00027 Sculpt Ergonomic Keyboard Desktop

Learn more >

Lexar® JumpDrive® S57 USB 3.0 flash drive

Learn more >

Mobile

Lexar® JumpDrive® S45 USB 3.0 flash drive 

Learn more >

Exec

Audio-Technica ATH-ANC70 Noise Cancelling Headphones

Learn more >

Lexar® Professional 1800x microSDHC™/microSDXC™ UHS-II cards 

Learn more >

Lexar® JumpDrive® C20c USB Type-C flash drive 

Learn more >

HD Pan/Tilt Wi-Fi Camera with Night Vision NC450

Learn more >

Budget

Back To Business Guide

Click for more ›

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?