Pattern matching with PHP

Pattern matching with PHP: Part III

Building on the previous page on pattern matching with regular expressions (REs), this time we will investigate substitutions and back references.

Substitutions and REs

PHP provides a framework for matching strings with REs and also for replacing substrings based on those REs. Say your application was required to strip all HTML tags from some text, except line breaks (<br>); you could use REs as follows:

01 <?
02 $pat = "(<[^b>]?[^r>]*>)";
03 $str = "<p>a <b>test</b><br>html string</p>";
04 echo eregi_replace($pat,"",$str);
05 ?>

On line 02, we define a pattern which matches a less than sign (<) that marks the start of an HTML tag. We then match any character other than 'b' and 'r'. We also want to avoid matching '>' since the second atom matches all characters up to the final '<'.

On line 05, we call eregi_replace() to match instances of the pattern. The second argument is our replacement text. In this case, it is "" (an empty string). As such, instances of the pattern in $str will be removed. To get familiar with this code, try replacing "" with "test" to see where substrings are removed. We also use eregi_replace() because it matches case insensitively and HTML is case insensitive. The case-sensitive equivalent to eregi_replace() is ereg_replace() (note the missing 'i' after 'ereg').

Understanding back references

A back reference is the text that has been matched by a sub-pattern in the pattern string. Back references allow the user to refer to parts of a matched string. The following example illustrates how they are used:

01 <?
02 $str = "01/03/2004";
03 $pat = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: day: {$regs[1]}, month: {$regs[2]}, year: {$regs[3]}\n";
07 }
08 ?>

On line 02, we define a string that is a human-readable form of the date for 1 March. On line 03, we define a pattern to match the structure of this date. The first atom, '([0-9]{1,2})', is designed to match either one or two numerical characters. The second atom is identical. The third matches four numeric characters.

The use of parentheses not only allows us to form sub-patterns in our pattern string, but also to 'save' the text of the matched sub-pattern and recall it later. On line 05, we call ereg() and tell it to store the matched patterns in $regs. On line 06, we can output a broken-down date string. (Readers interested in extending their RE skills should attempt to modify the pattern on line 03 to validate dates; currently, $pat would match a string such as 99-99-2004.)

Parentheses nested in another set of parentheses can also be back referenced. For example, the pattern "(a (string))" allows two back references. Reference one, which would be stored in $regs[1] if called in conjunction with ereg(), would be 'a string'. Back reference two would be 'string'.

Combining back references and substitutions

By using back references in conjunction with substitutions we can design small scripts to perform very complex tasks. Consider the following problem: replace European dates of the form 'dd/mm/yyyy' or 'dd-mm-yyyy' with the ISO 8601 format of 'yyyy-mm-dd' in a text file.

01 <?
02 $pat = "([0-9]{1,2})[/-]([0-9]{1,2})[/-]([0-9]{4})";
03 $repl = "\\3-\\2-\\1";
04 $str = join('',file("test.txt"));
05 echo ereg_replace($pat,$repl,$str);
06 ?>

Place the following string in a file called test.txt: "this is a date 1/2/2004 and this is another 20-3-2003". Running the script will convert this text to: "this is a date 2004-2-1 and this is another 2003-3-20".

On line 02, we match a European date of the form defined above and isolate three different atoms. For the purpose of back referencing, the sub-patterns are number 1 through 3 from left to right. On line 03 we rearrange the order of the date by reversing the order of the atoms we are back referencing.

Join the PC World newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Gavin Sherry

PC World
Show Comments

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Matthew Stivala

HP OfficeJet 250 Mobile Printer

The HP OfficeJet 250 Mobile Printer is a great device that fits perfectly into my fast paced and mobile lifestyle. My first impression of the printer itself was how incredibly compact and sleek the device was.

Armand Abogado

HP OfficeJet 250 Mobile Printer

Wireless printing from my iPhone was also a handy feature, the whole experience was quick and seamless with no setup requirements - accessed through the default iOS printing menu options.

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?