Pattern matching with PHP

Pattern matching with PHP: Part III

Building on the previous page on pattern matching with regular expressions (REs), this time we will investigate substitutions and back references.

Substitutions and REs

PHP provides a framework for matching strings with REs and also for replacing substrings based on those REs. Say your application was required to strip all HTML tags from some text, except line breaks (<br>); you could use REs as follows:

01 <?
02 $pat = "(<[^b>]?[^r>]*>)";
03 $str = "<p>a <b>test</b><br>html string</p>";
04 echo eregi_replace($pat,"",$str);
05 ?>

On line 02, we define a pattern which matches a less than sign (<) that marks the start of an HTML tag. We then match any character other than 'b' and 'r'. We also want to avoid matching '>' since the second atom matches all characters up to the final '<'.

On line 05, we call eregi_replace() to match instances of the pattern. The second argument is our replacement text. In this case, it is "" (an empty string). As such, instances of the pattern in $str will be removed. To get familiar with this code, try replacing "" with "test" to see where substrings are removed. We also use eregi_replace() because it matches case insensitively and HTML is case insensitive. The case-sensitive equivalent to eregi_replace() is ereg_replace() (note the missing 'i' after 'ereg').

Understanding back references

A back reference is the text that has been matched by a sub-pattern in the pattern string. Back references allow the user to refer to parts of a matched string. The following example illustrates how they are used:

01 <?
02 $str = "01/03/2004";
03 $pat = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: day: {$regs[1]}, month: {$regs[2]}, year: {$regs[3]}\n";
07 }
08 ?>

On line 02, we define a string that is a human-readable form of the date for 1 March. On line 03, we define a pattern to match the structure of this date. The first atom, '([0-9]{1,2})', is designed to match either one or two numerical characters. The second atom is identical. The third matches four numeric characters.

The use of parentheses not only allows us to form sub-patterns in our pattern string, but also to 'save' the text of the matched sub-pattern and recall it later. On line 05, we call ereg() and tell it to store the matched patterns in $regs. On line 06, we can output a broken-down date string. (Readers interested in extending their RE skills should attempt to modify the pattern on line 03 to validate dates; currently, $pat would match a string such as 99-99-2004.)

Parentheses nested in another set of parentheses can also be back referenced. For example, the pattern "(a (string))" allows two back references. Reference one, which would be stored in $regs[1] if called in conjunction with ereg(), would be 'a string'. Back reference two would be 'string'.

Combining back references and substitutions

By using back references in conjunction with substitutions we can design small scripts to perform very complex tasks. Consider the following problem: replace European dates of the form 'dd/mm/yyyy' or 'dd-mm-yyyy' with the ISO 8601 format of 'yyyy-mm-dd' in a text file.

01 <?
02 $pat = "([0-9]{1,2})[/-]([0-9]{1,2})[/-]([0-9]{4})";
03 $repl = "\\3-\\2-\\1";
04 $str = join('',file("test.txt"));
05 echo ereg_replace($pat,$repl,$str);
06 ?>

Place the following string in a file called test.txt: "this is a date 1/2/2004 and this is another 20-3-2003". Running the script will convert this text to: "this is a date 2004-2-1 and this is another 2003-3-20".

On line 02, we match a European date of the form defined above and isolate three different atoms. For the purpose of back referencing, the sub-patterns are number 1 through 3 from left to right. On line 03 we rearrange the order of the date by reversing the order of the atoms we are back referencing.

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Gavin Sherry

PC World
Show Comments

Cool Tech

SanDisk MicroSDXC™ for Nintendo® Switch™

Learn more >

Breitling Superocean Heritage Chronographe 44

Learn more >

Toys for Boys

Family Friendly

Panasonic 4K UHD Blu-Ray Player and Full HD Recorder with Netflix - UBT1GL-K

Learn more >

Stocking Stuffer

Razer DeathAdder Expert Ergonomic Gaming Mouse

Learn more >

Christmas Gift Guide

Click for more ›

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Walid Mikhael

Brother QL-820NWB Professional Label Printer

It’s easy to set up, it’s compact and quiet when printing and to top if off, the print quality is excellent. This is hands down the best printer I’ve used for printing labels.

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Featured Content

Product Launch Showcase

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?