Pattern matching with PHP

Pattern matching with PHP: Part III

Building on the previous page on pattern matching with regular expressions (REs), this time we will investigate substitutions and back references.

Substitutions and REs

PHP provides a framework for matching strings with REs and also for replacing substrings based on those REs. Say your application was required to strip all HTML tags from some text, except line breaks (<br>); you could use REs as follows:

01 <?
02 $pat = "(<[^b>]?[^r>]*>)";
03 $str = "<p>a <b>test</b><br>html string</p>";
04 echo eregi_replace($pat,"",$str);
05 ?>

On line 02, we define a pattern which matches a less than sign (<) that marks the start of an HTML tag. We then match any character other than 'b' and 'r'. We also want to avoid matching '>' since the second atom matches all characters up to the final '<'.

On line 05, we call eregi_replace() to match instances of the pattern. The second argument is our replacement text. In this case, it is "" (an empty string). As such, instances of the pattern in $str will be removed. To get familiar with this code, try replacing "" with "test" to see where substrings are removed. We also use eregi_replace() because it matches case insensitively and HTML is case insensitive. The case-sensitive equivalent to eregi_replace() is ereg_replace() (note the missing 'i' after 'ereg').

Understanding back references

A back reference is the text that has been matched by a sub-pattern in the pattern string. Back references allow the user to refer to parts of a matched string. The following example illustrates how they are used:

01 <?
02 $str = "01/03/2004";
03 $pat = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06       echo "match: day: {$regs[1]}, month: {$regs[2]}, year: {$regs[3]}\n";
07 }
08 ?>

On line 02, we define a string that is a human-readable form of the date for 1 March. On line 03, we define a pattern to match the structure of this date. The first atom, '([0-9]{1,2})', is designed to match either one or two numerical characters. The second atom is identical. The third matches four numeric characters.

The use of parentheses not only allows us to form sub-patterns in our pattern string, but also to 'save' the text of the matched sub-pattern and recall it later. On line 05, we call ereg() and tell it to store the matched patterns in $regs. On line 06, we can output a broken-down date string. (Readers interested in extending their RE skills should attempt to modify the pattern on line 03 to validate dates; currently, $pat would match a string such as 99-99-2004.)

Parentheses nested in another set of parentheses can also be back referenced. For example, the pattern "(a (string))" allows two back references. Reference one, which would be stored in $regs[1] if called in conjunction with ereg(), would be 'a string'. Back reference two would be 'string'.

Combining back references and substitutions

By using back references in conjunction with substitutions we can design small scripts to perform very complex tasks. Consider the following problem: replace European dates of the form 'dd/mm/yyyy' or 'dd-mm-yyyy' with the ISO 8601 format of 'yyyy-mm-dd' in a text file.

01 <?
02 $pat = "([0-9]{1,2})[/-]([0-9]{1,2})[/-]([0-9]{4})";
03 $repl = "\\3-\\2-\\1";
04 $str = join('',file("test.txt"));
05 echo ereg_replace($pat,$repl,$str);
06 ?>

Place the following string in a file called test.txt: "this is a date 1/2/2004 and this is another 20-3-2003". Running the script will convert this text to: "this is a date 2004-2-1 and this is another 2003-3-20".

On line 02, we match a European date of the form defined above and isolate three different atoms. For the purpose of back referencing, the sub-patterns are number 1 through 3 from left to right. On line 03 we rearrange the order of the date by reversing the order of the atoms we are back referencing.

Join the PC World newsletter!

Error: Please check your email address.

Struggling for Christmas presents this year? Check out our Christmas Gift Guide for some top tech suggestions and more.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Gavin Sherry

PC World

Most Popular Reviews

Follow Us

Best Deals on GoodGearGuide

Shopping.com

Latest News Articles

Resources

GGG Evaluation Team

Kathy Cassidy

STYLISTIC Q702

First impression on unpacking the Q702 test unit was the solid feel and clean, minimalist styling.

Anthony Grifoni

STYLISTIC Q572

For work use, Microsoft Word and Excel programs pre-installed on the device are adequate for preparing short documents.

Steph Mundell

LIFEBOOK UH574

The Fujitsu LifeBook UH574 allowed for great mobility without being obnoxiously heavy or clunky. Its twelve hours of battery life did not disappoint.

Andrew Mitsi

STYLISTIC Q702

The screen was particularly good. It is bright and visible from most angles, however heat is an issue, particularly around the Windows button on the front, and on the back where the battery housing is located.

Simon Harriott

STYLISTIC Q702

My first impression after unboxing the Q702 is that it is a nice looking unit. Styling is somewhat minimalist but very effective. The tablet part, once detached, has a nice weight, and no buttons or switches are located in awkward or intrusive positions.

Latest Jobs

Shopping.com

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?