What will we cover? |
---|
|
Regular expressions are groups of characters that describe a larger group of characters. They describe a pattern of characters for which we can search in a body of text. They are very similar to the concept of wild cards used in file naming on most operating systems, whereby an asterisk(*) can be used to represent any sequence of characters in a file name. So *.py means any file ending in .py. In fact filename wild-cards are a very small subset of regular expressions.
Regular expressions are extremely powerful tools and most modern programming languages either have built in support for using regular expressions or have libraries or modules available that you can use to search for and replace text based on regular expressions. A full description of them is outside the scope of this tutor, indeed there is at least one whole book dedicated to regular expressions and if your interest is roused I recommend that you investigate the O'Reilly book.
One interesting feature of regular expressions is that they manifest similarities of structure to programs. Regular expressions are patterns constructed from smaller units. These units are:
Note that because groups are a unit, so you can have groups of groups and so on to an arbitrary level of complexity. We can combine these units in ways reminiscent of a programming language using sequences, repetitions or conditional operators. We’ll look at each of these in turn. So that we can try out the examples you will need to import the re module and use it’s methods. For convenience I will assume you have already imported re in most of the examples shown.
As ever, the simplest construct is a sequence and the simplest regular expression is just a sequence of characters:
red
This will match, or find, any occurrence of the three letters ‘r’,’e’ and ‘d’ in order, in a string. Thus the words red, lettered and credible would all be found because they contain ‘red’ within them. To provide greater control over the outcome of matches we can supply some special characters (known as metacharacters) to limit the scope of the search:
Metacharacters used in sequences | ||
---|---|---|
Expression | Meaning | Example |
^red | only at the start of a line | red ribbons are good |
red$ | only at the end of a line | I love red |
\Wred | only at the start of a word | it’s redirected by post |
red\W | only at the end of a word | you covered it already |
The metacharacters above are known as anchors because they fix the position of the regular expression within a sentence or word. There are several other anchors defined in the re module documentation which we don’t cover in this topic.
Sequences can also contain wildcard characters that can substitute for any character. The wildcard character is a period. Try this:
>>> import re >>> re.match('be.t', 'best') <_sre.SRE_Match object at 0x01365AA0> >>> re.match('be.t', 'bess')
The message in angle brackets tells us that the regular expression ‘be.t’, passed as the first argument matches the string ‘best’ passed as the second argument. ‘be.t’ will also match ‘beat’, ‘bent’, ‘belt’, etc. The second example did not match because 'bess' didn’t end in t, so no MatchObject was created. Try out a few more matches to see how this works. (Note that match() only matches at the front of a string, not in the middle, we can use search() for that, as we shall see later!)
The next unit is a range or set. This consists of a collection of letters enclosed in square brackets and the regular expression will search for any one of the enclosed letters.
>>> re.match('s[pwl]am', 'spam') <_sre.SRE_Match object at 0x01365AD8>
This would also match 'swam' or 'slam' but not 'sham' since 'h' is not included in the regular expression set.
By putting a ^ sign as the first element of the group we can say that it should look for any character except those listed, thus in this example:
>>> re.match('[^f]ool', 'cool') <_sre.SRE_Match object at 0x01365AA0> >>> re.match('[^f]ool','fool')
we can match ‘cool’ and ‘pool’ but we will not match ‘fool’ since we are looking for any character except 'f' at the beginning of the pattern.
Finally we can group sequences of characters, or other units, together by enclosing them in parentheses, which is not particularly useful in isolation but is useful when combined with the repetition and conditional features we look at next.
We can also create regular expressions which match repeated sequences of characters by using some more special characters. We can look for a repetition of a single character or group of characters using the following metacharacters:
Metacharacters used in repetition | ||
---|---|---|
Expression | Meaning | Example |
‘?’ | zero or one of the preceding character. Note the zero part there since that can trip you up if you aren’t careful. | pythonl?y matches: pythony pythonly |
‘*’ | looks for zero or more of the preceding character. | pythonl*y matches both of the above, plus: pythonlly pythonllly etc. |
‘+’ | looks for one or more of the preceding character. | pythonl+y matches: pythonly pythonlly pythonllly etc. |
{n,m} | looks for n to m repetitions of the preceding character. | fo{1,2} matches: fo or foo |
All of these repetition characters can be applied to groups of characters too. Thus:
>>> re.match('(.an){1,2}s', 'cans') <_sre.SRE_Match object at 0x013667E0>
The same pattern will also match: ‘cancans’ or ‘pans’ or ‘canpans’ but not ‘bananas’ since there is no character before the second 'an' group. (How could we modify the search to work with bananas as well? Hint: Look at the other repeat specifiers - and don't forget the extra 'a' at the end of bananas)
There is one caveat with the {m,n} form of repetition which is that it does not limit the match to only n units. Thus the example in the table above, fo{1,2} will successfully match fooo because it matches the foo at the beginning of fooo. Thus if you want to limit how many characters are matched you need to follow the multiplying expression with an anchor or a negated range. In our case fo{1,2}[^o] would prevent fooo from matching since it says match 1 or 2 ‘o’s followed by anything other than an ‘o’ - but, it must be followed by something, so now 'foo' doesn't match! This illustrates the fickle nature of regular expressions. They can be very difficult to get just right and you need to be very careful to test them thoroughly! The actual pattern needed to allow 'foo', and 'foobar' but not 'fooo' is: 'fo{1,2}[^o]*$'. That is, 'fo' or 'foo' followed by zero or more non o's and the end of the line. (In fact even this is not completely foolproof, but we need to cover a few more elements before we can really nail it!)
Regular expressions are said to be greedy. What that means is that the matching and searching functions will match as much as possible of the string rather than stopping at the first complete match. Normally this doesn’t matter too much but when you combine wildcards with repetition operators you can wind up grabbing more than you expect.
Consider the following example. If we have a regular expression like a.*b that says we want to find an a followed by any number of characters up to a b then the match function will search from the first a to the last b. That is to say that if the searched string includes more than one 'b' all but the last one will be included in the .* part of the expression. Thus in this example:
re.match('a.*b',’abracadabra')
The MatchObject has matched all of abracadab. Not just the first ab. This greedy matching behaviour is one of the most common errors made by new users of regular expressions.
To prevent this ‘greedy’ behaviour simply add a ‘?’ after the repition character, like so:
re.match('a.*?b','abracadabra')
which will now only match ‘ab’.
The final piece in the jigsaw is to make the regular expression search for optional elements or to select one of several patterns. We’ll look at each of these options separately:
You can specify that a character is optional using the zero or more repetition metacharacters:
>>> re.match('computer?d?', 'computer') <re.MatchObject instance at 864890>
will match compute, computer or computed. However, it will also match computerd, which we don’t want.
By using a range within the expression we can be more specific. Thus:
>>> re.match('compute[rd]$','computer') <re.MatchObject instance at 874390>
will select only computer and computed but reject the unwanted computerd.
And if we add a ? after the range we can also allow compute to be selected but still avoid computerd.
In addition to matching options from a list of characters we can also match based on a choice of sub-expressions. We mentioned earlier that we could group sequences of characters in parentheses, but in fact we can group any arbitrary regular expression in parentheses and treat it as a unit. In describing the syntax I will use the notation (RE) to indicate any such regular expression grouping.
The situation we want to examine here is the case whereby we want to match a regular expression containing (RE)xxxx or (RE)yyyy where xxxx and yyyy are different patterns. Thus, for example we want to match both premature and preventative. We can do this by using a selection metacharacter (|):
>>> regexp = 'pre(mature|ventative)' >>> re.match(regexp,'premature') <re.MatchObject instance at 864890> >>> re.match(regexp,'preventative') <re.MatchObject instance at 864890> >>> re.match(regexp,'prelude')
Notice that when defining the regular expression we had to include the full text of both options inside the parentheses, rather than just (e|v) otherwise the option would have been restricted to prematureentative or prematurventative. In other words only the letters e and v would have formed the options not the full length groups.
Now, using this technique we can come back to the example above where we want to capture 'fo' or 'foo' but not 'fooo' plus whatever comes after. We left it with a regular expression consisting of: fo{1,2}[^o]*$. The problem with this one is that if the string following the 'fo' or 'foo' contains an 'o' the match fails. By using a choice of expressions we can get round that. We want the match to work where our pattern is either the end of the line or followed by any non 'o' character. That looks like: fo{1,2}($|[^o]). And that finally gives us what we wanted. Remember, when using regular expressions, always test thoroughly to ensure you are not catching more than you want, and are catching all that you want.
The re module has many features which we don't discuss here so it is worth studying the module documentation. One area I'd like to draw to your attention is the set of flags that you can use when compiling expressions with the re.compile() function. These flags control things like whether the pattern matches across lines, or ignores case etc.
Another feature that you can find in the standard Python distribution is a regular expression testing tool. It allows you to type in an expression then try different values against it to see if they match. You can find this in the Tools/Scripts folder and the file is redemo.py. Unfortunately there is a small bug in the version that ships with Python v3.1. The import statement at the top needs to be changed from
from TKinter import *
to
from tkinter import *
If you make that small change it should work fine, and by the time you read this it should have been fixed in the distribution too. Have fun!
We’ve seen a little of what regular expressions look like but what can we do with them? And how do we do it in Python? To take the first point first, we can use them as very powerful search tools in text. We can look for lots of different variations of text strings in a single operation, we can even search for non printable characters such as blank lines using some of the metacharacters available. We can also replace these patterns using the methods and functions of the re module. We’ve already seen the match() function at work, there are several other functions, some of which are described below:
re Module functions and methods | |
---|---|
Function/Method | Effect |
match(RE,string) | if RE matches the start of the string it returns a match object |
search(RE,string) | if RE is found anywhere within the string a match object is returned |
split(RE, string) | like string.split() but uses the RE as a separator |
sub(RE, replace, string) | returns a string produced by substituting replace for re at the first matching occurrence of RE. Note this function has several additional features, see the documentation for details. |
findall(RE, string) | Finds all occurrences of RE in string, returning a list of match objects |
compile(RE) | produces a regular expression object which can be reused for multiple operations with the same RE. The object has all of the above methods but with an implied re and is more efficient than using the function versions. |
Note that this is not a full list of re’s methods and functions and that those listed have some optional parameters that can extend their use. The listed functions are the most commonly used operations and are sufficient for most needs.
As an example of how we might use regular expressions in Python let’s create a program that will search an HTML file for an IMG tag that has no ALT section. If we find one we will add a message to the owner to create more user friendly HTML in future!
import re # detect 'IMG' in upper/lower case allowing for # zero or more spaces between the < and the 'I' img = '< *[iI][mM][gG] ' # allow any character up to the 'ALT' or 'alt' before > alt = img + '.*[aA][lL][tT].*>' # open file and read it into list filename = input('Enter a filename to search ') inf = open(filename,'r') lines = inf.readlines() # if the line has an IMG tag and no ALT inside # add our message as an HTML comment for i in range(len(lines)): if ( re.search(img,lines[i]) and not re.search(alt,lines[i]) ): lines[i] += '<!-- PROVIDE ALT TAGS ON IMAGES! -->\n' # Now write the altered file and tidy up. inf.close() outf = open(filename,'w') outf.writelines(lines) outf.close()
Notice two points about the above code. First we use re.search instead of re.match because search finds the patterns anywhere in the string whereas match only looks at the start of the string. Secondly we put an outer pair of parentheses around the two tests. These are not strictly necessary but they allow us to break the test into two lines which are easier to read, especially if there are many expressions to be combined.
This code is far from perfect because it doesn’t consider the case where the IMG tag may be split over several lines, but it illustrates the technique well enough for our purposes. Of course such wanton vandalism of HTML files shouldn’t really be encouraged, but then again anyone who doesn’t provide ALT tags probably deserves all they get!
Finally, regular expressions have limitations and for formally defined data structures, like HTML, there are often other tools, known as parsers that are more effective, reliable, and easier to use correctly, than regular expressions. But for complex searches in free text regular expressions can solve a lot of problems. Just be sure to test thoroughly.
We’ll see regular expressions at work again in the Grammar Counter case study, meantime experiment with them and check out the other methods in the re module. We really have just scratched the surface of what’s possible using these powerful text processing tools.
JavaScript has good support for regular expressions built into the language. In fact the string search operations we used earlier are actually regular expression searches, we simply used the most basic form - a simple sequence of characters. All of the rules we discussed for Python apply equally to Javascript except that regular expressions are surrounded in slashes(/) instead of quotes. Here are some examples to illustrate their use:
<Script type="text/javascript"> var str = "A lovely bunch of bananas"; document.write(str + "<BR>"); if (str.match(/^A/)) { document.write("Found string beginning with A<BR>"); } if (str.match(/b[au]/)) { document.write("Found substring with either ba or bu<BR>"); } if (!str.match(/zzz/)) { document.write("Didn't find substring zzz!<BR>"); } </Script>
The first two succeed the third doesn't, hence the negative test.
VBScript does not have built in regular expressions like JavaScript but it does have a Regular Expression object that can be instantiated and used for searches, replacement etc. It can also be controlled to ignore case and to search for all instances or just one. It is used like this:
<Script type="text/vbscript"> Dim regex, matches Set regex = New RegExp regex.Global = True regex.Pattern = "b[au]" Set matches = regex.Execute("A lovely bunch of bananas") If matches.Count > 0 Then MsgBox "Found " & matches.Count & " substrings" End If </Script>
That's all I'll cover here but there is a wealth of subtle sophistication in regular expressions, we have literally just touched on their power in this short topic. Fortunately there is also a wealth of online information about their use, plus the excellent O'Reilly book mentioned at the start. My advice is to take it slowly and get accustomed to their vagaries as well as their virtues.
Points to remember |
---|
|
If you have any questions or feedback on this page
send me mail at:
alan.gauld@yahoo.co.uk