|What will we cover?|
Handling text is one of the most common things that programmers do. As a result there are lots of specific tools in most programming languages to make this easier. In this section we will look at some of these and how we might use them in performing typical programming tasks.
Some of the most common tasks that we can do when working with text are:
Python takes a slightly ambiguous approach to processing text as of version 2.3. This is because in early versions of Python all string manipulation was done via a module full of functions and useful constants. In Python version 2.0 string methods were introduced which duplicated the functions in the module, but the constants were still there. This position has remained through to version 2.3 but work is underway to remove the need for the old string module completely. In this topic we will only look at the new object oriented approach to string manipulation, if you do want to try out the module then feel free to read the Python module documentation.
The first task we consider is how to split a string into its constituent parts. This is often necessary when processing files since we tend to read a file line by line, but the data may well be contained within segments of the line. An example of this is our Address Book example, where we might want to access the individual fields of the entries rather than just print the whole entry.
The python method we use for this is called split() and it is used like this:
>>> aString = "Here is a (short) String" >>> print aString.split() ['Here', 'is', 'a', '(short)', 'String']
Notice we get a list back containing the words within aString with all the spaces removed. The default separator for ''.split() is whitespace (ie. tabs, newlines and spaces). Let's try using it again but with an opening parenthesis as the separator:
>>> print aString.split('(') ['Here is a ', 'short) String']
Notice the difference? There are only two elements in the list this time and the opening parenthesis has been removed from the front of 'short)'. That's an important point to note about ''.split(), that it removes the separator characters. Usually that's what we want, but just occasionally we'll wish it hadn't!
There is also a ''.join() method which can take a list (or indeed any other kind of sequence) of strings and join them together. One confusing feature of ''.join() is that it uses the string on which we call the method as the joining characters. You'll see what I mean from this example:
>>> lst = ['here','is','a','list','of','words'] >>> print '-+-'.join(lst) here-+-is-+-a-+-list-+-of-+-words >>> print ' '.join(lst) here is a list of words
It sort of makes sense when you think about it, but it does look weird when you first see it.
Let's revisit that word counting program I mentioned in the functions topic. Recall the Pseudo Code looked like:
def numwords(aString): list = split(aString) # list with each element a word return len(list) # return number of elements in list for line in file: total = total + numwords(line) # accumulate totals for each line print "File had %d words" % total
Now we know how to get the lines from the file let's consider the body of the numwords() function. First we want to create a list of words in a line. That's nothing more than applying the default ''.split() method. Referring to the Python documentation we find that the builtin function len() returns the number of elements in a list, which in our case should be the number of words in the string - exactly what we want.
So the final code looks like:
def numwords(aString): lst = aString.split() # split() is a method of the string object aString return len(lst) # return number of elements in the list inp = file("menu.txt","r") total = 0 # initializer to zero; also creates variable for line in inp: total = total + numwords(line) # accumulate totals for each line print "File had %d words" % total inp.close()
That's not quite right of course because it counts things like an ampersand character as a word (although maybe you think it should...). Also, it can only be used on a single file (menu.txt). But it's not too hard to convert it to read the filename from the command line ( argv) or via raw_input() as we saw in the Talking to the user section. I leave that as an exercise for the reader.
The next common operation we will look at is searching for a sub-string within a longer string. This is again supported by a Python string method, this time called ''.find() It's basic use is quite simple, you provide a search string and if Python finds it within the main string it returns the index of the first character of the substring, if it doesn't find it, it returns -1:
>>> aString = "here is a long string with a substring inside it" >>> print aString.find('long') 10 >>> print aString.find('oxen') -1 >>> print aString.find('string') 15
The first two examples are straightforward, the first returns the index of the start of 'long' and the second returns -1 because 'oxen' does not occur inside aString. The third example throws up an interesting point, namely that find only locates the first occurrence of the search string, but what do we do if the search string occurs more than once in the original string?
One option is to use the index of the first occurrence to chop the original string into two pieces and search again. We keep doing this until we get a -1 result. Like this:
aString = "Bow wow says the dog, how many ow's are in this string?" temp = aString[:] # use slice to make a copy count = 0 index = temp.find('ow') while index != -1: count += 1 temp = temp[index + 1:] # use slicing index = temp.find('ow') print "We found %d occurrences of 'ow' in %s" % (count, aString)
Here we just counted occurrences, but we could just as well have collected the index results into a list for later processing.
The find() method can speed this process up a little by using a one of its extra optional parameters. That is, a start location within the original string:
aString = "Bow wow says the dog, how many o's are in this string?" count = 0 index = aString.find('ow') # use default start while index != -1: count += 1 start = index + 1 index = aString.find('ow', start) # set new start print "We found %d occurrences of 'ow' in %s" % (count, aString)
This solution removes the need to create a new string each time, which can be a slow process if the string is long. Also, if we know that the substring will definitely only be within the first so many characters (or we aren't interested in later occurrences) we can specify both a start and stop value, like this:
# limit search to the first 20 chars aString = "Bow wow says the dog, how many ow's are in this string?" print aString.find('ow',0,20)
To complete our discussion of searching there are a couple of nice extra methods that Python provides to cater for common search situations, namely ''.startswith() and ''.endswith(). From the names alone you probably can guess what these do. They return True or False depending on whether the original string starts with or ends with the given search string, like this:
>>> print "Python rocks!".startswith("Perl") False >>> print "Python rocks!".startswith('Python') True >>> print "Python rocks!".endswith('sucks!') False >>> print "Python rocks!".endswith('cks!') True
Notice the boolean result. After all, you already know where to look if the answer is True! Also notice that the search string doesn't need to be a complete word, a substring is fine. You can also provide a start and stop position within the string, just like ''.find() to effectively test for a string at any given location within a string. This is not a feature that is used much in practice.
And finally for a simple test of whether a substring exists anywhere within another string you can use the Python in operator, like this:
>>> if 'foo' in 'foobar': print 'True' True >>> if 'baz' in 'foobar': print 'True' >>> if 'bar' in 'foobar': print 'True' True
That's all I'll say about searching for now, let's look at how to replace text next.
Having found our text we often want to change it to something else. Again the Python string methods provide a solution with the ''.replace() method. It takes two arguments: a search string and a replacement string. The return value is the new string as a result of the replacement.
>>> aString = "Mary had a little lamb, its fleece was dirty!" >>> print aString.replace('dirty','white') "Mary had a little lamb, its fleece was white!"
One interesting difference between ''.find() and ''.replace is that replace, by default, replaces all occurrences of the search string, not just the first. An optional count argument can limit the number of replacements:
>>> aString = "Bow wow wow said the little dog" >>> print aString.replace('ow','ark') Bark wark wark said the little dog >>> print aString.replace('ow','ark',1) # only one Bark wow wow said the little dog
It is possible to do much more sophisticated search and replace operations using something called a regular expression, but they are much more complex and get a whole topic to themselves in the "Advanced" section of the tutorial.
One final thing to consider is converting case from lower to upper and vice-versa. This isn't such a common operation but Python does provide some helper methods to do it for us:
>>> print "MIXed Case".lower() mixed case >>> print "MIXed Case".upper() MIXED CASE >>> print "MIXed Case".swapcase() mixED cASE >>> print "MIXed Case".capitalize() Mixed case >>> print "TEST".isupper() True >>> print "TEST".islower() False
Note that ''.capitalize() capitalizes the entire string not each word within it. Also note the two test functions (or predicates) ''.isupper() and ''.islower(). Python provides a whole bunch of these predicate functions for testing strings, other useful tests include: ''.isdigit(), ''.isalpha() and ''.isspace(). The last checks for all whitespace not just literal space characters!
We will be using many of these string methods as we progress through the tutorial, and in particular the Grammar Counter case study uses several of them.
Because VBScript descends from BASIC it has a wealth of builtin string handling functions. In fact in the reference documentation I counted at least 20 functions or methods, not counting those that are simply there to handle Unicode characters.
What this means is that we can pretty much do all the things we did in Python using VBScript too. I'll quickly run through the options below:
We start with the Split function:
<script type="text/vbscript"> Dim s Dim lst s = "Here is a string of words" lst = Split(s) ' returns an array MsgBox lst(1) </script>
As with Python you can add a separator value if the default whitespace separation isn't what you need.
Also as with Python there is a Join function for reversing the process.
Searching is done with InStr, short for "In String", obviously.
<script type="text/vbscript"> Dim s,n s = "Here is a long string of text" n = InStr(s, "long") MsgBox "long is found at position: " & CStr(n) </script>
The return value is normally the position within the original string that the substring starts. If the substring is not found then zero is returned (this isn't a problem because VBScript starts its indices at 1, so zero is not a valid index). If either string is a Null a Null is returned, which makes testing error conditions a bit more tricky.
As with Python we can specify a sub range of the original string to search, using a start value, like this:
<script type="text/vbscript"> Dim s,n s = "Here is a long string of text" n = InStr(6, s, "long") ' start at position 6 MsgBox "long is found at position: " & CStr(n) </script>
Unlike Python we can also specify whether the search should be case-sensitive or not, the default is case-sensitive.
Replacing text is done with the Replace function. Like this:
<script type="text/vbscript"> Dim s s = "The quick yellow fox jumped over the log" MsgBox Replace(s, "yellow", "brown") </script>
We can provide an optional final argument specifying how many occurrences of the search string should be replaced, the default is all of them. We can also specify a start position as for InStr above.
Changing case in VBScript is done with UCase and LCase, there is no equivalent of Python's capitalize method.
<script type="text/vbscript"> Dim s s = "MIXed Case" MsgBox LCase(s) MsgBox UCase(s) </script>
And that's all I'm going to cover in this tutorial, if you want to find out more check the VBScript help file for the list of functions.
Splitting text is done using the split method:
Once again the search string argument is actually a regular expression so the searches can be very sophisticated indeed. Notice, however, that there is no way to restrict the range of the original string that is searched by passing a start position (although this can also be simulated using regular expression tricks).
To do a replace operation we use the replace() method.
And once again the search string can be a regular expression, you can begin to see the pattern I suspect! The replace operation replaces all instances of the search string and, so far as I can tell, there is no way to restrict that to just one occurence without first splitting the string and then joining it back together.
Changing case is performed by two functions: toLowerCase() and toUpperCase()
That concludes our look at text handling, hopefully it has given you the tools you need to process any text you encounter in your own projects. One final word of advice: always check the documentation for your language when processing text, there are often powerful tools included for this most fundamental of programming tasks.
|Things to remember|
If you have any questions or feedback on this page send me mail at: email@example.com