Handling Files and Text

What will we cover?
  • How to open a file
  • How to read and write to an open file
  • How to close a file.
  • Building a word counter
  • Handling files often poses problems for beginners although the reason for this puzzles me. Files in a programming sense are no different from files that you use in a word processor or other application: you open them, do some work and then close them again.

    The biggest differences are that in a program you access the file sequentially, that is, you read one line at a time starting at the beginning. In practice the word processor often does the same, it just holds the entire file in memory while you work on it and then writes it all back out when you close it. The other difference is that you normally open the file as read only or write only. You can write by creating a new file from scratch (or overwriting an existing one) or by appending to an existing one.

    One other thing you can do while processing a file is that you can go back to the beginning.

    Files - Input and Output

    Let's see that in practice. We will assume that a file exists called menu.txt and that it holds a list of meals:

    spam & eggs
    spam & chips
    spam & spam
    

    Now we will write a program to read the file and display the output - like the 'cat' command in Unix or the 'type' command in DOS.

    # First open the file to read(r)
    inp = open("menu.txt","r")
    # read the file into a list then print
    # each item
    for line in inp.readlines():
        print line
    # Now close it again
    inp.close()
    

    Note 1: open() takes two arguments. The first is the filename (which may be passed as a variable or a literal string, as we did here). The second is the mode. The mode determines whether we are opening the file for reading(r) or writing(w), and also whether it's for ASCII text or binary usage - by adding a 'b' to the 'r' or 'w', as in: open(fn,"rb")

    Note 2: We read and close the file using functions preceded by the file variable. This notation is known as method invocation and is our first glimpse of Object Orientation. Don't worry about it for now, except to realize that it's related in some ways to modules. You can think of a file variable as being a reference to a module containing functions that operate on files and which we automatically import every time we create a file type variable.

    Consider how you could cope with long files. First of all you would need to read the file one line at a time (in Python by using readline() instead of readlines(). You might then use a line_count variable which is incremented for each line then tested to see whether it is equal to 25 (for a 25 line screen). If so, you request the user to press a key (enter say) before resetting the line_count to zero and continuing. You might like to try that as an excercise...

    Really that's all there is to it. You open the file, read it in and manipulate it any way you want to. When you're finished you close the file. To create a 'copy' command in Python, we simply open a new file in write mode and write the lines to that file instead of printing them. Like this:

    # Create the equivalent of: COPY MENU.TXT MENU.BAK
    
    # First open the files to read(r) and write(w)
    inp = open("menu.txt","r")
    outp = open("menu.bak","w")
    
    # read the file into a list then copy to
    # new file
    for line in inp.readlines():
        outp.write(line)
    
    print "1 file copied..."
    
    # Now close the files
    inp.close()
    outp.close()
    

    Did you notice that I added a print statement just to reassure the user that something actually happened? This kind of user feedback is usually a good idea.

    One final twist is that you might want to append data to the end of an existing file. One way to do that would be to open the file for input, read the data into a list, append the data to the list and then write the whole list out to a new version of the old file. If the file is short that's not a problem but if the file is very large, maybe over 100Mb, then you will simply run out of memory to hold the list. Fortunately there's another mode "a" that we can pass to open() which allows us to append directly to an existing file just by writing. Even better, if the file doesn't exist it will open a new file just as if you'd specified "W".

    As an example, lets assume we have a log file that we use for capturing error messages. We don't want to delete the existing messages so we choose to append the error, like this:

    def logError(msg):
       err = open("Errors.log","a")
       err.write(msg)
       err.close()
    

    In the real world we wpuld probably want to limit the size of the file in some way. A common technique is to create a filename based on the date, thus when the date changes we automatically create a new file and it is easy for the maintainers of the system to find the errors for a particular day and to archive away old error files if they are not needed.

    Counting Words

    Now let's revisit that word counting program I mentioned in the previous section. Recall the Pseudo Code looked like:

    def numwords(s):
        list = split(s) # list with each element a word
        return len(list) # return number of elements in list
    
    for line in file:
        total = total + numwords(line) # accumulate totals for each line
    print "File had %d words" % total
    

    Now we know how to get the lines from the file let's consider the body of the numwords() function. First we want to create a list of words in a line. By looking at the Python reference documentation for the string module we see there is a function called split which separates a string into a list of fields separated by whitespace (or any other character we define). Finally, by again referring to the documentation we see that the builtin function len() returns the number of elements in a list, which in our case should be the number of words in the string - exactly what we want.

    So the final code looks like:

    import string
    def numwords(s):
        list = string.split(s) # need to qualify split() with string module
        return len(list) # return number of elements in list
    
    inp = open("menu.txt","r")
    total = 0  # initialise to zero; also creates variable
    
    for line in inp.readlines():
        total = total + numwords(line) # accumulate totals for each line
    print "File had %d words" % total
    
    inp.close()
    

    That's not quite right of course because it counts the '&' character as a word (although maybe you think it should...). Also, it can only be used on a single file (menu.txt). But its not too hard to convert it to read the filename from the command line ( argv[1]) or via raw_input() as we saw in the 'Talking to the user' section. I leave that as an excercise for the reader.

    BASIC and Tcl

    BASIC and Tcl provide their own file handling mechanisms. They aren't too different to Python so I'll simply show you the 'cat' program in both and leave it at that.

    BASIC Version

    BASIC uses a concept called streams to identify files. These streams are numbered which can make BASIC file handling tedious. This can be avoided by using a handy function called FREEFILE which returns the next free stream number. If you store this in a variable you never need to get confused about which stream/file has which number.

    INFILE = FREEFILE
    OPEN "TEST.DAT" FOR INPUT AS INFILE
    REM Check for EndOfFile(EOF) then 
    REM read line from input and print it
    WHILE NOT EOF(INFILE)
        LINE INPUT #INFILE, theLine
        PRINT theLine
    WEND
    CLOSE #INFILE
    

    Tcl Version

    By now the pattern should be clear. Here is the Tcl version:

    set infile [open "Test.dat" r]
    while { [gets $infile line] >= 0} {
         puts $line
         }
    close $infile
    
    Things to remember
  • Open files before using them
  • Files can usually only be read or written but not both at the same time
  • Pythons readlines() function reads all the lines in a file, while readline() only reads one line at a time, which may help save memory.
  • Close files after use.
  • Previous  Next  Contents


    If you have any questions or feedback on this page send me mail at: alan.gauld@btinternet.com