File Handling

Files in a programming sense are really not very different from files that you use in a word processor or other application: you open them, do some work and then close them again. The biggest differences are that in a program you access the file sequentially, that is, you read one line at a time starting at the beginning. (In practice the word processor often does the same, it just holds the entire file in memory while you work on it and then writes it all back out when you close it.) The other difference is that, when programming, you normally open the file as read only or write only. You can write by creating a new file from scratch (or overwriting an existing one) or by appending to an existing one.

One other thing you can do while processing a file is that you can go back to the beginning.

Files - Input and Output

Let's see that in practice. We will assume that a file exists called menu.txt and that it holds a list of meals:

Now we will write a program to read the file and display the output - like the 'cat' command in Unix or the 'type' command in Windows CMD shells.

Note 1: open() takes two arguments. The first is the filename (which may be passed as a variable or a literal string, as we did here). The second is the mode. The mode determines whether we are opening the file for reading(r) or writing(w), and also whether it's for text or binary usage - by adding a 'b' to the 'r' or 'w', as in:

Note 2: We read the file in a for loop. Recall that Python's for loop acts as a foreach loop on a collection. It returns each element of the collection. A text file can be considered a collection of lines and so the loop reads each line in turn.

Note 3: We close the file using a function preceded by the file variable. This notation is known as method invocation as described in the Modules and Functions topic. You can, if it helps, think of a file variable as being a reference to a module containing functions that operate on files and which we automatically import every time we create a file type variable.

In Python, files are automatically closed at the end of the program but it is good practice to get into the habit of closing your files explicitly. Why? Well, the operating system may not write the data out to the file until it is closed (this can boost performance). What this means is that if the program exits unexpectedly there is a danger that your precious data may not have been written to the file! So the moral is: once you finish writing to a file, close it.

Note 4: We have not specified the full path to the file in the code above so the file will be treated as being in the current folder. However, we can pass a full path name to open() instead of just the file name. There is a wrinkle when using Windows however, because the \ character used to separate folders in a Windows path has a special meaning inside a Python string. So, when specifying paths in Python it is best to always use the / character instead and that will work on any Operating System including Windows.(There is more information in a box below)

Now, consider how you could cope with long files. You couldn't display all of the file on a single screen so we need to pause after each screenful of text. You might use a line_count variable which is incremented for each line and then tested to see whether it is equal to 25 (for a 25 line screen). If so, you request the user to press a key (enter, say) before resetting line_count to zero and continuing. You might like to try that as an exercise...

Another way of reading a file is to use a while loop and a method of the file object called readline(). The advantage of this is that we can stop processing the file as soon as we find the data we need, this can greatly speed things up if processing long files. However, it is a little bit more complex, so let's look at the previous example again using a while loop:

Note: we use the break technique mentioned in the branching topic to exit the loop if the line is empty (recall an empty line counts as false in Boolean terms). Thereafter we printed each line and went around the loop again. Finally, after exiting the while loop, we closed the file. If we wanted to stop at a certain point in the file we would have introduced a branch condition inside the while loop and if it detected the stop condition we simply call break there too so that the loop will terminate.

Really that's all there is to it. You open the file, read it in and manipulate it any way you want to. When you're finished you close the file. However there is one little niggle you may have noticed in the previous example: the lines read from the file have a newline character at the end, so you wind up with blank lines using print() (which adds its own newline). To avoid that Python provides a string method called strip() which will remove whitespace, or non-printable characters, from both ends of a string. (It also has cousins, called rstrip and lstrip, which can from strip one end only) If we substitute the print() line above with:

To create a 'copy' command in Python, we simply open a new file in write mode and write the lines to that file instead of printing them. Like this:

Did you notice that I added a print() statement at the end, just to reassure the user that something actually happened? This kind of user feedback is usually a good idea.

Because we wrote out the same line that we read in there was no problems with newline characters here. But if we had been writing out strings which we created, or which we had stripped earlier we would have needed to add a newline to the end of the output string, like this:

Let's look at how we might incorporate that into our copy program. Instead of simply copying the menu we will add todays date to the top. That way we can easily generate a daily menu from the easily modified text file of meals. All we need to do is write out a couple of lines at the top of the new file before copying the menu.txt file, like this:

Note that we use the time module to get todays date (time.time()) and convert it into a tuple of values (time.localtime()) which are then used by time.strftime() (check the documentation for time.strftime to see what else it can do) to produce a string which, when inserted into a title message using string formatting, looks like:

Although we added two '\n' characters at the end of the string there is only one blank line printed, that's because one of them is the newline at the end of the title itself. Managing the creation and removal of newline characters is one of the more irritating aspects of handling text files.

Some Operating Systems Gotchas

Operating systems handle files in different ways. This introduces some niggles into our programs if we want them to work on multiple operating systems. There are two niggles in particular which can catch people out and we'll look at them here:

Newlines

The whole subject of newlines and text files is a murky area of non standard implementation by different operating systems. These differences have their roots in the early days of data communications and the control of mechanical teleprinters. Basically there are 3 different ways to indicate a new line:

A Carriage Return (CR) character ('\r')
A Line Feed (LF) character ('\n')
A CR/LF pair ('\r\n').

All three techniques are used in different operating systems. MS DOS (and therefore Windows) used method 3. Unix (including Linux) uses method 2. Apple in its original MacOS used method 1, but now uses method 2 since MacOS X is really a variant of Unix.

So how can the poor programmer cope with this multiplicity of line endings? In many languages she just has to do lots of tests and take different action per OS. In more modern languages, including Python, the language provides facilities for dealing with the mess for you. In the case of Python the assistance comes in the form of the os module which defines a variable called linesep which is set to whatever the newline character is on the current operating system. This makes adding newlines easy, and rstrip() takes account of the OS when it does its work of removing them, so really the simple way to stay sane, so far as newlines are concerned is: always use rstrip() to remove newlines from lines read from a file and always add os.linesep to strings being written to a file.

That still leaves the awkward situation where a file is created on one OS and then processed on another, incompatible, OS and sadly, there isn't much we can do about that except to compare the end of the line with os.linesep to determine what the difference is.

Specifying Paths

This is more of an issue for Windows users than others although MacOS 9 users (are there any left?) may bump into it occasionally too. As above, each OS specifies paths to files using different characters to separate the drives, folders and files. The generic solution for this is again to use the os module which provides the os.sep variable to define the current platforms path separator character. In practice you won't need to use this very often since the path will likely be different for every machine anyway! So instead you will just enter the full path directly in a string, possibly once for each OS you are running on. But there is one big gotcha hiding in wait for Windows users...

You saw in the previous section that Python treats the string '\n' as a newline character. That is it takes two characters and treats them as one. In fact there are a whole range of these special sequences beginning with back slash (\) including:

\n - A new line
\r - A carriage return
\t - A horizontal tab
\v - A vertical tab (sometimes means a new page)
\b - A backspace
\0nn - Any arbitrary octal character code. e.g. the code \033 is the escape character (ESC)

This means that if we have a data file called test.dat and want to open it in Python by specifying a full Windows path we might expect this to work:

>>> f = open('C:\test.dat')

But Python will see the \t pair as a tab character and complain it cannot find a file called:
C: est.dat.
So how do we get round this inconvenience? There are three solutions:

put an 'r' in front of the string. This tells Python to ignore any back slashes and treat it as a "raw" sting.
Use forward slashes (/) instead of backslashes, Python and Windows will between them sort out the path for you. This has the added advantage of making your code portable to other operating systems too.
Use a double backslash(\\) since a double backslash character is seen by Python as a single backslash!

Thus any of the following will open our data file correctly:

>>> f = open(r'C:\test.dat')
>>> f = open('C:/test.dat')
>>> f = open('C:\\test.dat')

Note that this is an issue only for literal strings you type into your program code. If the path strings are read from a file or from a user Python will not interpret the \ characters and you can use them as-is with no worries about separators.

Appending data

One final twist in file processing is that you might want to append data to the end of an existing file. One way to do that would be to open the file for input, read the data into a list, append the data to the list and then write the whole list out to a new version of the old file. If the file is short that's not a problem but if the file is very large, maybe over 100Mb, then you could run out of memory to hold the list, and it would take quite a long time. Fortunately there's another mode "a" that we can pass to open() which allows us to append directly to an existing file just by writing. Even better, if the file doesn't exist it will open a new file just as if you'd specified "w".

As an example, let's assume we have a log file that we use for capturing error messages. We don't want to delete the existing messages so we choose to append the error, like this:

In the real world we would probably want to limit the size of the file in some way. A common technique is to create a filename based on the date, thus when the date changes we automatically create a new file and it is easy for the maintainers of the system to find the errors for a particular day and to archive away old error files if they are not needed. (Remember, from the menu example above, that the time module can be used to find out the current date.)

With a twist

Python v3 has introduced a new, more convenient, way of working with files, particularly when iterating over their contents. This uses a new construct known as with. It looks like this:

with open('Errors.log',"r") as inp:
    for line in inp:
        print( line )

Notice that in this code we do not use close(). with guarantees to close the file at the end of the with statement. This construct makes file handling a little bit more reliable and is the recommended way of opening files in Python v3. I have tended to use the older, more explicit, open/close because this is how most programming languages work, but if you are using Python exclusively then try using with.

You remember the address book program we introduced during the Raw Materials topic and then expanded in the Talking to the User topic? Let's start to make it really useful by saving it to a file and, of course, reading the file at start-up. We'll do this by writing some functions. So in this example we pull together several of the strands that we've covered in the last few topics.

The basic design will require a function to read the file at start-up and another to write the file at the end of the program. We will also create a function to present the user with a menu of options and a separate function for each menu selection. The menu will allow the user to:

Loading the Address Book

Note 1: We import the os module which we use to check that the file path actually exists before opening the file.

Note 2: We defined the filename as a module level variable so we can use it both in loading and saving the data.

Note 3: We use rstrip() to remove the new-line character from the end of the line. Also notice the next() funtion to fetch the next line from the file within the loop. This effectively means we are reading two lines at a time as we progress through the loop.

The next function is actually part of a feature of Python called an iterator. I don't discuss iterators in this tutorial since they are quite Python specific. All Python collections as well as files and a few other things are considered iterators (or iterable types). You can read more about iterators in the Python documentation.

Saving the Address Book

Notice we need to add a newline character ('\n') when we write the data. Also note that we write two lines for each entry, this mirrors the fact that we processed two lines when reading the file.

Getting User Input

Note: We receive a length parameter which tells us how many menu entries there are. This allows us to create a prompt that specifies the correct number range.

Adding an Entry

Removing an entry

Finding an entry

Quitting the program

Actually I won't write a separate function for this, instead I'll make the quit option the test in my menu while loop. So the main program will look like this:

Now the only thing left to do is call the main() function when the program is run, and to do that we use a bit of Python magic like this:

This mysterious bit of code allows us to use any python file as a module by importing it, or as a program by running it. The difference is that when the program is imported, Python sets the internal variable __name__ to the module name but when the file is run as a program, the value of __name__ is set to "__main__". This means the main() function only gets called if the file is run as a program, but not when the file is imported. Sneaky, eh?

Now if you type all that code into a new text file and save it as addressbook.py, you should be able to run it from an OS prompt by typing:

Or just double click the file in Windows Explorer and it should start up in its own CMD window, and the window will close when you select the quit option.

This 60 odd line program is typical of the sort of thing you can start writing for yourself. There are a couple of things we can do to improve it which I'll cover in the next section, but even as it stands it's a reasonably useful little tool.

VBScript and JavaScript

Neither VBScript nor JavaScript have native file handling capabilities. This is a security feature to ensure nobody can read your files when you innocently load a web page, but it does restrict their general usefulness. However, as we saw with reusable modules there is a way to do it using Windows Script Host. WSH provides a FileSystem object which allows any WSH language to read files. We will look at a JavaScript example in detail then show similar code in VBScript for comparison, but as before the key elements will really be calls to the WScript objects.

Before we can look at the code in detail it's worth taking time to describe the FileSystem Object Model. An Object Model is a set of related objects which can be used by the programmer. The WSH FileSystem object model consists of the FSO object, a number of File objects, including the TextFile object which we will use. There are also some helper objects, most notable of which is, for our purposes, the TextStream object. Basically we will create an instance of the FSO object, then use it to create our TextFile objects and from these in turn create TextStream objects to which we can read or write text. The TextStream objects themselves are what we actually read/write from the files.

Type the following code into a file called testFiles.js and run it using cscript as described in the earlier introduction to WSH.

Opening a file

To open a file in WSH we create an FSO object then create a TextFile object from that:

Reading and Writing a file

Closing files

And in VBScript

Or alternatively, put the bit between the script tags into a file called testFile.vbs and run that instead. The .ws format allows you to mix JavaScript and VBScript code in the same file by simply using multiple script tags, should you want to...

Handling Non-Text Files

Handling text is one of the most common things that programmers do, but sometimes we need to process raw binary data too. This is very rarely done in VBScript or JavaScript so I will only be covering how Python does it.

Opening and Closing Binary Files

The key difference between text files and binary files is that text files are composed of octets, or bytes, of binary data whereby each byte represents a character and the end of the file is marked by a special byte pattern, known generically as end of file, or eof. A binary file contains arbitrary binary data and thus no specific value can be used to identify end of file, thus a different mode of operation is required to read these files. The end result of this is that when we open a binary file in Python (or indeed any other language) we must specify that it is being opened in binary mode or risk the data being read being truncated at the first eof character that Python finds in the data. The way we do this in Python is to add a 'b' to the mode parameter, like this:

The only difference from opening a text file is the mode value of "rb". You can use any of the other modes too, simply add a 'b': "wb" to write, "ab" to append.

Closing a binary file is no different to a text file, simply call the close() method of the open file object:

Because the file was opened in binary mode there is no need to give Python any extra information, it knows how to close the file correctly.

Binary encoding of data

Before we discuss how to access the data within a binary file we need to consider how data is represented and stored on a computer. All data is stored as a sequence of binary digits, or bits. These bits are grouped into sets of 8 or 16 called bytes or words respectively. (A group of 4 is sometimes called a nibble!) A byte can be any one of 256 different bit patterns and these are given the values 0-255.

The information we manipulate in our programs, strings, numbers etc must all be converted into sequences of bytes. Thus the characters that we use in strings are each allocated a particular byte pattern. There were originally several such encodings, but the most common was the ASCII (American Standard Coding for Information Interchange). Unfortunately pure ASCII only caters for 128 values which is not enough for non English languages. A new encoding standard known as Unicode has been produced, which can use data words instead of bytes to represent characters, and allows for over a million characters. These characters can then be encoded into a more compact data stream.

One of the most common encodings is called UTF-8 and it corresponds closely to the earlier ASCII coding such that every valid ASCII file is a valid UTF-8 file, although not necessarily the other way around. Unicode provides a number of different encodings each of which defines which bytes represent each Unicode numerical value (or code point in Unicode terms). If you are thinking that this is complicated you are right! It is the cost of building a global computer network that must work in lots of different languages. The good news if you are an English speaker is that for the most part you can ignore it! (Although you should know that, as of version 3 Python strings are actually Unicode strings.) The exception is when reading data from a binary file, when you do need to know which encoding has been used to be able to interpret the binary data successfully.

Python fully supports Unicode text. A string of encoded characters is considered to be a byte string and has the type bytes whereas a string of unencoded text has the type str. The default encoding is usually UTF-8 (but, in theory at least, could be different!). I will not be covering the use of non UTF-8 encodings in this tutorial but there is an extensive "How-To" document on the Python web site.

The key thing to realize in all of this is that a binary stream of encoded Unicode text is treated as a string of bytes and Python provides functions to convert (or decode) bytes into str values.

In the same way numbers need to be converted to binary codings too. For small integers it is simple enough to use the byte values directly, but for numbers larger than 255 (or negative numbers, or fractions) some additional work needs to be done. Over time various standard codings have emerged for numerical data and most programming languages and operating systems use these. For example, the American Institute of Electrical and Electronic Engineering (IEEE) have defined a number of codings for floating point numbers.

The point of all of this is that when we read a binary file we have to interpret the raw bit patterns into the correct type of data for our program. It is perfectly possible to interpret a stream of bytes that were originally written as a character string as a set of floating point numbers. Or course the original meaning will have been lost but the bit patterns could represent either. So when we read binary data it is extremely important that we convert it into the correct data type.

The struct Module

To encode/decode binary data Python provides a module called struct, short for structure. struct works very much like the format strings we have been using to print mixed data. We provide a string representing the data we are reading and apply it to the byte stream that we are trying to interpret. We can also use struct to convert a set of data to a byte stream for writing, either to a binary file (or even a communications line!).

There are many different conversion format codes but we will only use the integer and string codes here. (You can look up the others on the Python documentation for the struct module.) The codes for integer and string are i, and s respectively. The struct format strings consist of sequences of codes with numbers pre-pended to indicate how many of the items we need. The exception is the s code where the prepended number means the length of the string. For example 4s means a string of four characters (note 4 characters not 4 strings!).

Let's assume we wanted to write the address details, from our Address Book program above, as binary data with the street number as an integer and the rest as a string (This is a bad idea in practice since street "numbers" sometimes include letters!). The format string would look like:

To cope with multiple address lengths we could write a function to create the binary string like this:

So we used a string method - split() - (more on them in the next topic!) to split the address string into its parts, extract the first one as the number and then use another string method, join to join the remaining fields back together separated by spaces. We also need to convert the string into a bytes array because that's what the struct module uses. The length of that string is the number we need in the struct format string so we use the len() function in conjunction with a normal format string to build a struct format string. Phew!

formatAddress() will return a sequence of bytes containing the binary representation of our address. Now that we have our binary data let's see how we can write that to a binary file and then read it back again.

Reading & Writing Using struct

Let's create a binary file containing a single address line using the formatAddress() function defined above. We need to open the file for writing in 'wb' mode, encode the data, write it to the file and then close the file. Let's try it:

You can check that the data is indeed in binary format by opening address.bin in notepad. The characters will be readable but the number will not look like 10! In fact it has disappeared! If you have an editor which can read binary files (e.g vim or emacs) and use that to open address.bin you will see that the start of the file has 4 bytes. The first of these may look like a newline character and the rest are zeros. Now it turns out that, just coincidentally, the numerical value of newline is 10! As we can show using Python:

The ord() function simply returns the numeric value of a given character. So the first 4 bytes are 10,0,0,0 in decimal (or 0xA,0x0,0x0,0x0 in hexadecimal, the system usually used to display binary data - since it is much more concise than using pure binary).

On a 32 bit computer an integer takes up 4 bytes. So the integer value '10' has been converted by the struct module into the 4 byte sequence 10, 0, 0, 0. Now on intel micro-processors the byte sequence is to put the least significant byte first so that, reading it in reverse, gives us the true "binary" value: 0,0,0,10.

Which is the integer value 10 expressed as 4 decimal bytes. The rest of the data is basically the original text string and so appears in its normal character format.

Be sure not to save the file from within Notepad since although Notepad can load some binary files it cannot save them as binary, it will try to convert the binary to text and can corrupt the data in the process! It is worth pointing out here that the file extension .bin that we used is purely for our convenience, it has no bearing on whether the file is binary or text format. Some Operating Systems use the extension to determine what programme they will use to open the file, but you can change the extension by simply renaming the file, the content will not change it will still be binary or text whichever it was originally. (You can prove this by renaming a text file in Windows to .exe whereupon Windows will treat the file as an executable, but when you try to run it you will get an error because the text is not really executable binary code! If you now rename it back to .txt the file will open in Notepad exactly as it did before, the content has not been altered at all - in fact you could even have opened the text in Notepad while it was named as a .exe and it would have worked just as well!)

To read our binary data back again we need to open the file in 'rb' mode, read the data into a sequence of bytes, close the file and finally unpack the data using a struct format string. The question is: how do we tell what the format string looks like? In general we would need to find the binary format from the file definition (there are several web sites which provide this information - for example Adobe publish the definition of their common PDF binary format). In our case we know it must be like the one we created in formatAddress(), namely 'iNs' where N is a variable number. How do we determine the value of N?

The struct module provides some helper functions that return the size of each data type, so by firing up the Python prompt and experimenting we can find out how many bytes of data we will get back for each data type:

Ok, we know that our data will comprise 4 bytes for the number and one byte for each character. So N will be the total length of the data minus 4. Let's try using that to read our file:

Note: We had to convert rest to a string using the decode() function since Python considered it to be of type bytes (see the sidebar above) which won't work with join().

And that's it on binary data files, or at least as much as I'm going to say on the subject. As you can see using binary data introduces several complications and unless you have a very good reason I don't recommend it. But at least if you do need to read a binary file, you can do it (provided you know what the data represented in the first place of course!)

Random Access to Files

The last aspect of file handling that I'll consider is called random access. Random access means moving directly to a particular part of the file without reading all the intervening data. Some programming languages provide a special indexed file type that can do this very quickly but in most languages its built on top of the normal sequential file access that we have been using up till now.

The concept used is that of a cursor that marks the current position within the file, literally how many bytes we are from the beginning. We can move this cursor relative to its current position or relative to the start of the file. We can also ask the file to tell us where the cursor is currently.

By using a fixed linelength (perhaps by padding our data strings with spaces or some other character where necessary) we can jump to the start of a particular line by multiplying the length of a line by the number of lines. This is what gives the impression of random access to the data in the file.

Where am I?

To determine where we are in a file we can use the tell() method of a file. For example if I open a file and read three lines, I can then ask the file how far into the file I am.

Let's look at an example, first I will create a file with 5 lines of text all the same length (the equal length business isn't strictly necessary but it does make life easier!). Then I'll read three lines back and ask where we are. I'll then go back to the beginning, read one line then jump to the third line and print it, jumping over the second line. Like this:

Note the use of the seek() function to move the cursor. The default operation is to move it to the byte number specified, as shown here, but extra arguments can be provided that change the indexing method used. Also note that the value printed by the first tell() depends on the length of a newline on your platform, on my Windows 10 PC it printed 66 indicating that the newline sequence is 2 bytes long. But since this is a platform specific value and I want to make my code portable I've used tell() again, after reading one line, to work out how long each line really is. These kind of "cunning ploys" are often necessary when dealing with platform specific issues!

Handling Files

Files - Input and Output

Some Operating Systems Gotchas

Newlines

Specifying Paths

Appending data

With a twist

The Address Book Revisited

Loading the Address Book

Saving the Address Book

Getting User Input

Adding an Entry

Removing an entry

Finding an entry

Quitting the program

VBScript and JavaScript

Opening a file

Reading and Writing a file

Closing files

And in VBScript

Handling Non-Text Files

Opening and Closing Binary Files

Binary encoding of data

The `struct` Module

Reading & Writing Using `struct`

Random Access to Files

Where am I?

Handling Files

Files - Input and Output

Some Operating Systems Gotchas

Newlines

Specifying Paths

Appending data

With a twist

The Address Book Revisited

Loading the Address Book

Saving the Address Book

Getting User Input

Adding an Entry

Removing an entry

Finding an entry

Quitting the program

VBScript and JavaScript

Opening a file

Reading and Writing a file

Closing files

And in VBScript

Handling Non-Text Files

Opening and Closing Binary Files

Binary encoding of data

The struct Module

Reading & Writing Using struct

Random Access to Files

Where am I?

The `struct` Module

Reading & Writing Using `struct`