Web Client Programming

What will we cover?

What is Web client programming

Before going any further we need to clarify what is meant by web client programming. There are basically two different definitions:

  1. Using JavaScript to manipulate the content of a web page within the browser. This could involve changing colours of elements, moving panels around, making things appear or disappear and fetching bits of raw data from the server, for example in a "live search" scenario. This is usually not too complex programming but does require a deep knowledge of HTML and CSS and as such is beyond the scope of this tutorial. If you need to know more about this kind of web client programming there are many web tutorials available. There are also several JavaScript libraries and frameworks that you should research too, the best known being JQuery.
  2. Creating a program that runs on a PC but accesses a web server pretending to be a browser. These may simply retrieve some data for further analysis or they may follow links from one site to another, searching for specific topical data. The latter variety is often called a web spyder, the former is usually called a bot (short for robot). These are the types of web client program that we will discuss here.

Learning by Example

Unlike most of the topics in the tutorial I'm going to use a worked example to demonstrate how to write a web client program. This will start very simply then add features to show some of the more complex issues. It will still be a very simple program when we have finished but it should give you a feel for the kinds of things you can do.

The objective of this exercise is to build a program that will generate a new web page containing a collection of all of the "What will we cover?" lists from the top of each topic.

To do this we need to accomplish several things:

  1. We need to read the html content of the index file
  2. We need to parse the contents page and extract all the links on the left hand pane into a list.
  3. For each link fetch the html content of that file
  4. Extract the list of topics inside the "What will we cover" box at the top of the page and store in another list.
  5. Finally, we need to generate a new HTML page using the data we have collected.

Retrieving the contents

You already saw how to fetch a simple HTML page from a server using urllib.request:

import urllib.request as url
site = url.urlopen('http://www.alan-g.me.uk/l2p2/index.htm')
page = site.read()

You now have a string containing all the HTML for the top level page. You can look at the print out, but it's a bit hard to read so, instead, use the view source option in your web browser to look at the HTML. There you will see that the table of contents is contained within a pair of nav tags (short for navigation) and each list of topics per section is contained in an unordered list, signified by the ul tags with each list item marked by an li tag. The list items are all hyperlinks to the topic files and, as such, are surrounded by anchor tags (<a>)

So our next task is to extract all the <a> items within the nav pane of the page.

Extracting tag content

We mentioned in the introduction that you can use simple text searches to find tags etc. in a web page but that it is usually better to use a proper HTML parser. We will follow that advice and use the HTMLParser class found in the standard library module: html.parser.

This is an abstract, event-driven parser, which we must subclass to provide the specific features we need. The important points to note are that it calls two methods, one every time it finds an opening HTML tag (handle_starttag()), the other when it finds a closing tag(handle_endtag()). We must provide our own versions of those methods that act appropriately when the tags we are interested in are found.

In our case we want to find all the links inside the nav panel. To do that we set a flag which we will name in_nav whenever an opening nav tag is found. When we find an a tag and the flag is True then we save the href attribute value into a list. (The attributes are passed in to the method as a tuple of key/value pairs.)

Finally, we need to detect the closing /nav tag and reset the flag to False to ensure we don't collect any links from outside the contents pane.

To start with we'll set up the parser and ensure it can identify the three tags that we are interested in by using print statements. It looks like this:

import urllib.request as url
import html.parser

class LinkParser(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.links = []  #list to store the links
        self.is_nav = False  # the flag

    def handle_starttag(self, name, attributes):
        if name == 'nav':
            print("We found a  tag")

site = url.urlopen('http://www.alan-g.me.uk/l2p2/index.htm')
page = site.read().decode('utf8') # convert bytes to str

parser = LinkParser()
parser.feed(page)

If you run that it should fetch the page and parse it, printing out the messages for each tag found. (Apart from proving the parser works it also reveals that we really do need the is_nav flag since there is a link outside the nav panel.) We can now change the print statements for the actual code we want (the changes are in bold):

import urllib.request as url
import html.parser

class LinkParser(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.links = []  #list to store the links
        self.is_nav = False  # the flag

    def handle_starttag(self, name, attributes):
        if name == 'nav':
            self.is_nav = True
        elif name == 'a' and self.is_nav:
            for key,val in attributes:
                if key == "href":
                    self.links.append(val)

    def handle_endtag(self, name):
        if name == 'nav':
            self.is_nav = False

site = url.urlopen('http://www.alan-g.me.uk/l2p2/index.htm')
page = site.read().decode('utf8') # convert bytestring to str

parser = LinkParser()
parser.feed(page)
print(parser.links)

The result of that should look like:

['tutintro.htm', 'tutneeds.htm', 'tutwhat.htm', 'tutstart.htm', 'tutseq1.htm', 
'tutdata.htm', 'tutseq2.htm', 'tutloops.htm', 'tutstyle.htm', 'tutinput.htm', 
'tutbranch.htm', 'tutfunc.htm', 'tutfiles.htm', 'tuttext.htm', 'tuterrors.htm', 
'tutname.htm', 'tutregex.htm', 'tutclass.htm', 'tutevent.htm', 'tutgui.htm', 
'tutrecur.htm', 'tutfctnl.htm', 'tutcase.htm', 'tutpractice.htm', 'tutdbms.htm', 
'tutos.htm', 'tutipc.htm', 'tutsocket.htm', 'tutweb.htm', 'tutwebc.htm', 
'tutwebcgi.htm', 'tutflask.htm', 'tutrefs.htm']

Extracting the bullets from a topic

Having got our list of topics we want to create a function that can extract the bullet points at the top of each page. Using the browser's View Source feature once more, we can examine the HTML for a topic frame. This time we discover that we are looking for a set of list items inside a div with the class attribute set to "todo". That's conceptually very similar to what we did for the links. Let's create a function that can take an HTML string and return a list of li strings. (We will just hard code the file name for now as 'tutstart.htm'.)

import urllib.request as url
import html.parser

class BulletParser(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_todo = False
        self.is_bullet = False
        self.bullets = []

    def handle_starttag(self, name, attributes):
        if name == 'div':
           for key, val in attributes:
               if key == 'class' and val == 'todo':
                   self.in_todo = True
        elif name == 'li':
            if self.in_todo:
               self.is_bullet = True
            
    def handle_data(self, data):
        if self.is_bullet:
            self.bullets.append(data)
            self.is_bullet = False  # reset the flag
            
    def handle_endtag(self, name):
        if name == 'div':
            self.in_todo = False

topic_url = "http://www.alan-g.me.uk/l2p2/tutstart.htm"

def get_bullets(aTopic):
    site = url.urlopen(aTopic)
    topic = site.read().decode('utf8')
    topic_parser = BulletParser()
    topic_parser.feed(topic)
    return topic_parser.bullets

print( get_bullets(topic_url) )

Notice that this time we have an extra event handler method(handle_data()) to override. That's because we want to extract the data in the <li> tags rather than the tag itself or its attributes.

Other than that it is very similar to the previous example. We set a flag (in_todo) to indicate when we are inside the box and a second (is_bullet) to indicate we found a list item inside that box. We then reset the is_bullet flag as soon as we have read the data and reset the is_todo when we leave the box (</div>).

We can now merge our two programs together by adding the new class and function to our previous file. All that remains it to write a for loop to iterate over the links from the first parser and send them to the get_bullets() function. The results will be accumulated in a global dictionary keyed by topic. We will also tidy it up a little by creating a get_topics() function similar to our get_bullets(). We can add a little bit of error handling and convert it to a module format at the same time:

import urllib
import urllib.request as url
import html.parser

###### Link handling code  ####

class LinkParser(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.links = []  #list to store the links
        self.is_nav = False  # the flag

    def handle_starttag(self, name, attributes):
        if name == 'nav':
            self.is_nav = True
        elif name == 'a' and self.is_nav:
            for key,val in attributes:
                if key == "href":
                    self.links.append(val)

    def handle_endtag(self, name):
        if name == 'nav':
            self.is_nav = False

def get_topics(aSite):
    try:
        site = url.urlopen(aSite)
        page = site.read().decode('utf8') # convert bytestring to str
        link_parser = LinkParser()
        link_parser.feed(page)
        return link_parser.links
    except urllib.error.HTTPError:
        return []


##### Bullet handling code

class BulletParser(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_todo = False
        self.is_bullet = False
        self.bullets = []

    def handle_starttag(self, name, attributes):
        if name == 'div':
           for key, val in attributes:
               if key == 'class' and val == 'todo':
                   self.in_todo = True
        elif name == 'li':
            if self.in_todo:
               self.is_bullet = True
            
    def handle_data(self, data):
        if self.is_bullet:
            self.bullets.append(data)
            self.is_bullet = False  # reset the flag
            
    def handle_endtag(self, name):
        if name == 'div':
            self.in_todo = False

def get_bullets(aTopic):
    try:
        site = url.urlopen(aTopic)
        topic = site.read().decode('utf8')
        topic_parser = BulletParser()
        topic_parser.feed(topic)
        return topic_parser.bullets
    except urllib.error.HTTPError:
        return []

#### driver code  ####
if __name__ == "__main__":
    summary = {}
    site_root = "http://www.alan-g.me.uk/l2p2/"

    the_topics = get_topics(site_root+'index.htm')

    for topic in the_topics:
        topic_url = site_root + topic
        summary[topic] = get_bullets(topic_url)

    print(summary['tutdata.htm'])

We can tidy things even further by taking the two parser classes and their associated utility functions out into separate modules. Try that for yourself. Create two modules linkparser.py and bulletparser.py and import them into the remaining code which you can save as topic_summary.py

The latter file should look a lot like:

import linkparser as LP
import bulletparser as BP

if __name__ == "__main__":
    summary = {}
    site_root = "http://www.alan-g.me.uk/l2p2/"

    the_topics = LP.get_topics(site_root+'index.htm')

    for topic in the_topics:
        topic_url = site_root + topic
        summary[topic] = BP.get_bullets(topic_url)

    print(summary['tutdata.htm'])

Now isn't that a bit nicer?

Creating a summary web page

The only task left in our project is to take the accumulated data and create a web page to display it. This is little more than some basic file and text handling like we did for the menu example that we described away back in the Basics section. The only complication is the need to use HTML tags to format the data. We achieve this by writing out some static HTML header material, then create a loop to iterate over the topic data printing out a list of topics and bullets. We finish off with another piece of static HTML footer and close the file. Job Done. You can then open the file in your browser to see the final page. It looks like this:

import linkparser as LP
import bulletparser as BP
import time

def create_summary(filename,data):
    with open(filename, 'w') as outf:
        # Write out the header
        outf.write('''<!Doctype htm>
<html><body>
<h1>Summary of tutor topics</h1>
<dl>''')

        # Write each topic name...
        for topic in data:
            outf.write('<dt>%s</dt>' % topic)
            # ...and its bullets
            for bullet in data[topic]:
                outf.write("<dd>%s</dd>" % bullet)
                
        # Write out the footer
        outf.write('''
</dl>
</body></html>''')
    
if __name__ == "__main__":
    summary = {}
    site_root = "http://www.alan-g.me.uk/l2p2/"
    summary_file = './topic_summary.htm'
    
    the_topics = LP.get_topics(site_root+'index.htm')

    for topic in the_topics:
        topic_url = site_root + topic
        summary[topic] = BP.get_bullets(topic_url)
        time.sleep(1) # stop server treating as DoS attack

    create_summary(summary_file, summary)
    print('OK')

If you run that it should produce an HTML file called topic_summary.htm that you can open in your browser and it should look something like this.

Sending data in the request

Sometimes we need to do more than just fetch a static file (or files) from the server. For example, sometimes we need to login and we need to send a user name and password or similar credentials. Or it may be that we are using a search facility to bring back some links. There are two ways to send data to a server, depending on the http message type expected.

As you saw in the introduction topic, http has several messages that we can send, the most common being GET and POST. GET sends all its data in the address line which you see in the browser. You have probably noticed strange hieroglyphics with a lot of question marks and equal signs. These are the data values in the GET request.

Sending a search string to Github

Quite a lot of real world web client programming is part science and part trial and error. You can learn an awful lot about your target web site by visiting it using your favourite browser and looking closely at both the address bar and the source of the pages. If you visit the open source code repository Github and do a search for 'python' you will see on the returned page address bar that the address looks like this:

https://github.com/search?utf8=%E2%9C%93&q=python&type=

The utf8=%E2%9C%93 bit is simply a unicode character (representing a tick) to say the search is using UTF-8. We can ignore that as well as the empty type= at the end and just send:

http://github.com/search?q=python

And it will work fine.

By viewing the page source and, if necessary, viewing the element properties using the browser's inspection tools we can discover what kind of HTML structure we can expect to get back. In the case of Github we can do a text search of the page source screen and find that the first result (which for me was geekcomputers/Python) looks like:

<div class="repo-list-item d-flex flex-justify-start py-4 public source">
  <div class="col-8 pr-3">
    <h3>
      <a href="/geekcomputers/Python" class="v-align-middle">geekcomputers/<em>Python</em></a>
    </h3>
...

And you can hopefully see how we could extract the links with only slightly more difficulty than we did in the topic_summary.htm example above.

Of course life is full of disappointments and you will often find sites that have mechanisms in place to prevent non-browsers from accessing their information. Or the site uses advanced JavaScript techniques to generate the display and simple HTML parsing won't work. That's tough. But if you look a little deeper such sites often offer an alternative, more robust API that can be accessed. If it is a commercial site they may require some kind of license payment of course, but that's how they fund the site so if you have a genuine need for the data don't be mean and pay up.

There are several other gotchas waiting to bite you but I won't cover those here. Things you might look out for, and have to do some research on, are handling login prompts, using cookies and handling encrypted https connections. All of these are possible with a bit of effort but beyond the scope of this tutorial.

By combining various permutations of these techniques it is possible to extract most bits of information from an HTML page using this parser. However you can get caught out by badly written (or badly formed) HTML. In those cases you may need to write custom code to correct the HTML. If it is extensively malformed it might be better to write the HTML to a text file and use an external HTML checker like HTMLtidy to clean up and correct the HTML before trying to parse it. Alternatively investigate the third party package Beautiful Soup which can cope with most of the common problems in HTML.

Detecting error codes

One other thing you need to be able to detect are the errors that are returned from the web server. In a browser these are displayed for us as "Page not Found" or similar, relatively friendly, error strings. However if you are fetching data from the server yourself you will find that the error comes back in the form of an error code in the http header which urllib.request converts to an urllib.error.HTTPError exception. Now, we know how to catch exceptions using a normal try/except construct, so we can catch these errors quite easily like so:

import urllib.request as url
import urllib.error
try:
   site = url.urlopen("http://some.nonexistent.address/foo.html")
except urllib.error.HTTPError, e:
   print e.code

The value in urllib.error.HTTPError.code comes from the first line of the web server's HTTP response, just before the headers begin, (e.g. "HTTP/1.1 200 OK", or "HTTP/1.1 404 Not Found") and consists of the error number. The standard HTTP return codes are described here, the most interesting are those starting with 4 or 5:

The most common error codes you will encounter are:

Some of these (e.g. 503 and 504) simply require a retry, possibly after a short delay, others (e.g. 407) require significantly more work on your part to access the web page!

OK, Now lets look the other side of the coin. What happens when our requests reach a web server? How do we create our own web server? That's the subject of the next topic.

Things to remember

Previous  Next