What will we cover?
In this topic we will look at how data can be stored over time and manipulated via a database package. We have already seen how to use files to store small quantities of data such as our personal address book but the complexity of using files increases dramatically as the complexity of the data increases, the volume of data increases, and the complexity of the operations (searching, sorting, filtering etc). To overcome this, several database packages exist to take care of the underlying file management and expose the data in an abstract form which is easy to manipulate. Some of these packages are simple code libraries that simplify the file operations we have already seen, examples include the pickle and shelve modules that come with Python. In this topic, we will concentrate on more powerful packages, that are designed to handle large volumes of complex data.
The specific package I will be looking at is SQLite, an open-source freeware package that is simple to install and use but capable of handling the data needs of most beginning and intermediate programmers. Only if you have very large data sets - millions of records - do you need to consider a more powerful package and, even then, almost all you know from SQLite will transfer to the new package.
The SQLite package can be downloaded from here and you should fetch the command-line (aka tools) package appropriate for your platform. (There are some useful IDEs for SQLite that you might like, but they aren't necessary for this tutorial.) Follow the instructions on the web site to install the packages and you should be ready to start.
The basic principle of a relational database is very simple. It's simply a set of tables where a cell in one table can refer to a row in another table. Columns are known as fields and rows as records.
A table holding data about employees might look like:
EmpID | Name | HireDate | Grade | ManagerID |
---|---|---|---|---|
1020304 | John Brown | 20030623 | Foreman | 1020311 |
1020305 | Fred Smith | 20040302 | Labourer | 1020304 |
1020307 | Anne Jones | 19991125 | Labourer | 1020304 |
Notice a couple of conventions here:
We are not restricted to linking data within a single table. We could create another table for Salary. These could be related to Grade and so we get a table like:
SalaryID | Grade | Amount |
---|---|---|
000010 | Foreman | 60000 |
000011 | Labourer | 35000 |
Now we can look up the grade of an Employee, such as John, and find that they are a Foreman, then by looking at the Salary table we can see that a Foreman is paid $60000.
It is this ability to link table rows together in relationships that gives relational databases their name. Other database types include network databases, hierarchical databases and flat-file databases. Relational databases are by far the most common. (Although, a recent trend in the handling of extremely large data sets is to use NoSQL - "Not only SQL" - databases, which are often based on network or hierarchical structures.)
We can do much more sophisticated queries too and we will look at how this is done in the next few sections. But before we can do that we had better create a database and insert some data.
The Structured Query Language or SQL (pronounced as either Sequel or 'S' 'Q' 'L') is the standard tool for manipulating relational databases. In SQL an expression is often referred to as a query (even if it does not actually retrieve any data!).
SQL comprises two parts, the first is the Data Definition Language, or DDL. This is the set of commands used to create and alter the shape of the database itself, its structure. DDL tends to be quite database specific with each database supplier having a slightly different syntax for their DDL set of SQL commands.
The other part of SQL is the Data Manipulation Language or DML. DML is much more highly standardised between databases and is used to manipulate the data content of the database. You will spend the majority of your time using DML rather than DDL.
We will only look briefly at DDL, just enough to create (with the CREATE command) and destroy (with the DROP command) our database tables so that we can move onto filling them with data and then retrieving that data in interesting ways using the DML commands (INSERT, SELECT, UPDATE, DELETE etc).
Another feature of SQL worth highlighting is that, unlike Python and Javacript, it is not case sensitive. Thus you can use 'CREATE' or 'Create' or 'create' or even 'CrEaTe' if you wish. The SQL interpreter will not care. However, SQL programmers often follow a style whereby SQL keywords are entered as upper case and variables and table or field names are entered in lowercase. I will generally follow that style but always be aware that SQL doesn't really care.
One aspect of SQL which makes it strikingly different from our other languages is that it is designed to express the required outcome rather than instruct the computer how to achieve that. In other words we tell the interpreter what we want, not how we want it done. We leave the mechanics to the interpreter (Advanced database programmers (and administrators) can influence how the interpreter does things by defining or modifying an execution plan but that is definitely guru level SQL programming!).
To create a table in SQL we use the CREATE command. It is quite easy to use and takes the form:
CREATE TABLE tablename (fieldName, fieldName,....);
Note that SQL statements are terminated with a semi-colon. Also SQL, as well as not being case-sensitive does not care about white-space or indentation levels. As you will see there is a certain style convention that is used but it is not rigidly adhered to and SQL itself cares not a jot!
Let's try creating our Employee and Salary tables in SQLite. The first thing to do is start the interpreter, which is simply a case of calling it with a filename as an argument. If the database exists it will be opened, if it doesn't it will be created. Thus to create an employee database we will start SQLite like so:
E:\PROJECTS\SQL> sqlite3 employee.db
That will create an empty database called employee.db and leave us at the sqlite> prompt ready to type SQL commands. So let's create some tables:
sqlite> CREATE TABLE Employee ...> (EmpID,Name,HireDate,Grade,ManagerID); sqlite> CREATE TABLE Salary ...> (SalaryID, Grade,Amount); sqlite>.tables Employee Salary sqlite>
Note that I moved the list of fields into a separate line, which simply makes it easier to see them. The fields are listed by name but have no other defining information such as data type. This is a peculiarity of SQLLite and most databases require you to specify the type along with the name. We can do that in SQLLite too and we will look at this in more detail a little later in the tutorial.
Also note that I tested that the CREATE statements had worked by using the .tables command to list all the tables in the database. SQLite has several of these dot commands that we will use to find out about our database. .help provides a list of them.
There are lots of other things we can do when we create a table. As well as declaring the types of data in each column, we can also specify constraints as to the values (for example NOT NULL means the value is mandatory and must be filled in - usually we would make the Primary Key field NOT NULL and UNIQUE.) We can also specify which field will be the PRIMARY KEY. We will look more closely at these more advanced creation options later on.
For now we will leave the basic table definition as it is and move on to the more interesting topic of manipulating the data itself.
The first thing to do after creating the tables is fill them with data! This is done using the SQL INSERT statement. The basic structure is very simple:
INSERT INTO tablename ( column1, column2... ) VALUES ( value1, value2... );
There is an alternate form of INSERT that uses a query to select data from elsewhere in the database, but that's rather too advanced for us here so I recommend you read about that in the SQLite manual.
So now, to insert some rows into our employees table, we do the following:
sqlite> INSERT INTO Employee (EmpID, Name, HireDate, Grade, ManagerID) ...> VALUES ('1020304','John Brown','20030623','Foreman','1020311'); sqlite> INSERT INTO Employee (EmpID, Name, HireDate, Grade, ManagerID) ...> VALUES ('1020305','Fred Smith','20040302','Labourer','1020304'); sqlite> INSERT INTO Employee (EmpID, Name, HireDate, Grade, ManagerID) ...> VALUES ('1020307','Anne Jones','19991125','Labourer','1020304');
And for the Salary table:
sqlite> INSERT INTO Salary (SalaryID, Grade,Amount) ...> VALUES('000010','Foreman','60000'); sqlite> INSERT INTO Salary (SalaryID, Grade,Amount) ...> VALUES('000011','Labourer','35000');
And that's it done. We now have created two tables and populated them with data corresponding to the values described in the introduction above. Now we are ready to start experimenting with the data.
Data is extracted from a database using the SELECT command of SQL. SELECT is the very heart of SQL and has the most complex structure. We will start with the most basic form and add additional features as we go along. The most basic SELECT statement looks like this:
SELECT column1, column2... FROM table1,table2...;
So to select the names of all employees we could use:
sqlite> SELECT Name FROM Employee;
And we would be rewarded with a list of all of the names in the Employee table. In this case that's only three, but if we have a big database that's probably going to be more information than we want. To control the output we need to be able to refine our search somewhat and SQL allows us to do this by adding a WHERE clause to our SELECT statement, like this:
SELECT col1,col2... FROM table1,table2... WHERE condition;
the condition is an arbitrarily complex boolean expression and, as we shall see, can include nested SELECT statements within it.
Let's use a WHERE clause to refine our search of names. We will only look for names of employees who are labourers:
sqlite> SELECT Name ...> FROM Employee ...> WHERE Employee.Grade = 'Labourer';
Now we only get two names back. We could extend the condition using boolean operators such as AND, OR, NOT etc. Note that using the = condition the case of the string is important, testing for 'labourer' would not have worked! We'll see how to get round that limitation later on.
Notice that in the WHERE clause we used dot notation to signify the Grade field. In this case, it was not really needed since we are only working with a single table but where multiple tables are specified we need to make it clear which table the field belongs to. As an example let's change our query to find the names of all employees paid more than $50,000. To do that we will need to consider data in both tables:
sqlite> SELECT Name, Amount FROM Employee, Salary ...> WHERE Employee.Grade = Salary.Grade ...> AND Salary.Amount > '50000';
Notice the use of whitespace to tidy the layout and that this time I put the FROM clause on the first line. Remember, SQL doesn't care, it's just a style decision to aid readability.
As expected we only get one name back - that of the foreman. But notice that we also got back the salary, because we added Amount to the list of columns selected. Also note that we have two parts to our WHERE clause combined using an and boolean operator. The first part links the two tables together by ensuring that the common fields are equal, this is known as a join in SQL. (The whole topic of joining tables can get quite complex and SQL regulars tend to prefer a more explicit form of join that we won't get into here.
Note 1: Because the fields that we are selecting come from two tables we have to specify both of the tables from which the result will come. The order of the field names is the order in which we get the data back but the order of the tables doesn't matter so long as the specified fields appear in those tables.
Note 2: We specified two unique field names. If we had wanted to display the Grade as well, which appears in both tables, then we would have had to use dot notation to specify which table's Grade we wanted, like this:
sqlite> SELECT Employee.Grade, Name, Amount ...> FROM Employee, Salary etc/...
The final feature of Select that I want to cover (although there are several more which you can read about in the SQL documentation for SELECT) is the ability to sort the output. Databases generally hold data either in the order that makes it easiest to find things or in the order in which they are inserted, in either case that's not usually the order we want things displayed! To deal with that we can use the ORDER BY clause of the SELECT statement.
SELECT columns FROM tables WHERE expression ORDER BY columns;
Notice that the final ORDER BY clause can take multiple columns, this enables us to have primary, secondary, tertiary and so on sort orders.
Let's use this to get a list of names of employees sorted by HireDate:
sqlite> SELECT Name FROM Employee ...> ORDER BY HireDate;
And that's really all there is to it, now you can't get much easier than that! The only thing worthy of mention is that we didn't use a WHERE clause. If we had used one it would have had to come before the order by clause. So although SQL doesn't mind if you drop the clause, it does care about the sequence of the clauses within the statement.
That's enough about extracting data, let's now see how we can modify our data.
There are two ways that we can change the data in our database. We can alter the contents of a single record, or a set of records or, more drastically, we can delete records or even a whole table. Changing the content of an existing record is the more common case and we do that using the UPDATE SQL command.
The basic format is:
UPDATE tablename SET column = value WHERE condition;
We can try it out on our sample database by changing the salary of a Foreman to $70,000.
sqlite> UPDATE Salary ...> SET Amount ='70000' ...> WHERE Grade = 'Foreman';
One thing to notice is that up until now all of the data we've inserted and selected has been string types. SQLite internally stores its data as strings but actually supports quite a few different types of data, including numbers. So we could have specified the salary in a numeric format which would make calculations easier. We'll see how to do that in the next section.
The big gotcha here is that SQL will, by default, modify as many rows as match the condition. If you only want to modify a single row you must ensure the WHERE clause selects one and only one row. It is a common beginner's error to accidentally change a field for every row in the table, or a subset thereof! This can be a tricky one to fix so aways be extra careful when using modification commands. In fact it's a good idea to check the WHERE clause by putting it in a SELECT statement first to see what comes back.
The other form of fairly drastic change we can make to our data is to delete a row, or set of rows. This uses the SQL DELETE FROM command, which looks like:
DELETE FROM Tablename WHERE condition
So if we wanted to delete Anne Jones from our Employee table we could do this:
sqlite> DELETE FROM Employee WHERE Name = 'Anne Jones';
If more than one row matches our condition then all of the matching rows will be deleted. Remember, SQL always operates on all the rows that match our query, it's not like using a sequential search of a file or string using a regular expression.
To delete an entire table and its contents we would use the SQL DROP command, but we will see that in action a little later. Obviously destructive commands like DELETE and DROP must be used with extreme caution!
We mentioned linking data across tables earlier, in the section on SELECT. However this is such a fundamental part of database theory that we will discuss it in more depth here. The links between tables represent the relationships between data entities that give a Relational Database such as SQLite its name. The database maintains not only the raw data about the entities but information about the relationships too.
The information about the relationships is stored in the form of database constraints which act as rules dictating what kind of data can be stored as well as the valid set of values. These constraints are applied when we define the database structure using the CREATE statement.
We normally express the constraints on a field by field basis so, within the CREATE statement, where we define our columns, we can expand the basic definition from:
CREATE TABLE Tablename (Column, Column,...);
To:
CREATE TABLE Tablename ( ColumnName Type Constraint, ColumnName Type Constraint, ...);
And the most common constraints are:
NOT NULL PRIMARY KEY [AUTOINCREMENT] UNIQUE DEFAULT value
NOT NULL is fairly self explanatory, it indicates that the value must exist and not be NULL! And a NULL value is simply one that has no specified value. Thus NOT NULL means that a value must be given for that field, otherwise an error will result and the data will not be inserted.
PRIMARY KEY tells SQLite to use this column as the main key for lookups (in practice this means it will be optimized for faster searches). The AUTOINCREMENT means that an INTEGER type value will automatically be assigned on each INSERT and the value automatically incremented by one. This saves a lot of work for the programmer in maintaining separate counts. Note that the AUTOINCREMENT "keyword" is not actually used, rather it is implied from a type/constraint combination of INTEGER PRIMARY KEY. This is a not so obvious quirk of the SQLite documentation that trips up enough people for it to appear at the top of the SQLite FAQ list!
UNIQUE means that the value must be unique within the column. If we try to insert an existing value into a column with a UNIQUE constraint then an error results and the row will not be inserted. UNIQUE is often used for non INTEGER type PRIMARY KEY columns.
DEFAULT is always accompanied by a value. The value is what SqlLite will insert into that field if the user does not explicitly provide one. The effect of this is that columns with a DEFAULT constraint are in practice very rarely NULL, to create a NULL value you would need to explicitly set NULL as the value.
We can see a quick example showing the use of DEFAULT here:
sqlite> CREATE TABLE test ...> (id Integer PRIMARY KEY, ...> Name NOT NULL, ...> Value Integer DEFAULT 42); sqlite> INSERT INTO test (Name, Value) VALUES ('Alan',24); sqlite> INSERT INTO test (Name) VALUES ('Heather'); sqlite> INSERT INTO test (Name,Value) VALUES ('Linda', NULL); sqlite> SELECT * FROM test; 1|Alan|24 2|Heather|42 3|Linda| sqlite>
Notice how the entry for Heather has the default value set? And also that the value for Linda is non existent, or NULL. That is an important difference between NOT NULL and DEFAULT. The former will not allow NULL values either by default or explicitly. The DEFAULT constraint prevents unspecified NULLs but does not prevent deliberate creation of NULL values.
You might have noticed that I used the shorthand character * in place of a field list in the last SELECT statement. This is a simple way to fetch all the fields in the table. It is great for experimenting like this but should not be used in production programs because any change in the data structure could change the data results and break any code that relies on the specific number, or order, of fields retrieved.
There are also constraints that can be applied to the table itself but we will not be discussing those in any depth in this tutorial.
The other kind of constraint that we can apply, as already mentioned, is to specify the column type. This is exactly like the concept of types in a programming language and the valid set of types in SQLite are:
These should be self evident with the possible exceptions of NUMERIC and BLOB. NUMERIC allows the storage of floating-point numbers as well as integers. BLOB is for storing binary data such as images or other non-textual documents. (It is, however, often better to store such items as separate files and only hold a reference to them in the database.) NULL is not really a type but simply indicates that, as we did above, you don't need to specify a type at all. Most databases come with a much wider set of types including, crucially, a DATE type, however as we are about to see, SQLite has a somewhat unconventional approach to types which renders such niceties less relevant.
The SQL standard has been created by a committee formed from all of the various database vendors. as such the list of types is extensive to cater for all the different types the vendors allow (and were defined before the standard). SQLite supports many of these at a syntactical level but in practice, they are simply aliased to the best native equivalent type. Thus, for example, an Oracle VARCHAR type is an alias for a SQLite TEXT type. The idea is that existing SQL scripts can be ported to SQLite with the minimum of changes.
Most databases strictly apply the types specified. However, SQLite employs a more dynamic scheme, where the type specified is more like a hint, and any type of data can be stored in the table. When data of a different type is loaded into a field then SQLite will use the declared type to try and convert the data, but if it cannot be converted it will be stored in its original form. Thus if a field is declared as INTEGER but the TEXT value '123' is passed in, SQLite will convert the string '123' to the number 123. But if the TEXT value 'Freddy' is passed in the conversion will fail so SQLite will simply store the string 'Freddy' in the field! This can cause some strange behaviour if you are not aware of this foible. Most databases treat the type declaration as a strict constraint and will fail if an illegal value is passed.
So how do these constraints help us to model data and, in particular, relationships? Let's look again at our original simple two-table database:
EmpID | Name | HireDate | Grade | ManagerID |
---|---|---|---|---|
1020304 | John Brown | 20030623 | Foreman | 1020311 |
1020305 | Fred Smith | 20040302 | Labourer | 1020304 |
1020307 | Anne Jones | 19991125 | Labourer | 1020304 |
SalaryID | Grade | Amount |
---|---|---|
000010 | Foreman | 60000 |
000011 | Labourer | 35000 |
The ID value should be of INTEGER type and have a PRIMARY KEY constraint, the other columns, with the possible exception of the ManagerID should be NOT NULL. ManagerID should also be of type INTEGER.
For the Salary table we see that again the SalaryID should be an INTEGER with PRIMARY KEY. The Amount column should also be an INTEGER and we will apply a DEFAULT value of 10000. Finally the Grade column will be constrained as Unique since we don't want more than one salary per grade! (Actually, this is a bad idea since normally salary varies with things like length of service as well as grade, but we'll ignore such niceties! In fact, in the real world, we probably should call this a Grade table and not Salary...)
The modified SQL looks like this:
sqlite> CREATE TABLE Employee ( ...> EmpID INTEGER PRIMARY KEY, ...> Name NOT NULL, ...> HireDate NOT NULL, ...> Grade NOT NULL, ...> ManagerID INTEGER ...> ); sqlite> CREATE TABLE Salary ( ...> SalaryID INTEGER PRIMARY KEY, ...> Grade UNIQUE, ...> Amount INTEGER DEFAULT 10000 ...> );
You can try out these constraints by attempting to enter data that breaks them to see what happens. Hopefully, you see an error message!
One thing to point out here is that the INSERT statements we used previously are no longer adequate. We previously inserted our own values for the ID fields but these are now autogenerated so we can (and should!) miss them out of the inserted data. But this gives rise to a new difficulty. How can we populate the managerID field if we don't know what the EmpID of the manager is? The answer is we can use a nested SELECT statement. I've chosen to do this in two stages using NULL fields initially and then using an update statement after creating all the rows.
To avoid a lot of repeat typing I've put all of the commands in a couple of files which I called employee.sql for the table creation commands and employee.dat for the insert statements. (This is just the same as creating a python script file ending in .py to save typing everything at the >>> prompt.)
The employee.sql file looks like this:
DROP TABLE IF EXISTS Employee; CREATE TABLE Employee ( EmpID INTEGER PRIMARY KEY, Name NOT NULL, HireDate NOT NULL, Grade NOT NULL, ManagerID INTEGER ); DROP TABLE IF EXISTS Salary; CREATE TABLE Salary ( SalaryID INTEGER PRIMARY KEY, Grade UNIQUE, Amount INTEGER DEFAULT 10000 );
Notice that I drop the tables before creating them. The DROP TABLE command, as mentioned earlier, simply deletes the table and any data within it. This ensures the database is in a nice clean state before we start creating our new table. We also added an IF EXISTS clause which simply prevents us from trying to drop a table that has already been dropped.
The employee.dat script looks like this:
INSERT INTO Employee (Name, HireDate, Grade, ManagerID) VALUES ('John Brown','20030623','Foreman', NULL); INSERT INTO Employee (Name, HireDate, Grade, ManagerID) VALUES ('Fred Smith','20040302','Labourer',NULL); INSERT INTO Employee (Name, HireDate, Grade, ManagerID) VALUES ('Anne Jones','19991125','Labourer',NULL); UPDATE Employee SET ManagerID = (SELECT EmpID FROM Employee WHERE Name = 'John Brown') WHERE Name = 'Fred Smith' OR Name = 'Anne Jones'; INSERT INTO Salary (Grade, Amount) VALUES('Foreman','60000'); INSERT INTO Salary (Grade, Amount) VALUES('Labourer','35000');
Notice the use of the embedded SELECT statement in the UPDATE command and also the fact that I've used a single UPDATE to modify both employee rows by using a boolean OR condition. By extending this OR I can easily add more employees with the same manager.
This is typical of the problems you can have when populating a database for the first time. You need to plan the order of the statements carefully to ensure that for every row that needs to contain a reference value to another table that you have already provided the data for it to reference! It's a bit like starting at the leaves of a tree and working back to the trunk. Always create/insert the data with no references first, then the data that references that data and so on. If you are adding data after the initial creation you will need to use queries to check the data you need already exists, and add it if it doesn't. At this point a scripting language like Python becomes invaluable!
Finally, we can run these from the sqlite prompt like this:
sqlite> .read employee.sql sqlite> .read employee.dat
Make sure you have the path issues sorted out though: either run SQLite from wherever the SQL scripts live (as I've done above) or provide the full path to the script.
Now we'll try a query to check that everything is as it should be:
sqlite> SELECT Name FROM Employee ...> WHERE Grade IN ...> (SELECT Grade FROM Salary WHERE amount >50000) ...> ; John Brown
That seems to have worked, John Brown is the only employee earning over $50000. Notice that we used an IN condition combined with another embedded SELECT statement. This is a variation on a similar query that we performed above using a cross table join. Both techniques work but usually the join approach will be faster.
One scenario we haven't discussed is where two tables are linked in a many to many relationship. That is, a row in one table can be linked to several rows in a second table and a row in the second table can at the same time be linked to many rows in the first table.
Consider an example. Imagine we are writing a database to support a book publishing company. They need lists of authors and lists of books. Each author will write one or more books. Each book will have one or more authors. How do we represent that in a database? The solution is to represent the relationship between books and authors as a table in its own right. Such a table is often called an intersection table or a mapping table. Each row of this table represents a book/author relationship. Now each book has potentially many book/author relationships but each relationship only has one book and one author, so we have converted a many to many relationship into two one to many relationships. And we already know how to build those using IDs. Let's see it in practice:
DROP TABLE IF EXISTS author; CREATE TABLE author ( ID INTEGER PRIMARY KEY, Name TEXT NOT NULL ); DROP TABLE IF EXISTS book; CREATE TABLE book ( ID INTEGER PRIMARY KEY, Title TEXT NOT NULL ); DROP TABLE IF EXISTS book_author; CREATE TABLE book_author ( bookID INTEGER NOT NULL, authorID INTEGER NOT NULL ); INSERT INTO author (Name) VALUES ('Jane Austin'); INSERT INTO author (Name) VALUES ('Grady Booch'); INSERT INTO author (Name) VALUES ('Ivar Jacobson'); INSERT INTO author (Name) VALUES ('James Rumbaugh'); INSERT INTO book (Title) VALUES('Pride & Prejudice'); INSERT INTO book (Title) VALUES('Emma'); INSERT INTO book (Title) VALUES('Sense & Sensibility'); INSERT INTO book (Title) VALUES ('Object Oriented Design with Applications'); INSERT INTO book (Title) VALUES ('The UML User Guide'); INSERT INTO book_author (BookID,AuthorID) values ( (SELECT ID FROM book WHERE title = 'Pride & Prejudice'), (SELECT ID FROM author WHERE Name = 'Jane Austin') ); INSERT INTO book_author (BookID,AuthorID) VALUES ( (SELECT ID FROM book WHERE title = 'Emma'), (SELECT ID FROM author WHERE Name = 'Jane Austin') ); INSERT INTO book_author (BookID,AuthorID) VALUES ( (SELECT ID FROM book WHERE title = 'Sense & Sensibility'), (SELECT ID FROM author WHERE Name = 'Jane Austin') ); INSERT INTO book_author (BookID,AuthorID) VALUES ( (SELECT ID FROM book WHERE title = 'Object Oriented Design with Applications'), (SELECT ID FROM author WHERE Name = 'Grady Booch') ); INSERT INTO book_author (BookID,AuthorID) VALUES ( (SELECT ID FROM book WHERE title = 'The UML User Guide'), (SELECT ID FROM author WHERE Name = 'Grady Booch') ); INSERT INTO book_author (BookID,AuthorID) VALUES ( (SELECT ID FROM book WHERE title = 'The UML User Guide'), (SELECT ID FROM author WHERE Name = 'Ivar Jacobson') ); INSERT INTO book_author (BookID,AuthorID) VALUES ( (SELECT ID FROM book WHERE title = 'The UML User Guide'), (SELECT ID FROM author WHERE Name = 'James Rumbaugh') );
Now we can try some queries to see how it works. Let's see which Jane Austin books we publish:
sqlite> SELECT title FROM book, book_author ...> WHERE book_author.bookID = book.ID ...> AND book_author.authorID = (SELECT ID FROM author ...> WHERE Name = "Jane Austin");
It's getting a wee bit more complex but if you sit and work through it you'll get the idea soon enough. Notice you need to include both of the referenced tables, book and book_author, in the table list after the SELECT. (The third table author is not listed there because it is listed against its own embedded SELECT statement.) Let's try it the other way around, Let's see who wrote 'The UML User Guide':
sqlite> SELECT Name FROM author, book_author ...> WHERE book_author.authorID = author.ID ...> AND book_author.bookID = (SELECT ID FROM book ...> WHERE title = "The UML User Guide");
If you look closely you will see that the structure of the two queries is identical, we just swapped around the table and field names a little.
That's enough for that example, I'm now going to return to our Address Book example that we last considered in the topic on file handling. You might want to review that before reading on to see how we convert it from file-based storage to a full database.
In the file-based version of the address book we used a dictionary with the contact name as key and the address as a single data item. That works fine if we always know the name we want or we always want the full address details. But what if we want all of our contacts in a particular town? Or all the people called 'John'? We could write specific Python code for each query but as the number of special queries rises the amount of effort gets to be a serious disincentive. This is where a database approach pays dividends with the ability to create queries dynamically using SQL.
So what does our address book look like as a database? Basically it is a single table. We could split the data into address and person and link them - after all you may have several friends living in the same house, but we will stick with our original design and use a simple table.
One thing that we will do is split the data into several fields. Rather than a simple name and address structure we will split the name into first and last names, and the address into its constituent parts. There has been a lot of study into the best way to do this and no definitive answer, but the one thing everyone agrees on is that single field addresses are a bad idea - they are just too inflexible. Let's list the fields of our database table and the constraints that we want to apply:
Field Name | Type | Constraint |
---|---|---|
First Name | String | Primary Key |
Last Name | String | Primary Key |
House Number | String | NOT NULL |
Street | String | NOT NULL |
District | String | |
Town | String | NOT NULL |
Post Code | String | NOT NULL |
Phone Number | String | NOT NULL |
Some points to note:
Going back to the first point, that we have two primary keys. This is not allowed in SQL but what we can do is take two columns and combine them into what is called a composite key which allows them to be treated as a single value so far as identifying a row is concerned. Thus we could add a line at the end of our create table statement which combined FirstName and LastName as a single Primary Key. It would look something like this:
CREATE TABLE address ( FirstName NOT NULL, LastName NOT NULL, ... PhoneNumber NOT NULL, PRIMARY KEY (FirstName,LastName) );
Notice the last line which lists the columns we want to use as the composite key. (This is actually an example of a table based constraint.)
However, thinking about this, it isn't really such a good idea since, if we know two people with the same name, we could only store one of them. We'll deal with this by defining an integer primary key field to uniquely identify our contacts, even though we will rarely if ever use it in a query.
We know how to declare an INTEGER PRIMARY KEY constraint, we did that for our employee example.
We can turn that straight into a SQLite data creation script, like this:
-- drop the tables if they exist and recreate them afresh
-- use constraints to improve data integrity
DROP TABLE IF EXISTS address;
CREATE TABLE address (
ContactID INTEGER PRIMARY KEY,
First NOT NULL,
Last NOT NULL,
House NOT NULL,
Street NOT NULL,
District,
Town NOT NULL,
PostCode NOT NULL,
Phone NOT NULL
);
The first two lines are simply comments. Like the # symbol in Python, anything following a double dash (--) is considered a comment in SQL.
Notice that I have not defined the type because TEXT is the default in SQLite, if we needed to convert, or port in computer speak, this schema, or table layout, to some other database we would probably need to go back and add the type information.
The next step is to load some data into the table ready to start performing queries. I'll leave that as an exercise for the reader (using the insertion code above as a template), but I will be using the following data set in the following examples:
First | Last | House | Street | District | Town | PostCode | Phone |
---|---|---|---|---|---|---|---|
Anna | Smith | 42 | Any Street | SomePlace | MyTown | ABC123 | 01234 567890 |
Bob | Builder | 17 | Any Street | SomePlace | MyTown | ABC234 | 01234 543129 |
Clarke | Kennit | 9 | Crypt Drive | Hotspot | Metropolis | ABC345 | 01234 456459 |
Dave | Smith | 42 | Any Street | SomePlace | MyTown | ABC123 | 01234 567890 |
Dave | Smith | 12A | Double Street | AnyTown | DEF174 | 01394 784310 | |
Debbie | Smith | 12A | Double Street | AnyTown | DEF174 | 01394 784310 |
Now we have some data let's play with it and see how we can use the power of SQL to extract information in ways we couldn't even dream of with our simple file-based Python dictionary.
This is a fairly straightforward SQL query made simple by the fact that we have broken our address data into separate fields. If we had not done that we would have had to write string parsing code to extract the street data which is much more complex. The SQL query we need looks like this:
sqlite> SELECT First,Last FROM Address ...> WHERE Street = "Any Street";
Again this is a fairly straightforward SELECT/WHERE SQL expression:
sqlite> SELECT First,Last FROM Address ...> WHERE Last = "Smith";
Again a straightforward query except that we get multiple results back.
sqlite> SELECT First,Last, Phone FROM Address ...> WHERE First LIKE "Dav%";
Notice we used LIKE in the WHERE clause. This uses a wild card style comparison and ignores case. (Notice that the SQL wild card is a percent symbol (%) rather than the more common asterisk (*).) As a result it is a looser match than equality which requires an exact match. Notice that if we had only used D% as the wildcard pattern we would also have selected Debbie.
This is a more complex query. We will need to select the entries which occur more than once. This is where the unique ContactID key comes into play:
sqlite> SELECT DISTINCT A.First, A.Last ...> FROM Address AS A, Address AS B ...> WHERE A.First = B.First ...> AND A.Last = B.Last ...> AND NOT A.ContactID = B.ContactID;
We use a few new features here. The first thing is that we provide alias names, A and B, to the tables in the FROM clause. (We could also provide aliases for our result values too, to save typing.) Also we use those aliases when referring to the result fields using the familiar dot notation. You can use an alias in any query but we are forced to do it here because we use the same table, Address, both times (thus joining it to itself!) so we need two aliases to distinguish the two instances in our where clause. We also introduce the DISTINCT keyword which results in us eliminating any duplicate results.
In summary, the query searches for rows with the same first and last names but different ContactIDs, it then eliminates duplicate results prior to displaying them.
As with the Python interactive prompt the SQLite interactive prompt is a powerful tool when developing more complex queries like this. You can start with a simple query and then build in the complexity as you go. For example, the last part of the query that I added was the DISTINCT keyword, even though it's the second word in the final query!
SQLite provides an Application Programmers Interface or API consisting of a number of standard functions that allow programmers to perform all the same operations that we have been doing using the interactive SQL prompt. The SQLite API is written in C but wrappers have been provided for other languages, including Python.
The first thing you need to work with a database is a connection. The name stems from the fact that most databases are server programs running on a central computer somewhere on your network and you need to connect to it (usually including logging in with a username and password)). SQLite is only a file on your filesystem but, for consistency, the API requires that you connect to it. In other words we open the database file.
When accessing a database from within a program one important consideration is how to access the multiple rows of data potentially returned by a SELECT statement. The answer is to use what is known in SQL as a cursor. A cursor is like a Python sequence in that it can be accessed one row at a time. Thus by extracting our data into a cursor and then using a loop to access the cursor we can process large collections of data.
It is important to understand that the cursor does not store all of the result data rather it just stores a reference to the result data which is held in a temporary table within the database. This is less significant for SQLite, which tends to be on the same computer as the program, but is important when dealing with a client/server network-attached database. The cursor allows you to fetch the data in small quantities and is thus potentially much faster and more memory efficient than copying all of the result data into your program at once. This is especially important when working with very large databases of many GB in size. However, it also means that if you are only going to process some of the result data at a time you should not execute any more queries using the same cursor or you will lose access to the original result set. You need to create a new cursor for the new query. In small programs, like the ones in this tutorial, you can safely just copy all of the data into a Python data structure and the problem goes away, but you need to be aware that in bigger projects you may need multiple cursors.
The documentation for the latest version of the Python DB API is found in the Database Topic Guide on the Python website. You should read this carefully if you intend doing any significant database programming using Python.
import sqlite3
I am not going to cover all of the DBI features just enough for us to connect to our database and execute a few queries and process the results. We'll finish off by rewriting our address book program to use the address book database instead of a text file.
>>> db = sqlite3.connect('address.db')
>>> cur = db.cursor() >>> cur.execute('SELECT * FROM address')
>>> print( cur.fetchall() )
And the results look like this:
[(1, 'Anna', 'Smith', '42', 'Any Street', 'SomePlace', 'MyTown', 'ABC123', '01234 567890'), (2, 'Bob', 'Builder', '17', 'Any Street', 'SomePlace', 'MyTown', 'ABC234', '01234 543129'), (3, 'Clarke', 'Kennit', '9', 'Crypt Drive', 'Hotspot', 'Metropolis', 'ABC345', '01234 456459'), (4, 'Dave', 'Smith', '42', 'Any Street', 'SomePlace', 'MyTown', 'ABC123', '01234 567890'), (5, 'Dave', 'Smith', '12A', 'Double Street', '', 'AnyTown', 'DEF174', '01394 784310')]
As you can see the cursor.fetchall() returns a list of tuples. This is very similar to what we started off with back in the raw materials topic! And we could simply use this list in our program as if we had read it from a file, using the database merely as a persistence mechanism. However, the real power of the database lies in its ability to perform sophisticated queries using SELECT.
I'm now going to present our address book example one more time. It's far from polished and is still command line based. You might like to add a GUI, remembering to refactor the code to separate function from presentation. (Later in the tutorial we will add a web interface too.)
I won't explain every detail of the code, by now it should mostly be self evident if you read through it. I will however, highlight a few points in bold and discuss them at the end.
############################### # Addressbook.py # # Author: A J Gauld # ''' Build a simple addressbook using the SQLite database and Python DB-API. ''' ############################### # set up the database and cursor import sqlite3 dbpath = "D:/DOC/Homepage/Tutor2/sql/" def initDB(path): try: db = sqlite3.connect(path) cursor = db.cursor() except sqlite3.OperationalError: print( "Failed to connect to database:", path ) db,cursor = None,None raise return db,cursor # Driver functions def addEntry(book): first = input('First name: ') last = input('Last name: ') house = input('House number: ') street = input('Street name: ') district = input('District name: ') town = input('City name: ') code = input('Postal Code: ') phone = input('Phone Number: ') query = '''INSERT INTO Address (First,Last,House,Street,District,Town,PostCode,Phone) VALUES (?,?,?,?,?,?,?,?)''' try: book.execute(query,(first, last, house, street, district, town, code, phone)) except sqlite3.OperationalError: print( "Insert failed" ) raise return None def removeEntry(book): name = input("Enter a name: ") names = name.split() first = names[0]; last = names[-1] try: book.execute('''DELETE FROM Address WHERE First LIKE ? AND Last LIKE ?''',(first,last)) except sqlite3.OperationalError: print( "Remove failed" ) raise return None def findEntry(book): validFields = ('first','last','house','street', 'district','town','postcode','phone') field = input("Enter a search field: ") value = input("Enter a search value: ") if field.lower() in validFields: query = '''SELECT first,last,house,street,district,town,postcode,phone FROM Address WHERE %s LIKE ?''' % field else: raise ValueError("invalid field name") try: book.execute(query, (value,) ) result = book.fetchall() except sqlite3.OperationalError: print( "Sorry search failed" ) raise else: if result: for line in result: print( line ) else: print("No matching data") return None def testDB(database): database.execute("SELECT * FROM Address") print( database.fetchall() ) return None def closeDB(database, cursor): try: cursor.close() database.commit() database.close() except sqlite3.OperationalError: print( "problem closing database..." ) raise # User Interface functions def getChoice(menu): print( menu ) choice = input("Select a choice(1-4): ") return choice def main(): theMenu = ''' 1) Add Entry 2) Remove Entry 3) Find Entry 4) Test database connection 9) Quit and save ''' try: theDB, theBook = initDB(dbpath + 'address.db') while True: choice = getChoice(theMenu) if choice == '9' or choice.upper() == 'Q': break if choice == '1' or choice.upper() == 'A': addEntry(theBook) elif choice == '2' or choice.upper() == 'R': removeEntry(theBook) elif choice == '3' or choice.upper() == 'F': try: findEntry(theBook) except ValueError: print("No such field name") elif choice == '4' or choice.upper() == 'T': testDB(theBook) else: print( "Invalid choice, try again" ) except sqlite3.OperationalError: print( "Database error, exiting program." ) # raise finally: closeDB(theDB,theBook) if __name__ == '__main__': main()
Note 1: We use a try/except clause to capture any database errors. Notice too that the error is a custom type defined within the sqlite3 module so we need to prefix it with the module name.
Note 2: We use the keyword raise after printing the error message. This results in the original exception being raised to the next level, which in our case is main where it is caught and another message printed. (More on that in a moment)
Note 3: In the query strings we use ? placeholders to hold the data variables. This is similar to the % markers used in string formatting but the insertion of the values is done as part of the execution of the query by book.execute where we pass the tuple of inserted values as a second argument. This has the advantage that some security checking of the input values is done and greatly improves the security of our code. If this were not done, or if we used standard string formatting, a nefarious (or very unlucky) user could type code as an input value, which would then get inserted in the query and, potentially, corrupt the database. This is known as an injection attack in security circles and is one of the more common forms of security violation on the web today.
Note 4: In the findEntry function we insert both the field name and search value as insertion fields which makes our search function much more versatile. Without this trick we would need a search function for every possible search criteria, which would be tedious, to put it mildly! There is a gotcha however, because SQLite's parameter mechanism only works for values, not for SQL elements such as field or table names. To get around this we need to use Python string formatting to insert the field name. To ensure that the query is safe we need to validate that the field name is one of the defined names before inserting it into the query. If the field is not valid we raise a standard Python exception of type ValueError. We then need to catch this in the main() function.
Also note that the query once again uses the LIKE search expression since it allows us to leverage SQL's % wildcard option in our search string.
Note 5: The closeDB function includes a call to commit. This forces the database to write all the changes in the current session back to the file, it can be thought of as being a little like the file.flush method. (Strictly speaking it terminates a transaction but the whole topic of transaction management is a bit beyond our scope.)
Note 6: The main function wraps everything in a try/except/finally construct. The except clause catches the exceptions raised by the lower level functions as mentioned above. But notice it also has a raise statement which has been commented out. The reason for this is that the full error traceback is a very useful debugging tool (although not very user friendly) and so, during development, we can uncomment the top level raise and we get the full traceback on screen. Then, when all the bugs have been dealt with, simply comment out that one line to restore the more user friendly display of the final program. This trick is not specific to databases, obviously, it can be used in any program where there are lots of potential errors being raised that we don't want to show our users but do want to see as developers.
Note 7: The finally clause in the main function is used to make sure that we always close the database cleanly, regardless of whether or not we encountered any errors. This minimises the risk of database corruption.
We mentioned above (in Note 3) that the use of the DB API ? markers instead of using regular % string formatting was a security device. You could use regular string formatting and the code would work and you may be tempted to try. However, in a world where security is ever more under attack I would urge you not to. Good habits like this are easily adopted and they will do a lot to keep your code secure. It's not a complete or perfect solution and you do lose visibility of the actual SQL query submitted to the database (which can be useful when debugging) but if there is any possibility of users entering rogue data this is the best way to defeat it.
We have used SQLite for our examples because it is freely available, easy to install and use, and fairly forgiving of mistakes. However, this simplicity means that many of the more advanced features found in more powerful packages are not available. In particular the text processing capabilities and the range of constraints available are quite limited. If you do find yourself confronted with a database like Oracle or IBM's DB2 it is well worthwhile taking the time to read the reference documentation, using the features of the database to best effect can significantly reduce the amount of custom code you need to write and often improves performance too.
Some of the advanced features you should look out for are described in the box below:
Most databases (including SQLite) feature Foreign Keys. These allow us to specify linkages or relationships between tables. The data consists of primary keys of other tables. In some cases the keys are allowed to be in other databases as well as from other tables in the current database. We haven't discussed cross-table joins in much detail here but this is one of the most commonly used features of relational databases as applications get larger.
Referential integrity is the ability to only allow data values in a column if they exist in another location. For example, in our employee database, we could have restricted the value in the Employee.Grade field to only allow values already defined in the Salary.Grade table. This is a very powerful tool in maintaining the consistency of data across the database and is especially important where, as with the grade columns, the values are used as keys for joining two tables. SQLite supports a limited form of referential integrity as a form of constraint but it requires some special processing and is less comprehensive than more powerful databases.
These are functions written in a proprietary programming language provided by the database vendor and stored in the database. The advantage of using a stored procedure is that these are compiled and so tend to be much faster than using the equivalent SQL commands. In addition, they save bandwidth by only requiring the function name and arguments to be sent from the client program. By being built into the server they allow us to build common behaviour, such as complex business rules, into the database where it can be shared by all applications consistently. The disadvantage is that they are proprietary and if we want to change database vendor then all stored procedures will need to be rewritten, whereas standard SQL will work almost unchanged on any database. SQLite does not support any form of stored procedure.
These are virtual tables composed of data from real tables. They could be a subset of the data in another table, to simplify browsing or, more commonly, a few columns from one table and a few from another joined by some key. You can think of them as being a SQL query permanently being executed and the result stored in the view. The view will change as the underlying data changes. Some databases only allow you to read data in a view but most will permit updates as well. Views are often used to implement data partitioning such that one user can only see the subset of data that is relevant to her. (Note that SQLite does support views, we just didn't cover them above)
If a cascaded delete is set up between two data items it means that when the master item is deleted all subordinate items will also be deleted. One of the most common examples is an order. An order will normally comprise several ordered items plus the order itself. These will typically be stored in two separate tables. If we delete the order we also want to delete the order items. Cascaded deletes are normally configured in the database DDL statements used to create the database schema. They are a type of constraint. SQLite does not support cascaded deletes.
Triggers are somewhat like events. A trigger is a piece of SQL that is executed when a specified action occurs, for example, when a new row is inserted in a given table. Triggers are very useful for checking data validity beyond simple type and values. Triggers are especially useful when linked to stored procedures. The downside is that triggers can be abused and become a real resource hog and slow down the database dramatically, so use with care. SQLite supports very basic, pure SQL, triggers.
Some databases permit a wide variety of data types to be stored. In addition to the usual number, character, date and time data, there may be network addresses, Binary Large Objects (known as BLOBs) for image files etc. Another common type of data is the fixed precision decimal type used for financial data, this avoids the errors from rounding found with traditional floating point numbers. SQLite does support BLOBs but not the other advanced types.
Finally, if you do want to explore some more sophisticated uses of SQLite there is an excellent tutorial here. With the foundation material above imprinted on your mind you should have no problem following along.
Points to remember