SomeBeans: programming

Showing posts with label programming. Show all posts

Thursday, May 19, 2011

More news from the shed…

In the month of May I seem to find myself playing with maps and numbers.

To the uninvolved this may appear to be rather similar to my earlier “That’s nice dear”, however the technology involved here is quite different.

This post is about extracting the results from the local elections held on 5th May from the Cheshire West and Chester website and displaying them as a map. I could have manually transcribed the results from the website, this would probably be quicker, but where’s the fun in that?

The starting point for this exercise was noticing that the results pages have a little icon at the bottom saying “OpenElectionData”. This was part of an exercise to make local election results more easily machine-readable in order to build a database of results from across the country, somewhat surprisingly there is no public central record of local council election results. The technology used to provide machine access to the results is known as RDF (standing for Resource Description Framework), this is a way of providing “meaning” to web pages for machines to understand - this is related to the talk of the semantic web. The good folks at Southampton University have provided a browser which allows you to inspect the RDF contents of a webpage. I used this to get a human sight of the data I was trying to read.

RDF content ultimately amounts to triplets of information: “subject”,”predicate”,”object”. In the case of an election then one triplet has a subject of “specific ward identifier” the predicate is “a list of candidates” and the object is “candidate 1;candidate 2; candidate 3…”. Further triplets specify the whether a candidate was elected, how many votes they received and the party to which they belong.

I’ve taken to programming in Python recently, in particular using the Python(x,y) distribution which packages together an IDE with some libraries useful to scientists. This is the sort of thing I’d usually do with Matlab, but that costs (a lot) and I no longer have access to it at home.

There is a Python library for reading RDF data, called RDFlib, unfortunately most of the documentation is for version 2.4 and the working version which I downloaded is 3.0. Searching for documentation for the newer version normally leads to other sites where people are asking where the documentation is for version 3.0!

The base maps come from the Ordnance Survey, specifically the Boundary Line dataset which contains administrative boundary data for the UK in ESRI Shapefile format. This format is widely used for geographical information work, I found the PyShp library from GeospatialPython.com to be well-documented and straightforward way to read the format. The site also has some nice usage examples. I did look for a library to display the resulting maps but after a brief search I adapted the simple methods here for drawing maps using matlibplot.

The Ordnance Survey Open Data site is a treasure trove for programming cartophiles, along with maps of the UK of various types there’s a gazetteer of interesting places, topographic information and location data for UK postcode.

The map at the top of the page uses the traditional colour-coding of red for Labour and blue for Conservative, some wards elect multiple candidates and in those where the elected councillors are not all from the same party purple is used to show a Labour/Conservative combination and orange a Labour/Liberal Democrat combination.

In contrast to my earlier post on programming, the key elements here are the use of pre-existing libraries and data formats to achieve an end result. The RDF component of the exercise took quite a while, whilst the mapping part was the work of a couple of hours. This largely comes down to the quality of the documentation available. Python turns out to be a compact language to do this sort of work, it’s all done in 150 or so lines of code.

It would have been nice to have pointed my program to a single webpage and for it to find all the ward data from there, including the ward names, but I couldn’t work out how to do this – the program visits each ward in turn and I had to type in the ward names. The OpenElectionData site seemed to be a bit wobbly too, so I encoded party information into my program rather the pulling it from their site. Better fitting of the ward labels into the wards would have been nice too (although this is a hard problem). Obviously there’s a wide range of analysis that can be carried out on the underlying electoral data.

Footnotes

The python code to do this analysis is here. You will need to install the rdflib and PyShp libraries and download the OS Boundary Line data. I used the Python(x,y) distribution but I think it’s just the matlibplot library which is required. The CWac.py program extracts the results from the website and writes them to a CSV file, the Mapping.py program makes a map from them. You will need to adjust file paths to suit your installation.

Sunday, April 03, 2011

Obsession

This is a short story about obsession: with a map, four books and some numbers.

My last blog post was on Ken Alder's book "The Measure of All Things" on the surveying of the meridian across France, through Paris, in order to provide a definition for a new unit of measure, the metre, during the period of the French Revolution. Reading this book I noticed lots of place names being mentioned, and indeed the core of the whole process of surveying is turning up at places and measuring the angles to other places in a process of triangulation.

To me places imply maps, and whilst I was reading I popped a few of the places into Google Maps but this was unsatisfactory to me. Delambre and Mechain, the surveyors of the meridian, had been to many places. I wanted to see where they all were. Ken Alder has gone a little way towards this in providing a map: you can see it on his website but it's an unsatisfying thing: very few of the places are named and you can't zoom into it.

In my investigations for the last blog post, I discovered the full text of the report of the surveying mission, "Base du système métrique décimal", was available online and flicking through it I found a table of all 115 triangles used in determining the meridian. So a plan is formed: enter the names of the stations forming the 115 triangles into a three column spreadsheet; determine the latitude and longitude of each of these stations using the Google Maps API; write these locations out into a KML file which can be viewed in Google Maps or Google Earth.

The problem is that place names are not unique and things have changed in the last 200 years. I have spent hours transcribing the tables and hunting down names of obscure places in rural France, hacking away with Python and loved every minute of it. Cassini's earlier map of France is available online but the navigation is rather clumsy so I didn't use it. Although now I come to writing this I see someone else has made a better job of it.

Beside three entries in the tables of triangles are the words: "Ce triangle est inutile" - "This triangle is useless". Instantly I have a direct bond with Delambre, who wrote those words 200 years ago - I know that feeling: in my loft is a sequence of about 20 lab books I used through my academic career and I know that besides an (unfortunately large) number of results the word "Bollocks!" is scrawled for very similar reasons.

The scheme with the the Google Maps API is that your program provides a place name "Chester, UK", for example, and the API provides you with the latitude and longitude of the point requested. Sometimes this doesn't work, either because there are several places with the same name or the placename is not in the database.

I did have a genuine Eureka moment: after several hours trying to find missing places on the map I had a bath and whilst there I had an idea: Google Earth supports overlay images on its maps. At the back of the "Base du système métrique décimal" there is a set of images showing where the stations are as a set of simple line diagrams. Surely I could overlay the images from Base onto Google Earth and find the missing stations? I didn't leap straight from the bath, but I did stay up overlaying images onto maps deep into the night. It turns out the diagrams are not at all bad for finding missing stations. This manual fiddling to sort out errant stations is intellectually unsatisfying but some things it's just quicker to do by hand!

You can see the results of my fiddling by loading this KML file into Google Earth, if you're really keen this is a zip file containing the image overlays from "Base du système métrique décimal" - they match up pretty well given they are photocopies of diagrams subject to limitations in the original drawing and distortion by scanning.

What have I learned in this process?

I've learnt that although it's possible to make dictionaries of dictionaries in Python it is not straightforward to pickle them.
I've enjoyed exploring the quiet corners of France on Google Maps
I've had a bit more practice using OneNote, Paint .Net, Python and Google Earth so when the next interesting thing comes along I'll have a head start.
Handling French accents in Python is a bit beyond my wrangling skills.

You've hopefully learnt something of the immutable mind of a scientist!

View

Saturday, March 19, 2011

Inordinately fond of bottles...

J.B.S. Haldane, when asked “What has the study of biology taught you about the Creator, Dr. Haldane?”, he replied:
“I’m not sure, but He seems to be inordinately fond of beetles.”

The National Museum of Science & Industry (NMSI) has recently released a catalogue of its collection in easily readable form, you can get it here. The data includes descriptions, types of object, date made, materials, sizes, and place made - although not all objects have data for all these items. Their intention was to give people an opportunity to use the data, now who would do such a thing?

The data comes in four 16mb CSV files plus a couple of other smaller ones covering the media library (pictures) and a small "events" library. I've focussed on the main catalogue. You can load these files individually into Microsoft Excel, each one has about 65536 rows so they're a bit of a pain to use, alternatively you can upload them to a SQL database. This turns out to be exceedingly whizzy! I wrote a few blog posts about SQL a while back as I learnt about it and this is my first serious attempt to use it. Essentially SQL allows you to ask nearly human language looking questions of big datasets, like this:

USE sciencemuseum;
SELECT collection,
       COUNT(collection)
FROM   sciencemuseum.objects
GROUP  BY collection
ORDER  BY COUNT(collection) DESC
LIMIT  0, 11000;

This gets you a list of all the collections inside the Science Museums catalogue (there are 162) and tells you how many objects are in each of these collections. Collections have names like "SRM - Acoustics" and "NRM - Railway Timepieces", the NMSI incorporates the National Railway Museum (NRM), and the National Media Museum (NMEM) as well as the Science Museum (SCM) - hence the first three letters of the collection name. I took the collection data and fed it into Many Eyes to make a bubble chart:

The size of the bubble shows you how many objects are in a particular collection, you can see a majority of the major collections are medical related.

So what's in these collections? As well as longer descriptions, many objects are classified into a more limited number of types. This bubble chart shows the number of objects of each type:

This is where we learn that the Science Museum is inordinately fond of bottles (or jars, or specimen jars, or albarello's or "shop rounds"). There are also a lot of prints and posters, from the National Railway Museum. This highlights a limitation to this type of approach: the fact that there are many of an object tells you little. It perhaps tells you how pervasive medicine has been in science - it is the visible face of science and has been for many years.

I have also plotted when the objects in the collection were made:

This turns out to be slightly tricky since over the years different curators have had different ideas about how *exactly* to describe the date when an object was made. Unsurprisingly in the 19th century they probably didn't consider that a computer would be able to process 200,000 records in 1/4 second but simultaneously be unable to understand that circa 1680, c. 1680, c1680, ca 1680 and ca. 1680 actually all mean the same thing. This shows a number of objects in the first few centuries AD, followed by a long break and gradual rise after 1600 - the period of the Scientific Revolution. The pace picks up once again at the beginning of the 19th century.

I also made a crack at plotting where all the objects originating in the UK came from, on PC this is a live Google Map and is zoomable, beneath the red bubbles are disks sized in proportion to the number of objects from that location:

From this I learnt that there was a Pilkingtons factory in St Asaph, and a man in Chirk made railway models. To me this is the value of programming, the compilers of the catalogue made decisions as to what they included but once in my hands I can look into the catalogue according to my interests. I can explore in my own way, if I were a better programmer I could perhaps present you with a slick interface to do the same.

Finally for this post, I tried to plot when the objects arrived at the museum, this was a bit tricky: for about 60% of the objects the object reference number for objects contains the year as the first four characters so I just have the data for these:

The Science Museum started in 1857, the enormous spike in 1889 is due to the acquisition of the collection of Sir John Percy on his death, I discovered this on the the Science Museum website. Actually, I'd like to commend the whole Science Museum site to you, it's very nice.

I visited the Science Museum a number of times in my childhood, I must admit to preferring it to the Natural History Museum, which seemed to be overwhelming large. The only record I have of these visits is this picture of a German Exchange visit to the museum, in 1985:

I must admit to not being a big fan of museums and galleries, they make my feet ache and I can't find what I'm looking for or I don't know what I'm looking for, and there never seems to be enough information on the things I'm looking at. This adventure into the data is my way of visiting a museum, I think I'll spend a bit more time in wandering around the museum.

I had an alternative title for people who had never heard of J.B.S. Haldane: "It's full of jars"

Footnote
If the Many Eyes visualisation above don't work, you can see them in different formats from my profile page.

Saturday, February 19, 2011

Photographs, videos and GPS

This post is in part a memory aid but it may be interesting to other amateur photographers, and organisational obsessives.

My scheme for holidays and walks out is to take cameras (Canon 400D, Casio Exilim EX-S10), sometimes a video camera (Canon Legaria FS200) and a Garmin GPS 60 which I use to provide information for geotagging photos rather than navigation, although I once used it as an altimeter to find the top of a cloud covered Lake District mountain. Geotagging is the process of labelling a camera image with the location at which it was taken.

I save images as JPEG, I should probably use RAW format on the SLR but the workflow is more complicated and I rarely do anything particularly advanced with images after I’ve taken them other than cropping, straightening and a little fiddling with contrast. Once home I save all the images from a trip to a directory whose name is as follows:

Z:\My Pictures\[year]\[sequence number] – [description] – [date]

So for my recent skiing trip:

Z:\My Pictures\2011\003 - Hinterglemm – 29jan11

I leave the image file names unaltered. Padding the sequence number with zeroes helps with sorting. The idea of this is that I can easily find photos of a particular trip just using the “natural” ordering of the file system, I don’t rely on 3rd party software and I’m fairly safe from the file system playing sneaky tricks with creation dates. The Z: drive on my system is network attached storage, so it can be accessed from both my desktop and laptop computers. I back this up to the D: drive on my desktop PC using Syncback and I also copy it periodically to a portable drive which I keep at work. Syncback synchronises the files in two directories, I use this in preference to “proper” backup because it doesn’t leave my files in a big blob of an opaque backup format (I got burnt by this when using NTbackup in the past). The drawback is that I can’t go back to a snapshot in time but I’ve never felt the need to do this.

In addition to the images, I also save the GPS file in GPX format to the directory, this is downloaded and converted using Mapsource which is Garmin’s interfacing software. GPX is a format based on XML so is easy to read programmatically and even by humans. I do little inside Mapsource other than converting, and for a multi-session trip, stitching all the tracks together into a single file. Another handy tool in this area is GPSBabel which converts GPS data between a multitude of formats.

I use Picasa for photo viewing and labelling: it’s free, it has basic editing functions, it allows labelling and geotagging of photos in a fairly open manner and it does interesting stuff like face recognition too. As well as all this it links to Google’s web albums, so I can share photos, and it talks nicely to Google Earth.

Both geotagging and labelling images use EXIF (Exchangeable image file format) this is a way of adding metadata to images; nice because it’s a standard and the data goes in the image file so can’t get lost. EXIFtool is a very useful command-line tool for reading and writing EXIF data, and it can be integrated into your own programs. Software like Picasa, and websites such as Flickr are EXIF aware so data saved in this format can be visible in a range of applications.

It is possible to geotag photos manually with Picasa via Google Earth but I’ve collected a GPS track so this is not necessary. There are free software packages to do this but I’ve written my own for fun. The process is fairly simple: the GPS track has a timestamp associated with each location point and the photos from the camera each have a timestamp. All the geotagging software has to do is find the GPS point with the timestamp closest to that of the photo and write that location data to the image file in the appropriate EXIF fashion. The only real difficulty is matching up the offset between image time and GPS time - for this I take a picture of my GPS which shows what time it thinks it is and label this “GPS”.

In fact I usually label photos after they have been geotagged: photos can be exported from Picasa as a Google Earth compatible KMZ file and then upload into Google Earth along with the GPS track in GPX format making it possible to see where you were when you took the photo, which makes labelling easier.

I use www.gpsvizualiser.com to create images of GPS tracks on top of satellite images, this is a bit more flexible than just using Google Earth, I must admit to being a bit bewildered with the range of options available here. Below is an example where height is coded with colour.

As I go around I sometimes take sets of images to make a panorama. The final step is to stitch together these multiple images to make single, panoramic views, I now use Microsoft Image Composite Editor to do this, it preserves the EXIF data of the input image and does a nice auto-crop. My geotagging program flags up images that were taken close together in time as prospective panoramic images. The image below is a simple to image panoramic view (from Hinterglemm)

I mentioned video in the title: at the moment I’m still a little bemused by video. I use the same directory structure for storing videos as I do for pictures but I haven’t found album software I’m happy with or a reliable way of labelling footage – Picasa seems promising although the playback quality is a bit poor. ffmpeg looks like a handy programming tool. Any suggestions welcome!

Monday, August 02, 2010

A set of blog posts on SQL

This is a roundup post for a rather specialist set of posts I wrote on SQL (Structured Query Language), a computer language for creating and querying databases. Basically the posts are my notes on the language which I'm learning because a couple of programming projects I have in mind will need it. The main source for these notes is the Head First SQL book. I've used a another book in this series (Head First Design Patterns) - I quite like the presentational style. The code in the posts is formatted for display using this SQL to HTML formatter.

Topics covered:
Some notes on SQL: 1 - Creation
Some notes on SQL: 2 - Basic SELECT
Some notes on SQL: 3 - Changing a table
Some notes on SQL: 4 - Advanced SELECT
Some notes on SQL: 5 - Database design
Some notes on SQL: 6 - Multi-table operations
Some notes on SQL: 7 - Subqueries and views

Of course you can find SQL cheatsheets elsewhere.

The Head First SQL book also has material on transactions and security, if I get a renewed bout of enthusiasm I will add a post on these items.

I used MySQL via its command line client to do the exercises, because it's about as straightforward as you can get. Notepad++ recognises SQL as a language and will do syntax highlighting, so I type my commands into it and copy them into the MySQL command line client. MySQL is pretty straightforward to install. I also installed Microsoft SQL Server Express 2008, which turned out to be a bit more of a struggle but on the plus side integration the C# .NET, which is what I normally program in, looks better than for MySQL.

I've been using with the SQL Server via SQL Management Studio (a graphical interface to databases) on the general election data compiled by The Guardian. First contact with actual data, as opposed to learning exercises has proved interesting! A lot of things that are fiddly to do in a spreadsheet are straightforward using SQL.

SQL was designed in the early 1970's, with commercial implementations appearing towards the end of the decade. It's influence visible is visible in more modern languages, such as the LINQ extensions to C# (this influence is pretty explicitly stated). Some of the ideas of database design (normalisation) seem relevant to object-oriented programming.

It's been an interesting learning experience, my scientific background in programming has me stuffing pretty much any sort of data into an array in the first instance. SQL and a database look like a much better solution for many situations.

Some notes on SQL: 7 - Subqueries, combining selections and views

This is the seventh, and final, in a series of blog posts on SQL, the first covered creating a database, the second selecting information from a database, the third commands to modify the structure and contents of an existing database, the fourth, advanced selection. The fifth post covered database design. The sixth post covered multi-table database operations. This post covers subqueries and views. No claim of authority is made for these posts, they are mainly intended as my notes on the topic.

This is largely a wrapping up blog post to cover a couple of items.

As an alternative to the joins described in a previous post, "subqueries" can often be used. A subquery is essentially an entire query embedded in another query. Subqueries can be used with UPDATE, INSERT and DELETE statements, whilst joins cannot. However, joins can be used to bring columns from multiple tables. There are no special keywords involved in creating a subquery. There is some discussion on the pros and cons of subqueries and joins here on Stackoverflow.

In an uncorrelated subquery the so-called inner query can be evaluated with no knowledge of the contents of the outer query. This expression contains an uncorrelated subquery:

UPDATE raw_candidate_data
SET    gender = 'f'
WHERE  firstname IN (SELECT firstname
                     FROM   firstname_gender_lookup
                     WHERE  gender = 'f');

The inner query is the clause in brackets, in this instance it is a shorthand way of building a list for an "IN" comparison. Often an inner query returns a single value, i.e. for an average or maximum in a list.

This expression contains a correlated subquery:

UPDATE raw_candidate_data
SET    party_id = (SELECT party_id
                   FROM   parties
                   WHERE  raw_candidate_data.party = parties.party_name);

The inner query requires information from the outer query to work, this expression acts as a look up.

Complex queries can be given aliases using the VIEW keyword, for example:

CREATE VIEW web_designer AS SELECT mc.first_name, mc.last_name, mc.phone,
mc.email FROM my_contacts mc NATURAL JOIN job_desired jd WHERE jd.title='Web Designer';

Can subsequently be used by:

SELECT * FROM   web_designers;

The view web_designers can be treated just as any other table.

The results of multiple select commands can be combined using the keywords: UNION, UNION ALL, INTERSECT and EXCEPT. UNION returns the distinct union of the results of all the selects, i.e. with no repeats, UNION ALL includes the repeats. INTERSECT returns items that are common to both SELECTs and EXCEPT returns those items that are returned by the first SELECT but not the second. The general form for these combination operations is as follows:

SELECT title FROM   job_current
UNION
SELECT title FROM   job_desired
UNION
SELECT title FROM   job_listings
ORDER  BY title;

Each SELECT statement must return the same number of columns and each column must be coercible to the same datatype. Only a single ORDER BY statement for the set of SELECTs can be used.

Keywords: UNION, UNION ALL, INTERSECT, EXCEPT, VIEW

Sunday, July 25, 2010

Some notes on SQL: 6 - Multi-table operations

This is the sixth in a series of blog posts on SQL, the first covered creating a database, the second selecting information from a database, the third commands to modify the structure and contents of an existing database, the fourth, advanced selection. The fifth post covered database design. This post covers multi-table database operations. No claim of authority is made for these posts, they are mainly intended as my notes on the topic.

Good database design leads us to separate information into separate tables, the information we require from a SELECT statement may reside in multiple tables. There are keywords and methods in SQL to help with extracting data from multiple tables. To assist with clarity aliases, indicated using the AS keyword, allow tables to be given shorter, or clearer, names temporarily. Various JOIN keywords enable lookups between tables, as with other aspects of SQL there are multiple ways of achieving the same results - in this case 'subqueries'.

The AS keyword can be used to populate a new table with the results of a SELECT statement, or it can be used to alias a table name. In it's aliasing guise it can be dropped, in shorthand. This is AS being used in table creation:
CREATE TABLE profession
  (
     id         INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
     profession VARCHAR(20)
  ) AS
  SELECT profession
  FROM   my_contacts
  GROUP  BY profession
  ORDER  BY profession;
The following two forms are equivalent, the first uses the AS to alias, the second uses an implicit alias:
SELECT profession AS mc_prof
FROM   my_contacts AS mc
GROUP  BY mc_prof
ORDER  BY mc_prof;

SELECT profession mc_prof
FROM   my_contacts mc
GROUP  BY mc_prof
ORDER  BY mc_prof;

The following examples use two tables boys which is a three column table {boy_id, boy, toy_id} and toys a two column table {toy_id, toy}.

boy_id	boy	toy_id
1	Davey	3
2	Bobby	5
3	Beaver	2
4	Richie	1

toy_id	toy
1	Hula hoop
2	balsa glider
3	Toy soldiers
4	Harmonica
5	Baseball cards

Cross, cartesian, comma join are all names for the same, relatively little used operation which returns every row from one table crossed with every row from a second table, that's to say two 6 row tables will produce a result with 36 rows. Although see here for an application.
SELECT t.toy,
       b.boy
FROM   toys AS t
       CROSS JOIN boys AS b;
Notice the use of the period and aliases to reference columns, this query will produce a 20 row table.

Inner join combines the rows from two tables using comparison operators in a condition, an equijoin returns rows which are the same, a non-equijoin returns rows that are different. These are carried out with the same keywords, the condition is different. This is an equijoin:
SELECT boys.boy,
       toys.toy
FROM   boys
       INNER JOIN toys
         ON boys.toy_id = toys.toy_id;

The ON and WHERE keywords can be used interchangeable; in this instance we do not use aliases furthermore since the columns in the two tables (toys and boys) have the same name we could use a natural join:
SELECT boys.boy,
       toys.toy
FROM   boys
       NATURAL JOIN toys;

Natural join is a straightforward lookup operation, a key from one table is used to extract a matching row from a second table, where the key column has the same name in each table. Both of these versions produce the following table:

boy	toy
Richie	hula hoop
Beaver	balsa glider
Davey	toy soldiers
Bobby	harmonica

A non-equijoin looks like this:
SELECT boys.boy,
       toys.toy
FROM   boys
       INNER JOIN toys
         ON boys.toy_id<>toys.toy_id
ORDER  BY boys.boy;

the resultant in this instance is four rows for each boy containing the four toys he does not have.

Outer joins are quite similar to inner joins, with the exception that they can return rows when no match is found, inserting a null value. The following query
SELECT b.boy,
       t.toy
FROM   boys b
       LEFT OUTER JOIN toys t
         ON b.toy_id = t.toy_id;

produces this result

Boy	toy
Richie	Hula hoop
Beaver	balsa glider
Davey	Toy soldiers
NULL	Harmonica
Bobby	Baseball cards

That's to say each row of the toys table is taken and matched to the boys table, where there is no match (for toy_id=4, harmonica) a null value is inserted in the boy column. Both LEFT OUTER JOIN and RIGHT OUTER JOIN are available but the same effect can be achieved by swapping the order in which tables are used in the query.

In some instances a table contains a self-referencing foreign key which is the primary key of the table. An example might be a three column table, clown_info, of "clowns" {id, name, boss_id} where each id refers to a clown name and the bosses identified by boss_id are simply other clowns in the same table. To resolve this type of key a self-join is required this uses two aliases of the same table.
SELECT c1.name,
       c2.name AS boss
FROM   clown_info c1
       INNER JOIN clown_info c2
         ON c1.boss_id = c2.id;

Notice both c1 and c2 alias to clown_info.

Keywords: AS, ON, INNER JOIN, NATURAL JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, UNION, INTERSECT, EXCEPT

Tuesday, July 20, 2010

Some notes on SQL: 5 - database design

This is the fifth in a series of blog posts on SQL, the first covered creating a database, the second selecting information from a database, the third commands to modify the structure and contents of an existing database, the fourth on advanced selection. This post covers database design, as such it is a little lighter on the code examples. No claim of authority is made for these posts, they are mainly intended as my notes on the topic. These notes are based largely on Head First SQL.

The goal of database design is to produce a database which is straightforward and efficient to search. This is done by splitting data into a set of tables, with lookups between those tables used to build the desired output results.

Efficient database design is normally discussed with reference to "normal forms", the goal being to reach the highest order normal form. In practice, pragmatism is applied which means it may be sensible to hold back a little on this.

First normal form - each row of data must contain atomic values, and each row of data must have a unique identifier, known as a Primary Key. "Atomic values" are essentially the smallest pieces into which data can be sensibly divided, this may depend on application. So, for example, in some cases a house address may be kept as a single text field whilst in others it might be divided into Number, Street, Town etc. Furthermore to be "atomic" data should not be repeated (i.e. a table containing interests should not contain columns "interest_1", "interest_2"...

The Primary Key may be a single column of 'synthetic' numbers (i.e. they don't have any other purpose), or it may be a pre-existing column in the table, or it may be a combination of columns which case it is called a Composite Key. Primary and Composite Keys are indicated using the PRIMARY KEY keyword :

CREATE TABLE customer
  (
     sid        INTEGER,
     last_name  VARCHAR(30),
     first_name VARCHAR(30),
     PRIMARY KEY (sid)
  );
For a composite key, this form is used:
PRIMARY KEY (column_1,column_2,column_3)

Second normal form the table is in first normal form, and in addition contains no 'partial functional dependencies', this happens naturally with synthetic primary keys. Partial functional dependency means that a non-key column is dependent on some but not all of the columns in a composite primary key.

Third normal form the table is in second normal form, and in addition contains no 'transitive dependencies'. Transitive functional dependency is when any non-key column is related to any of the other non-key columns. This page has a nice example, if we have a table with columns: {Project_id, manager_name, manager_address} then manager address and manager name are transitively dependent: change manager name and we change manager address. To address this in third normal form we split the table into two tables {Project_id, manager name} and {Manager_name, manager_address}. As the author writes:

In a normalised relation a non-key field must provide a fact about the key, the whole key and nothing but the key.

Relationships between tables in a database are indicated like this:

CREATE TABLE orders
  (
     order_id     INTEGER,
     order_date   DATE,
     customer_sid INTEGER,
     amount       DOUBLE,
     PRIMARY KEY (order_id),
     FOREIGN KEY (customer_sid) REFERENCES customer(sid)
  );

(Example borrowed from here). PRIMARY KEY and FOREIGN KEY are examples of 'constraints', primary keys must be unique and a foreign key value cannot be used in a table if it does not exist as a primary key in the referenced table. The CONSTRAINT keyword is used to give a name to a constraint (a constraint being one of NOT NULL, UNIQUE, CHECK, Primary Key, Foreign Key). CHECK is not supported in MySQL.

Keywords: PRIMARY KEY, CONSTRAINT, FOREIGN KEY, REFERENCES, CONSTRAINT

Wednesday, July 14, 2010

Some notes on SQL: 4 - Advanced select

This is the fourth in a series of blog posts on SQL, the first covered creating a database, the second selecting information from a database, the third commands to modify the structure and contents of an existing database. This post covers more advanced commands for selecting information from a database and ways of manipulating the results returned. No claim of authority is made for these posts, they are mainly intended as my notes on the topic.

SQL supports CASE statements, similar to those which are found in a range of programming languages, they are used to write multiple comparison sequences more compactly:

UPDATE my_table
SET    new_column = CASE
                      WHEN column1 = somevalue1 THEN newvalue1
                      WHEN column2 = somevalue2 THEN newvalue2
                      ELSE newvalue3
                    END;

The CASE statement can also be used in a SELECT:

SELECT title,
       price,
       budget = CASE price
                  WHEN price > 20.00 THEN 'Expensive'
                  WHEN price BETWEEN 10.00 AND 19.99 THEN 'Moderate'
                  WHEN price < 10.00 THEN 'Inexpensive'
                  ELSE 'Unknown'
                END,
FROM   titles

(This second example is from here)

The way in which results are returned from a SELECT statement can be controlled by the ORDER BY keyword with the ASC (or ASCENDING) and DESC (or DESCENDING) modifiers. Results can be ordered by multiple keys. The sort order is numbers before letters, and uppercase letters before lowercase letters.

SELECT title,purchased
FROM   movie_table
ORDER  BY title ASC, purchased DESC;

ASCENDING order is assumed in the absence of the explicit keyword.

There are various functions that can be applied to sets of rows returned in a query to produce a single value these include MIN, MAX, AVG, COUNT and SUM. The functions are used like this:

SELECT SUM(sales)
FROM   cookie_sales
WHERE  first_name = 'Nicole';

This returns a sum of all of the "sales" values returned by the WHERE clause. Related is DISTINCT which is a keyword rather than a function so the syntax is slightly different:

SELECT DISTINCT sale_date
FROM   cookie_sales
ORDER  BY sale_date;

This returns a set of unique dates in the sale_date column.

The GROUP BY keyword is used to facilitate the use of functions such as SUM etc which take multiple arguments to produce a single output, or to reduce a list to distinct elements (in these circumstances it is identical to the DISTINCT keyword but execution may be faster). The format for GROUP BY is shown, by example below:

SELECT first_name, SUM(sales)
FROM   cookie_sales
GROUP  BY first_name;

This will return a sum of the "sales" by each person identified by "first_name". A final keyword used to control the output of a SELECT statement is the LIMIT keyword which can take one or two parameters the behaviour for the two forms is quite different. One parameter form:

SELECT * FROM your_table LIMIT  5;

This returns the first five results from a SELECT. Two parameter form:

SELECT * FROM your_table LIMIT  5, 5;

This returns results 6,7,8,9 and 10 from the SELECT. The first parameter is the index of the first result to return (starting at 0 for the first position) and the second parameter is the number of results to return.

Keywords: CASE, WHEN, THEN, ELSE, ORDER BY, ASC, DESC, DISTINCT, MIN, MAX, AVG, COUNT, SUM, GROUP BY, LIMIT