South Shields is an island

In an earlier post, we talked about how we can use geographic information about constituencies to improve our estimates of public opinion in those constituencies. That is, we’re trading on the idea that constituencies that are close together are going to be quite similar to each other in many respects, and more similar than constituencies that are far apart.

We neglected to mention how we get data on constituency geographies. The answer lies in something called shapefiles — specifically, an ESRI shapefile. (ESRI is dangerously close to being to computerized map information what Hoover is/was to vacuum cleaners — a trademark used to denote a generic).

Shapefiles are… uh, quite tedious to work with. First, they don’t fit easily with the way most quantitative social science research works. Most social scientists are really happy with information which is stored in a rectangular block: each variable maps on to a column, and each case maps on to a row.

Shapefiles aren’t like that. They’re just collections of points, with some shapes in the file having very few points, and some shapes have hundreds of thousands of points. So statistical packages which work with shapefiles have to handle them using special object types — and that very quickly generates idiosyncracies.

Second, shapefiles are used for lots of purposes where detail is often required. Typically, in our analysis one of the first things we do is throw out half of the points in the shapefile — because these shapefiles are at a level of detail we don’t need. We’re not investigating planning applications for the council: most of the time, we’re just interested in working out whether two constituencies are next to one another or not. (Unfortunately, in the process of throwing out information, more gremlins creep in: in the shapefiles we’re using, we managed to make South Shields an island. We still don’t know why this happened, and we had to manual patch the adjacency matrix we produced).

Third, shapefiles are often not freely distributable. We can’t distribute the shapefiles we use in this project, because they’re only available from the Ordnance Survey (at this link: scroll down to Boundary-Line) under licence. That licence is fairly permissive — but it still means that the results we produce depend on an external file over which we don’t have control — and that limits reproducibility.

Anyway, enough grumbling: here’s the link to the code we use to find out which constituencies are adjacent to each other.

Posted in Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *