When we discussed Mr. P before, we talked, rather blithely, about being able to build up a tally of the number of voters of particular types residing in each constituency. Those types depend on the particular model that we use, but for modelling vote choice, we might want to build a tally of all voters of (1) a particular gender, belonging to (2) a particular age-group, in (3) either rented or owned accommodation, working in (4) either the private or the public sector; who are (5) married or single; and who have a (6) particular type of education, and (7) a particular type of social grade (AB, C1, C2, or DE).
That’s quite detailed information. For eminently sensible reasons, the census authorities do not generally release such detailed information at the constituency level. You can get some cross-tabs (ideally from Nomisweb) — you can find out, say, the number of people with a particular educational attainment for each gender and age-group. And you can get oodles and oodles of univariate statistics — raw counts of people according to type of housing tenure; raw counts of people according to type of employment, that kind of thing. But any information which is sufficiently detailed for our post-stratification stage is also going to be sufficiently detailed to risk compromising the anonymity of the census.
So in order to get the detailed breakdowns by constituency that we need, we need to rely on some special census data release — the Sample of Anonymised Records. This is a sample of records from the census, with information on all of the questions that the census asks about. It’s been anonymised in clever ways — some of the data has had some random noise added to it, not enough to ruin anyone’s results, but enough to prevent jigsaw identification.
Using this sample — which, at one hundred thousand records, is a pretty huge sample by anyone’s reckoning — we can get information on the relationship between the different variables which feature in our voter types. We can see that, say, people working in the public sector tend to be slightly more likely to have a university degree than people working in the private sector. So we can create a table for the whole country, with all the information we need about the relationships between these variables.
In order to turn that information into information at the level of the constituency, we do something called `raking’, also known by its Sunday name of `iterative proportional fitting’. We start by assuming that everyone in the constituency has the same probability of landing up in any one of our types, and we gradually chip bits of, and shape this table, until it starts looking more like the relationships we know are there (i.e., the Sample of Anonymised Records). It’s a bit like taking a block of marble and chipping bits off until a beautiful sculpture comes forth.
This raking procedure manages to recover information quite well — we know that it works well at recovering information in the limited cases where the census cross-tabs are available for two or three variables. It means that we can answer quite detailed questions.
If, inspired by the Beyonce song, we wished to identify all the single ladies (all the single ladies) in a particular age-group, in a particular constituency of the UK, we could do (and in fact you’ll see that map plotted below: click to embiggen).
Or, conversely, if we wanted to identify the constituencies with the lowest number of males in employment, we could do that to — though perhaps ignoring some obvious problems.
All this is going to be very useful — indeed, essential — for our post-stratification stage.