Cutting through the bullshit.

Sunday 16 June 2013

Below average

The other day, this meme cropped up on Facebook.

I surmise that the intent is to ridicule the innumeracy of those who would consider such a revelation shocking when in reality it is a perfectly banal observation. What could be more obvious than that it's the nature of an 'average' (presumably the median – see here if you were absent the day they taught averages) that half will be above it and half below?

Or is it?

First of all, the assertion assumes that we know what 'intelligence' means. In reality, the only rigorous definition I've encountered is in terms of performance on intelligence tests. It may be circular, but that's what I'll assume the meme is talking about.

In a population with an odd number of observations, the median is the one exactly in the middle of the distribution when ranked from highest to lowest. So in a population of five with intelligence test scores ('IQ') of 120, 110, 105, 90, 85, the median is 105, and clearly 50% are not 'below average' – 40% are. In such populations, the proportion 'below average' is always going to be less than 50%, even if there are over 316 million observations and the difference is slight.

When the number of observations is even, however, the median is the arithmetic mean of the two observations in the middle. So in a population of four, with IQs of 110, 100, 90, 85, the two central observations – 100 and 90 – average to 95 and in this case, exactly 50% really are below average. But what if their scores were 110, 100, 100, 90? The two observations to average are both 100, so the median is 100 and only 25% are 'below average'. Since IQ scores are deliberately calculated to conform to a normal distribution, it is, if I'm not mistaken, inevitable that there will be a cluster exactly in the middle of the range and there will never be 50% below average because a proportion, probably a plurality, will always be 'average'.

Even if there were some possibility that not one of the 316,044,000 Americans had an IQ of exactly 100, it transpires that intelligence test scores are grouped into ranges and the range 90-109 (sometimes 85-115) is, uncoincidentally, denominated 'Average'. About 50% of the population fall into the 90-109 range and 68% into the 85-115 range, so only 25% in the first case and 16% in the second would be 'below average'.

If it were true that a study showed 50% of Americans 'have below-average intelligence', then, that really would be a shocker.

Thursday 13 June 2013

What's 'metadata'?

[Note: a few additions, 2013 06 16]

When I first read about the NSA collecting 'metadata' about telephone calls, I thought it was an unconventional use of the term. But on reflection, if you regard the actual substance of a phone call as the data of interest, then the details pertaining to the call are, in fact, metadata.

But this is deceptive. The 'metadata' the NSA collect are crucial to understanding the data they pertain to – the actual content of phone calls and messages that they are purportedly not capturing. A message stating 'The bomb goes off in 48 hours' would be useless if you didn't know when it was sent, among other things. Under the circumstances, they are treating the 'metadata' as data and analysing it as such.

At another level of analysis, those metadata are themselves data and are only meaningful when viewed in light of their own set of metadata. That is what I think of as metadata.

The record of a telephone call is probably represented as a string of characters. The metadata would specify what they mean. First of all, it would tell you that the unit of enumeration – what each record describes – is a 'Phone call', which would be defined to include or exclude SMSs, MMSs, video calls, etc. If it included more than one type of call, there would be a 'Call type' field with a code indicating what kind of call it was. The first 20 characters of the record, then, might be a sequential call record identifier. The metadata would then specify that the field is named 'Call record id' and comprises 20 numerical characters. There might then be a code for the date of the call, which would tell you that it is an eight character numeric field and link it to a particular time zone. Similarly, the metadata for 'Time of call' would say that the field is four numeric characters and the applicable time zone, and define its content as the time the call is answered (or perhaps when dialling was completed). The next three characters might represent the country code of the originating phone, which would be called, for example, 'Originating country code', be 3 numeric characters long and associated with a table of valid country codes (a classification), linking each with the relevant country; etc. The 'classification' associated with the fields for actual phone numbers would effectively be a reverse telephone directory. Other fields could include codes for the phone towers, satellites and nodes the call passes through, with times, details of the receiving number, and so forth.

The metadata pertaining to survey data would include definitions of the concepts purportedly collected, the wording and sequence of the questions asked to elicit data supporting those concepts, how such data are classified, a definition of the units being measured, the time the data apply to, the sampling methodology, and so forth. So a record in a labour force survey would include data like 'Dwelling id', 'Household number', 'Person number', 'State' or other geographic indicators, 'Sex of person', 'Age of person in years', 'Marital status of person', 'Labour force status of person', 'Hours actually worked by person during reference period', 'Hours usually worked by person', 'Status in employment of person in main job', 'Occupation of person in main job', 'Industry of person in main job', additional comparable fields for second, third, fourth... jobs, 'Duration of unemployment of person', etc. The metadata for records like that would name and define each field, stipulate its length and the type of characters allowed, and associate it with any relevant classification. The classification could be quite simple. A classification of 'Sex of person' might look like this:
0    Undetermined
1    Female
2    Male
3    Other
while classifications of 'Occupation' and 'Industry' fill large tomes. Sometimes, of course, a number is just a number, 'Age in years', for example. But even these can be classified by grouping them into ranges. For the record, it's never a good idea to collect age in ranges, as that can make it impossible to compare the data with other data collected or output in different ranges. So if you are interested in the population aged 18-22 years, for example, and ask respondents whether they are in that range, you could never compare your data with data collected in standard age ranges: 15-19, 20-24... If you're developing a survey, I urge you to collect age last birthday in single years.

To digress, it's worth pointing out that there's more than one way to define a concept. For most purposes – when defining a word, for example - it's probably best to identify the core of the concept and acknowledge that speakers will vary in how far from that core something can be before they'll call it something else. When defining statistical concepts, however, it's crucial to demarkate the outline of the concept so that there is little or no possibility of ambiguity. Each unit either fits into a category or not. I hasten to add that statisticians are not always as careful as they need to be about defining terms and that standard definitions may not accurately reflect the concept as it is actually collected.

Similarly, not all classifications are the same. In a statistical classification, categories have to be defined to be mutually exclusive so that every unit in the population to be enumerated fits into one and only one category. And the classification must be exhaustive, so there's a category to accommodate every unit to be classified, even if the category is just 'Other'. Again, not all statistical classifications are entirely 'fit for purpose' or even internally coherent. And even when they are, they are often applied in contexts they were not designed for. So a classification of industries, for instance, that may (or may not) be perfectly serviceable in classifying the commodities each industry produces can obscure similarities and differences among industries when applied to workplace safety. 'Agriculture, forestry and fishing' captures the industries that produce food and timber, but the hazards encountered in abalone diving are quite different to those in the dairy cattle farming industry, which in turn might be similar to those in the beef cattle farming industry.

Finally, a generalisation like 'The mean number of completed calls originating from phones within the 50 states is 7.8 per day' is not really metadata. It's just another way of presenting the data. The applicable metadata would include definitions of 'Completed call' (including or excluding SMS, etc.) and 'Within the 50 states' (including or excluding diverted calls, global roaming, etc.).