Cutting through the bullshit.
Sunday, 16 June 2013
Thursday, 13 June 2013
[Note: a few additions, 2013 06 16]
When I first read about the NSA collecting 'metadata' about telephone calls, I thought it was an unconventional use of the term. But on reflection, if you regard the actual substance of a phone call as the data of interest, then the details pertaining to the call are, in fact, metadata.
At another level of analysis, those metadata are themselves data and are only meaningful when viewed in light of their own set of metadata. That is what I think of as metadata.
The record of a telephone call is probably represented as a string of characters. The metadata would specify what they mean. First of all, it would tell you that the unit of enumeration – what each record describes – is a 'Phone call', which would be defined to include or exclude SMSs, MMSs, video calls, etc. If it included more than one type of call, there would be a 'Call type' field with a code indicating what kind of call it was. The first 20 characters of the record, then, might be a sequential call record identifier. The metadata would then specify that the field is named 'Call record id' and comprises 20 numerical characters. There might then be a code for the date of the call, which would tell you that it is an eight character numeric field and link it to a particular time zone. Similarly, the metadata for 'Time of call' would say that the field is four numeric characters and the applicable time zone, and define its content as the time the call is answered (or perhaps when dialling was completed). The next three characters might represent the country code of the originating phone, which would be called, for example, 'Originating country code', be 3 numeric characters long and associated with a table of valid country codes (a classification), linking each with the relevant country; etc. The 'classification' associated with the fields for actual phone numbers would effectively be a reverse telephone directory. Other fields could include codes for the phone towers, satellites and nodes the call passes through, with times, details of the receiving number, and so forth.
The metadata pertaining to survey data would include definitions of the concepts purportedly collected, the wording and sequence of the questions asked to elicit data supporting those concepts, how such data are classified, a definition of the units being measured, the time the data apply to, the sampling methodology, and so forth. So a record in a labour force survey would include data like 'Dwelling id', 'Household number', 'Person number', 'State' or other geographic indicators, 'Sex of person', 'Age of person in years', 'Marital status of person', 'Labour force status of person', 'Hours actually worked by person during reference period', 'Hours usually worked by person', 'Status in employment of person in main job', 'Occupation of person in main job', 'Industry of person in main job', additional comparable fields for second, third, fourth... jobs, 'Duration of unemployment of person', etc. The metadata for records like that would name and define each field, stipulate its length and the type of characters allowed, and associate it with any relevant classification. The classification could be quite simple. A classification of 'Sex of person' might look like this:
Sexwhile classifications of 'Occupation' and 'Industry' fill large tomes. Sometimes, of course, a number is just a number, 'Age in years', for example. But even these can be classified by grouping them into ranges. For the record, it's never a good idea to collect age in ranges, as that can make it impossible to compare the data with other data collected or output in different ranges. So if you are interested in the population aged 18-22 years, for example, and ask respondents whether they are in that range, you could never compare your data with data collected in standard age ranges: 15-19, 20-24... If you're developing a survey, I urge you to collect age last birthday in single years.
Finally, a generalisation like 'The mean number of completed calls originating from phones within the 50 states is 7.8 per day' is not really metadata. It's just another way of presenting the data. The applicable metadata would include definitions of 'Completed call' (including or excluding SMS, etc.) and 'Within the 50 states' (including or excluding diverted calls, global roaming, etc.).
Posted by Ernie Halfdram at 13:02