Data Acquisition Doesn’t Work Without Subject-matter Experts: Primary Keys
I rarely have expertise in the data sources I automate for clients. I have expertise in the means I use to collect and store that data, but I don’t necessarily know what any of the data physically means. That’s why I need the help of subject-matter experts when designing databases to store their data – they know what it means, what matters, what varies – I can only guess, and I prefer not to guess.
One of the first challenges that comes up is “primary keys”: I need to know which columns in a dataset uniquely define that row.
For example if I were to store the daily temperature for my town in a database, that data would look something like this:
If we don’t set a primary key then the database won’t know if a row we try to insert is new or if it should update existing data. For example, if we added a new temperature reading for 2022-11-11 without setting a primary key for the above table, the table would look like this:
What was the temperature on 2022-11-11? We have two answers, which can be confusing. A subject-matter expert in this data that uses the data for a particular purpose would be able to tell me what to do with this. For example:
- they could be fine with this – they know the data comes out at a fixed interval and they are going to take the average for the day
- they might want to retain these intra-day readings but might want it clearly stated when they came in, so they’d ask me to store the time as well as the date
- or they might only care about the average, final, or highest temperature for the data – in which case I’d need to do some data processing
For the first case I’d actually recommend storing the time as well; and therefore for all of these cases we still want to set a primary key: the date, or the date+time. Setting a primary key on the date for the above table means “if a new piece of data comes in with the same date, then this new data replaces the old one”. This is part of why we use databases: they ensure data integrity – they ensure that the data makes sense.
Often we need to specify multiple primary keys. Here’s an example of a table that would use multiple primary keys:
For the above data table we would set the following columns to be primary keys: Datetime, Station ID, and Variable. If we got a new data point for “2022-11-09 13:22” for station ID “Halifax-01” and variable “Temperature”, this configuration would replace the existing data point, making updates and corrections possible. If we did not set these primary keys we would get two values for the exact same station, variable, and time – which is confusing.