I obtain data generated by cheap, low powered devices that make mistakes. Sometimes the data simply will not match the specification because the firmware mucked up (a bit flip perhaps) and there’s nothing anybody can do to prevent it.
So what should I do about it? Rejecting it leads to a clean data warehouse but is unrealistic - we’ve lost information about the fact a transaction happened even if we don’t know exactly what transaction. Fixing the data is also unrealistic, breaking the ability to reload the data warehouse and obtain the same state.
The solution I’ve come up with is simple but required quite a bit of coding. A GUI is put over the Operational Data Store section of the data warehouse allowing changes. Every time anybody changes the data there a trigger kicks off a few actions:
The original source file is backed up.
The original data is deleted from the data warehouse
A file that simulates the generated data is created
The simulated file is loaded
As far as I can see, this allows any authorised user to fix data in the data warehouse using either a table editing GUI or SQL, while maintaining the data warehouse’s integrity. However I’m still sitting on the idea and seeing if there are any flaws in it - drop me an email if you can think of any.