Let's begin with what is data?
Facts of statistics, known or assumed, collected together manually or automatically, for reference or analysis. Since data is collected, measured, reported and analysed it transforms from raw unorganized bits of values or statements to patterns or images or graphs that reveal some information.
Thus data, for example, could be a set of values revealing the temperature of a room over different hours of the day over a number of days. It could also be a compilation of "go/no go" report such as "pass" or "fail" statement for the number of batches produced over a specified period of time.
What is integrity?
In a person it is the quality of being honest and having strong moral principles. In data it is a state of being whole (undivided) and incorruptible over a specified period of validity. If we apply the yardstick of honesty to data, it would translate into truthfulness and reliability.
Thus data integrity could be defined as a truthful and reliable set of values, images, reports or graphs collected over a specified time period generated through the operation of a process.
In the computer world, data is facts and figures stored in and processed by a an electronic device. In the world of statistics, data is facts and figures from which conclusions can be drawn. Data that have been recorded, classified, organized, related, or interpreted within a framework so that a meaning emerges, is known as information. A type of information obtained through mathematical operations on numerical data is statistics.
Basically, it emerges that data is essentially a collection of facts and figures for onward processing through sifting, organizing, analyzing and concluding to give desired information for decision making. The decision made can be as right as the data collected and maintained. Herein lies the crux of data integrity.
Danger to data integrity is data corruption or data loss. Data corruption are errors that occur during collection, storage, transmission and processing of data. It is significant to note that interpretation of data is not included in the above. Interpretation is subjective and the same data can be interpreted differently by different individuals to meet different goals. Errors can occur through hardware or software malfunction. Here is a snapshot of data corruption obtained from wikipedia
Production of unaccepted results is data corruption. The unaccepted result could be a minor loss of data or a total system failure.
Corruption of data comes in two shades, the undetected and the detected. The undetected is also referred to as silent data corruption as there is no indication that the data is incorrect and this is the most dangerous type of corruption. Detected corruption may lead to permanent loss of data or a temporary malfunction which is auto-corrected through in-built mechanisms (software) and does not result in data loss. For example the auto save function which prevents data loss in case of battery drain while using a laptop.
Data corruption can occur at any stage from data acquisition, transmission, storage, processing and retrieval. The causes of data loss are hardware oriented or software related. Interruption during data transmission can also cause data loss. Cosmic rays from increased solar activity, cloud cover, microwaves, electrical fluctuations, loose connections, background radiation, external disturbances like loud sound, aging, viruses and human error can lead to data loss or corruption.
Data can be secured in a number of ways. Multiple layers of defense mechanisms can protect data should there be a deliberate or incidental breach either internally or through the net. Introducing programmable logic controls like levels of authority, passwords, authentication like biometrics, firewalls, anti-virus, anti spyware, anti-malaware and encryption are some methods that may be employed. Physically, restrict access through locks on servers and storage cabinets.
Access control and access records are other methods to discourage espionage, hacking and pilfering of data. In which case access control should be made robust through key-code changes and combination lock changes pretty regularly. Changing passwords and restricting access are obvious choices but it is in obvious areas that carelessness creeps in.
Data transfer should be done over safe networks and very discreetly. This translates into keeping a low profile on data handling and storage. Maintain a data log of who accessed the data, where and when and keep the control by periodic questioning of the purpose behind the access.
Keep the security system transparent so that it becomes an effective auto-check mechanism. Don't let obsessive security affect productivity. Ensure online and offline back-ups and make sure that the back-ups are not corrupted or inaccessible by periodic and controlled access, duly recorded. Validate data recovery processes.
Who can help manage data integrity? The obvious choice would be vendors of storage hardware, Exchange notes with software vendors how they maintain their data and secure their operations.
Data management also includes data redundancy and elimination much like archiving old documents. Scan disk and similar programmes keep the system robust and improve detectability of impending malfunction so that appropriate steps may be taken before damage occurs. This is also preventive and predictive maintenance.
Why do we need data management and data integrity? Primarily because it is a legal requirement to have accurate history with a proven audit trail. Secondly because organizational culture can be evaluated on the basis of data integrity.
India as a nation has suffered at the hands of handful of companies that could not and did not secure their data. That the world took pot shots at integrity was a heavy price a nation paid and is still paying!