I am researching the area of data integrity and security. I would like to know basically what these concepts mean...
and some of the most common methods of implementing strategies to maintain good security controls on data and means of keeping integrity.
The concept of data integrity from a security perspective deals with making sure that data is not subject to unauthorized alteration. If I can ensure data integrity, I can verify that the data hasn't been changed by some attacker. Data integrity is hugely important; in many applications, data integrity outweighs confidentiality. For example, for most users, an attacker's changing the balance of their bank accounts is of far greater concern than if a bad guy can see their balance. Although, I suppose whether the balance is changed upwards or downward has some impact on the user's concerns. ; )
Data integrity can be guarded by a variety of mechanisms. First, by keeping the attackers off of the system holding the information, data has some level of protection. Hardening systems, applying patches and utilizing host- and network-based intrusion-detection systems all help to keep the bad guys away from the data.
Another process that helps protect integrity is the regular back up. Systems, applications and stored data should all be backed up on a regular basis. For many systems, this should occur on a weekly or daily basis. If the data gets altered, backups are critical in restoring a trusted state.
However, these are only baby steps to ensuring real data integrity. To really protect integrity, cryptographic algorithms can help to create a secure digital fingerprint of the data. Integrity checking is often accomplished using hash functions, such as Message Digest 5 (MD5). On a Linux system (and some other flavors of UNIX), you can use the md5sum program to create the hash. A hash function takes a larger amount of data (1k to 1 Meg, or more) and crunches it down to a fixed length (usually on the order of a hundred bits or so). The hash function has a one-way nature. This means that given the hash result only, it's computationally very difficult to figure out what the original data was. Furthemore, it's extremely hard to find another set of data that has the same hash result.
I can use a hash function to create a fingerprint for each piece of data I want to protect. Then, I'll store the hash functions on a safe medium (such as a write-protected floppy disk, CD-ROM or another server). Periodically, I'll recalculate all the hashes of the active data to make sure they still match the hashes. If the data doesn't match the hash, I know it was altered. Then, I can roll it back to my previous value stored in my trusted backup.
Now, we don't want an attacker to manipulate our stored hash results. If they alter the data and the hash result, we won't be able to check the integrity of the data. Therefore, some applications apply the concept of a "keyed hash" or even a digital signature. For a keyed hash, the data is fingerprinted using a hash function that includes a secret key. If attackers manipulate the data and the hash, but don't have the secret key, they will not be able to calculate the proper hash. A digital signature uses public key encryption to digitally sign the hash to make sure it isn't altered. In a sense, both of these techniques are used to ensure the integrity of the hash itself.
There are many products that implement these ideas. As stated above, the md5sum program calculates simple hashes. The Tripwire and AIDE tools are used to provide file system integrity checking using MD5 hashes. The PGP tools can be used to digitally sign (and also encrypt) any type of data, including e-mail and files.