I have an application where I want to use a reliable message digest algorithm, such as SHA-1 or MD5. Both of these are implemented in Java 1.4, and I have some sample code and results from the IBM DevelopersWorks site. However, when I compile and run the code on a Sun box, the message digest doesn't match the expected results. It appears to be a code-page issue.
Can these Java message digest algorithm implementations be used in such a manner as to generate the same results across platforms and control for code-page differences?
The problem isn't the hash algorithms, it is what we call "text canonicalization." What this means is that you have to account for code-page differences before hashing by translating into some known "canonical form" -- or remember *not* to do any translation before hashing. Either of them is an acceptable way to solve the problem. You have to do the hash over the actual data.
OpenPGP (for which I'm a spec author) specifies that all text is in UTF-8 of Unicode.
For more information on this topic, visit these other SearchSecurity.com resources:
Ask the Expert: Clarification of encryption keys
Ask the Expert: Using MD5 in Java
WhatIs Definition: canonicalization