Text canonicalization in encryption

I have an application where I want to use a reliable message digest algorithm, such as SHA-1 or MD5. Both of these

are implemented in Java 1.4, and I have some sample code and results from the IBM DevelopersWorks site. However, when I compile and run the code on a Sun box, the message digest doesn't match the expected results. It appears to be a code-page issue.

Can these Java message digest algorithm implementations be used in such a manner as to generate the same results across platforms and control for code-page differences?

The problem isn't the hash algorithms, it is what we call "text canonicalization." What this means is that you have to account for code-page differences before hashing by translating into some known "canonical form" -- or remember *not* to do any translation before hashing. Either of them is an acceptable way to solve the problem. You have to do the hash over the actual data.

OpenPGP (for which I'm a spec author) specifies that all text is in UTF-8 of Unicode.

For more information on this topic, visit these other SearchSecurity.com resources:
Ask the Expert: Clarification of encryption keys
Ask the Expert: Using MD5 in Java
WhatIs Definition: canonicalization

This was first published in August 2002

Dig deeper on Disk Encryption and File Encryption



Enjoy the benefits of Pro+ membership, learn more and join.

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: