Problem solve Get help with specific problems with your technologies, process and projects.

Text canonicalization in encryption

I have an application where I want to use a reliable message digest algorithm, such as SHA-1 or MD5. Both of these...

are implemented in Java 1.4, and I have some sample code and results from the IBM DevelopersWorks site. However, when I compile and run the code on a Sun box, the message digest doesn't match the expected results. It appears to be a code-page issue.

Can these Java message digest algorithm implementations be used in such a manner as to generate the same results across platforms and control for code-page differences?

The problem isn't the hash algorithms, it is what we call "text canonicalization." What this means is that you have to account for code-page differences before hashing by translating into some known "canonical form" -- or remember *not* to do any translation before hashing. Either of them is an acceptable way to solve the problem. You have to do the hash over the actual data.

OpenPGP (for which I'm a spec author) specifies that all text is in UTF-8 of Unicode.

For more information on this topic, visit these other SearchSecurity.com resources:
Ask the Expert: Clarification of encryption keys
Ask the Expert: Using MD5 in Java
WhatIs Definition: canonicalization

This was last published in August 2002

Dig Deeper on Disk and file encryption tools

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.