Problem solve Get help with specific problems with your technologies, process and projects.

Text canonicalization in encryption

I have an application where I want to use a reliable message digest algorithm, such as SHA-1 or MD5. Both of these are implemented in Java 1.4, and I have some sample code and results from the IBM DevelopersWorks site. However, when I compile and run the code on a Sun box, the message digest doesn't match the expected results. It appears to be a code-page issue.

Can these Java message digest algorithm implementations be used in such a manner as to generate the same results across platforms and control for code-page differences?

The problem isn't the hash algorithms, it is what we call "text canonicalization." What this means is that you have to account for code-page differences before hashing by translating into some known "canonical form" -- or remember *not* to do any translation before hashing. Either of them is an acceptable way to solve the problem. You have to do the hash over the actual data.

OpenPGP (for which I'm a spec author) specifies that all text is in UTF-8 of Unicode.

For more information on this topic, visit these other SearchSecurity.com resources:
Ask the Expert: Clarification of encryption keys
Ask the Expert: Using MD5 in Java
WhatIs Definition: canonicalization

This was last published in August 2002

Dig Deeper on Disk and file encryption tools

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.