How to prevent input validation attacks
How does canonicalization work? What should I do to prevent this input validation attack?

    Requires Free Membership to View

    SearchSecurity.com members gain immediate and unlimited access to breaking industry news, virus alerts, new hacker threats, highly focused security newsletters, and more -- all at no cost. Join me on SearchSecurity.com today!

    Michael S. Mimoso, Editorial Director

    By submitting your registration information to SearchSecurity.com you agree to receive email communications from TechTarget and TechTarget partners. We encourage you to read our Privacy Policy which contains important disclosures about how we collect and use your registration and other information. If you reside outside of the United States, by submitting this registration information you consent to having your personal data transferred to and processed in the United States. Your use of SearchSecurity.com is governed by our Terms of Use. You may contact us at webmaster@TechTarget.com.

Canonicalization is a fancy word for a fairly simple concept. It stems from the fact that there are more than 47 gazillion ways to encode characters for the Web today. Some of the most popular are UTF-8, UTF-16, and so on (which are described in detail in RFC 2279) A single character, such as a dot (.), may be represented in many different ways, such as ASCII 2E, Unicode C0 AE and many others. The problem is, with all of these different ways of encoding user input, a Web application's filters can be easily confused if they're not carefully built. For example, if you wanted to filter dots, but only remove ASCII 2E, it is possible for someone to use an alternative Unicode format with C0 AE and squeeze a dot past your filter. Web applications often filter user input for evil characters, like quotes that might be part of SQL injection attacks or script tags (like < and >) that might be part of a cross-site scripting attack. However, if the Web application filter only searches for UTF-8 input an attacker could use another encode, like UTF-16, to code the evil characters and bypass the filter.

Canonicalization, means converting something into a simpler, more fundamental form. Web sites should have code that converts user input from a variety of different encoding forms to a single simple form that everything after will utilize, like UTF-8. The filters and all subsequent processes are applied after canonicalization, so everything has the same impression of what the user input will mean. This conversion process is called canonicalization. As a user, there is nothing you can do about canonicalization issues on the Web sites you use. But, as a Web developer, you'll want to make sure that you appropriately canonicalize the data you receive from users. There's a wonderful Web application developer's guide at the Open Web Application Security Project (OWASP) that describes the details of the code you'd need to write to canonicalize data.

More on preventing Web application attacks

  • Visit our Web Application Attacks Learning Guide for Web application security tools and tactics to protect against these specific attack types.
  • Learn ten dos and dont's for secure coding.
  • This was first published in August 2006