Wednesday, April 2, 2014

What does the term “canonical form” or “canonical representation” in Java mean?

Wikipedia points to the term Canonicalization.
A process for converting data that has more than one possible representation into a "standard" canonical representation. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.
The Unicode example made the most sense to me:
Variable-length encodings in the Unicode standard, in particular UTF-8, have more than one possible encoding for most common characters. This makes string validation more complicated, since every possible encoding of each string character must be considered. A software implementation which does not consider all character encodings runs the risk of accepting strings considered invalid in the application design, which could cause bugs or allow attacks. The solution is to allow a single encoding for each character. Canonicalization is then the process of translating every string character to its single allowed encoding. An alternative is for software to determine whether a string is canonicalized, and then reject it if it is not. In this case, in a client/server context, the canonicalization would be the responsibility of the client.
In summary, a standard form of representation for data. From this form you can then convert to any representation you may need.

No comments:

Post a Comment