Friday, August 22, 2014

Character Set Special Characters

Is iso-8859-1 a proper subset of utf-8?
The character reportoire of ISO-8859-1 (the first 256 characters of Unicode) is a proper subset of that of UTF-8 (every Unicode character).
However, the characters U+0080 to U+00FF are encoded differently in the two encodings.
  • ISO-8859-1 assigns each of these characters a single byte from 80 to FF.
  • UTF-8 encodes the same characters as two-byte sequences C2 80 to C3 BF.
What about iso-8859-n?
These are 15 different encodings that contain a total of 614 distinct characters. Some of these characters occur in multiple "parts" of ISO 8859, and some don't. You'll have to be more specific.
I see that your question is tagged ISO-8859-2. The characters that are in -2 that aren't in -1 are:
Ă㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŔŕŘřŚśŞşŠšŢţŤťŮůŰűŹźŻżŽžˇ˘˙˛˝
What about windows-1252?
Windows-1252 is just like ISO-8859-1 except that it replaces the rarely used control characters in the 0x80-0x9F range with printable characters. The characters that are in windows-1252 but not in ISO-8859-1 are:
ŒœŠšŸŽžƒˆ˜–—‘’‚“”„†‡•…‰‹›€™
http://stackoverflow.com/questions/10021594/character-set-special-characters

Character encoding, entity references and UTF-8

There have been a lot of questions in the forums recently all of which touch on a very important but often misunderstood part of building websites - character encoding, or how a document stores and displays different characters on a page.
The basics of character encoding - US-ASCII
In the beginning there was binary - all information is stored as a series of ones and zeros, or "on" and "off" - the heart of computing and electronics. In order to display alphanumeric characters, a standard was created which defined which binary sequence represented which character. This was the American Standard Code for Information Interchange, or ASCII. There were a few variants, the most well-known by far being US-ASCII, still in widespread use today.
With ASCII, each character is represented by a single-octet sequence. One byte, one letter. The biggest weakness with US-ASCII is that it only includes characters used in English, excluding any accented letters or regional variations such as the German double S.
Stage two - the ISO standards
To fulfil the demands on users which required more than the basic a-z / A-Z sequence, extensions to ASCII were developed and approved by the ISO. The best known are the ISO-8859 series, which used the same sequences as ASCII but added extra characters for accented letters and regional variations. ISO-8859-1 is for most western European languages such as English, French, Italian...
ISO-8859-1 versus windows-1252
ISO-8859-1 became the standard encoding for most Unix and Unix-like systems. However when Microsoft developed Windows, it used a slight variation on ISO-8859-1 commonly known as windows -1252. the differences between the two boil down to 27 characters (including the Euro symbol, certain angled quote marks, the ellipsys and the conjoined oe or "oe ligature") which windows-1252 uses in the place of 27 control characters in ISO-8859-1. Within Windows, ISO-8859-1 is silently replaced by windows-1252, which often means that copy/pasting content from, say, a Word document left the web page with validation errors. Many web authors incorrectly assume that the fault is with the characters themselves and that using entity references is the only way for accented characters. In fact, if you are using a western European language and ISO-8859-1 most accented characters suh as é è û î etc. can be used without resorting to entities such as é or similar. (ISO-8859-1 does not include an oe ligature for a very bizarre reason, but that's another story! You must therefore use œ instead.)
Character encoding on the web - HTML entity references
In order to get around character encoding problems on the web, a method was introduced to "encode" non-ASCII characters in HTML without having to change charsets away from the widely-supported US-ASCII. Accented characters such as é (e acute) can be encoded as é and the user agent would "translate" that into the appropriate character. These entity references or character entities are defined within the HTML document type definition (DTD) - in HTML 4.0, for example, there are over two hundred different entity references defined.
There are several weaknesses with the entity references approach. Firstly, they are excessively verbose - in ISO-8859-1 an e acute takes up one byte of space, whereas the entity reference takes up 8 bytes. The second problem is that the are only useful in the context of a parsed HTML document - read the source code as plain text and the result can end up verging on gibberish, especially of you are using a language which relies heavily on accents, such as Polish. Even in French, if you want to write the phrase à côté it ends up as à côté.
HTML entities are a tag soup solution to a tag soup problem, and this is seen clearest with the third problem with entity references - XML.
HTML entity references, RSS and XML
The entity references "solution" falls down once you start working with XML. Unlike HTML 4.0 or XHTML 1.0, which have DTDs which define the entity references, most XML does not have a doctype declaration, so none of those entity references are valid. What's worse, as XML doesn't share HTML's liberal error-handling, using undefined entities will break the document.
XML actually has ony five defined entity references, the bare minimum required for functionality. They are: & ' " < and >. There are various hacks and methods to add extra entity references to your XML, but the only real solution is to avoid their use entirely.
The most popular use of XML on the web at the moment is RSS and syndication. RSS is an XML format, so if you are using entity references, for example held in a database, then you will have difficulties producing a valid RSS feed. What's more, encoding directly in, say, ISO-8859-1 doesn't completely solve your problem as you are limited in the character you can use. Want to add a copyright notice in you feed? In HTML you can use ©, but in RSS you just get a parsing error, and ISO-8859-1 does not offer an alternative.
One encoding for every language - Unicode and UTF-8
In order to overcome the hodge-podge of incomplete, conflicting and aging standards (the ISO-8859 series date from the early 1980s), the notion of Unicode was developed. The differing versions of the ISO-10646 standard (Unicode has been approved by the ISO) are beyond the scope of this very brief introduction, but the important thing to note that is different with Unicode is that it offers one single character encoding for all of the world's languages. The second difference is that it is a multi-byte implementation rather than a simple one-byte per character representation.
By far the most important Unicode version on the web is UTF-8. This standard have numerous advantages, the most important of which is that it remains compatible with the much earlier US-ASCII standard. In fact, all of the single-byte ASCII characters are represented in exactly the same way in UTF-8. Only extended characters are different, made from multi-byte strings defined for each character, whether an e acute, an oe ligature, or characters from Arabic, Russian, Urdu or Japanese.
UTF-8 is especialy important for XML as it is the default encoding for all XML documents. And as you can't use HTML entity references and earlier ISO-8859 standards are incomplete, UTF-8 is the only logical choice when dealing with XML formats such as RSS or Atom which, even if you are only using English, are more than likely to eventually need more than the basic ASCII charset can offer.
UTF-8 is incredibly useful in HTML/XHTML too - no more entity references, the possibility to use extended characters such as curly quotes or long dashes, the possibility of using one charset across a multi-lingual site.
The downsides to UTF-8
There remain a few hurdles to UTF-8 acceptance, most of which can be minimsed or overcome.
- Browser support is excellent, with IE5.x up supporting UTF-8 fully, as do Mozilla/Firefox, Opera, Safari, Konqueror, etc. However earlier browsers such as IE4 and NN4 have problems, and IE3/NN3 and earlier lack support. Bear in mind that documents using markup older than HTML 4.0 cannot use UTF-8.
- The scripting language PHP (and some others) can have problems with multi-byte strings. See an excellent earlier WebmasterWorld thread by ergophobe: UTF-8, ISO-8859-1, PHP and XHTML [webmasterworld.com]. However if you check out how beautifully the PHP-driven WordPress handles UTF-8 content, it is clear that UTF-8 and PHP can successfully mix.
- Just because you can add content in, say, traditional Chinese to your site doesn't mean that the end-user has an appropriate font to display it - you still need to test and ensure compatibility when it comes to defining font families and such for your target audience.
How to implement UTF-8 on your site
If your site's language is English, simply swapping your ISO-8859-1 meta tags to UTF-8 goves the impression that you have succeeded. However, there is a little more to it than that. You still need to ensure that any non-ASCII content is correctly encoded. Users of other languages will almost certainly need to convert their files to UTF-8.
Most modern text and wysiwyg editors handle UTF-8 perfectly - in most cases, it is simply a case of going to "Save As" and choosing "UTF-8" or "Unicode" from the options. From then on, you can tidy up any entity references and start using the true characters. One useful tip is to copy/paste from a word-processing program such as Word which automagically replaces, for example, straight quotes with the appropriate "curly" opening and closing quotes.
If you are using a Linux or similar Unix-like server or desktop, you can use iconv to batch-convert many files at once.
Conclusion
If you are serious about standards, character encoding matters - even if you are just producing content in English. UTF-8 offers huge advantages, and you have everything to gain by moving to UTF-8 for new content.
Further reading
If you want a better or more detailed introduction to Unicode and character encoding in general, try some of these links:
  • A tutorial on character code issues [cs.tut.fi] (Jukka Korpela) 
  • On the Goodness of Unicode [tbray.org] (Tim Bray) 

  • What Is Unicode? [unicode.org] (Unicode Consortium) 

  • http://www.webmasterworld.com/forum21/11176.htm

    Monday, August 18, 2014

    How do I compare strings in Java?

    == tests for reference equality.
    .equals() tests for value equality.
    Consequently, if you actually want to test whether two strings have the same value you should use .equals()(except in a few situations where you can guarantee that two strings with the same value will be represented by the same object eg: String interning).
    == is for testing whether two strings are the same object.
    // These two have the same value
    new String("test").equals("test") // --> true 
    
    // ... but they are not the same object
    new String("test") == "test" // --> false 
    
    // ... neither are these
    new String("test") == new String("test") // --> false 
    
    // ... but these are because literals are interned by 
    // the compiler and thus refer to the same object
    "test" == "test" // --> true 
    
    // concatenation of string literals happens at compile time,
    // also resulting in the same object
    "test" == "te" + "st" // --> true
    
    // but .substring() is invoked at runtime, generating distinct objects
    "test" == "!test".substring(1) // --> false
    
    // interned strings can also be recalled by calling .intern()
    "test" == "!test".substring(1).intern() // --> true
    It is important to note that == is a bit cheaper than equals() (a single reference comparison instead of a method call), thus, in situations where it is applicable (i.e. you can guarantee that you are only dealing with interned strings) it can present an important performance improvement. However, these situations are rare.

    Saturday, August 16, 2014

    What is the use of hashcode in Java?

    Hashcode is used for bucketing in Hash implementations like HashMap, HashTable, HashSet etc. The value received from hashcode() is used as bucket number for storing elements of the set/map. this bucket number is the address of the element inside the set/map. when you do contains() then it will take the hashcode of the element, then look for the bucket where hashcode points to... if more than 1 element is found in the same bucket (multiple objects can have the same hashcode) then it uses the equals() method to evaluate if object are equal, and then decide if contain() is true or false, or decide if element could be added in the set or not.

    http://stackoverflow.com/questions/3563847/what-is-the-use-of-hashcode-in-java

    Friday, August 15, 2014

    What's the meaning for object's monitor in java? Why use this word?

    but I am puzzled why use word "the object's monitor" instend of "the object's lock"?
    See ulmangt's answer for links that explain the term "monitor" as used in this context.
    Why use the term "monitor" rather than "lock"? Well strictly speaking, the terms do mean different things ... especially if you use them in the way that they were originally intended to be used.
    • A "lock" is something with acquire and release primitives that maintain certain lock properties; e.g. exclusive use or single writer / multiple reader.
    • A "monitor" is a mechanism that ensures that only one thread can be executing a given section (or sections) of code at any given time. This can be implemented using a lock, but it is more than just a lock. Indeed, in the Java case, the actual lock used by a monitor is not directly accessible. (You just can't say "Object.lock()" to prevent other threads from acquiring it ... like you can with a Java Lock instance.)
    In short, if one were to be pedantic "monitor" is actually a better term than "lock" for characterizing what Java is providing. But in practice, both terms are used almost interchangeably.

    Difference between wait() and sleep()

    wait can be "woken up" by another process calling notify on the monitor which is being waited on whereas a sleep cannot. Also a wait (and notify) must happen in a block synchronized on the monitor object whereas sleep does not:
    Object mon = ...;
    synchronized (mon) {
        mon.wait();
    } 
    At this point the currently executing thread waits and releases the monitor. Another thread may do
    synchronized (mon) { mon.notify(); }
    (On the same mon object) and the first thread (assuming it is the only thread waiting on the monitor) will wake up.
    You can also call notifyAll if more than one thread is waiting on the monitor - this will wake all of them up. However, only one of the threads will be able to grab the monitor (remember that the wait is in a synchronized block) and carry on - the others will then be blocked until they can acquire the monitor's lock.
    Another point is that you call wait on Object itself (i.e. you wait on an object's monitor) whereas you call sleep on Thread.
    Yet another point is that you can get spurious wakeups from wait (i.e. the thread which is waiting resumes for no apparent reason). You should always wait whilst spinning on some condition as follows:
    synchronized {
        while (!condition) { mon.wait(); }
    }
    http://stackoverflow.com/questions/1036754/difference-between-wait-and-sleep