JustOnlyJohn
UTF-8 "compatibility"
editThe authors of UTF-8 seem to have had something very different in mind when they spoke of UTF-8 compatibility. They mainly meant that if a program parses a string for the presence of certain ASCII characters (such as NUL, LF, space, forward slash, etc.), but does not change the byte sequences that fall between these demarcation characters, and instead simply passes on such string fragments to some other interface unaltered, then the program will work just as well (for the purposes of such string parsing) with UTF-8 data as it does with ASCII data. Obviously such a program won't "understand" Unicode, but it can still be compatible... AnonMoos (talk) 00:04, 17 November 2008 (UTC)
- I agree completely, and that's exactly where the misconceptions arise. UTF-8 strings are only compatible with null-terminated C strings in very isolated, specific cases, for specific programs, that is, programs which do not actually handle strings or characters in the logical sense, and only see raw bytes. In other words, programs which do no string processing in the first place. Yes, some programs may function unmodified, but that's a gamble, and is not what people expect "backwards compatible" to mean. I think it is important to clarify that the compatibility of the encoding means that UTF-8 data is backwards compatible with old (C string) data, but not with old C string programs. In other words, new programs can read old data, but old programs can not read new data. This practical matter is often glossed over in discussions of UTF-8.
- How about if we include your explanation of "compatiblity" above the list of misconceptions? That way, there would be some context in the article explaining how the misconceptions might arise. JustOnlyJohn (talk) 02:19, 17 November 2008 (UTC)
- I'm not sure we do agree -- such programs do in fact parse strings (though only with respect to fixed ASCII characters such as NUL, LF, space, "/" etc.), and they do do string processing in a sense (since they can add substrings together, add prefixes to substrings, add suffixes to substrings, etc.). The only restriction is that they must treat each of the substrings derived by splitting at ASCII delimiters as an unanalyzable atom, and not attempt to change it internally (though they can add ASCII characters to the beginning of such a substring, add ASCII characters to the end of such a substring, or concatenate two substrings together). This limited "compatibility" in fact allows a large number of originally non-Unicode-aware programs which are not mainly text-oriented to handle UTF-8 data either unaltered, or with relatively minor and trivial coding changes. That's why it only took a few days to do the basic work of converting Plan 9 to using UTF-8[1] AnonMoos (talk) 07:14, 17 November 2008 (UTC)
- The fact that you think strlen(s) should return some number that has something to do with "the number of characters", or that string[x] should return a "character", indicates that you have not studied this very carefully and I am glad your edits have been removed. I recommend you actually try writing some UTF-8 using software because apparently the fact that programs DO NOT CARE how many "characters" are between two pointers is very difficult to get across. This is due to the fact that in byte strings it is trivial to convert between a pointer difference and a character count, leading to people using the terms interchangeably. A challenge for you is to design an actual useful case where a program wants the N'th Unicode code point in a string, and does NOT ever look at the 0 through N-1 code points of the same string. I don't believe there are any such real programs that serve any useful functions but you are free to try to design one and prove me wrong.
- Programs can indeed process UTF-8 without "understanding" it. Your argument is like saying a word processor does not work with English because it does not really understand English and just sees sequences of characters.Spitzak (talk) 17:20, 18 November 2008 (UTC)
- The challenge is upon you to prove that ALL existing ASCII-based C-string programs will behave identically with UTF-8 input as they do with ASCII input. If you can prove that, then you can claim that UTF-8 is "backwards compatible" with ASCII. Of course, nobody can prove any such thing. You can only produce hypothetical scenarios where it's POSSIBLE that existing programs may work unmodified.
- You are focused upon the specific cases which ARE backwards compatible. For example, programs which never perform random access on a C-string, which never inspect the N'th char in a C-string, which never replace the N'th char in a C-string, which never truncate a C-string based on a byte length, which never split a C-string, which never iterate across a C-string byte by byte, etc.
- It is immaterial whether performing those operatings is "useful" or "correct" in Unicode terms. The point is that many, many existing ASCII C-string perform such operations and may be incorrect when suddenly given UTF-8 input.
- It's odd that you don't see the difference between "MIGHT BE" compatible, and "IS ALWAYS" compatible. This isn't a debate about whether SOME programs MAY work unmodified. That point has been made abundantly clear by UTF-8 advocates. This is a debate about clarifying what is meant by "backwards compatible". If everyone took the time to understand UTF-8 clearly, then we wouldn't need to tell them that it is "backwards compatible" with ASCII in the first place -- they would discover the limitations themselves and decide that UTF-8 is not backwards compatible in any practical sense of the term.
- I'm disappointed that the stubborn combativeness of one person will perpetuate this misinformation about UTF-8, affecting many people. I also see that you have a long history of personal attacks, and deleting others' changes in Wikipedia... so consider me out of this.
- Such absolute exceptionless 100% compatibility is not something that has been claimed or sought for by the creators and promoters of UTF-8. What they meant by compatibility was 1) All ASCII characters have the same byte representation in UTF-8 and ASCII. 2) Only those characters in ASCII are represented using bytes in the 0-127 range in UTF-8. 3) Programs expecting ASCII inputs which were written using good coding practices, and which do only limited text parsing and processing, often will accept UTF-8 inputs unaltered (or if not, they can usually be converted to accept UTF-8 inputs with only relatively minor and trival coding fixes). Obviously, this last does not apply to programs which do extensive text manipulation, use fonts to display characters from multiple languages, etc. etc.
- The point of the limited compatibility in UTF-8 is to make the simple cases of handling multilingual text as simple and painless as possible with respect to the how things had been previously done (ASCII-only). It was not to formally prove absolute 100% exceptionsless anything. AnonMoos (talk) 14:38, 1 December 2008 (UTC)
- You are showing your ignorance of how software really works. Your examples *do* work with UTF-8. Random access works, that is why byte-oriented search algorithims work without rewrite in UTF-8. Inspect the Nth character works, if what is being done is "compare it to some ascii character" (and I challenge you to find an actual example where any other action is done). Replace the N'th character works if both the original and replacement is ascii (and again I challenge you to find a counter example). Truncation to a fixed-sized buffer? that is done by every file copy and compression and search program, and they all work fine with UTF-8! And iterating across a C string is *exactly* what all the interfaces like printf that work perfectly with UTF-8 do! Spitzak (talk) 07:29, 8 January 2009 (UTC)
Welcome
editHello, JustOnlyJohn, and welcome to Wikipedia! Thank you for your contributions. I hope you like the place and decide to stay. Unfortunately, one or more of your edits to the page UTF-8 have not conformed to Wikipedia's verifiability policy, and have been reverted. Wikipedia articles should refer only to facts and interpretations that have been stated in print or on reputable websites or other forms of media. Always remember to provide a reliable source for quotations and for any material that is likely to be challenged, or it may be removed. Wikipedia also has a related policy against including original research in articles.
There is a page about the verifiability policy that explains the policy in greater detail, and another that offers tips on the proper ways of citing sources. If you are stuck and looking for help, please come to the New contributors' help page, where experienced Wikipedians can answer any queries you have! Or, you can just type {{helpme}}
on your user page, and someone will show up shortly to answer your questions. Here are a few other good links for newcomers:
- The five pillars of Wikipedia
- How to edit a page
- Help pages
- Tutorial
- How to write a great article
- Manual of Style
I hope you enjoy editing here and being a Wikipedian! Please sign your name on talk pages using four tildes (~~~~); this will automatically produce your name and the date. If you have any questions, check out Wikipedia:Where to ask a question or ask me on my talk page. Again, welcome! -- Collectonian (talk · contribs) 02:40, 17 November 2008 (UTC)