This article is rated Stub-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
|
no more simple "grep 不"
editOh no. Now I will have to grep for two characters instead of one. Say where they keep the master list of them.
Mention if color/colour is like X,Y, or Z variants.
—Preceding unsigned comment added by 210.201.31.246 (talk • contribs) 16:34, 15 March 2006
- It's fundamentally impossible to unify Chinese so everyone will be happy. Like everything else in real life text processing, code is going to have to be sensitive to the issues, just like code is sensitive to casing issues. The Unihan file from the Unicode ftp site has some of this information.
- There's been enough bad analogies already. Making superficial comparisons to English isn't going to clear anything up; to understand the issue, you'll have to understand some of the concepts behind Chinese writing.--Prosfilaes 22:15, 15 March 2006 (UTC)
x-variant?
editWe have
- The X-axis represents differences in semantics; for example, the Latin capital A (U+0041 A) and the Greek capital alpha (U+0391 Α) are represented by two distinct codepoints in Unicode, and might be termed “X-variants”
But I think that's a bad example. In my opinion, Latin A versus Greek A are Y-variants or Z-variants or no variants at all, distinct only because of the source separation rule. I'm still not sure I understand the term "x-variant" correctly, but I think a better example might be U+00C5 Latin Capital Letter A with Ring Above Å versus U+212B Angstrom Sign Å, or U+03BC Greek Small Letter Mu μ versus U+00B5 Micro Sign µ, or maybe even U+0041 Latin Capital Letter A versus U+0042 Latin Capital Letter B. —Steve Summit (talk) 21:01, 17 May 2006 (UTC)
- The fact that every Latin/Greek character set that wasn't trying to cram both in 7 bits has encoded Latin and Greek seperately shows that Greek users feel it's not merely source seperation. Conflating the two brings up too many problems. U+00C5 and U+212B are the same thing in Unicode; they aren't variants of any kind.--Prosfilaes 01:47, 18 May 2006 (UTC)
- It's still a bad example. I find it confusing and I already understand the topic from reading many other sources. This example will certainly confuse anybody not already understanding the topic who is searching for clarification here. Unicode itself does not use a confusing example in their definition.
- Here are some purely Latin alphabet examples: for the x-axis "a" vs "b", for the y-axis "a" vs "A" or "a" vs "ɑ", and for the z-axis "ş" vs "ș". Not that I would include these as the primary examples in the article since there is no Latin Unification in Unicode and many people will not be familiar with these characters anyway. — Hippietrail (talk) 00:12, 28 January 2009 (UTC)
Confusion?
editThere might be some confusion as it's not always obvious how to precisely apply Unicode's definitions, sometimes looser definitions like the one on wikt:z-variant might be used, and the Unicode itself isn't perfectly rigorous. But whatever the confusion is about, the section § Confusion seems to only add to it and make it more confusing to the reader. It shows two pairs of Unicode characters neither of which are z-variants of each other by Unicode definitions, and there is nothing in referenced Unihan database that would suggest otherwise. Yet it erroneously claims they are both examples of z-variants.
Code points U+4E0D 不 CJK UNIFIED IDEOGRAPH-4E0D and U+F967 不 CJK COMPATIBILITY IDEOGRAPH-F967 are compatibility variants, which means they do not differ on Z axis at all; they are identical on all the X-Y-Z axes. They are essentially considered duplicate encodings of the same variant that exist only for round-trip compatibility with KS X 1001 where K0-5D55 (61-53) has a reading 불 and maps to F967 while K0-5C74 (60-84) reads 부 and maps to 4E0D and both have exactly the same reference glyph. But 4E0D has both readings and F967 canonically decomposes to it. In many places it's not even possible to use F967 without it getting converted to 4E0D. On the other hand, because the distinction exists only in Korean, the compatibility character F967 unambiguously refers to the Korean variant while 4E0D unifies all the other variants in addition to this one so it might be displayed slightly differently than F967 when font or language setting other than Korean is used. There is a standardized variation sequence 4E0D FE00 that can be used to select the F967 variant using 4E0D as a base character (your fontage may vary). Calling them font variants is certainly imprecise (not a Unicode term anyway), but you can see where it comes from.
Code points U+5154 兔 CJK UNIFIED IDEOGRAPH-5154 and U+514E 兎 CJK UNIFIED IDEOGRAPH-514E are semantic variants which differ on Y-axis—their abstract shapes are not unifiable in the first place, so they cannot be z-variants. This pair is even explicitly mentioned in UAX #38 Unihan as an example of y-variants. The fact that the referenced draft mentioned these two as examples of "zVariant" [sic] shows how confusing it might be, but does it matter given that it never made it to the final version?
Can we get rid of the section? It makes no sense whatsoever in its current shape and I don't see how could it be improved. – MwGamera (talk) 14:45, 18 August 2022 (UTC)