This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||||||
|
Broken link
edithttp://www.iijnet.or.jp/iriz/irizhtml/multling/codes.htm has gone 404. I wasn't able to find a replacement. -- Anon, July 2003
- Thanks. Link fixed by going to http://www.iijnet.or.jp/iriz & follow the new site. --Menchi 20:17 21 Jul 2003 (UTC)
So, what *is* Big5, i.e., what characters does it map and how does it map them? The other pages on encodings seem to at least list what character ranges they can represent, and for the smaller (8-bit) ones a complete table.
... I Googled for a bit, but couldn't turn up a spec. I did find out it uses 8- or 16-bit characters. Playing with iconv seems to indicate it's a superset of ASCII. Can anybody do better?
I have added description of the coding structure. - Gniw 04:41, 6 Feb 2005 (UTC)
Many contents in zh:大五碼 should be translated in English, and some contents here should be translated in that Chinese passage too. -Hello World! 04:28, 11 August 2005 (UTC)
Not encoding a full semantic unit is a property of DBCS?
editFrom the article:
- What Big5 encodes are particular graphical representations of characters or part of characters that happen to fit in the space taken by two monospaced ASCII characters. This is a property of double-byte character sets (…).
I don't think this is correct. DBCS only describes a class of encodings of alphabets (in the mathematical sense). It would be perfectly possible to conceive a DBCS encoding that associates a code with either half or a full ellipsis. Conversely, these issues are not unique to DBCS either. For instance, "…" = [ '…' ] may be the normal way to represent an ellipsis now, but in the past often dots (in this context effectively third ellipses) were used: "..." = [ '.' , '.' , '.' ]. The same may apply to combinations of punctuation and a space, or others. Also characters that have to combine with other characters are not unique to DBCS. For instance when using the de facto standard Quenya or Sindarin encoding (SBCS) some codes represent tehtar; these are only intended to be used in combination with a tengwa. Shinobu 08:40, 6 October 2005 (UTC)
- What I wrote in the article is correct. You need to understand the way the term "DBCS" is used by the Chinese/Japanese/Korean people (as opposed to the Wikipedian understanding, which might as well be historically true but meaningless for day-to-day use, at least for CJK).
- When DBCS is used for CJK, one double-byte character is always double-width (全角 or 全形, "full width"), and one single-byte character is always single-width (半角 or 半形, "half width"). This was historically due to the way the characters are displayed on a computer/terminal screen; only DBCS character sets exhibiting such a property can be correctly handled by programs not specifically designed for CJK (i.e., users can buy American-made programs to process CJK only if single-byte characters are single-width and double-byte characters are double-width).
- It is for this reason that the ellipsis or dash is impossible to be represented by a single DBCS character: the ellipsis or dash intrinsically spans two double-width positions.
- Certainly what you wrote is theoretically right (and Wikipedia's DBCS article will agree with you), but it is not the way DBCS is actually used in practice.
- Now, of course, if you use an operating system that never had a fixed-pitch legacy behind it, it is free to adopt any width to any character, single- or double-byte. Pre-OSX MacOS was such an OS.
- I'll try to change the article to clarify what I wrote.
- —Wing 00:30, 7 October 2005 (UTC)
Thanks for the clarification. Your addition "as normally used in CJK computing" was just what was needed and the historical perspective is nice. By the way, any character set containing a mixture of single-byte and double-byte characters is called MBCS, if I'm not mistaken. Given the way Big5 works, with lead bytes and standalone bytes, it is an MBCS, I think. Shinobu 08:34, 7 October 2005 (UTC)
- Yes, you are right. However, such systems have always been called DBCS in the context of Chinese language processing on the PC (i.e., the need for a paired SBCS was always implicitly understood, and which SBCS should be used is also unspecified); in fact I have never heard of the term "MBCS" until about a month ago, in Wikipedia. (I'm not sure abot Japanese and Korean.) If you say MBCS, I dare say most Chinese programmers would not know what you are talking about; in spite of what Wikipedia's "DBCS" article says, the term "DBCS" has a very specific meaning for Chinese programmers and is not ambiguous at all. (This might be true for all MS-DOS programmers as well, as recent as 10 years ago—I just need to find out my Microsoft-published manual if I had not thrown it away. I am tempted to say that Wikipedia's DBCS article might be a little flaky in terms of historical accuracy as far as PC-based computing goes.)
- If you are familiar with Unix or X11 at all, you might be aware of such confusion elsewhere: We (CJK users) always perceive fonts exhibiting the property I described as "monospaced" (for the practical reason described above), but the non-CJK people were confused and insist that such fonts are "bi-width"; this misunderstanding has been causing font display problems and is just starting to be corrected.
- I hope I'm not too long-winded :D — Wing 03:06, 8 October 2005 (UTC)
Not at all :-) It's just that I have a thing for programming, typography and, just recently, Japanese (and lots of other things of course).
DBCS's meaning is quite unambiguous: "Double Byte Character Set". If Big5 itself specifies only two-byte combo's, then it's a DBCS; before your edits it stated that single bytes were interpreted as ASCII, thereby specifying them, making it an MBCS. (The MBCS article doesn't exist yet, which is odd since it's in general use on this side of the continent.)
I also have two other questions about the article; I'll assume Big5 doesn't specify what to do with the single bytes.
- "In practice, Big5 cannot be used without a matching SBCS; this is mostly to do with a compatibility reason."
Sure you can: just don't use single bytes. One uses a matching SBCS because you want to use characters encoded by single bytes, no?
- "However, as in the case of other CJK DBCS character sets, the SBCS to use has never been specified."
This follows from the fact that it's a DBCS; if it would specify what to do with single bytes (like Shift_JIS does) it would be an MBCS.
@We (CJK users) always perceive fonts exhibiting the property I described as "monospaced" (for the practical reason described above), but the non-CJK people were confused and insist that such fonts are "bi-width";: Perhaps the Chinese were confused when they (mis)applied the term. The term monospaced has only one meaning. If a font contains both non-full-width characters, it's bi-width, or even proportional. In text that's not grid-aligned Japanese kana sometimes use less space than kanji, and often Latin characters are proportional.
@this misunderstanding has been causing font display problems and is just starting to be corrected.: I know of no specific instances. Could you name a few specific examples? At least I know for a fact that for most people this hasn't been a problem since at least 1995, and possibly even earlier.
Also note that you can, strictly speaking, not use every SBCS in conjunction with Big5: Latin-1 for instance needs all possible byte values. Maybe I'll read up a bit on the different versions of Big5. Shinobu 10:59, 8 October 2005 (UTC)
- You can try to google for “global advance” and “fontconfig”. This should return some discussion about the font display problem.
- Regarding the accompanying SBCS, we (Big5 users) did not bother whether the whole SBCS is usable, though for Latin1 probably not much is left after subtracting Big5 lead bytes so you’re absolutely right :D Back in the old DOS days, we regularly used Code Page 437, with a big chunk of it frequently missing (not necessarily always missing, depending on the Chinese system used and depending on what follows the SBCS characters). —Wing 15:59, 18 October 2005 (UTC)
Why is it called "Big5" at all?
editWhat's the story behind the "Big5" name? Is this an acronym? -- 20:10, 1 June 2007 (UTC)
- I put in a sentence under "History" to answer that, as best as I could find out. Corrections welcomed. --Motley Fool 22:18, 17 August 2007 (UTC)
External links modified
editHello fellow Wikipedians,
I have just modified one external link on Big5. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive http://web.archive.org/web/20060904023817/http://input.foruto.com:80/jptxt/ to http://input.foruto.com/jptxt/
When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}
).
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}}
(last update: 5 June 2024).
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—cyberbot IITalk to my owner:Online 12:18, 30 May 2016 (UTC)