Wikipedia:Historical archive/UNICODE

Wikipedia historical archive

This discussion page is inactive and preserved for reference in the English Wikipedia's historical archive. As an archive item, its contents don't necessarily represent current information or consensus on project matters. It was active ⁠in October 2001 and sporadically until 2004.

Item description: Early discussions about using Unicode in Wikipedia.

Looking to revive discussion on this? The talk page of an archived item is unlikely to be monitored, so start a discussion at an active venue like the village pump instead.

I noticed someone has chinese fonts working in debian. I've been looking around the metawikis trying to find which packages I should install to get chinese unicode fonts in debian. A Great topic for this page would be howtos on getting unicode fonts and getting unicode functioning on different platforms IMO. --dikaiopolis

Someone is going through and adding UNICODE characters for various symbols in the martial arts areas. I have mixed feelings about this. Is it a good idea?

In my browser, this doesn't look good or useful at all. It just puts question marks there.

I would imagine that for other people it is useful and nifty.

Opinions?

(And, this chit chat should be removed soon to make a real UNICODE article!)

I'm in process of downloading fonts so I can see the characters. I don't think people should have to download fonts to see the characters. --Koyaanis Qatsi

This is an encyclopedia. It is intended for people to research on various topics and termonologies. If a term is originated from a foreign language, it would be helpful to include the term in its original writing, such as Chinese, Hebrew, Arabic etc. This is made possible only recently because of the UNICODE standard. The original writing is helpful for researchers to communicate with people who knows the language. It also helps resolve confusion in variation of transliteration in different languages, e.g. Qi vs. Chi vs. Ki etc. When you copy down the original writings, you get help from native speakers easier than trying to explain what exactly you are talking about. I have gone through some articles and put in the UNICODE for just the title terms only. I have refrained myself from touching other content of the articles just for the reason you have complained about. In my opinion, the UNICODE addition is an important resource for any researchers. If no one appreciates this contribution, I can stop right away.

If you believe that Unicode is the standard encoding for HTML in the future, you should upgrade your browser to such a state that you can see all text from around the world regardless whether you can read it or not.

Rebuttal?

Well, I don't really have a rebuttal, exactly. I think it's a toss-up. On the one hand, it is probably true that over the next couple of years, people will naturally and accidentally find their computers using Unicode so that they can just magically see all these nice characters. On the other hand, I don't know how many people can see them now.

It isn't really about me. I'm a computer savvy guy. I probably should upgrade my browsing situation to UNICODE. It's really about less computer savvy people. We shouldn't be snobbish and demand that everyone in the world upgrade to the latest browser, should we?

--User:Jimbo Wales

Does this means unicode in articles like Hebrew alphabet are not welcome here either? They all show up as question marks and alot of numbers. However, I believe some people somewhere are able to read that page if they care to set up the computer with the fonts. The missing wiki links are shown as ? too, to me, they also look very ugly. Can we get rid of them too?

Jimbo, I thought you own this project. You have enough authority to decide one way or the other.

Actually the unicode does not really affect the operation of the encyclopedia. The user just need to know that the ?'s are for user who care to see the foreign characters, all those who don't need to know can simply ignore the ???s.

As somebody whose system can display Japanese/Chinese unicode characters (mozilla 0.9.5 on Debian), they certainly look way cool, but I'm not convinced they add much to the article (then again, I'm not a scholar of Asian languages). Then again, having (???) there shouldn't be so bad.

Would it be practical to have a routine to convert characters specified in unicode to gifs and embed them in the page for the interim period until the majority of people have unicode browsers and sufficient fontsets to view Asian-language glyphs? --User:Robert Merkel

Hey! I'm using debian, do you know what packages you installed to get Japanese/Chinese unicode characters working? That's useful information! (I'm still sifting through apt-cache search chinese). --User:Dikaiopolis

Most people with unsophisticated browsing arrangements (read: Windows 98/ME/2000/XP with Internet Explorer 5+) will either see those symbols fine as it is, or will get a automatic prompt from Explorer to download appropriate fonts to see them if they want to.

I strongly agree with native-language rendering in case of important names, I think it only adds to the articles. We might want to consider adding some kind of standard one-line disclaimer to articles that feature a lot of Unicode characters that are likely to demand additional fonts from the user, something that would link to a special page which would explain the situation, where to DL the fonts if necessary, etc. --User:AV

The need to include native writing is obvious, as pointed out by the comments given above. You can see the same approach in other articles such as Munich where the first paragraph says

Munich is a city in and the state capital of Bavaria in Germany. Its German name is München.

The only difference here is that the German text has the luxury of showing in the ISO-Latin1 code page. i.e. it is visible to most western users while the Asian text nor Polish text nor Arabic text are not visible without special browser setup. Nevertheless the need is there regardless of the font issue. On the other hand, most Chinese and Japanese users who use the asian code page will not be able to see extended European characters on their browsers. So it is an even game. So if your argument to ban foreign characters is purely based on displayability of the text, then you must ban all non-English characters including German and French.

Another way is to specify the codepoint and instruction for the user how to look up the Unicode character as in Chinese numerals. My opinion is that it is not as convenient as the in-line Unicode inclusion.

ISO-Latin-1 codes should be converted into Unicode HTML entities (this is especially true for all accented characters). We've discussed this before, and all the evidence's for it. I'll fix Munich right away. --User:AV

changing the Latin1 character into a ü code does not help much. The character is still shown as a ? on a Chinese browser. So as I pointed out earlier, the font problem is from the browser, not whether unicode is used or not. If you don't like Asian text to show as ??? on your English browser, I don't like German text to show as ??? on my Chinese browser either. If you really want to please everyone, you can only use the lowest common denominator, which is pure English. Why can't we just tolerate each other's ??? as long as everyone knows that someone else are able to see those ??? as native text.

changing the Latin character does help much: it allows browsers to correctly display the text without identifying the encoding. If your Chinese browser still shows ?, then your Chinese browser is deficient, and it is the problem, not Wikipedia. The article now correctly gives it all the information about the character; and if you use Windows, accented European characters are available in default fonts in all Windows versions including Asian versions. And in fact, the Chinese character show up perfectly correctly on my English Windows and browser.

I'm not trying to please everyone, I'm trying to make it possible for everyone to see the right text, which is only possible by using Unicode. I'm not speaking against Chinese characters, on the contrary. My opinion is that short names and crucial concepts should be given in their native rendering, but sentence fragments or complete sentences should be in English. --User:AV

Actually, AV, ISO-8859-1 characters do not need to be converted, and in fact they're more likely to display correctly on some older (pre-HTML4) browers if they aren't. The Bomis server sends a "Character-Encoding:" header with the pages that specifies ISO-8859-1, so all of those characters will come across with no problems unless

the browser reading the text is broken (and in that case, the entities probably won't help). It's only characters outside the Latin1 range that have to be specified as entities. It is unfortunate that present software and standards bodies are often out of sync, but that's the way it is. --LDC

LDC, the HTML source doesn't contain any specification of the encoding (just view the source of any Wikipedia page). Maybe the HTTP headers contain a header which specify ISO-8859-1, I haven't checked, but even if it does, it's not enough: too many browsers, including for example this here IE 5+ I'm using, don't use that information to automatically set the page's encoding. When I visit a Wikipedia page which uses even Latin1 8-bit characters, I often see them as random Cyrillic characters (owing to the fact that I often visit Russian websites prior to visiting Wikipedia), the browser isn't smart enough to switch the encoding.

Yes, I'm talking about HTTP headers, not HTML source. Here they are exactly:

 Date: Tue, 23 Oct 2001 01:56:47 GMT
 Server: Apache/1.3.19 (Unix) PHP/4.0.4pl1 mod_fastcgi/2.2.10
 Connection: close
 Transfer-Encoding: chunked
 Content-Type: text/html; charset=iso-8859-1

Yes, some modern browsers will ignore that header; most older browsers don't even know it exists. But that's precisely my point: they will assume 8859-1 because that's the native character set (or Windows code page 1251, which is close enough), and do the right thing for the wrong reason. Those same older browsers won't know what ü means, because that's a recent HTML-ism. All I'm saying is that those browsers that get it wrong when not encoded and get it right when encoded are few (you seem to have one), but the ones that will get it right when not encoded are many. And I'm not saying we should actually do the wrong thing just because it works on some browsers--what I recommend is technically correct. Do you have your IE5 set to "Auto-select" encoding? That's the View->Encoding->Auto-select menu item. It should switch back and forth correctly--mine does.

The encoding information should be inside the HTML document, and the Wiki software currently doesn't do this. There're many good reasons why HTTP header isn't enough, not just practical ones: for instance, HTML files should strive to be XHTML-compliant, and this absolutely requires encoding information inside the HTML stream. If the Wiki software is modified to insert the appropriate attribute into all outgoing streams, I agree that converting Latin-1 characters won't be needed for correct displaying; I still think that it's better to present as entities all characters outside of 7-bit ASCII. --User:AV

I would like to see the encoding inside the HTML as well. Perhaps after a while when old pre-HTML4 browsers have been phased out even overseas, the entity refs will be the clearly better thing to do. But I think you underestimate the number of IE2, IE3, and Netscape 3.x machines in the world who won't know what a ü is.

Are you sure Netscape 3.x won't understand ü ? Anyway, one way to check would be to ask for Bomis administrators to analyze the Wikipedia traffic and give us statistics on browsers used to access Wikipedia now. --User:AV

Instead of ignoring very old browsers, another way could be to translate things like ü into their 8-bit equivalents when it's needed; e.g. when a user selects a certain flag or even better, automatically when an older browser is detected (as all pages are dynamic anyway). --User:AV

One compromise would be to set a policy that only the subject title can include the native writing and it must be enclosed in () so that it does not interrupt the flow of the sentence. I also agree the one line to point out what the ??? are will be helpful for the clueless.

As long as an article doesn't depend on the characters, and they are merely parenthetical extra information (as I rendered them in the Ang Lee article), then they can be nothing but a benefit. Also, this page should be deleted and moved to something like "Wiki special characters/ChineseTalk". --User:Lee Daniel Crocker

Though the Chinese characters started this discussion, I think the rule can extend to other languages. We have a lot of terms in this wikipedia that are based on Arabic e.g. Al-Qaida and Hebrew e.g. Yahweh. It would be benefitial for future schoolars to actually see which text is transliterated when there are multiple versions of the English spelling.

Agreed; we can certainly represent eth and thorn in Anglo-Saxon with Unicode so that AS verse is rendered as is to all browsers. User:sjc

One problem with Hebrew and Arabic is that the text reads from right to left. The text will be displayed wrong if the browsers do not handle bi-directional writings. On the other hand, people who use the wrong browsers will see the text as ???, it reads the same either way. :-)

I like the idea of having Unicode as the base font of the wikipedia instead of ASCII. This adds considerably to the quality and eases the description of many things because you can use specific characters and symbols.

Alan Wood has a comprehensive site on Unicode fonts for the various operating systems. [1]. In particular he indicates where to download Unicode fonts so that Internet Explorer 5.0 (two years old) and above display Unicode fonts.

User:Hannes Hirzel

One danger of using Unicode is that people start to overdo it. Since Unicode supports all languages of the world, there is a tendency to add too much non-English text into an English encyclopedia.

I would suggest that wikipedia put down a policy to limit foreign text just for specifying the original writings of any transliterated English word such as Tao, Al-Qaida and Yahweh etc. I have no objection on including quotation in the original language provided that it is set in a block that does not affect the flow of the reading assuming some people will just see all ???s if they don't care about the original text.

I can see that including native text may not be important for any alphabet based language because there is always a one-to-one mapping from the English alphabet/syllable back to the foreign alphabet, so including the Unicode does not add much. However, for non-alphabetic languages such as Japanese, Chinese etc. The English transliteration seldem maps back to the original text correctly. For example, in Chinese, one pronunciation can maps to over 100 different characters. On the other hand, in Japanese, one Kanji character (e.g. the character for one) can map to over 100 different pronunciations depending on the context it is in. Including the native text is the only way to solve such problem.

In several of the Physics articles I help work on, there is a need for a way to display h-bar (Planck's constant divided by 2π) in the various formulae. In Unicode, it is as simple as entering ℏ. A hack is to use a struck-through "h", like this: h. Which unfortunately looks terrible. My gut feeling is to use Unicode, but is are there people who feel strongly against it? --CYD

For the love of all that is good, yes, use a Unicode character reference and not some ugly hack! (My opinion, anyway.) If the specific Planck's over 2π (ℏ = ℏ) isn't sufficiently widely viewable (works for me), try an italicized lowercase h-with-stroke (ħ = ''ħ'') which is fairly close and should be in basic fonts on most recent OSs, being in ISO-8859-3 and the Unicode Latin-extended B section rather than the math symbols section. --Brion VIBBER

I'm pretty sure I already know the answer, since I've already looked at the UNICODE and HTML pages, but I'm asking. Is there any consistant way to indicate dot-under characters in the wiki and in html in general? I know, dot-under's are not part of ISO-Latin-1, but they're an important part of transliterating Persian. It's not uncommon to ignore them, and that's what I have done, but I've tried _hard_ to get the orthography correct on the various Bahá'í stuff I've done, and the lack of the dot-under's is annoying to me. -- thanks in advance (and yes, I know that sometimes the answer is "no". Rick Boatright 05:05 Feb 24, 2003 (UTC)

The dot under characters are mostly in the "Latin Extended Additional" area of Unicode. Using number codes should do the trick. They fall in the hex range, 1E00-1EFF. Thus Hex 1EA1 = Decimal 7841 gives "&#7841", Hex 1E05 = 7685 gives "&#7685", etc. Eclecticology 07:49 Feb 24, 2003 (UTC)

Hmmmm. Browsers can send a HTTP request header saying which languages they accept. Perhaps we could strip out Unicode characters that encode for language X unless language X is listed under accepted languages. This wouldn't work for browsers that understand Unicode but don't use the accepted languages header, but how many such browsers are there? -- Khym Chanur 11:21, Oct 28, 2003 (UTC)

Greek letters, anyone? The instruction to use the "Symbol" font does not work on Mozilla regularly. However, when I use Unicode, Mozilla, IE, and others all manage to be able to handle Greek letters with all the default fonts in my system--no special fonts needed. Dogface 17:07, 5 Dec 2003 (UTC)

Unicode question

This may be the wrong place to ask this, or it may be answered elsewhere, but can anyone tell me if and when the English Wiki will be changed over to UTF-8? I ask becuase it's hugely inconvenient to work with text that's full of ś's, but for some topics (Sanskrit and associated languages and subjects, in my case), there is no adequate alternative to using unicode characters. This is true even if I eschew Devanagari and work in roman, because standardized roman transliteration requires characters with diacritics that aren't available in latin-1. कुक्कुरोवाच 20:51, 31 Mar 2004 (UTC)

I'm assuming that moniker is in Tamil, because my Mozilla 1.6 is totally fazed by it. -Phil | Talk 14:, Apr 1, 2004 (UTC)

No, regular everyday Sanskrit, in Devanagari.

As I understand it: Until recently, the general prognosis was "never", but the French Wikipedia recently converted, and I believe it was mostly successful. So if the remaining problems highlighted by that conversion get ironed out, there may be a possibility that the English 'pedia could make the switch as well if the desire is there. - IMSoP 22:08, 31 Mar 2004 (UTC)

What are the pros and cons? (I am sure this conversation has been had before, so a pointer will plenty). Pete/Pcb21 (talk) 22:30, 31 Mar 2004 (UTC)

The pros are that people who edit pages using special characters or non-Roman alphabets can just enter the characters as normal, and it'll just "work," instead of them having to encode the characters using a somewhat random numerical code. For example, the characters in Kukkurovaca's name above must be encoded as कुक्कुरोवाच

I'm not sure of all of the cons, but one is that some older browsers don't support Unicode, in input if not in output; the database back end that Wikipedia uses may not support it either, in which case there would have to be a layer of code that would convert the Unicode-encoding text into something the database can handle when it is stored, and convert that text back into Unicode when it is retrieved. Also, special characters which are already on many pages currently in Wikipedia could go glitchy due to the change. Garrett Albright 22:41, 31 Mar 2004 (UTC)

Those older browsers are not able to browser half the WWW by now. — Jor (Talk) 12:21, 1 Apr 2004 (UTC)

The masses clamor for Unicode! I'm surprised something so standards-oriented as Wikipedia isn't using it already... Garrett Albright 22:23, 31 Mar 2004 (UTC)

The main reason it isn't Unicode is because the original version of the software didn't support it, and conversion is difficult. It'll require some downtime. There were worries about corruption of the database in various ways, but we have a fairly good handle on that problem now thanks to the recent conversion of the French Wikipedia. I think conversion of the English Wikipedia would be a good idea, some time during the next few months. -- Tim Starling 00:04, Apr 1, 2004 (UTC)

The only Mac browsers able to use Unicode are Safari, Opera etc. on MacOS X, as far as I know, while it is not possible to edit unicode pages with IE. A switch to unicode would be very problematic for many Mac users. Ertz 00:12, 1 Apr 2004 (UTC)

OS 9 has Unicode support; not quite as slick as OS X, no, but it's there. Either way, the number of people still using OS 9 is dwindling rapidly, and will continue to do so. Garrett Albright 02:43, 1 Apr 2004 (UTC)

Which masses have you polled? Unicode would be largely impossible to edit. RickK | Talk 02:45, 1 Apr 2004 (UTC)

Howso? I mean, what are the specific drawbacks, other than for the users of older macs?कुक्कुरोवाच 03:10, 1 Apr 2004 (UTC)

If I were trying to edit a page, and came across something looking like |कुक्कुरोवाच, I would have NO idea what to do with it. RickK | Talk 03:35, 1 Apr 2004 (UTC)

RickK: Just work around it and don't touch it. :)

Judging by your <nowiki> tags, do you mean "something looking like कुक्कुरोवाच"? In which case, I'm not sure I see your point. We already use such character entities extensively in articles. The idea of UTF-8 is to allow unicode characters to be inserted without resorting to such ugly constructions. Also, switching en to UTF-8 will make it easier to implement some proposed interwiki features, such as merging the meta recent changes (which is UTF-8) with the local wiki recent changes. -- Tim Starling 03:49, Apr 1, 2004 (UTC)

Doesn't work in Safari, at least not whatever particular language that is. I see the same character (a box surrounding a char I don't recognize) repeated for each character in your sig. Other languages work fine: Japanese, Chinese, Greek, some Cyrillic, but there's one Cyrillic-alphabet-based language that also doesn't work (not sure which it is). That's the problem: support is spotty. If user A enters in text in Japanese natively, what happens when user B who doesn't have Unicode support saves the page? I'm pretty sure the characters would change to little boxes (or whatever the browser displays when it doesn't understand a character) in the textarea, the user would save the page and then everybody would see the "little boxes." I think it could be a problem waiting to happen. RADICALBENDER ★ 05:02, 1 Apr 2004 (UTC)

The web browser does *not* rewrite the characters to "little boxes" when editing -- they are simply shown that way by whatever display mechanism the browser uses. silsor 05:29, Apr 1, 2004 (UTC)

RB: Next time you (re)install OS X, make sure to let it install every language file it can. I'm running Safari on OS X, and I see the characters just fine. Garrett Albright 05:34, 1 Apr 2004 (UTC)

I have no idea how to do that. And how many other random Wikipedia editors would? RickK | Talk 04:14, 1 Apr 2004 (UTC)

The whole point is that if the software were switched over to UTF, you wouldn't need to interact with these strings or know anything about them at all. They would just work as regular characters.

I'm at an utter loss. How would I possibly be able to insert a character that isn't on my keyboard? RickK | Talk 04:56, 1 Apr 2004 (UTC)

Rick, if you're using Windows, then the Character Map applet is your friend. Find the character you want and it will either tell you how to enter it from the keyboard or allow you to copy+paste it. You'll need some nice Unicode fonts, like Junicode, but newer versions of Windows come with Lucida Sans Unicode anyway. --Phil | Talk 14:, Apr 1, 2004 (UTC)

In most Windows applications, Left alt + numeric keyboard types (dec) Unicode. alt+0549 is ȥ for example. — Jor (Talk) 12:21, 1 Apr 2004 (UTC)

The prefixing 0 is important by the way: otherwise the Windows encoding is used instead, which wraps around (alt+256 = alt+0) — Jor (Talk) 12:25, 1 Apr 2004 (UTC)

Actually with or without 0 you don't get Unicode, but the systems ANSI and OEM codepages respectively. You can use Wordpad (or anything other which uses a Richedit control), type the hex number for the Unicode character, then type Alt-x.Pjacobi 08:43, 14 Jul 2004 (UTC)

With a compose key, maybe, or with copy-and-paste. I keep a set of characters I need which I don't have on my keyboard on my userpage on cy:, and c+p them when I need them in articles. Marnanel 05:01, Apr 1, 2004 (UTC)

People who use the languages in question know how to type in them. Someone who studies Sanskrit needs to be aware of how to produce the relevant unicode characters. Similarly, someone who writes mathematical articles may need to learn TeX, and someone who works in science may need to produce diagrams. You contribute what you know, it's not necessary to be an encyclopedia to contribute to an encyclopedia. That said, there's a good resource at http://www.alanwood.net/unicode/ . If you go to the test pages, you'll see a list of characters which can be copied and pasted into an edit box. -- ɫɪɱ ʂɫɒɼʅɪɳɠ 05:10, Apr 1, 2004 (UTC)

If you were going to work with Sanskrit (or other languages in its family) I would suggest http://www.aczone.com/itrans/online/. Other tools would apply for other languages (there's also http://www.emeld.org/tools/charwrite.cfm for IPA in Unicode, which would offer pan-linguistic functionality of a certain kind.) Of course, it's entirely possible you'll never need to deal with nonstandard characters (in which case it shouldn't make the least differnece to you which encoding the site uses, as your keyboard will suffice in either), but those who contribute to articles that necessarily involve terms from languages that aren't representable with the characters that go into English, there's a basic need, here.कुक्कुरोवाच 05:42, 1 Apr 2004 (UTC)

Switching the entire project over to UTF-8 or leaving things in ISO-8859-1 are not the only two choices. It would be straightforward to add a user option for "Edit in UTF-8". When a logged-in user with this option set requests to edit a page, the server translates HTML character references to their UTF-8. When the users submits their edit, the server translates non-ASCII (or non-ISO-8859-1) characters back to the HTML character references for storage in the database. Users who don't set this option would see no differences. See my Editing in UTF-8 feature request. — Gdr 12:33 2004-04-01.

For complex scripts, this is a nontrivial operation. This would require the server to change all entities over #255 in Unicode to numeric entities when converting to ISO-8859-1, and likewise to convert all entities back to direct characters when converting to UTF-8. Let alone the problem of combining diacritics and RTL/LTR! — Jor (Talk) 12:41, 1 Apr 2004 (UTC)

I don't see the difficulty. Numeric character references are trivial to translate since HTML ሴ turns into Unicode U+1234 and vice versa. Named character references like ö and → can be looked up in a table. There's no need to do anything with diacritics and bidirectional text. Just store and transmit the text as it was written and leave it up to the browser to render it. — Gdr 13:52 2004-04-01 (UTC)

I agree with the last part. But that, if anything, is an argument for UTF-8 only rather than for a server-side ISO-8859-1/UTF-8 conversion. Just for argument's sake, browsers that can't handle Unicode won't be affected as UTF-8 is identical to ISO-8859-1 in the first 256 characters. Any chars above that probably will not display correctly for people using archaic browsers anyway. — Jor (Talk) 17:43, 1 Apr 2004 (UTC)

I think you misunderstand. The point of having an "edit in UTF-8" option has nothing to do with display. Pages display just fine with the current system. The point is to make it easy to enter international text in browsers other than Mozilla. If the editing page is transmitted in UTF-8, I can type international characters directly into the edit box in many browsers, including Opera, Safari, and Internet Explorer. With the current system (editing page transmitted in ISO-8859-1), I have to convert international characters into the corresponding HTML character entity references. This is tedious. — Gdr, 11:44 2004-04-02.

Hehe- even early versions of moz are more advanced than IE, not only when it comes to utf-8. IE4 has patchy support, NS4 as well. Nobody editing pages in languages where utf-8 is important uses these browsers though. A check if the posted text validates as utf-8 makes sense imo, throw error otherwise. Just somebody has to write it. Volunteers? -- Gabriel Wicke 13:24, 2 Apr 2004 (UTC)

I guess using Opera made me lazy. I just type non–West European chars like Ł or 匥, and Opera does the conversion to the HTML entity for me if the page is in a non-Unicode charset :). Thanks for clarifying! — Jor (Talk) 19:55, 2 Apr 2004 (UTC)

Hi! I am a user from the french wikipédia. I know that some of you were interested by the conversion to utf-8. As you perhaps want to test on your personal wiki before considering the switch, here is the software to convert the MySQL dump : http://mboquien.free.fr/wikiconvert/ . It converts :

html entities, for instance ß => ß, excluding on purpose >, <,   and &
unicode entities (decimal or hexadecimal), for instance ß => ß
all other caracters valid in your encoding are converted properly

What it doesn't do :

bad formatted entities are not converted, typically an entity that doesn't finish with ;
windows-1252 characters are also not converted. To have them corrected before the conversion, you can ask Looxix on the french wiki. He has a very good bot to perform this kind of task, if you don't already have one.

This version is the rewritten version of the one we used (which was really dirty) to convert the french wiki. I rewrote it this afternoon and i tested it on an old cur dump of the french wiki, everything seems to work as expected. For the details, it depends on Qt (no troll on the toolkit used please) and i ran it on Mandrake 10.0. I was reported that it also compiles out of the box on Slackware. If you use another distribution, you may perhaps need to tweak the Makefile to have the correct path for Qt (you should set QTDIR correctly before trying to compile). No need to say that you need the Qt development packages installed. Using it is quite easy. The Makefile produces a wikiconvert executable. To convert you just need to write : ./wikiconvert < dump > converteddump (if you don't use iso8859-1, there is one line to change in wikiconvert.cpp, as explained in the source). On my computer (an athlonXP 2000+ underclocked at 1,5 GHz), converting a 90 Mb dump of cur lasts about 100 seconds. You should ask for a non compressed dump of cur for your test since converting compressed dumps available at http://download.wikipedia.org/ are not suitable for conversion since, once converted, MySQL can't load the dump completely (a problem of lines too long apparently, last time i tried).

I'd be very happy to get some feedback, and i would gladly accept patches to make the program faster/better. :) If you have any question, you can reach me on #fr.wikipedia on Freenode or on my discussion page (french or english only please). Med 09:41, 4 Apr 2004 (UTC)

I think the ironic thing is that Wikipedia is already using Unicode. Tagging the pages as ISO-8859-1 and forcing users to use HTML entities just takes up more bandwidth and makes the editing slower.

-浪人

update: By now the spanish and the german wikipedia have been converted successfully to utf8. Only dutch, danish, swedish and english still use 8859-1.

While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the  format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. --Aramgutang 06:46, 8 Aug 2004 (UTC)

Greek unicodes

I have placed a set of Greek alphabet unicodes at the foot of my User page for anyone who works on Greek-related articles and shares my inability to memorise them. Adam 03:12, 23 Apr 2004 (UTC)

Wouldn't it be best to use HTML entities, for backwards compatibility? Dysprosia 10:28, 23 Apr 2004 (UTC)

Plus they are a lot easier to remember.theresa knott 11:01, 23 Apr 2004 (UTC)

HTML entities are hard to edit and look ugly in the editing window, not to mention that they are SGML only, and that Unicode can just be copied&pasted in any text editor. — Jor (Talk) 12:21, 23 Apr 2004 (UTC)

What was wrong with the Unicode tables in the Greek alphabet article? Gdr 11:56, 2004 Apr 23 (UTC)

There is nothing inherently wrong with Unicode, but most people who are on non-Unicode compliant systems can't see Unicode glyphs. Dysprosia 12:05, 23 Apr 2004 (UTC)

But people using those archaic systems won't be able to access most non-US ASCII websites anyway. Why punish everyone to cator to a very small minority which probably has no interest in reading Greek in the first place? — Jor (Talk) 12:21, 23 Apr 2004 (UTC)

That doesn't mean we should actively seek to prevent users on different, non-Unicode-compatible systems from reading the text. I was somewhat sure that Windows 9x versions were not natively Unicode compatible, but [2] seems to suggest that this is the case.

In any case, how are the HTML entities "punishment" in comparison to the Unicode glyphs? One would think that the numerical Unicode entity would be more painful to enter than the slightly more intuitive HTML text-based entity... Dysprosia 12:53, 23 Apr 2004 (UTC)

You can't save unicode characters into articles on en, the encoding is ISO 8859-1. If you paste in a unicode character, or type it somehow, most browsers will automatically convert it to a numeric character entity. You can type in unicode if you wish, but it means that numeric character entities will be saved (e.g. α) rather than the more readable named character entities, e.g. α. Unicode support in browsers is irrelevant. -- Tim Starling 01:15, Apr 24, 2004 (UTC)

I don't think the named entities are really necessary for typing Greek text: they exist mostly as a coincidental accident because of the fact that Greek letters are used as symbols in a lot of other areas. We type Cyrillic using the numeric entities, for example, because that's the only way to do it, and it doesn't seem like doing the same for Greek is somehow worse. Furthermore, it is not possible to write correct Greek text using only the named entities, because no entities are provided for accented characters, and nearly every Greek word has at least one accent in it (and spelling it without the accent is not correct). Writing a word using all named entities except for one numeric entity in the middle would be kind of odd. --Delirium 02:50, Apr 28, 2004 (UTC)

Which Unicode characters can/should we use?

I started a few weeks ago changing various Greek language entries (e.g. in the top line of Jesus, I put Greek Ἰησοῦς Χριστός Iēsoûs Khristós) to display the proper accent marks. This displays fine in Mozilla. But when I try to display the same pages in Microsoft Internet Explorer all I get is little squares not Greek letters.

Is there an official Wikipedia policy on which Unicode characters we should and should not use? m.e. 10:58, 24 Jun 2004 (UTC)/m.e. 08:12, 9 Jul 2004 (UTC)

I can see a few question marks in between the aramaeic spelling, and I have the rather complete MS Arial Unicode font installed. The different display in Mozilla or IE might be a font selection problem, maybe you have set your Mozilla to use a different default font? I am not aware on any official policy on unicode, only that we should limit ourself to the original and the english spelling, as there is not much point in having the Cyrillic spelling of someplace in Greece. If it displays better in most cases you can try it without the accent marks, maybe put the correct version enclosed in a HTML comment behind it. andy 11:33, 24 Jun 2004 (UTC)

The Mozilla is on Linux and the MSIE is on XP, so I'm not surprised to get different results. I know that some users will be reading Wikipedia using Mosaic on Windows 1.0 and some will have the complete Unicode everything installed. I'm not sure how to strike a compromise in between. m.e. 12:02, 24 Jun 2004 (UTC)

IE displays a subset of the characters Mozilla displays of unicode on the same machine with the same operating system and the same font. I think this is because Mozilla has a better developed character code mapping table (its had three years' more development). Mr. Jones 14:07, 24 Jun 2004 (UTC)

You might find this page on meta useful. theresa knott 14:02, 24 Jun 2004 (UTC) — thank you, Theresa, I've read it now; I have been creating the characters using , I was wondering which characters I should and should not use. m.e. 10:45, 25 Jun 2004 (UTC)

I suggest that MS Arial Unicode is perhaps the worst font for page compatibility tests because, although it is probably the most complete Unicode font commonly available, it is limited to only those who have a Microsoft product like MS Office 2000 or later installed on their MS Windows IBM-compatible computer. Even though this probably includes more than half the computer user population of the world, it leaves out a huge minority as well. (Personally, I've never gone beyond Office 97, having no compelling reason to pay the huge expense.) Microsoft doesn't seem to offer it as a separately downloadable font, even for a price. (Just another of the thousands of little ways it encourages everyone to buy its major software products.) -- Jeff Q 21:09, 24 Jun 2004 (UTC)

Are there any good alternative fonts that are more widely available? Also, is the En wiki ever going to go UTF-8 like all the others? -- कुक्कुरोवाच|Talk‽ 21:13, 24 Jun 2004 (UTC)

Alan Wood's Unicode Resources page is an excellent resource for Unicode font issues. His "Introduction" section includes a set of links in the line reading: "Lists of fonts for Windows, Mac OS 9, Mac OS X 10 and Unix, with the Unicode ranges they support, and where to obtain them." -- Jeff Q 11:22, 25 Jun 2004 (UTC)

On the basis of this, it appears that IE is rendering Greek but not Extended Greek. According to the Alanwood pages that you referred to, Arial Unicode MS should render both Greek and Extended Greek correctly. Does the Wikipedia CSS force IE to another font that does not have Extended Greek? Also, I notice that Wikipedia pages have charset=iso-8859-1 in the header, but I presume this doesn't matter as I am coding my characters as  codes rather than directly inserting the characters themselves.

I suppose this means we need a rule that says only use the characters supported by Arial???? m.e. 10:03, 27 Jun 2004 (UTC)

Font rendering is an incredibly complex, multidimensional problem that is far from being adequately solved, especially for a global Web resource like Wikipedia. You can't really speak of what IE will render; you've got to specify what version it is, what platform you're running on, what fonts you have installed (by manufacturer name, not style), how your browser is configured to render certain types of fonts, what language it's set to, and so on. (I can see that you, m.e., know much of this already, but I state it here explicitly for other folks reading this.) Most of these settings are done very differently for different browsers and even between versions of the same browser. Frankly, I don't understand a good bit of it myself. Just when I think I've got everything configured properly for my Opera browser, something weird happens and I have to delve back into this confusion. From what little I've seen, MSIE is simpler to configure but more difficult to customize properly. One thing to keep in mind is that simply finding a font that renders your desired characters isn't sufficient, since you can't expect anybody to have done this for their browsers. Any Wikipedia page that displays nicely in your customized browser will be useless to the vast majority. I have no good answer for this annoyance. It seems to require a commitment to robust Unicode font inclusion in browser installations and preconfigurations AND cooperation between the mercilessly-competitive platform, browser, and font vendors that just doesn't exist yet. -- Jeff Q 14:47, 28 Jun 2004 (UTC)

I suppose someone should jump in and write a policy that says which characters one should and should not use? Where would it go? Who should write it? Would it go through some sort of acceptance test before it reaches 'production'? I'd think it would be a bit contextual; in some (more specialised) contexts you might go for the 'real' characters, and accept that they might not display for evveryone.

Also, could we solve this by using the TeX option? Can we use the TeX display mode, normally used for mathematics, for displaying non-Latin characters?... TeX mode doesn't seem to work for this, as it throws you straight into math mode, and it seems only to recognise a limited subset of TeX commands; is this true/ m.e. 09:22, 29 Jun 2004 (UTC)

I think the policy should be use any Unicode characters you think right for the article. Writing excellent encyclopedia articles is more important than worrying too much about browser and operating system capabilities. Browsers and operating systems will catch up (some are pretty good already). To cater for people who can't see some characters, the right thing to do is to present the same information in several forms. For example many articles give pronunciation indications in both IPA and ASCII-IPA. Gdr 19:12, 2004 Jul 3 (UTC) — that's a point, I suppose we should work on the principle that Wikipedia will still be around in 10ⁿ years and we should write for then as well as for now. m.e. 09:53, 5 Jul 2004 (UTC)

Little conversion tool

I posted this on the article page, though I'd post it here as well, so that more people know about it.

While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the  format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. --Aramgutang 06:50, 8 Aug 2004 (UTC)

Resin identification code

The Resin identification code Unicode symbols don't work (on firefox anyway). Is there someone here who knows how to fix them?

Duk 16:00, 10 Oct 2004 (UTC)

They work here in Firefox on Windows XP. It's all about whether you have appropriate fonts installed. DopefishJustin (・∀・) 17:16, Nov 11, 2004 (UTC)