Wikipedia:Naming conventions (Unicode) (draft)

Right now this page just contains info on special characters that are likely to be interesting/useful for en. Hopefully it can be expanded into a policy on their use.

Unicode provides an international standard which has the goal of providing the means to encode the text of every document people want to store on computers. This includes all scripts in active use today, many scripts known only by scholars, and symbols which do not strictly represent scripts, like mathematical, linguistic and APL symbols.

ASCII

edit

These are our bread and butter and have been dealt with exhaustively elsewhere. Plugwash 28 June 2005 20:37 (UTC)

Latin-1

edit

Same as above pretty much Plugwash 28 June 2005 20:41 (UTC)

The degree sign and the masculine ordinal indicator are often conflated, and are sometimes misused in place of the raised "o" in approximations of the numero sign. The feminine ordinal indicator is also occasionally misused as a superscript "1". The passability of such approximations is predicated on the output media being visual (screen, print) and the characteristics of certain fonts at certain sizes. In other contexts, the semantics of the text are changed when the wrong characters are used. Therefore, these characters should only ever be used for their intended purposes: use the degree sign to mean degrees, and the ordinal indicators as appropriate for the languages that need them.

Latin extended A

edit

This is mostly letters with less common diacritics. One or two of these are already in article titles due to conversions from windows-1252 (most browsers interpret iso-8859-1 as windows-1252 and so that stuff got into article names here). The extra diacritics are probably a good thing as long as redirects are in place from the diacritic-less names and these are likely to be widely supported so I think it's pretty safe to have them in article titles. Plugwash 28 June 2005 20:47 (UTC)

Example: Paul Erdős. Arbor 6 July 2005 12:59 (UTC)

It is erroneous to refer to these as "less common" diacritics. Many of these are diacritics used for Eastern European languages, including Polish, Czech, Slovakian, Croatian, Slovenian, Serbian in Latin script, Hungarian, Turkish, Romanian, Latvian, Lithuanian, etc. and since the introduction of MediaWiki 1.5 they are already extensively used in article titles. The fact that Latin 1 is limited to Western European languages is just an artifact of the Cold War. -- Curps 15:05, 2 August 2005 (UTC)[reply]

Latin extended B

edit

Here, U+018F is uppercase schwa, used in Azerbaijani. Note the lowercase schwa is U+0259 in the IPA section; the character U+01DD is "Latin small letter turned e", whose uppercase is U+018E, which is used in pan-Nigerian alphabets.

Also, U+01A0, U+01A1 is "o with horn" and U+01AF, U+01B0 is "u with horn", used in Vietnamese.

Also, U+01CD through U+01DC are a, i, o, u, u-umlaut with caron, used in Chinese pinyin for the third tone. Notice that Ě/ě (e with caron) is U+011A/U+011B and is in Latin extended A, since it is also a letter of Czech alphabet.

Also, U+0218 through U+021B are "s with comma below" and "t with comma below", used in Romanian. Fonts are sometimes not available for these, so "s with cedilla" and "t with cedilla" from the Latin extended A section are sometimes used instead of these. See Special Romanian Unicode characters -- Curps 21:10, 24 August 2005 (UTC)[reply]

out of interest do you know if the romanian wp has a policy on using comma below or cedilla? Plugwash 22:46, 24 August 2005 (UTC)[reply]
I'm pretty sure they use the cedilla forms, simply because my browser can't display the comma versions at Special Romanian Unicode characters and when I go to the http://ro.wikipedia.org/ I can see the letters. The comma versions only display in the above table because the whole table is set up to use special Unicode fonts rather than the default fonts. I edited the above table to show the default-font version alongside the Unicode font version. -- Curps 03:39, 25 August 2005 (UTC)[reply]

Latin extended additional

edit

The range U+01EA0 through U+01EF9 is used for Vietnamese. -- Curps 21:10, 24 August 2005 (UTC)[reply]

IPA extensions

edit

Used for IPA in body text (with a special template to persuade IE to render them right) probably not appropriate for article titles here. Plugwash 28 June 2005 20:48 (UTC)

Note however that schwa (U+0259) is also a letter in the Azerbaijani alphabet. Lowercase schwa is in the IPA section, but uppercase (U+018F) is in the Latin B section. -- Curps 15:05, 2 August 2005 (UTC)[reply]

Spacing modifier letters

edit


I am not formatting this correctly, but there is at least one interesting item in this code block:

  • MODIFIER LETTER TURNED COMMA (U+02BB): used in Hawaiian, where it is called ʻOkina
Unfortunately, this character does not seem to be available in either the default or Unicode fonts on Windows. -- Curps 21:38, 4 September 2005 (UTC)[reply]
Works fine on mine, although I couldn't tell you precisely which font is being so obliging. —Phil | Talk 14:14, 14 October 2005 (UTC)[reply]

Combining diacritical marks

edit

Notice how the marks show up now they have something to combine with… —Phil | Talk 10:49, 24 October 2005 (UTC)[reply]

Greek

edit

Already used heavily for math type stuff but probably not too appropriate for article titles in English. Plugwash 28 June 2005 20:38 (UTC)

Well, here are some English articles that could use those letters in the title. Some of these articles claim to have the “wrong title due to technical limitations”: Pi, C omega, Omega constant, Chi-squared distribution, Gamma function, Cronbach's alpha, Beta particle, Beta distribution, and many, many more. Arbor 6 July 2005 19:07 (UTC)

Also some star names such as α And as a redirect for Alpha Andromedae. But here we run into the "initial letter is capitalized" issue, which applies to Greek letters too. -- Curps 04:44, 25 August 2005 (UTC)[reply]

Greek extended

edit

Cyrillic

edit

Cyrillic supplement

edit

Armenian

edit


Hebrew

edit

Note that the mathematical symbol for Aleph (eg Aleph-0) is in the math section at U+2135. -- Curps 21:10, 24 August 2005 (UTC)[reply]

See also #Alphabetical presentation forms for forms used by Yiddish. -- Curps 00:56, 11 September 2005 (UTC)[reply]

Arabic

edit

Actually I'm getting better coverage with the default font here. Phil | Talk 12:30, 24 October 2005 (UTC)[reply]

Arabic supplement

edit

Not getting anything here with either option: is there a better alternative for displaying Arabic? —Phil | Talk 12:45, 24 October 2005 (UTC)[reply]

Syriac

edit

Thaana

edit

Devanagari

edit

Bengali

edit

Gujarati

edit

Tamil

edit

Telugu

edit

Kannada

edit

Malayalam

edit

Thai

edit

Georgian

edit

General punctuation

edit

Pages that already use such letters:

Candidate pages for use:

  • RIGHT SINGLE QUOTATION MARK (’): Mother's day, St. John's Cathedral, and thousands more that use a possessive. Also T'Pol and many more
  • LEFT SINGLE QUOTATION MARK (‘): 'Okina, Hawaii, Ayin and others, but all of those are more correctly spelt with MODIFIER LETTER TURNED COMMA (U+02BB in the "Spacing modifier letters" section)
  • EN DASH (–): Hasse-Minkowski theorem and hundreds more
  • LEFT and RIGHT DOUBLE QUOTATION MARK (“ and ”): Knights who say Ni and others that use a quotation in the title

Superscripts and subscripts

edit

Probably not relevant for titles and in body text its probably safer to use html subscript/superscript.

Besides requiring some heavy-duty fonts to be rendered correctly, the lesser-known of these characters (all but ² and ³, really) tend to appear too small to be legible, at least compared with <sub>. The policy in place for minor planets (whose designations often include subscripts) is 1) to use <sub> systematically in the article and in the wrongtitle template, 2) use non-subscripts in the title, and 3) put in place a redirect from the subscripted title. Urhixidur 14:51, 2005 August 2 (UTC)

Currency symbols

edit

The Euro (U+20AC) is heavily used within articles and should be OK to use in an article title (though it's hard to think of a plausible title that would use it... maybe if some book or movie title incorporates the symbol). -- Curps 21:08, 24 August 2005 (UTC)[reply]

How about €2 commemorative coins? ;)Nightstallion (?) 12:48, 28 July 2006 (UTC)[reply]

Letterlike symbols

edit

Miscellaneous symbols

edit
  • MUSICAL SHARP SIGN (Unicode 266F), ♯. Needed for Sharp-P and Sharp-P-complete. Some textbooks use NUMBER SIGN (#) instead, and refer to the complexity class as Number-P. I prefer that symbol myself, but that’s not the point here. Needed for C sharp. Arbor 7 July 2005 07:18 (UTC)

Alphabetical presentation forms

edit

Arabic presentation forms-A

edit

Arabic presentation forms-B

edit

Halfwidth and fullwidth forms

edit