User:Dohn joe/diacritics

This is meant to be a place to develop a comprehensive WP:RfC on the issue of diacritics usage on the English-language WP. The top sections will explore the issues, with a middle section for discussion, and eventually the language of a proposed RfC at the end.

Background

Why is this such a contentious issue?

What WP policies and guidelines are affected?

"Accuracy" - why is "Warsaw" accurate, but not "Lodz"

Reliable sources - historically, English-language sources were unable (technically and/or profitably) to use some/all foreign diacritics. Now, a much larger range of diacritics is available, cheaply. Usage has dramatically increased, but many (most?) publications still do not use some/all diacritics. Is it a choice? How does that affect our choice?

Do we draw a line? If so, where? (By language: before or after Vietnamese, before or after Icelandic, before or after Polish....) (By notability of subject: before or after Slobodan Milošević, before or after Zoran Žižić....)

Ultimately, this is about our readers. Does universal usage of diacritics help or hurt accessibility of the content we provide? What other factors are there?

Specific diacritics questions to consider

Readability

non-Latin characters (Icelandic, etc.)
ligatures (Danish, etc.)
strikethrough or other physically altered letters (Polish, etc.)
non-altered letters (accents, umlauts, etc.)
multiple diacritics per letter (Vietnamese, etc.)

Familiarity

of language (Spanish, French versus Maltese, Vietnamese)
of particular subject (well-known politician, athlete, place name)
exonym status
native English diacritics (dieresis, etc.)

Other factors

citizenship (of English-speaking country versus non)
usage by subject (in authorized books, websites, etc.)

Discussion

Diacritics vs. different spellings and non-Latin characters

I think it's necessary to be very clear as to the issue, which I think should be purely the use or non-use of diacritical marks. I believe that we shouldn't stray into different spellings of names (e.g. Brussels–Bruxelles–Brussel or Warsaw–Warszawa) regardless of whether one form has diacritics or not. Nor should we stray into non-Latin letters (e.g. eth ð or thorn þ in otherwise Latin alphabets, or different alphabets or writing systems).
For me the deciding factor is whether a reader who isn't familiar with diacritics or doesn't know how to produce them can easily work out a form without diacritics which can be used as a redirect. At one end of the scale, Rhône is easily converted to Rhone. At the other end of the scale, I do find Đổi Mới difficult to read in the font face and size that I use when looking at Wikipedia, and it wasn't initially obvious to me that Doi Moi was the "base form". Our concern should always be readers, not editors or experts in the subject. Peter coxhead (talk) 19:25, 13 June 2014 (UTC)

I think somewhere in Icelandic territory it starts to stray away from diacritics and into a different language. Since there are relatively simple replacements for icelandic letters, I think we should consider using those replacements (similar to the german SS/ß). Otherwise, I don't think we should be biased against one set of diacritics vs others, since specialist sources in english language still use these diacritics.--Obi-Wan Kenobi (talk) 00:47, 16 June 2014 (UTC)

I agree with you about Icelandic. On your other point, what specialist sources do has been discounted in other decisions. User:SMcCandlish has elsewhere argued that the specialist style fallacy doesn't apply here, but I think the general points in this essay are highly relevant. What general sources of an encyclopedic standard do is what matters here, not what specialist sources do. Peter coxhead (talk) 02:19, 16 June 2014 (UTC)

I think diacritics are different than other issues, because the use of diacritics, esp in older books, required special additional typesetting costs and additional editorial oversight. Thus, the comparison I've made many times with using a black and white book to describe the color of Picasso's paintings - we should not use a low-MOS source to decide on the proper spelling with diacritics. It all comes down to cherry picking sources, ultimately - for every source you bring me that uses Vietnamese diacritics, I can find one which doesn't, and vice versa. For example, even with Icelandic, we take a source like the Guardian, and it doesn't eschew icelandic diacritics or even icelandic special letters: [1]. Every source has to make it's own decision, ultimately, about diacritics, and I think just saying "Well, do what MOST other sources do" is not a good option here, since we are rather unique, and we certainly don't apply the "do what most other sources do" to almost any other decision here, for example the whole MOS is our invention. Other sources are not crowdsourced, and we can, at basically zero cost, get the diacritics right, so I'm rather confused that we would not want to.--Obi-Wan Kenobi (talk) 02:31, 16 June 2014 (UTC)

I taught computational linguistics and insisted that students get to grips with the IPA, so personally I entirely sympathize with the desire to "get things right". But for the same reason I'm aware how difficult monolingual English speakers, even highly intelligent graduate students, find it to grasp unfamiliar orthography. So it does concern me if Wikipedia adopts a more "difficult" style than say, National Geographic. Many of the subtleties of the MOS pass most readers by (en-dashes versus hyphens for example). Heavy use of diacritics won't. However, this is for the RfC. Here we need to work on the wording. Peter coxhead (talk) 02:58, 16 June 2014 (UTC)

I think we should get some evidence of readers who have posted on talk pages or at the help desk saying "I can't understand why this article about an obscure turkish politician spells his name using turkish accents" - is there actually evidence of readers being confused? We even use chinese language in some of our articles (not in article text, but it's there nonetheless).--Obi-Wan Kenobi (talk) 03:28, 16 June 2014 (UTC)

I concur that the Brussels/Bruxelles and Icelandic letters and Danish ligatures issues are distinct from the diacritics issue (and distinct from each other, which incidentally means that consensus on one does not necessarily preemptively indicate consensus on another). — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 04:05, 22 June 2014 (UTC)

Multiple diacritics per letter

Peter coxhead, you said: I do find Đổi Mới difficult to read in the font face and size that I use when looking at Wikipedia, and it wasn't initially obvious to me that Doi Moi was the "base form". That seems to be an argument for you making some font adjustments (or for WP doing so at the CSS level, if you're using the default fonts in whatever browser and OS you have, and they're poor at rendering Vietnamese). Under Mac OS X in Firefox and in Chrome and Safari (and in both sansserif, as you wrote, and serif, as quoted in {{tq}}), it's trivially easy to make out the "Doi Moi" in Đổi Mới; I only have to slow down parsing for the hundredth-of-a-second or whatever it takes to me recognize that as probably Vietnamese, even less than the amount of time it takes me to mentally flag "Andrés" as probably Spanish not French or Italian, since the more complex nature of the Vietnamese diacritics actually helps identify them. I would have to slow down more to actually sound out the Vietnamese name in my head, but that's only because I personally have only passing understanding of Vietnamese diacritics, while I know Spanish and am reasonably personally familiar with both French and Italian at the introductory level. For the 1mil+ Vietnamese-speaking Americans, the exact opposite would be the case - they'd recognize and know how to pronounce Đổi Mới without even consciously thinking about it, but many might be quite confused about how to pronounce "Andrés", and would not be likely to immediately recognize it as Spanish vs. something else, unless they individually also happened to be familiar with Spanish.

There isn't anything intrinsically more difficult about "ổ" or "ớ" vs. "ó" or "ö"; they're just symbols. I'm unaware of any research anywhere suggesting that multi-stroke characters like "x" and "M" are notably more difficult for children (or adult illiterates) to learn than single-stroke ones like "l" and "O". Similarly, some Chinese characters are very simple while some have many more lines in them, but native and secondary students of Chinese don't seem to have their heads explode upon contact with more visually complex ones; once they know them, they know them, even if it takes more practice to write them correctly. More importantly we know that children learn to read and write English and Chinese at about the same rate, despite the greatly increased detail complexity of Chinese vs. English script. I.e., complexity of individual characters (including any diacritics they may have) is demonstrably a red herring. Besides, "ö" actually is a character with multiple diacritic marks, just like the two Vietnamese ones, so we'd have to ban the umlaut, too. — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 04:05, 22 June 2014 (UTC)

Familiarity of language

Every one of these points could have an entire projectpage devoted to debate about it (and why is this in userspace, BTW?), but I'll start with just one:

"Familiarity of language" is a red herring. No non-English language is familiar to all English speakers. The familiarity of a non-English language to native English speakers is almost entirely a factor of a) closeness of linguistic relation (Dutch and French are more familiar to English-speakers than Serbian and Tagalog, on average, in this sense), and b) geographical population statistics (Spanish is very, very familiar to English speakers in the American Southwest, but not very familiar at all to South Africans; various Asian languages are increasingly familiar to urban Americans and Britons, but not to their rural counterparts, etc.) No assumption can be made about whether any given Wikipedia reader "is familiar" (a term we can't really define) "enough" (a measurement for which we have no gauge) with any particular language.

The language that most often triggers this questionable argument is Vietnamese, because to British, Australian, etc., readers it seems weird and "foreign". Yet more than one million Americans are Vietnamese speakers, and it's the 6th most-spoken language in the US (spoken even more in some places most people would never think to guess, e.g. it is third most common language after English and Spanish in Texas, and 4th (after the same plus French) in Arkansas and Louisiana. Most Vietnamese-Americans do actually bother with the diacritics, from what I can tell by the restaurant and other business signs around here (California, where it's language #6). The idea that "Vietnamese diacritics are just too unfamiliar" being argued by a Briton is just as unsustainable as "Irish and Scottish Gaelic diacritics are just too unfamiliar" being posited by someone from North Dakota.

Secondarily, even if one wishes to rely only upon the linguistic closeness argument, we still have no objective way to determine how close "close enough" is. There is no place to draw a line. If you want to rule out everything further from English than the West Germanic languages (English's family for grammar and basic vocabulary) and Western Romance languages (English's family for extended vocabulary after importing much of Norman French during the Middle English period), you're left with a "line" that doesn't make any sense to draw, becuase the diacritics used in the other Germanic and Romance languages are mostly the same and serve mostly the same functions. So if you broaden it to all Germanic and Romance languages, same problem - by that point, the diacritics used in Slavic and Finno-Ugric and so on languages using the Latin script cannot really be distinguished clearly. And so on.

A similar analysis can be done for every point on this list. No matter what you come up with, it comes down to a choice between accepting diacritics, and seemingly almost everyone on WP does, or specifically excluding certainly languages' (or language families') diacritics on bases that are going to be continually attacked as arbitrary, jingoistic, even racist, no matter what the underlying rationale for fencing them off was. The three most common and obvious targets of this "away with their diacritics" desire appear to be Turkish, Polish and Vietnamese. If I need to even mention why speakers of these languages are liable to feel that an American-dominated (and to a lesser extent British-oriented) English-language encyclopedia excluding their proper orthography but permitting that of neighboring (sometimes oppressing, e.g. the French and German) cultures is going to be seen by many of them as patently discriminatory and part of a long history of cultural aggression, then please avail yourselves of some of our fine historical articles starting with Causes of World War I and Partitioning of the Ottoman Empire, History of Poland (1939–45), and French Indochina and Vietnam War. — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 08:00, 21 June 2014 (UTC)

@SMcCandlish: in what you wrote above, you're anticipating the RfC. The original point of this thread was to try to formulate a clear proposal which could be put forward to a community-wide decision. I'm pretty sure that if the proposal is carefully framed it will be accepted and then there will no longer be grounds for arguing that the use of diacritics in languages like Vietnamese is just a local consensus. Please help to frame such a proposal. Peter coxhead (talk) 09:17, 21 June 2014 (UTC)

Yes, I am anticipating the RfC. Most RfCs on complicated and emotional (but not terribly important) matters like much of what arises at WT:MOS would go much better, with less heat and less likelihood of rehash, if the issues and rebuttals raised by them were anticipated and hashed out better before the RfCs were launched. Thus my post. I have a strong position on the question the RfC would ask, and people rarely consider RfCs I write "neutral enough", so I'll be content for now illustrating some positions to avoid. Or, really, illustrating the sort of difficult-to-refute analysis that can be brought against these positions with very little effort. I'm mostly doing other off-WP stuff right now, and not intending to provide an analysis like this for every point, nor to write an RfC myself that avoids all of them. More work than I'm prepared to do right now, especially given the level of hate mail I get when efforts like that on my part are actually successful. People automatically take it very personally when some style quirk they love/hate doesn't go their way, but I can't absorb all of that animosity all the time. — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 03:00, 22 June 2014 (UTC)

I don't know whether SMcCandlish was getting ahead of themself or not, but I'd say it is a valid thing to raise at this stage in any case. I agree that "familiarity of language is a red herring". It would be very difficult to make practicable, abnormal (IMO) in terms of what is normally done in English and I don't think many people would support it. From that point of view, it is valid to make a case that it is a waste of time to bother much about it, and it should not be included as an option etc in any RFC.

Where I disagree is with the supposition that this only leaves us with the options of accepting diacritics or rejecting them for some languages arbitrarily. This is not the case, because various non-arbitrary means of discrimination can be envisaged, COMMONNAME being the most obvious example.

The supposed heartache of speakers of languages when they see how low they rank in terms of English adoption of their diacritics is irrelevant. Although I'm picturing their poor little faces right now, and make no mistake I do feel sorry for them, the same logic sould see Wikipedia introducing new loan words from Nepalese and Catalan so that they can catch up with French and German and make the English language fairer. Desire for language reform should guide us to no extent. Formerip (talk) 10:43, 21 June 2014 (UTC)

COMMONNAME only applies to article titles, so it basically won't have any real effect on the debate, and it might not even end up regulating the titles themselves in such cases, because the COMMONNAME principle is balanced against many other ones. One of them that would obviously come up is whether it makes any sense to have the article at a different spelling than that used in the main article text, when consensus favors that version for reasons that are independent of WP:AT's concerns; the answer to this question is almost always going to be "obviously not".

This has nothing to do with "desire for language reform" at all, other than that "defenders of real English" keep popping up to tell us we "can't" use diacritics in English because "it's wrong" and "not English" and "anti-English" and whatever, nevermind the undeniable fact that predominantly English-speaking places do in fact handle names and words from notable minority languages, including their diacritics, regularly and without incident. The "desire for language reform" is coming almost entirely from the anti-diacritics camp. — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 03:00, 22 June 2014 (UTC) PS: "The supposed heartache of... their poor little faces" verbal hostility and dismissiveness you exhibit gives the lie to your claim, "make no mistake I do feel sorry for them". It's like telling your children you really love them while snorting and rolling your eyes. I'm not making a case for major-minority language speakers or their position, anyway, only stating as a matter of obvious fact what the PR fallout will be of intentionally coming across as xenophobic jackasses for no real reason, especially if it's targeted at some particular languages. — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 04:46, 22 June 2014 (UTC)

Familiarity of diacritical mark

To get back to the point, it seems to me that point 1 (Familiarity of language) under "Familiarity", above, is not needed. Perhaps missing from the list is "Familiarity of diacritical mark". Formerip (talk) 11:07, 21 June 2014 (UTC)

Half of my point above is that every "excuse to be against diacritics", not just point 1 in a particular section of them, is going to be easy to shoot full of holes. Why would we want to add another bogus rationale of that sort, unless (as I've done with just that one case, to make an example of it) we actually want to anticipate and pre-emptively take apart whatever anti-diacritics arguments we think are likely to be raised? Yet, I get the feeling that some participants even in this drafting phase are not exactly big fans of diacritics, and personally consider some of them "too unfamiliar" or "too complicated" or "too" something else, and do in fact want to include anti-diacritics arguments as advocates of those arguments. — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 03:00, 22 June 2014 (UTC)

To your points above re: familiarity, where do you stand on letters like the following: Þ ð ß œ æ - My personal sense is, oe/ae is ok, since people have seen those at least in formal writing. The german ß is less used and often turned into ss which works fine in English. The icelandic letters I'm just not sure about, they aren't diacritics they are different letters, of an extended latin alphabet perhaps, but it's perhaps a step too far. That said, I don't really relish the though of trolling through the wiki to eliminate such icelandic letters, and I'm sure the people who edit articles about iceland will care and fight it.--Obi-Wan Kenobi (talk) 03:22, 22 June 2014 (UTC)

I'm skeptical this is a good place for this discussion, but basically I don't think we should use Þ/ð/ß except in the same contexts we'd use Greek or Cyrillic or Japanese characters, while the œ/æ ligatures should only be used for names/terms in languages that require them to be fused, which English does not. I think MOS actually already addresses at least the œ/æ point and ß, but has skirted the Þ/ð issue, probably because neo-pagans, who usually aren't even native speakers of Icelandic and whatever, keep pitching a fit about wanting to use them. They sometimes should actually be used, even for Old English, but again only in the same sort of case where we'd give a name/title in the original Greek or whatever script if the topic in question used that script rather than a Germanic-Latin one with Þ/ð characters. They certainly shouldn't be used in article titles, any more than Delphi should be at the article title Δελφοί (which it might very well have been if Wikipedia had been written in the 1870s). The article Althing, about the Icelandic parliament, handles the matter okay, though I think it would be even better if the lead began "The Althing (Icelandic: Alþingi) is the national parliament..." instead of "The Alþingi (anglicised as Althing or Althingi) is the national parliament...", for at least three reasons only one of which is germane here.

This is distinct from the Brussels/Bruxelles issue, which is a question of very long-established exonyms (cf. Vienna, Munich, Warsaw, Greece, Albania, Navajo, Basque, etc.), and in turn distinct from Bombay/Mumbai which is a politically revisionist renaming issue (both anti-British and pro-Marathi, and not well accepted even by all Indians), in turn distinct from the superficially similar Beijing/Peking question (which is really about whether to use a romanization based on the official Standard Mandarin pronunciation instead of the earlier transliteration of the name in southern Chinese dialects), and in turn distinct from other cases where an anglicization some would prefer is actually wrong (i.e. based on demonstrable linguistic error, as with Hawaii vs. Hawaiʻi). — SMcCandlish ☺ ☏ ¢ ≽^ʌⱷ҅_ᴥⱷ^ʌ≼ 04:39, 22 June 2014 (UTC)