Long-term todo items

edit
  • Javascript automation to make fixing typos faster and more fun (possibly using JWB or AWB)
    • First pass through all articles was completed on 23 September 2019, in under 1 year. Algorithmic improvements found more typos on the second pass, so it's taking longer.
  • Use Wikipedia categories like Category:Redirects from misspellings and Wiktionary categories like wikt:Category:English misspellings to highlight English misspellings that moss is currently ignoring.
  • Spell-check against only English words in the dictionary (will highlight all non-English words not inside {{lang}} and friends)
  • Spell-check non-English languages (requires {{lang}} or similar to know which language to spell-check against)

Checking specific languages

edit

Just brainstorming... you may have already thought of this and/or it may not be feasible...
If/When language-recognition functionality is added (i.e. a string on WP which is not tagged as belonging to a foreign language will only count as having an entry on en.Wikt if that entry has an ==English== header), it might be useful to also check foreign-language text on en.WP for typos. For French text, one could probably replicate the current setup except with fr.Wikt being checked instead of en.Wikt, since fr.Wikt's coverage of French is as extensive as en.Wikt's coverage of English. Other major languages' Wiktionaries are much less extensive, e.g. de.Wikt has an order of magnitude fewer entries, so in the case of those languages it might make sense to check the corresponding Wikipedia instead. (I.e., if the program finds that the word magdeburgischen is used on en.WP and tagged as German, it checks if the word is used on de.WP — perhaps even requiring it to be used more than once, to help ensure the use on de.Wikt isn't a typo.) -sche (talk) 16:08, 24 January 2015 (UTC)Reply

Hmm, that's an interesting idea. Certainly it would drive the number of false alarms down, but it would no doubt also increase the number of non-English misspellings we miss if we match against Wikipedias in addition to Wiktionaries. Part of the point is also to encourage creation of dictionary entries that are needed anyway, and it's nice to have humans providing a reliable ground truth for correctness determinations. It's also a bit difficult for me to tell is such a thing is actually working if I don't speak the language, but maybe some fluent speakers of other languages could help out if I ever get around to dealing with non-English stuff in detail. (It's going to be a lot of work just to get English spelling, grammar, punctuation, and style rules coded up completely.) -- Beland (talk) 09:11, 14 April 2018 (UTC)Reply
@-sche: After a few more years of experience, I'd say the main barrier to doing this (other than the huge backlog of English typos) is that most non-English text in the English Wikipedia isn't tagged with {{lang}} or {{transl}} or similar, which is necessary to know which language to spell-check the text against. It's possible to guess the language, but we actually need {{lang}} for all non-English text anyway, so screen readers and search engines and whatnot function properly. In the long run, I will definitely want to spell-check all languages if volunteers are available. In the meantime, the system is prioritizing cleaning up two types of non-English text. The first is words that are not in any Wiktionary, which are the most likely to be actual misspellings and thus the most beneficial to give attention to first. These are found in the "Case notes" of most of the main page listings, and also the most important ones are found in the newly consolidated "Highest-frequency words missing from dictionary" section. The section is the new Wikipedia:Typo Team/moss/not English report, which focuses on articles with the longest passages of non-English text. These are most likely to be in need of translation, but may also represent some long non-English quotations that need tagging before they can be spell-checked against a specific language. -- Beland (talk) 07:15, 3 June 2022 (UTC)Reply

Exclude terms Wiktionary only records as misspellings

edit

Wiktionary gives some misspellings entries (originally only very common ones, but the threshold has been steadily whittled down by various users), where they are labelled as such and readers are directed to the usual spellings. Example: wikt:tassled. I don't know if those "misspelling entries" are already excluded from counting as Wiktionary entries for the purposes of this project; if not, that's something to add to the todo list alongside the goal of excluding non-English Wiktionary entries. Or, actually, perhaps it would be simpler to search a dump of en.WP for all 1756 terms in wikt:Category:English misspellings, and make a separate list of articles that contain them, since they're known misspellings. -sche (talk) 09:06, 26 January 2015 (UTC)Reply

It's not urgent, but it might be useful to periodically check WP for entries which Wiktionary has but categorizes as either misspellings or obsolete spellings/forms (e.g. wikt:laterly, wikt:kinge). -sche (talk) 06:04, 1 March 2019 (UTC)Reply
Also, the ~60 words categorized as wikt:Category:Non-native speakers' English (although in some cases other senses of some of those words are valid English; those could be pruned off when checking for words on en.WP). -sche (talk) 08:19, 11 March 2019 (UTC)Reply
@-sche: This is a good idea which has been in the back of my mind since you made the suggestion. I think some more recent category-processing improvements in service of the experimental grammar checker will make implementing it a lot easier. I'll still have to find a solid chunk of time to work on implementation, but...yes! Especially now as we have cleaned up decades of English typos, it's getting harder to find actual misspellings, and these are an easy target. -- Beland (talk) 07:29, 3 June 2022 (UTC)Reply

Limit search to article namespace

edit

In the "Likely misspellings by frequency" (etc) sections, can you construct the "find all" links to use a search URL that only looks at namespace 0? I saw a lot of Wikipedia: pages in the list for one of the entries, and I don't think that namespace is a priority for spellcheck. David Brooks (talk) 02:45, 23 September 2018 (UTC)Reply

Whoops, looks like I did that on one code branch and not the other. Fixed; it should show up in the Oct 1 run. -- Beland (talk) 22:39, 29 September 2018 (UTC)Reply
I am also finding errors in templates. When they are transcluded on many pages, the search finds the pages, but if you try to edit the error is not there! But for some kinds of templates, eg DYK nominations, spelling errors are of little importance and should not be fixed. Graeme Bartlett (talk) 03:09, 30 September 2018 (UTC)Reply
@Graeme Bartlett: Did you have an example typo with this problem? I can try to diagnose and find a workaround. -- Beland (talk) 20:53, 1 October 2018 (UTC)Reply
Well I fixed Template:Roman Catholic Diocese of Buffalo, and Template:Reach plc as part of Typo Team/moss. Now that they are fixed you won't see error in the search. Do you want me to leave the next error that I find so that you can investigate? (That is better than deliberately putting in an error). Graeme Bartlett (talk) 21:40, 1 October 2018 (UTC)Reply
Hmm, yes, that would be helpful. I found a relatively rare correctly-spelled phrase to test with, but I couldn't reproduce this problem. I'm wondering if the namespaces being searched are being controlled by the selection you make which are stored in your browser's cookies? Do you see a list of namespaces under the search text bar? -- Beland (talk) 14:17, 2 October 2018 (UTC)Reply
I am just using the default. There errors re visible on article pages, so searching articles should show the errors. But when you edit you don't see the text. These kinds of errors do need to be fixed, so I am not complaining. At least these errors are only nested one deep, unlike some other problems I have seen with speedy delete nominations, or unclosed formatting inside templates. So there is no need to search template space. Graeme Bartlett (talk) 23:40, 2 October 2018 (UTC)Reply

"Brittanica" -> "Britannica"

edit

I'm wondering why I don't see the above goof in lists here. At more than 200 occurrences it ought to show up somewhere, or eventually. Is it not now in the lists because it would be a 'T2' thing? Shenme (talk) 06:39, 27 May 2019 (UTC)Reply

It would be excluded because of the capital "B". We can still fix it though. AWB is a good tool to do this. Graeme Bartlett (talk) 22:39, 1 June 2019 (UTC)Reply

"A*" mis-spelling

edit

I've been away from MOSS for a while, so apologies if I'm not up to date with current conventions. But I just came across the word Archibishop, which occurs in about 100 pages (I accidentally put it in a search). It doesn't appear in the a-m misspellings by frequency; can I expect it to appear in the next dump? I'm just leaving on a trip, unfortunately, or I'd attack it myself. David Brooks (talk) 17:01, 1 June 2019 (UTC)Reply

Graeme Bartlett partly answered my question immediately above. The majority of uses have a capital A, but there are still about 25 archibishop appearances. Should that be enough to get it on the lists? (yes, I'm on the trip, sitting on Heathrow WiFi after a red-eye :-) ) David Brooks (talk) 06:51, 2 June 2019 (UTC)Reply
The spelling "archibishop" is used in some Wikipedia article titles, so that would exclude it from the typo lists. They are all redirects, for example Catholic Archibishop. It's unclear to me if this is a common misspelling or an alternate spelling. If it's a misspelling, we'd want to add {{R from misspelling}} to all the offending redirects. I could either add code to exclude redirects so tagged, or I could blacklist the bad spelling. -- Beland (talk) 17:57, 11 June 2019 (UTC)Reply
Marking {{R from misspelling}} redirects as not (necessarily) valid sounds like a good general rule to me. I'll let you decide on the priority order when the word also appears in a "regular" title. David Brooks (talk) 20:25, 17 June 2019 (UTC) ETA: Sorry, that wasn't meant to be sardonic, just not properly thought through. Of course plenty of valid words can appear in mis-spellings of multi-word titles. David Brooks (talk) 07:24, 18 June 2019 (UTC)Reply

Congratulations!

edit

With the completion of the "V" subpage in the main listing today, after jumping around a bit, we've completed a first pass through an entire alphabet of articles with misspellings, in under a year. Around fifty thousand thanks are due to the volunteer editors who have been helping out - that's how many spelling and punctuation errors you've fixed! The typo counts seem to indicate we have at least that many left to fix, but they are getting a bit harder to find automatically because they are in articles which also have suspect words which are less likely to be incorrect. To keep us on this roll, I've cooked up some code that unlocks the next layer of typos. As we circle back around to the beginning of the alphabet, you'll see the main listings for A-I will have a lot of missing-spaces-after-period errors (TS+DOT) which we haven't fixed because we only started doing that on the first pass when we got to J. You'll also see more listings where moss has found likely misspellings alongside suspect words it thinks are probably OK. These should be almost entirely in articles we haven't touched yet; my goal is to have to look at each article only once (at least until new errors sneak in). I've also been experimenting with detecting other types of Manual of Style errors beyond spelling issues. I'll be posting some updated advice and a fresh batch of typos shortly. Your feedback is always welcome, and thank again for making this tremendous accomplishment possible! -- Beland (talk) 04:14, 23 September 2019 (UTC)Reply

Citation errors

edit

Hey, not sure if this is suitable for the project, but there are a few hundred instances of incorrect citation last names ([1] [2]) due to a bug in the Zotero backend that WP:VE and WP:ProveIt use. @Beland: Could you check to see if this is suitable, and add it to the main page if it is? Thanks. Darylgolden(talk) Ping when replying 00:58, 15 April 2020 (UTC)Reply

@Darylgolden: moss code ignores templates, so it would be a bit of a pain to check for these problems and include them in the main listings. But the links you provided work quite well; you can see in some cases I just post links like those to the main project page. Is there a pattern to how to fix them? It looks like we'd need to manually inspect the source to find the author's last name? -- Beland (talk) 08:08, 21 April 2020 (UTC)Reply
@Beland: Unless someone comes up with a better way, it appears that manually inspecting the website would be the best way. Since the project already has listings on tasks that aren't strictly typos (templates and HTML tags), and I can't find anywhere else on Wikipedia to post this kind of task, it would be helpful to include on the main page if you think that it is within scope. Darylgolden(talk) Ping when replying 08:19, 21 April 2020 (UTC)Reply

Example: product(s)

edit

wikt:product(s) for example - I don't think should be a typo. Is this a Manual of Style thing? How should I handle it? Sct72 (talk) 21:30, 17 October 2020 (UTC)Reply

@Sct72: You can leave those in the case notes for me. I exclude most of those from the listings but I need to make some algorithmic improvements to keep them all out. -- Beland (talk) 00:44, 23 January 2021 (UTC)Reply
Ok cool, thank you! Sct72 (talk) 01:00, 24 January 2021 (UTC)Reply

Bird calls, special ISO codes, empty uses of not a typo, and D&D

edit

Look I know this is many things but this is what I'm like.

While cruising around learning about ISO codes (which I hate), I found that there's a few codes for basically all the stuff that doesn't fit in a code. (quotes are from ISO 639)

  • mis -> for uncoded languages; for example, many Australian languages don't have an ISO code. I'd still use the macrolanguage one, aus, but arguably this isn't appropriate, so I can understand someone choosing instead to use mis.
  • mul -> when a segment has "multilingual content (includes at least two languages in separatable parts)." This isn't super useful for us, it'd typically get used to indicate, for example, that a website as a whole contains multiple languages.
  • und -> "content includes zero, one or many languages, in arbitrary combination". Probably not the most useful for us, unless there's a section of text we simply cannot figure out. If we did that, there would probably need to be a tracking category for it, and it'd probably need to be discussed on the lang talk page.
  • zxx -> "No linguistic information at all". It can also mean not applicable. This has a bunch of uses, idk how many are covered already by other templates though. So I think this one can possibly be used for bird calls. I can't for the life of me remember where I saw it, but at least one non-wiki place recommended the code as an option for how to indicate animal calls.

Another option for bird calls may be Template:Respell, as it's a standardised English language way to represent pronunciation, which is the same as bird calls. It would require learning the rules of how to do it, but hey editing wikipedia has a bunch of learning curves anyway. The only downside I can think of is if there's actually a standardised way that bird calls are written, in which case it just kinda needs its own template.

Then, Template:Not a typo. Something I've noticed (and have done myself at least once when on the visual editor) is that the template has been placed next to the relevant word and doesn't have anything in it. Is there any way to track empty uses of not a typo? (re: bird calls, I'm not super happy with using not a typo to indicate those as it may be confusing to people using screen readers. That's an issue I have generally with the template tbh. But not the current problem.)

I've noticed a fair few articles that are coming up because they mention things from the Dungeons and Dragons games. I separated these out from other entries on the pages I noticed them on, because I think these will keep being a problem. All the time. I explained on Wikipedia:Typo_Team/moss/S#Dungeons_&_Dragons, "Proposal: remove detailed D&D articles and lists from being checked. Reasoning: every single named object in d&d has a ridiculous name, and everyone that plays it is a freak that'd immediately correct typos anyway." Having thought about it since then, there is also the option of using one of those lists as a spellcheck dictionary - a good option for that would be List of Advanced Dungeons & Dragons 2nd edition monsters. Most of the non-English words that people won't capitalise would be under monsters, and it's the most comprehensive/best written of the monsters lists. It's also an old version, so it shouldn't change too much. This method may miss some of the D&D words, but it would probably take less time than finding all the D&D articles and would mean the typos on all those pages won't get ignored automatically.

Also if it's not an issue, I may try to reorder the instructions for editors across the project page and the specific letter pages, bc at the moment trying to find the right instructions is very hard for my atrocious brain (it's not clearly organised by type, I'm guessing it's more done by when the solution was figured out). --Xurizuri (talk) 07:35, 31 January 2021 (UTC)Reply

Oh I forgot to say, one of the special codes (I think it was zxx) can be used for fragments, like suffixes or whatever. --Xurizuri (talk) 09:16, 2 February 2021 (UTC)Reply
@Xurizuri::
  • For D&D terminology, my recommendation would be to make a redirect pointing at whichever article gives the best explanation, even if that's just a list. That will let the spell checker know these are correctly spelled words, and will also help readers who are using search engines find the best article.
  • You can get a live list of empty uses of {{not a typo}} with a search for insource:/\{\{not a typo\}\}/. This does look like it is unfortunately proliferating; I'll add a link to this search as a thing to check up on.
  • Sorry about the disorganized instructions; they have indeed just accreted over time. Is there any particular top-level organizational scheme that might make more sense? It's difficult for me to get a sense of it because I'm too close to the content.
  • Having an ISO code for fragments would actually solve a lot of annoying problems, that's great! I'll have to document that.
  • Is there like an international notation standard for birdcalls or something? That would be handy to track down.
-- Beland (talk) 19:00, 6 February 2021 (UTC)Reply

TS+EXTRA decimal fractions and file extensions; contractions

edit

(move from main listings section on project page)

    • Beland are we still doing TS+EXTRA+? And can I remove the ones that are correct as is, which fall under these categories, from case notes? --Xurizuri (talk) 02:27, 5 February 2021 (UTC)Reply
      • @Xurizuri: Yes, TS+EXTRA+ are included on reports, assuming there are any for that letter. (There were for J.) If you see some that are correct as written in the article, that might indicate a bug in the code which may or may not have already been fixed. Were there any in particular you were concerned about? I can take a look and come up with a diagnosis. -- Beland (talk) 08:17, 5 February 2021 (UTC)Reply
        • @Beland: I can't always tell the categories apart tbh. But I've been through all the letters multiple times looking for specific types of entries, so there's a few types of false positives I've noticed. There's about 1000000 instances of it picking up bullet calibres, sports scores, and software suffixes (e.g. .jpg): across A, C and D you can huge lists of see those. The case notes also have some issues in with "it's"/"she's"/etc picking up the possessive of longer words (e.g. Kuwait's), and spellings using brackets to indicate singluar+plural which is explicitly okay in MOS. I can't remember there being any huge lists of those, but they were kind of sprinkled throughout. --Xurizuri (talk) 09:36, 5 February 2021 (UTC)Reply
        • @Xurizuri: Ah, yes, so for calibers and sports scores and anything else that looks like a decimal fraction with no leading zero, as of the 2020-05-01 dump those have been reclassified from TS to Z, and I'm not including any Zs in reports until I figure out how to better separate the good from the bad. The upshot is, you can remove any correct-as-is decimal fractions from the case note listings. The report subsections are a bit weird in that all the typos for a given article appear on one line, so this results in some being "promoted" to a potentially incongruous section. (Like if an article has a mix of T1s and T2s, all of those will show up under the T1+ section; that's what the plus is hinting at.) For file extensions, it looks like there was a bug I fixed for those as of the 2020-04-01 dump or thereabouts. If there's an article or redirect for them, they are find to remove from the listings; otherwise I'd add a redirect. (For example, .jpg is a blue link so those can just be removed, but .uci doesn't exist and should probably be made into a redirect.) Around the same time I also disabled reporting of BWs, because of that issue with contractions you mentioned. If those are OK as-is, they can just be dropped from the listings. (I need to improve the code to handle those more intelligently.) Thanks for your attention to the case notes, by the way. Volunteers have been making such fast progress fixing typos and finding non-typo cases that I haven't been able to keep up with the output, especially for the dumps where the bugs in my code resulted in a very large number. If you happen to see any other patterns, it would be good to know so I can tweak the code as needed. Feel free to ping me. -- Beland (talk) 18:36, 6 February 2021 (UTC)Reply

Chemical formulas

edit

I am going through the new section labelled Wikipedia:Typo Team/moss#Chemical formulas. There are very few unformatted chemical formulae in that list. Mostly they are not chemical formulae at all. I would suggest that these elements be removed from the match: Es|Fm|Md|No|Lr|Rf|Db|Sg|Bh|Hs|Mt|Ds|Rg|Cn|Nh|Fl|Mc|Lv|Ts|Og|R as their compounds are hardly known. (And R is not an element). Also if it is just one letter followed by digits, it is very unlikely to be a chemical formula. So I suggest that these are not included. R with digits is likely to be Rand for example. These also include postcodes, page numbers, bus routes, mutation identifiers. More complex ones starting with H can be epigenetic modifications. We do have students over the years writing articles on all of these. Where it is a meaningful chemical formula, I am redirecting. But it would be good if we could get the list of articles with unformatted chemical formula (where it is really one and not a false positive!). Graeme Bartlett (talk) 21:59, 25 March 2021 (UTC)Reply

@Graeme Bartlett: Hmm, it does sound like I need to do a better job distinguishing between anything that could be a chemical formula vs. only the things that actually are a chemical forumla. I included "R" because it is used in chemical formulas to represent a substituent. Since it would be helpful to report article names, I think I'll have to script this properly rather than simply grepping what was ignored. (Chemical formulas start with a capital letter, so they are normally assumed to be proper nouns and ignored.) That will also allow me to make use of a lot more context, though I will also try your suggestions. It will probably take me a few days to fine-tune; in the meantime, I will post some articles with a huge number of probably-misformatted chemical formulas. -- Beland (talk) 03:03, 1 April 2021 (UTC)Reply
Thanks, I would like to fix up misformatted formulas. But so far it has mostly only been reference titles that are misformatted. There are also chemical formula fragments that go up to a "(". The formula includes the brackets part too, so the fragment does not make a useful redirect. Graeme Bartlett (talk) 03:37, 1 April 2021 (UTC)Reply
  • Thanks Beland for working on Wikipedia:Typo Team/moss#Known chemical formulas that don't use subscripts. I guess you are onto the issues already - pattern strings showing up in image names, urls, or InChI strings. But it has also highlighted completely non-standard use of <chem> or <ce> html markup. I can fix these, but where <math> is used, it will be for more complex cases (not easily fixed). I have fixed some things in references. But I reckon there will be a lot of unformatted CO2 and H2O around. Graeme Bartlett (talk) 08:20, 29 October 2021 (UTC)Reply
    • @Beland: I have now been through Wikipedia:Typo Team/moss#Known chemical formulas that don't use subscripts and added subscripts for the chemicals. There are still a lot of other things that aren't chemical formulas though. It would be good to get a new list, particularly for CO2, and secondarily for H2O. Some I enclosed with proper name template -- will that keep them off this list? Graeme Bartlett (talk) 10:25, 4 December 2021 (UTC)Reply
      • @Graeme Bartlett: Oh, excellent! I had limited the listings to 25 articles per entry because I thought that cleanup would go relatively slowly. I just posted all the remaining articles for the above-25 entries, so you can tackle as much or as little of that as you care to. The next snapshot will be taken on Dec 20, so I'll try and remember to update this again after that's processed, and feel free to ping me if I forget. I'll take a look at the leftovers now to see what should be done about them. And yes, adding pretty much any template should suppress instances from being reported on this list. -- Beland (talk) 23:58, 4 December 2021 (UTC)Reply

Large group treatment

edit

There has been some discussion on the use of {{notatypo}} and methods to remove some spellings from future pickup. Would be interested in hearing opinion on treating exceptions in larger groups such as 'CO2'? Tag them with 'notatypo' (and obfuscate) or let them ride (on the merry-go-round)? Exceptions can include band name, external titles and such, and the numbers will be increasing. Perhaps use of {{text}} or {{proper name}} when it's other than in a title or similar? Neils51 (talk) 01:52, 3 February 2022 (UTC)Reply

Well "CO2" is a mistake, even if it is not a typo, but due to ineptness or negligence. I asked for more CO2 to appear in the list, but then got distracted and have not correct many more. CO2 is almost always corrected by CO2, but not always as sometimes it is not carbon dioxide. Using AWB it is possible to do some mass editing as you suggest, but checking by a person is required. Graeme Bartlett (talk) 10:56, 3 February 2022 (UTC)Reply
Thanks for responding Graeme Bartlett, however, if a ref/cite title uses 'CO2', albeit a mistake, I am not going to correct it. Thought that I might seek agreement as to how to treat these so they don't appear in future lists, though I'm happy to do my own thing. Can get through these quite quickly with AWB. Neils51 (talk) 13:18, 3 February 2022 (UTC)Reply
I suppose you should just "let them ride". We are volunteers and do not have to do something. I have noticed that quite a few typos that appear on the list should have been fixed the previous time around, so we are not actually correcting everything, but as long as there is improvement, its going in the right direction. Graeme Bartlett (talk) 21:16, 3 February 2022 (UTC)Reply

Do we still have a backlog?

edit

The quick link takes you to X, where it seems like most of the typos that were actually typos have already been covered. But maybe I'm missing a backlog somewhere else. If that's the case, I'm willing to help out with it. It's been awhile since I've helped out here and I miss it, I think it's a really cool concept. Clovermoss (talk) 16:54, 17 May 2022 (UTC)Reply

Most of these are not yet fixed. Some have added comments, but in reality definitions need to be added to Wiktionary, species added to wikispecies, proper names marked, non-standard things made standard etc. Graeme Bartlett (talk) 11:40, 18 May 2022 (UTC)Reply
@Graeme Bartlett: Thanks for getting back to me so quickly. I guess my quick glance wasn't enough. I haven't really done much beyond fixing typos when I was involved with the project before. How simple is moving definitions to Wiktionary? Do articles with these words actually contain definitions a lot of the time or is there something else you have to do? How would you mark proper names? Is there a list somewhere that prevents from being re-added as typos? Or am I misunderstanding what you meant by that? Clovermoss (talk) 15:12, 18 May 2022 (UTC)Reply
The instructions are at the top of Wikipedia:Typo Team/moss. If the words are not dealt with, then they should be readded on the list. Otherwise no one will sort them out. The main exception is the species that you can get away with adding to a wanted list. (I think perhaps we need something like this for chemicals, as unlimited chemical names can be constructed. Chemical formulas and maths formulas just need a smarter detection algorithm.) You can also write articles or create redirects. Don't expect to find definitions for Wiktionary in the articles. Take a look at some Wiktionary entries to get an idea of the format. Graeme Bartlett (talk) 21:57, 18 May 2022 (UTC)Reply
@Clovermoss: Oh yes, there is still a huge backlog of English misspellings to fix and other tasks to take care of. I just noticed the "X" listings were getting a bit stale, so I dealt with the remaining few actual misspellings and dumped the rest (which need tagging or investigation) into the "Case notes" section on the X subpage, which is the usual practice. A lot of folks seem to have more fun just fixing actual English misspellings and leave the non-English cleanup and linguistic research to others. Since there's always plenty of easy misspellings ready and waiting, if you're one of those people and ever want fresh listings, feel free to ping me. Recently I've been distracted not only by real life but also by maintaining other moss reports, and forgot to check the main listings. I just posted fresh typos at Wikipedia:Typo Team/moss/Y, so hopefully with your help and that of many others, we'll soon finish our second pass through the alphabet. If you're also still interested in adding Wiktionary entries (so many piled up wanting to go over!), I made Wikipedia:Typo Team/Wiktionary cheat sheet for myself, and you may find that speeds the process. -- Beland (talk) 07:25, 3 June 2022 (UTC)Reply

Nothing wrong with "№"

edit

There’s nothing wrong with using № in place of "No." in table headings. Why is № being targeted? Jeff in CA (talk) 07:13, 29 May 2022 (UTC)Reply

@Jeff in CA: This symbol is contrary to the manual of style, at MOS:NUMERO. -- John of Reading (talk) 07:52, 29 May 2022 (UTC)Reply

Fresh listings

edit

@Jake The Great 908, Puddleglum2.0, Schazjmd, Bradleyagin, Darylgolden, MarkZusab, Amiodarone, Zojomars, Anarhistička Maca, Clovermoss, JaAlDo, Creativecreatr, Voidify, Doghouse09, Spazure, Idell, Fehufanga, Triethylborane, Littleb2009, Normal Name, Amazomagisto, TreeReader, and Alivemussel:

This is your official notice that I've just refreshed most of the sections on the main page, and posted fresh typos to Wikipedia:Typo Team/moss/Y. We're almost done with our second pass through the alphabetical listing of all articles with typos! This update has combined several sections all looking at the most frequent words which are missing from the dictionary. As we've fixed over a decade worth of existing typos, it's becoming increasingly rare to encounter actual English misspellings that appear many times in Wikipedia, and increasingly hard to automatically distinguish English from non-English words, so I stopped trying to do that and just made one report for the highest-frequency typos.

The backlog of words to move to Wiktionary was getting very long, so I moved that to its own subpage. Folks working on that queue should feel free to rearrange it if it suits them, or let me know if there's anything that I could improve that would speed things along.

Thanks to Jonesey95, we also have a new report, Wikipedia:Typo Team/moss/not English, which finds long passages of non-English text (and other non-English garbage sometimes), if you're interested in helping tag or clean up or translate those articles.

For those working on the chemistry lists (I see Graeme Bartlett charging through a giant pile most recently!) since those had a lot of active edits recently and the new results look very similar to the old ones, I decided it might be easier for everyone not to update those lists quite yet. But if you're encountering stale listings, let me know and I'll be happy to refresh those too.

In fact, anyone should feel free to ping me if updating anything would provide you with a more satisfying experience in your favorite work queue, or if you spot any potential for improvement or have any questions. -- Beland (talk) 08:14, 3 June 2022 (UTC)Reply

Just Curious: Why do so many Indian pages appear on TS+Dot lists?

edit

(Moved from Wikipedia talk:Typo Team/moss/K.)

It seems like a lot. Elfabet (talk) 19:13, 8 March 2019 (UTC)Reply

@Elfabet: Oh, hey, I was just centralizing talk page discussions so I wouldn't miss important questions and comments, and I came across this from a while ago.
That's an interesting question! I can't give you an entirely scientific answer, but based on some casual observations of demographics and typo patterns, my guess would be a combination of factors:
  • India is a large, heavily populated country, so there is a lot to write about.
  • There are hundreds of millions of people in India who speak English, so there is good coverage of India-related topics compared to countries with very few English speakers.
  • It might be that lots of words describing India-related topics begin with the letter "K", because those words are borrowed or transliterated from languages that use that as an initial sound a lot.
  • Punctuation-related errors sometimes occur because of hard-to-read wiki syntax or just because someone who knows the standard English punctuation rules has mistyped. But statistically, I do see punctuation errors disproportionately in articles where there is a reasonably long passage of at least somewhat grammatically incorrect English. I suspect this happens when someone who is not familiar with the rules of standard written English contributes new sentences and paragraphs. This happens to the degree that I actually require at least one punctuation error before listing an article on the Wikipedia:WikiProject Guild of Copy Editors/Database Report (which results in it getting a top-to-bottom copyedit for grammar and punctuation and not just spelling).
  • Hundreds of millions of people in India speak English as a second or third language, with varying levels of proficiency for the spoken and written forms. Interestingly, and unlike vocabulary and grammar, spelling and knowledge of punctuation rules are applicable only to the written form and not the spoken form.
  • Non-native speakers who have not yet become proficient at a new language tend to commit errors in systemic ways because of the language(s) they learned first. For example, if your native language(s) does not have plurals or subject-verb agreement, you will tend to make mistakes in those aspects of English, and probably master those aspects last. Similarly, native English speakers who are learning Mandarin tend to have a lot of trouble with tones because English is not a tonal language. Out of curiosity, I checked around and though it looks like sometimes Hindi is written with periods, it is often not, even in formal settings like newspapers. It would then make sense that text written by native Hindi speakers from India would have more period-related errors (which is what the TS+DOT section is reporting) than native Brazilian Portuguese speakers, where the punctuation rules are mostly the same as English. There are many other widely spoken native languages in India which may have similar differences, though I haven't bothered to catalog them.
-- Beland (talk) 02:04, 4 June 2022 (UTC)Reply

Just add Water

edit

I see that for H2O there are many items (search=167) that relate to H2O: Just Add Water. Article Loreto Kirribilli is one example that Graeme Bartlett added subscripting to the display portion (single entry) in December, so the current report pickup is on the link. This could be 'fixed' with the following construct [[{{as written|H|2O}}: Just Add Water|H<sub>2</sub>O: Just Add Water]] to give H2O: Just Add Water however perhaps the better approach would be an exclusion? Neils51 (talk) 22:38, 16 June 2022 (UTC)Reply

Based on the article title, the version with the subscript seems to be more correct. -- Beland (talk) 23:04, 17 June 2022 (UTC)Reply
Umm..Beland, OK, let me re-phrase this. Why is Loreto Kirribilli in the current list and what would you do to fix it? Neils51 (talk) 00:01, 18 June 2022 (UTC)Reply
@Neils51: Ah, I see what you mean now! Sorry, I didn't read that as carefully as I needed to, perhaps because I hadn't eaten breakfast. After a hearty meal, I'm now remembering the algorithm for this report actually doesn't even look at link targets, only the display text. So Loreto Kirribilli is already fixed; it doesn't need {{as written}}. The only reason it's still on the list is that it wasn't removed when it was fixed, and I haven't updated the list since the 2021-11-01 dump. Feel free to delete any articles for which the only apparent problem is the link target. I was holding off on updating the list because it's actively being edited, but since it's started to cause confusion, why don't I go ahead and do that with the next dump. 2022-06-20 is coming soon, and it's a fast one so I should be able to sneak it in relatively easily without disrupting the cleanup process. -- Beland (talk) 00:50, 18 June 2022 (UTC)Reply
Thanks @Beland: that makes it clear. Based on what you have stated here I would like to update your "Instructions to Editors" to suggest strikethru prior to edits and delete once edits completed and the reasoning. When you say 'list', I'm assuming that means the published project page. Need to ask as sometimes assumptions......Neils51 (talk) 01:23, 18 June 2022 (UTC)Reply
Yes, I mean the lists on the project page. Mmm, well it seems like a bit of unnecessary work to do both strikethru and delete. As long as one of them happens, it will prevent other editors from attempting to fix something that is already fixed. The instructions recommend strikethru for sections that get updated all at once, which is what happens for the chemistry formula report, just so I can make sure to manually delete items from the new report that got fixed in the day or two between the dump snapshot and when the report is ready. Does that make sense? -- Beland (talk) 02:31, 18 June 2022 (UTC)Reply
I also may have fixed many pages that appear on other lists. Also I have found when I strike some off, then other editors might remove my striking. It does result in more checking than required! Graeme Bartlett (talk) 02:59, 18 June 2022 (UTC)Reply
Yeah, having multiple lists as we do makes unfreshness particularly problematic when they collide. I hope I haven't been the one accidentally undoing your strikethroughs? Sometimes I delete struck-through text when I'm preparing for an update but the snapshot hasn't happened yet, to reduce the amount of cross-checking I have to do later. If strikethrough really isn't working out, we could just say, always delete, and I will try to use page history diffs instead of strikethroughs to prevent duplicate work across updates. I don't want to confuse people who have gotten used to a particular way of doing things, but I'm open to whatever people find the most efficient and pain-free. -- Beland (talk) 08:24, 18 June 2022 (UTC)Reply
My 'concern' was that an item dealt with in December is something I have revisited recently as still in the list. I was suggesting that strikethrough and later delete would prevent that (reappearance) from happening, based on your response. I am not sure how that would work for you however what can we do that will prevent, or assist you to prevent, the ongoing presence, or return, of actioned items? Perhaps I have a misunderstanding as to how you generate the lists? On the project page you say that you run fresh scans against a recent dump to produce new lists. That implies that any item previously dealt with, no longer meeting selection criteria, should disappear, particularly when 6 months or so have elapsed. The 'implication' seems at odds with manual manipulation of items that are flagged as processed on the project page? The 'logic' doesn't make sense to me so what am I missing? The purpose here is to have you provide the best information and the editors using it to do so in an efficient manner. I know that writing a book is not always the best way to go so happy to be involved in a Zoom (or similar) call. Neils51 (talk) 09:50, 18 June 2022 (UTC)Reply
Now reading back through the comments I am getting the notion that you are adding to the lists and probably manually deleting any that have been struck so not a list refresh as such, just an append, as your original base was late last year. Do I have that correct? So some items have been actioned and either not been struck or if struck the strike was at some point removed? If this is the case then maybe the editor who has completed the item(s) should be responsible for their deletion from the list (at some point). Neils51 (talk) 10:11, 18 June 2022 (UTC)Reply
Yeah, in this case, the listings were based on a dump that was six months old. I just did a complete refresh from the 2022-06-20 dump, so staleness shouldn't be a problem anymore. The only manual inspection I really need to do is to prevent items from re-appearing if they have been completed between the time the dump is snapshotted and the time I post the update. For first-of-the-month dumps, that's typically a couple weeks, but for 20th-of-the-month dumps, that's typically a couple of days. I think the easiest thing to do going forward might be to just say "delete things as you finish them to avoid duplicate work" (unless you need to leave a note documenting a weird case) and I will take care of the rest (and try to use 20th-of-the-month dumps to make it easy on myself). -- Beland (talk) 18:58, 22 June 2022 (UTC)Reply

"convert special characters found by Wikipedia:Typo Team/moss"

edit

Can anyone please explain the policy behind converting "special characters" to templates? i.e. https://en.wikipedia.org/w/index.php?title=DR_Class_130_family&diff=next&oldid=1095852548 In particular the policy basis for this.

AIUI, for some time the push has been to move "special markup" (i.e. HTML entity references) to unicode characters, i.e. &deg; to ° Is that correct, or has that now changed?

If so, why are other unicode characters like ′ now being replaced with a template?

In particular (@Beland:), why is this being done within links, where it obviously breaks the link, and without any form of checking afterwards? We've been here for years, we already knew that bulk operations by simple regex are a really bad idea, just because they cause this havoc to wikicode. Andy Dingley (talk) 15:00, 12 July 2022 (UTC)Reply

@Andy Dingley: Whoops, my bad. Changing to an equivalent Unicode character or HTML entity works in links, and changing to an equivalent template generally works in the display text part of links, but as it turns out not in the target part of links. I check all these changes manually, and I forgot the distinction between display and target part when deciding not to preview this particular page. Thanks for catching that, and I'll be more careful to double-check links in the future.
Yes, in general, consensus favors converting HTML entities to Unicode characters for ease of use, and that's what I've been doing. There are some exceptions, and in the case of characters that can be visually confused for one another, many editors prefer to have the name of the character in the markup (especially when <math>...</math> markup is used on the same page). My general assumption is that if a template exists, it's preferred over the equivalent HTML entity, because templates can have documentation attached, most editors don't know how to use HTML entities, and wikitext is generally preferred over HTML. I've assembled a partial list of visually confusing characters here, along with some notes for myself on how to sort them out: Wikipedia:Manual of Style/Character Table 2.
The prime character in particular is often misused, either as an apostrophe, single quote mark, ʻokina, or other similar character. Legitimate uses often occur right next to an italic letter, and because wikitext uses apostrophes to italicize, the markup gets hard to read with a raw character. As I've been going through and fixing misuses, I've been converting the raw character to {{prime}} both for clarity and to indicate that this instance has been checked for correctness. Since I can't do that for this article, I've just added an HTML comment to clarify. -- Beland (talk) 19:04, 12 July 2022 (UTC)Reply

Manual attention needed

edit
  • 2 - 99th United States Congress - wikt:you, wikt:you. It looks like these two were caught because of another typo, which I corrected. They clearly address the readers, a no-no, but I left this entry as a reminder for now because they're part of a boilerplate paragraph used in articles covering the 38th through 111th Congresses, all of which need the same rewriting. (The other articles starting with '9' didn't have the other typo, and I suspect that's why they didn't show up on this list.) A template paragraph might be better - and easier. Ira Leviton (talk) 17:01, 30 September 2019 (UTC)Reply
User:Ira Leviton, I honestly don't think the paragraph needs to be there at all... it just explains how to click on a link and how to read the document it opens to which I think we can assume denizens of the internet to be capable of without instruction. All of the relevant information on the document is summarised in the article anyway, which is also true of the handful of other random articles I checked within the range you listed. Xurizuri (talk) 10:35, 27 December 2020 (UTC)Reply
@Xurizuri:This is still on my ever-expanding list of things to do, but I had forgotten about this it. I think that having the link is useful; it's much tougher to find this information without it. But the paragraph can be rewritten as "Complete lists of members and staff for all House and Senate standing, select, and special committees and subcommittees appear in the annual congressional directory listed at the bottom of the page in the external links section. I'm willing to insert this on all of these pages. (But I'll wait to hear from you, or if you can improve my wording.)
Ira
Ira Leviton (talk) 14:54, 28 December 2020 (UTC)Reply
That sounds solid. I'm not very across MOS for external links so I honestly don't know if there's any issues with that plan. I guess an alternative to directly mentioning the dictionary in the text is to have "Complete lists of members and staff for all House and Senate standing, select, and special committees and subcommittees as they appear in the annual congressional directory" and cite that to the ACD itself. Either way, seems like an improvement to just be getting that 2nd person language out of there. Xurizuri (talk) 15:27, 28 December 2020 (UTC)Reply
I have created and implemented a template along these lines at Template:List of Congressional Committees instructions. Wording tweaks are, of course, welcome. BD2412 T 02:20, 9 July 2022 (UTC) Sct72 (talk) 01:51, 13 July 2022 (UTC)Reply

Dashes

edit

I wouldn't normally be this pedantic but since this is a typo team project... The project pages consistently use hyphen incorrectly where a dash is required, for example in all the dump lists. Example: Wikipedia:Typo Team/moss/I#Case notes. All those items that start with "1 -" should start with "1 –". GA-RT-22 (talk) 15:49, 30 July 2022 (UTC)Reply

@GA-RT-22: Hmm, the dashes are actually used in the Python and shell scripts a lot to split apart lines and whatnot. It would be difficult to use non-ASCII characters because they are not on my keyboard, it would be a big change that could break some things, and it's difficult to tell the difference between the different kinds. I think I'll leave things as they are. -- Beland (talk) 18:35, 12 August 2022 (UTC)Reply

Time for more

edit

Wikipedia:Typo Team/moss/before A is almost completely finished, just case notes really now. Please @Beland: can you post the next batch? Graeme Bartlett (talk) 12:19, 12 August 2022 (UTC)Reply

@Graeme Bartlett: Thanks for the ping! "A" is now posted!
Reading through the recently resolved case notes, I see a lot of the "probably OK" words are actually real typos. It looks like these are coming from the moss "ME" class which are "coMpound English" words. I think we made a lot of progress adding legitimate compounds to Wiktionary, and a lot of what's left are instances where there's a missing hyphen or space, or a misspelling that can happens to look like two unrelated words smashed together. So, I'll be posting those after the next run, which should be for "B". Thanks to everyone who has been conscientiously putting those in case notes! -- Beland (talk) 00:59, 13 August 2022 (UTC)Reply

Trivial, not typo

edit

This edit by Beland appears to be trivial, and so to be avoided. It is not a typo. Note that character-by-entity entrance is useful to achieve script support; converting to unsupported script characters is not helpful in this. Is there more background involved? Does it relate to "This page contajns script XYZ character"-categorisation & reader help? DePiep (talk) 07:49, 16 October 2022 (UTC)Reply

@DePiep: Greetings! In the spirit of MOS:MARKUP (keep markup simple) and in my past experience, there seems to be general support for replacing numeric HTML entities with the equivalent Unicode characters, or named HTML entities, or templates. There are a few exceptions, such as private use characters, where it's clearly necessary to keep the numeric entity. These changes are not trivial in the sense of WP:COSMETICBOT; they are intended to make it easier for downstream consumers to parse wikitext. For example, simplifying the representation of special characters makes it easier for search engines to find all the relevant pages to a query that contains special characters, and in my case, for my spelling, grammar, and style checker to validate content. Keeping markup simple is also supposed to make life easier for editors, since it makes the wikitext more WYSIWYG. I'm not exactly sure what you mean by achieving "script support"? If you're saying it's easier for you to input wikitext using HTML entities, you're still welcome to do that. I'm hoping the conversion makes it easier for you to read and edit the wikitext subsequently, but if that's not the case, we can discuss and make alternate arrangements. -- Beland (talk) 17:56, 17 October 2022 (UTC)Reply
Thanks, clear now. Basically, the "trivial" was the question (i.e., not an issue). My background is Unicode, so I am more interested in the numerical code :-). Long term, we want to improve script support (support rare scripts). But anyway, I am also working on character-analysis tool (return character properties), so the edits we are talking about are not a hinder. DePiep (talk) 20:24, 25 October 2022 (UTC)Reply

Needless italics around quoted speech

edit

This is a problem I see very often around Wikipedia, e.g.

The author later denied the claims: "I never wrote such a thing."

Of course the italics are not necessary (and in fact merely confusing) when we already have quotation marks. Could the fixing of this very common problem be automated, or included here, too? Equinox 12:36, 29 October 2022 (UTC)Reply

That's a great idea! I see that from time to time as well. I already have some regexes that try to change single quotes to double quotes per MOS:DOUBLE. Based on my experience with that, I'd say it's probably easy enough to put in some regexes for relatively simple cases. And I expect to find many thousands of them. -- Beland (talk) 16:37, 16 May 2024 (UTC)Reply

B finishing up

edit

@Beland: Wikipedia:Typo Team/moss/B is very closed to finished. So it could be time to break out the "C" typos. Graeme Bartlett (talk) 06:37, 20 February 2023 (UTC)Reply

@Graeme Bartlett: Thanks for the note! Wikipedia:Typo Team/moss/C is posted. -- Beland (talk) 03:05, 22 February 2023 (UTC)Reply

These pages have become too big to work with conveniently.

edit

Wikipedia:Typo Team/moss/C started at well over 1 million bytes, which makes it slow to load, slow to edit, and hard to find things. Can we start splitting these up into subpages, either by sectioning issues by type, or by first letter combinations? Also, can we perhaps centrally archive the previous case notes somewhere? BD2412 T 03:35, 12 April 2023 (UTC)Reply

I notice it takes quite a few seconds to edit, save, or load these pages. I thought of taking out a section to work on, and merge the updates later, but that risks duplicate effort. But please keep things simple, otherwise we will deter helpers! So updating the one page is simpler. But we also need people to action the case notes. Old case notes in a separate page sounds OK, as they gradually fill up the page. Also splitting sounds good. Graeme Bartlett (talk) 22:25, 12 April 2023 (UTC)Reply
I think the easiest thing would be to split into smaller letter ranges, e.g. Ca-Cd, Ce-Ck, Cl-Co, Cp-Cz (or Că, if we're counting special characters). If we can generate a page by first letter, we can generate pages by first letter ranges. Still split out the old case notes. BD2412 T 22:59, 12 April 2023 (UTC)Reply
I have created Wikipedia:Typo Team/moss/Old case notes, and moved case notes from B to E there so far, with appropriate redirects added to sections of the original pages. BD2412 T 18:43, 30 April 2023 (UTC)Reply
Thanks for fixing that pain point. Just FYI, the moss script automatically downloads the per-letter pages and uses them to suppress typos already triaged. I added /Old_case_notes to the list of downloads when I noticed this move a while ago, so no worries. But if anyone is restructuring these pages in the future, please ping me so I can update the moss scripts. (I've also added this page using the new talk page subscription feature, so hopefully I'll notice new threads here more quickly.) As I plan to develop a new JavaScript-powered UI, I may need to rejigger these pages or add a new more machine-oriented one. A primary goal would be that no one has to edit these pages anymore, and the effort is coordinated behind the scenes or by tagging article pages immediately. -- Beland (talk) 16:45, 16 May 2024 (UTC)Reply
edit

Maybe this is something we could add to the database dumps if there's a way of automatically determining if an external link is outside the external links section? Clovermoss🍀 (talk) 21:25, 26 April 2023 (UTC)Reply

We should revise the Quickstart and Instructions for Editors sections

edit

The Instructions for Editors section on the individual list pages and, to lesser extent, the Quickstart section on the main MOSS page need a revision. Some things in these sections are phrased unclearly (like the matter of what constitutes a proper name). The formatting and phrasing of the Instructions page need a copyedit anyway (e.g. the extra bullet point before "For DNA sequences" and the misplacement of the word "titles" in the line about proper names). And there are some matters that aren't listed here, like an example edit summary for adding a proper name tag (which would be the same as for the not a typo tag, but it's annoying to have to edit it after pasting every time). The main issue that led me to write this talk page section, is the matter of plurals of the subject of a page. As stated here by Beland, if e.g. "fexprs" is used on a page "fexpr", a redirect should be created. I've come across this situation multiple times and only now have figured out what to do with it. It'd be a good idea to include this in the Quickstart and Instructions sections. 110521sgl (talk) 09:13, 30 April 2023 (UTC)Reply

Harmful automated edits

edit

I consider this edit to Transient execution CPU vulnerability page as vandalism because it breaks the intentional formatting. In particular:

  • &ZeroWidthSpace; HTML entity is used to allow word wrapping on / (Firefox does this automatically, but Chrome doesn't). Using {{wbr}} is an alternative (which I wasn't aware of) and pointless since IE<7 are widely unsupported;
  • a non-breaking space (&nbsp;) can be used to prevent an undesirable line break in arbitrary places as well. In my case they mostly fix the wrapping in ref previews.
    I know that it should be placed as an HTML entity, but doing so hurts the source readability and I don't see any problems from other editors being unaware of them (apart from trying to re-insert them as an HTML entity);
  • I agree with using ... instead of (I didn't check the MOS and don't remember where I took it from), but removing the &nbsp;s and especially ...s is incorrect;
  • not quite sure about ® and symbols, especially in quotes. Shouldn't we cite things as is?

In general, you should not make functional changes to anything unless it's justified to be incorrect or broken (assume good faith ⇒ intended, especially HTML entities)! EvgenKo423 (talk) 07:38, 16 May 2024 (UTC)Reply

I am guessing that Beland did not poke through the entire page to make those changes, but that the JWB editing tool has automated presets for making such edits, which would reflect the technical opinion of the maintainers of that tool that these edits constitute general improvements to any page where the tool is used. BD2412 T 12:00, 16 May 2024 (UTC)Reply
It's not the maintainers of the JWB tool that have chosen these substitutions. I coded them up myself; there's a special part of the moss database dump scanning process that outputs some JWB config files and article lists for me to fix more efficiently than we currently do with spelling errors. I manually review every edit, and some of the changes made in that edit I put in manually to try to streamline markup and make it more readable.
@EvgenKo423: Responding to your points in order:
  • Ah, I use Firefox, so I didn't see any problems with word wrapping when editing that article. {{wbr}} should be used when necessary instead of &ZeroWidthSpace: MOS:MARKUP says that wiki markup is preferred to HTML when available. Using a template lets us change the breaking technique if needed to keep up with browser evolution. &ZeroWidthSpace is somewhat obscure, and part of the point of wiki markup is to avoid making editors learn HTML syntax. In this specific case, I think it looks better to use a spaced slash, which is actually already done in the "Zen / Zen+" column header. I'll do this with a non-breaking space on the left of the slash per MOS:SLASH.
  • MOS:NBSP requires the use of &nbsp rather than the raw Unicode character, to avoid editor confusion. I agree that it reduces wikitext readability, but apparently that is the consensus. To improve readability, I try to use {{nowrap}} instead of multiple consecutive &nbsps in the same expression. I also tend to remove them if they aren't absolutely necessary. For example, the phrase "Whiskey Lake stepping C" appears on a line by itself under a wide table. It will never be wrapped because the table above it forces horizontal scrolling long before that phrase runs out of horizontal space. Remembering the directive to keep markup simple, I also generally remove non-breaking spaces between normal words. I can see why it's helpful to keep e.g. numerical expressions on a single line, but it seems excessively fussy to keep adjective phrases or infinitives intact, for example in this case where there were non-breaking spaces in "Two mitigation" and "To minimize". If we were to do this consistently, we'd have a huge number of &nbsps in every Wikipedia article, and that's just not our general practice.
  • Ah, I personally prefer unspaced "...", but you are right that MOS:ELLIPSES requires a non-breaking space before and regular space after. I don't think it's necessary to put "..." at the end of a quotation, because it's understood that a quote is not the whole source text. One thing we should probably fix is using "[...]" vs. "..." consistently; according to that MOS guideline, "[...]" is only needed if there is a "..." that is actually part of the source text.
  • MOS:TMRULES explicitly says not to use ® and ™ in article text and citations. Even in quotations, MOS:CONFORM says we alter them to conform to Wikipedia style, because "a quotation is not a facsimile".
  • My intention in making these changes is to make articles comply with the Manual of Style. It's a bit distressing to see a tidy-up effort being equated with a mischievous teenager spray-painting swear words on a building. WP:VANDALISM says "Even if misguided, willfully against consensus, or disruptive, any good faith effort to improve the encyclopedia is not vandalism."
-- Beland (talk) 16:27, 16 May 2024 (UTC)Reply
  • Well, the MOS:MARKUP also says "This makes wikitext easier to understand and edit" and I could argue that &ZeroWidthSpace; is easier to understand, because it tells you literally what it is, whereas it's about the second time I ever hear about the {{wbr}} HTML tag as a programmer, let alone it gives no (clear) clue by its name (at least to non-HTML-developers). As a new user I also find it harder to get any help information about Wikipedia and templates in particular, it's so unintuitive to search for a template that it's simpler for me to use Google or just the direct URL...
    While I think that a spaced slash always looks better, AFAIK it's an incorrect grammar in English when indicating choice between single-word elements (in that case the choice is between the first word only) and it also increases the minimum width, if we are to be pedantic (see other edits below). The "Zen / Zen+" is a leftover from when it was "Zen 1 / Zen 1+";
  • "It seems excessively fussy" to me to search for and clean up the garbage for other users on the whole site in return. As I've explained previously, the &nbsp;s in your last example were used where "absolutely necessary" to prevent the quote from starting at line end with a 2-3-letter word in a ref preview (the mouse hover popup), which is not pretty.
    You prioritize better wikitext over better results which makes little sense;
  • I used ... when the sentence was truncated and [...] when intermediate sentences were omitted completely.
Regarding your other edits:
  • The "overlapping text" issue was already discussed on talk and it works fine for me on a fresh latest Firefox as well as very old Firefox 38 I happen to have. &ZeroWidthSpace; is needed there in Chrome to unify the look of links in a cell.
    So I encourage you to revert at least this change (which took a lot of time to implement with all the alternatives causing wrapping issues in Chrome), solve the problem with your browser installations and explain its causes there;
  • "Ice Lake": makes the table 7px smaller at best, but the wrapping looks ugly: not useful;
  • "Stale Data": no profit, ugly wrapping (depends on zoom level);
  • "Software Recompilation" (1): no profit, ugly wrapping;
  • "Software Recompilation" (2): 2px smaller at best, ugly wrapping for itself and TAA row (depends on zoom level).
I also find it offensive and disrespectful to my efforts to make excess moderation when it looks OK and you don't completely understand the purpose and implications of a particular thing. As a programmer you should be familiar with another of its principles: "Don't touch what works" (especially relevant for web markup). In that regard I find placing &nbsp;s as a character being an advantage over those willing to put their hands on things I don't want them to.
Overall, I am not interested in edit wars and I'm not willing to contribute my time to Wikipedia anymore. Thanks for your efforts.
EvgenKo423 (talk) 11:35, 24 May 2024 (UTC)Reply
  • I certainly did not mean to disrespect your contributions, and I apologize if I came in with an excess of bluster. I tend to assume most editors don't care about this kind of markup cleanup, and if someone does have an objection to a change I'm making, I'm always happy to discuss. It's easy to revert any changes that turn out to be undesirable, because the edit history is always available to us, so I don't think of it like stomping on a sand castle on a beach. I'm pretty sure I understood the purpose and implications of the markup I changed; I just found it not to be in compliance with the Wikipedia Manual of Style. Wikipedia is a collaborative project edited by people of lots of different skill levels; entrenching your own preferred markup by deliberately making it harder to see what's happening seems somewhat anti-collaborative. I think the reason for this convention is that if we can all easily see what is happening, we can discuss if we have any differences of opinion, but invisible characters make it more likely that editors will simply make mistakes or argue needlessly because they are confused about what is going on. If you feel strongly about making non-breaking spaces invisible, you can seek consensus on Wikipedia talk:Manual of Style. I can also see the argument for less cluttered wikitext, so I have no particular position on the question and will happily follow whatever consensus emerges. It's possible the Mediawiki UI could be changed to indicate whitespace differences through color or in some other helpful, less-cluttered way.
  • I did not participate in the discussions that came to the consensus that 1.) wiki syntax is preferred over HTML syntax and 2.) simplicity in markup is preferred over cosmetic perfection. I cannot unilaterally change consensus even if I agreed with your position. If you want that to change the guidelines, or you think my interpretations are incorrect, you can seek consensus on Wikipedia talk:Manual of Style.
  • Style guides differ as to whether a spaced slash can or should be used in this situation; there is a not a universally accepted English grammatical rule. Wikipedia house style is defined by MOS:SLASH. Those guidelines do say spaced slashes are used to join multi-word phrases, implying single-word phrases don't need spaces. However, it's also unclear whether this is a circumstance where a slash is generally allowed, or if the guidelines are mostly aimed at prose and not table headers. MOS:SLASH does say that if a slash is used, "Cascade / Comet / Amber Lake" must have a spaced slash because of the multi-word phrase. I think it would look weird to space for one column and not the other. If you feel strongly that "Zen/Zen+" is correct, I would not object if you change the article to do that instead.
  • I don't see anything wrong with a line ending with the start of a quotation with a two- or three-letter word. I don't see how anyone would be confused by that but wouldn't be confused if it were a seven-letter word. The aesthetic different is subjective; it does not look ugly to me, and so simplicity of markup seems more important. I mean "fussy" in the sense that it's more complicated than the simplest thing that works or is mandated by the MOS. It requires a higher level of skill and more time and energy to maintain more complicated markup. Those are precious commodities for a volunteer project, which is, I assume, part of why we try to keep things simple. (And why I try not to spend too much time discussing minor changes, this thread being somewhat ironic in that regard.)
  • I just edited the article to use the .../[...] distinction specified by MOS:ELLIPSES rather than the one you had used. It doesn't explicitly specify how sentence-ending ellipses should work, so I interpolated.
  • I replied on the other talk page regarding the overlapping text bug.
-- Beland (talk) 01:41, 25 May 2024 (UTC)Reply

Moribund

edit

I feel like this project has ground nearly to a halt. We have been working on a single letter for well over a year, with no end in sight given the remaining volume of fixes. At this rate, it will take 28.75 years for us to reach the end of the alphabet for this run alone. BD2412 T 12:58, 16 May 2024 (UTC)Reply

Yeah, I've felt for a while that we've been losing momentum. I have some ideas for a JavaScript-powered UI that would make it much faster to fix and mark-as-correct typos. A Wikimedia Foundation team inquired as to how this might be constructed and I was hoping they'd obviate the need for me to do so, but it seems that building such a feature into the site has not been a priority. (One of the challenges is that they like to do things across all languages.) I will start working on that, but it may take a few weeks as I also have lots of real-world and other wiki cleanup projects I'm trying to get done. But thanks for your help with typos and noticing and caring enough to write a note! -- Beland (talk) 16:32, 16 May 2024 (UTC)Reply
For the letter C we have fixed all the easy stuff, and now really obscure words or spelling errors are left. Graeme Bartlett (talk) 12:13, 22 July 2024 (UTC)Reply
Given how long the harder cases have been up and that folks have expressed interest in working on some easy cases, I've posted a fresh batch for D. -- Beland (talk) 19:54, 22 July 2024 (UTC)Reply
@Beland: I think that is the right thing to do, and going forward we should aim to start the next letter when a certain percentage of the current letter is done. I also think that many of the "hard cases" are hard because they are not misspellings, but correct spellings or formulations of obscure words. The solution in those cases is to create the entry in Wiktionary (as I just did with "nonpalatalization"), but that is far more involved than fixing a typo on a Wikipedia page. BD2412 T 18:57, 23 July 2024 (UTC)Reply

Self-Hosted Moss

edit

I recently went through the GitHub page of the Moss Project. I have seen the README file which has some instructions on running the project locally, but there is no clear step-by-step guide of how the project works. I understand that there are very few maintainers of the project, but if there is a clear guide of how to run it locally, then one can update the spell check list frequently, allowing for more quicker fixing of typos. Please correct me if I am wrong, but if people too can contribute their computing resources, then the spell check lists can be updated monthly, compared to the 1–3-year-old archives of lists that there is today. Bunnypranav (talk) 04:57, 20 July 2024 (UTC)Reply

Greetings, and I'm glad you've taken an interest in this project!
The list of misspellings is generated twice a month for the complete alphabet, every time moss runs. Generation of typo lists is not the bottleneck, it's the speed at which volunteers are able to clear typos. It can take more than a year for us to run through the entire alphabet, especially since we have fewer volunteers now than when we were going at peak speed. I found it was not good to post more than one letter at a time, because if people were working on more than one letter at a time, it would take even longer to finish up any given letter. It's a bit awkward to update listings too close to the time when people have been working on them, because of the delay in edits showing up in database dumps. So, my habit was to update the next letter when the current one was nearly done. That's why upload is not automated; I just manually copy and paste from the main listings on disk, choosing the letter of focus manually.
In order to get back up to a reasonable pace, I think the user interface for volunteers needs an upgrade. My idea was to make a JavaScript interface more like JWB, where there would be buttons volunteers could press to replace potential typos with a suggested spelling, mark them as proper nouns or other not-a-typo categories. To some degree this can be done by changing individual articles, but I thought we might also need to have some storage somewhere to mark things as false positives that really don't need to be marked in the wikitext (i.e. there's a moss bug) or that people don't know what to do with them so I can take care of them. I was dithering about whether doing that on a special wiki page would make sense, or if this should run in the cloud somewhere and have a proper database. Wikimedia Cloud Services would be the logical place to do that, as you suggested on my user talk page.
We'd have to check and see if Wikimedia servers are hardware-capable of running moss and also if they'd be comfortable with the amount of load it generates - right now I have a dedicated box in my basement for it. The code is parallelized to speed up the results, so it maxes out all 8 CPU cores for 2-3 days. Then sometimes I do test runs or one-off runs to grep for random problems I'm fixing or discussing on talk pages.
I was also in the middle of a rewrite to enhance performance of MOS compliance checks that run separately from the spell check. I should probably merge that into the main branch because it's de facto what I'm running everything off of. -- Beland (talk) 01:20, 22 July 2024 (UTC)Reply
Thanks for the elaborate response. I have a thought about the sizableness of Wikimedia Cloud Services. As the typo correction by users is anyway taking a lot of time, I do not think it is a problem if the entire script takes even around a week, or two. I understood the reason for delayed posting, but all I wanted was a simpler and more user-friendly self-hostable guide for the project, as a user like me may not be that talented to fix complex typos, including the chemistry one. But I can use my decent enough home lab to self-host my own instance, to fix the simpler, easier to understand typos quickly. I understand that this may not be implemented quickly, I respect and admire your dedication to this project. Thanks! Bunnypranav (talk) 12:31, 22 July 2024 (UTC)Reply
I just updated Wikipedia:Typo Team/moss/D with a batch of easy fixes. -- Beland (talk) 19:56, 22 July 2024 (UTC)Reply
Thanks, I will definitely contribute. I would be very grateful if you could upload the updated and more efficient code to GitHub, with (if possible) some instructions on how to run moss with only one letter. Thanks! Bunnypranav (talk) 13:51, 23 July 2024 (UTC)Reply
I have just pushed updates that merge the efficient code into the "main" branch. I have not finished rewriting many of the reports to run in the faster framework, but the main spell check is working fine.
I have also updated the environment setup instructions for Ubuntu and Fedora and tested them locally. Hopefully they will work for you; if not, let me know and I can debug.
It is currently not possible to run the spell check one letter at a time. In fact, it is not even easy to run the spell check without running all the other reports. This is because some reports list articles with a given typo instead of listing typos for a given article. However, it shouldn't be that hard for me to untangle these things and let you run only the spell-check, and only for a single letter if you want. The whole main spell check process takes about 23 hours; for a single letter it will vary a lot because articles are not evenly distributed over the alphabet. I am downloading a new database dump now. That should be done in a day or so, and I will try running the disentangled code locally and let you know when it's ready. -- Beland (talk) 21:37, 24 July 2024 (UTC)Reply
Thanks a lot, will definitely check the updated code out. Thanks again for your sincere contribution to the project Bunnypranav (talk) 10:27, 25 July 2024 (UTC)Reply
I have just pushed a code update (with new instructions in the README) which should now make it easy to run the main spell check one letter at a time. The download took about 6 hours for me, and spell-checking "Z" took about 10 minutes. I'm sure the larger letters would take over an hour on a comparable machine, but that's a lot better than a week or two. 8) Let me know if you have any problems; I'm doing a full run now to make sure I didn't break anything else. -- Beland (talk) 20:17, 25 July 2024 (UTC)Reply

In-universe terminology and other article-specific words

edit

I've come across a lot of articles on movies, books, and similar works of narrative that feature many in-universe terms. From what I can tell, the Instructions for editors section says to either create redirects for these terms, or to tag them as {{not a typo}}. However, in-universe terms usually aren't significant enough to warrant such a redirect; they just aren't prominent enough to characterize the work they're from. The same goes for other kinds of article-specific terms for which a redirect isn't warranted.

According to the instructions, such terms should be tagged as {{proper name}}s if they fit that bill and as {{not a typo}} otherwise. Most in-universe terminology can be written off as proper names in various shapes and forms, but not always. In particular, grammatically reasonable compounds, affixed terms, and morphologically valid inflections don't pass as such. (Think dragonless, cryokinetically, brownstoned (if wikt:brownstone has a verbal use in the story).) Thus, it seems {{not a typo}} fits these cases best.

I just wanted to verify this interpretation and suggest that this use of {{not a typo}} be stated more clearly in the instructions, because the current description portrays {{not a typo}} as being reserved mostly for non-words. If I'm wrong and these words still fall under {{proper name}}'s purview, I suggest specifying what exactly constitutes a proper name or mentioning as such in the instructions. --110521sgl (talk) 08:11, 13 August 2024 (UTC)Reply

A proper name is a proper noun, typically capitalized, but those that use this template sometimes have non-standard capitalization. If an in-universe term is only used in one article, I'd still make a redirect with {{R from subtopic}} and leave it at that, since this would still be the best source of info on that search term. For words that are used across multiple fictional authors, or which are plausibly used in the real world, creating a Wikitionary entry seems like the right way to go. For example, wikt:dragonless exists, and thus should not be flagged as a typo by moss. -- Beland (talk) 08:31, 13 August 2024 (UTC)Reply
edit

Absence of a space between words is a recurring source of M+-flagged and TS+DOT+-flagged pages. However, there is a more specific variant of this phenomenon where an HTML "ref" element or image without a trailing space is the source of the typo. See for example these three diffs: https://en.wikipedia.org/w/index.php?title=Daikokuten&diff=1241452942&oldid=1239061808; https://en.wikipedia.org/w/index.php?title=Daganbhuiyan_Upazila&diff=1241452569&oldid=1221680930; https://en.wikipedia.org/w/index.php?title=Dwain_Chambers&diff=1241451215&oldid=1240534676. Might it be possible to identify such situations separately and mark them with a designation of their own so we can pick them out more easily? --110521sgl (talk) 07:42, 21 August 2024 (UTC)Reply

Hmm, might be a bit tricky...there's a lot of code that strips out wiki markup and leaves only visible words, which results in somewhat artificially jammed-together words. I'll take a look and see what's feasible in the short term. Long-term this should be doable given I think I need to track both pre- and post- stripping versions to better automate fixes. -- Beland (talk) 08:03, 7 September 2024 (UTC)Reply