Wikipedia talk:AutoWikiBrowser/Typos/Archive 3

Archive 1Archive 2Archive 3Archive 4Archive 5

Pubication

I'm trying to patrol pubic and certain other easily confused words using poop patrol, and I can see a few phrases that would suit this software better.

  1. pubication - publication
  Done here. Rjwilmsi 08:21, 24 July 2010 (UTC)
  1. pubic school - public school
  Done here. Shadowjams (talk) 21:04, 1 August 2010 (UTC)
  1. discuss throw - discus throw

ϢereSpielChequers 08:09, 24 July 2010 (UTC)

We should discuss throwing out the remaining suggestion. –xenotalk 14:16, 18 August 2010 (UTC)
OK false positives are theoretically possible though it doesn't exist yet on Wikipedia and there once were many dozens of participants in the Olympic sport of synchronised ventriloquism. I will leave it in Botlaf. However can disolve be added as a typo for dissolve? I went through it manually a year or so back but there are about fifty again. ϢereSpielChequers 18:06, 31 August 2010 (UTC)
The "Diss-" rule already handles it.--BillFlis (talk) 18:26, 31 August 2010 (UTC)
I'm not convinced it does, I've just fixed one from June and I'd have thought AWB would have fixed it by now if it was in AWB. Can we have a specific rule for Disolv - Dissolv please. ϢereSpielChequers 13:01, 18 September 2010 (UTC)
Confirmed that the fix for "disolve" → "dissolve" works in this edit GoingBatty (talk) 19:37, 18 September 2010 (UTC)
Thanks GoingBatty. ϢereSpielChequers 22:33, 26 September 2010 (UTC)
  1. Pubic library - Public library
  2. Pubic domain - public domain

ϢereSpielChequers 22:33, 26 September 2010 (UTC)

  Done here. GoingBatty (talk) 21:39, 28 September 2010 (UTC)

et al.

Could someone please update the rule for "et al." so it won't replace ''et al''. with ''et al.''., as it wants to in Spinosaurus? Thanks! GoingBatty (talk) 21:13, 26 September 2010 (UTC)

The rule doesn't handle the apostrophe italics right now. According to Wikipedia:Manual of Style (abbreviations) it should be italicized. I'll take a shot at it. Shadowjams (talk) 07:00, 27 September 2010 (UTC)
  Done here. Shadowjams (talk) 07:33, 27 September 2010 (UTC)

gollem

Please add rule for "gollem" and "gollems" to "golem" and "golems" respectively. McLerristarr / Mclay1 14:45, 14 October 2010 (UTC)

  Done here GoingBatty (talk) 23:27, 14 October 2010 (UTC)

the exact same

Please add rule to "Incorrect phrases": "the exact same" to "exactly the same". McLerristarr / Mclay1 14:51, 14 October 2010 (UTC)

  Done here GoingBatty (talk) 23:28, 14 October 2010 (UTC)

Hongkong Land

I let AWB do this and had to undo it again. Can the "Hong Kong" rule be expanded? -- John of Reading (talk) 11:44, 20 October 2010 (UTC)

  Done here. Rjwilmsi 12:45, 20 October 2010 (UTC)

tv

Please correct tv to TV. Although it is commonly used, acronyms should be capitalised, otherwise "tv" might be pronounced as a single syllable. McLerristarr | Mclay1 17:01, 24 October 2010 (UTC)

… except when it’s used as the country code top-level domain of Tuvalu (.tv) — Hebrides (talk) 17:47, 24 October 2010 (UTC)
  Done here GoingBatty (talk) 21:34, 24 October 2010 (UTC)
...and expanded here to also fix "Tv". GoingBatty (talk) 23:41, 24 October 2010 (UTC)

Womens > Women's

Should this rule apply to lowercase "womens" only? See Apostrophe#Possessives in names of organizations. -- John of Reading (talk) 17:13, 11 September 2010 (UTC)

I can't find any "womens", capitalized or not, in wikipedia. We can delete the rule altogether.--BillFlis (talk) 06:45, 1 October 2010 (UTC)
Womens Bay, Alaska, Sheffield Wednesday Womens F.C., List_of_WWE_Women's_Champions, University of Pittsburgh Medical Center, Womens Bay, Women in Ancient Rome, Apostrophe, 2009 Adelaide Football Club season. I think you can get the picture! Regards, SunCreator (talk) 00:40, 15 October 2010 (UTC)

Italicise Latin words and phrases

Please italicise Latin words and phrases, the most common being et cetera (or etcetera, et caetera or et cætera), de facto, de jure, id est, ad libitum, circa, floruit and exempli gratia. McLerristarr / Mclay1 07:49, 14 September 2010 (UTC)

I suggested this earlier but it got archived before anything was done about it. Manual archiving, like on Wikipedia talk:AutoWikiBrowser/Feature requests, would be much better. McLerristarr / Mclay1 03:18, 4 October 2010 (UTC)

Rules for "Consider" and "Considered"

I don't agree with the rule for Considered changing "consideres" → "considered", as the proper word could be "considers". (e.g. this edit) I hope you'll reconsider (pun intended) this rule. Speaking of which, adding "(Re)" to the beginning of these rules would be good too. Thanks! GoingBatty (talk) 02:55, 24 September 2010 (UTC)

Rules expanded for Re- prefix. Rjwilmsi 11:21, 24 September 2010 (UTC)
"consideres" could be either -ed or -s, we don't support options so choose the most likely one. Rjwilmsi 11:21, 24 September 2010 (UTC)

False positive

"Diary products" could be legitimate; I nearly committed this edit to "Dairy products" before I noticed. I was too scared to screw up the code to edit it; could someone who knows what they're doing, please? --John (talk) 06:48, 29 September 2010 (UTC)

  Done here. I removed "diary product" but I added some other similar trailing words. Shadowjams (talk) 08:40, 29 September 2010 (UTC)
What does '"Diary products" could be legitimate' mean? Did you actually find it anywhere? It seems way beyond likely to me.--BillFlis (talk) 03:16, 30 September 2010 (UTC)
My initial instinct too. I found 2 examples of it (searching for the phrase finds the two... I don't remember them now). Frankly the typo seems more likely; I'd be fine with it added back (although I added some others too so don't remove those) Shadowjams (talk) 03:43, 30 September 2010 (UTC)
Actually, I just now found an instance of "diary products"! I corrected it to "personal organizers". I think the rule can now be safely restored.--BillFlis (talk) 07:14, 1 October 2010 (UTC)

Profiling heads up for you guys

Hi All, Thanks for the great work.

Little heads up for you. I was poking at AWB doing some profiling, and Regextypofix takes nearly a 3rd of the time whilst processing an article. Most of this, is doing match evaluation.

Reedy 17:55, 3 October 2010 (UTC)

Is it possible for you to drill down deeper and see which or what kinds of regexes take the longest? Anyways we can optimize what's here from the rule-writing perspective? Shadowjams (talk) 21:25, 3 October 2010 (UTC)
Not exactly. MaxSem seems to think there was, but we'll have to dig it out. I imagine, there are a lot of rules that won't ever get matched, and are probably just pointless keeping around. I need to do a new TypoScan dump, and if I do it with some extra stats, such as the word/the rule it matched, it might give us a better idea. We have a lot of regexes!! Reedy 21:51, 3 October 2010 (UTC)
Yeah, it's huge. One-third is less than I would have guessed for the typo rules. There was a conversation (I think it's above) about whether using alteration (pipes) or character classes (brackets) was faster, since the latter is significantly faster in some implementations. For AWB it turns out the difference is small, but classes are slightly faster.
While I'm interested in the optimization issues it's mostly academic; I don't personally find the speed right now a serious issue. Even on old hardware I don't have trouble working with anything in AWB. If anything the API for saving changes (gets are quick) is a larger slow-down. If I do large database dump scans that takes a while but even then it's not extraordinarily long, and it's easily batched which is probably a more long-term and cheaper solution (in terms of coding time) than on optimizing everything. That's something I guess you ultimately get to decide, but just my two-cents. Thanks for the info, let me know if I can help speed anything up. Shadowjams (talk) 22:59, 3 October 2010 (UTC)

1/3of the time seems very good! Rich Farmbrough, 11:01, 7 October 2010 (UTC).

Re: which typo rules are the slowest. We have the 'profile typos' option to run on a particular page, but that is only for a particular page. We also have to be careful that just because a rule doesn't match any pages in a given database dump doesn't mean the rule is useless. Somebody may have fixed 20 typos using that rule the day before the dump. However, the last time I did profile typos on a page there were certain rules that were much slower than others, so we might achieve a reasonable performance improvement by focusing on a handful of rules. Still, I don't think current performance is a problem, the "1/3 of the time" Reedy mentions depends entirely on the page you run against. Rjwilmsi 11:13, 7 October 2010 (UTC)
I have posted the 50 slowest typo rules, based on profiling Tiger Woods. The number at the start is the time (I think this is probably the time in milliseconds to apply the typo 100 times or something), and then the regex of the rule is given. Note that the quickest typo has a time of 2, a typical value for the majority of the rules is around 50. Therefore some rules are 5 or 10 times slower than average. Rjwilmsi 11:31, 7 October 2010 (UTC)
Quick example on the 11th slowest: ($1nally): originally 0.87 seconds using Expresso for 10 iterations on Tiger Woods, using \b([A-Za-z]{2,}[a-mo-z])(?:nalyl|anlly)\b instead is 0.67 seconds. That's about 20% faster with no change to the rule's matching. Rjwilmsi 11:53, 7 October 2010 (UTC)
A lot of these start with "\b(\w+)", which I think can be safely eliminated.--BillFlis (talk) 12:37, 7 October 2010 (UTC)
No, not quite true, we want to match the whole word so the edit summary shows whole words being corrected. Rjwilmsi 13:00, 7 October 2010 (UTC)
Converting \w to [A-Za-z] for performance improvement: that reduced typical typo time on Tiger Woods from average 7.7 seconds to average 6.9 seconds on my laptop, ~10% better. [A-Za-z] may be better as [a-z], I'll see about that. Rjwilmsi 13:36, 7 October 2010 (UTC)
I think \w covers [A-Za-z0-9_] and maybe (depending on the language) extended Latin/Cyrillic characters. Mitigating that though, in most cases those probably aren't intended. Shadowjams (talk) 16:18, 7 October 2010 (UTC)

2007 Brazilian Grand Prix

Oposta => Opposta wrongly. Rich Farmbrough, 11:01, 7 October 2010 (UTC).

Marking sections so AWB doesn't search for typos?

Is there a way to mark sections of articles that are in foreign languages (e.g. Middle Scots#Sample text) so that AWB won't search them for typos? Thanks! GoingBatty (talk) 00:32, 10 October 2010 (UTC)

Yes. You can enclose them in the language template, like this:
{{lang|es|Mi gato se llama Rebecca.}}
That comes out like this:
Mi gato se llama Rebecca.
It doesn't make the text look different in the article, but AWB doesn't flag typos inside it. --Auntof6 (talk) 03:56, 10 October 2010 (UTC)
Perfect - thanks! GoingBatty (talk) 04:06, 10 October 2010 (UTC)

Inocentes → Innocentes

[1] Doesn't AWB usually not run typo fixing within quotes? –xenotalk 20:50, 12 October 2010 (UTC)

That's within italics, not quotes, and we've only had hiding of text in italics since rev 7042. Rjwilmsi 21:15, 12 October 2010 (UTC)
My bad, looked like quotes in the diff view. –xenotalk 21:21, 12 October 2010 (UTC)
Please see es:Día de los Santos Inocentes and wikt:inocente. The Spanish word inocente (inocentes in the plural) (meaning "innocent") has only one n before the o.
Wavelength (talk) 00:41, 7 November 2010 (UTC)

Edit summary incorrect when two different sets of duplicated words fixed

In this edit, AWB changed "be be" to "be" and "with with" to "with", but the edit summary automatically created was "typos fixed: be be → be (2)"

Yes, when the same typography rule makes more than one fix, the effect of the rule is summarised as you describe. Imagine how long this edit summary would have been if it hadn't done this. -- John of Reading (talk) 10:38, 14 October 2010 (UTC)
John's explanation is correct, though his example uses AWB find & replace rather than typo fixing, but both do the same edit summary condensing he's explained. Rjwilmsi 11:05, 14 October 2010 (UTC)

Philippino and variants

Please add the following:

  1. Philippino --> Filipino
  2. Philippinos --> Filipinos
  3. Philippinoes --> Filipinos
  4. Philippina --> Filipina
  5. Philippinas --> Filipino
  6. Filipinoes --> Filipinos

I don't know if there's one out there, in case there aren't please add them. Thanks.--JL 09 q?c 08:11, 16 October 2010 (UTC)

It doesn't "convert" because there is no rule for it here. "Philippina" is a word. E.g., 631 Philippina.--BillFlis (talk) 14:44, 16 October 2010 (UTC)
  Done #1-3 here
  Not done #4-5 per comment above
Will let someone else do #6 to ensure rule isn't expanded to "fix" correct spellings too. GoingBatty (talk) 16:39, 16 October 2010 (UTC)
  Done #6 here. -- JHunterJ (talk) 20:31, 22 October 2010 (UTC)

Could the rule be expanded to cover double and single Ls and Ps? McLerristarr | Mclay1 14:03, 26 October 2010 (UTC)

Sorry, but I don't understand your request. Could you please specify the exact misspellings that you want to be identified and fixed? Thanks! GoingBatty (talk) 02:16, 27 October 2010 (UTC)

Possible State capitalization issue

I have had a few pages lately where AWB is trying to capitalize states that are within a web address and I dont think we want to do that. Here is one example. --Kumioko (talk) 19:52, 22 October 2010 (UTC)

It looks like AWB properly ignored the web address (the part in the brackets that uses the http:// prefix) and only tried to fix the unfortunately worded description of the web address (not in brackets, with no http:// prefix). -- JHunterJ (talk) 20:19, 22 October 2010 (UTC)

Plurals of SI units

Could the typo facility be used without false positives to change 'kms' and 'kgs' to 'km' and 'kg'? Lightmouse (talk) 18:49, 25 October 2010 (UTC)

Please look at this code change:

  • <Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)K(g|m)\b" replace="$1k$2" />

to:

  • <Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)K(g|m)s\b" replace="$1k$2" />

Would that work? Lightmouse (talk) 23:20, 25 October 2010 (UTC)

Neither of them seem to work for me in the AWB Regex Tester. In particular, although you want to change "kms" and "kgs" (which contain lower case "k"), the regex only has an uppercase "K". GoingBatty (talk) 02:29, 26 October 2010 (UTC)

Good call, thanks. Let me add lower case 'k' as an option:

  • <Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)[Kk](g|m)s\b" replace="$1k$2" />

How about that?

I tried the AWB Regex Tester again using your Find and Replace on the text "Kgs and Kms and kgs and kms and kg and km", and it didn't find anything to replace. Hopefully one of the experts can give you a hand with this. Good luck! GoingBatty (talk) 02:20, 27 October 2010 (UTC)
Ah, I see the error of my ways - the rule is set up to look for a number before the symbol. GoingBatty (talk) 17:26, 27 October 2010 (UTC)
If you change that, it will no longer correct "Km" or "Kg", which was the intent of the rule.--BillFlis (talk) 10:51, 27 October 2010 (UTC)
Are you sure BillFlis? It works for me. Lightmouse (talk) 14:27, 27 October 2010 (UTC)
The way the proposed rule is written above, it's looking for a terminal "s", as in "Kgs" or "kms".--BillFlis (talk) 17:01, 27 October 2010 (UTC)
Ah yes! I thought you were saying it wouldn't find an upper case 'K'. Thanks for being patient with me. How about:
  • <Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)[Kk](g|m)s?\b" replace="$1k$2" />
That's also going to result in false positives where it tries to fix km and kg. Since we want to fix kms, Kms, Km, kgs, Kgs, Kg - but not km or kg - how about splitting this into two rules:
  • <Typo word="kg (kilogram)" find="([\d\.]+(?:\s| |-)?)(Kgs?|kgs)\b" replace="$1kg" />
  • <Typo word="km (kilometre)" find="([\d\.]+(?:\s| |-)?)(Kms?|kms)\b" replace="$1km" /> GoingBatty (talk) 17:26, 27 October 2010 (UTC)

The one line version should be faster than the two line version. Yes, it does over-write 'km' with 'km' but it has to parse the text anyway and the outcome is unchanged. Lightmouse (talk) 17:44, 27 October 2010 (UTC)

How about one rule: <Typo word="kg/km (kilogram/kilometre)" find="([\d\.]+(?:\s| |-)?)(?:K([gm])s?|[Kk]([gm])s)\b" replace="$1k$2$3" /> Could someone please test this? If two rules are necessary, I'd suggest that one handle the capital "K" error, and the other the terminal "s" error.--BillFlis (talk) 19:09, 27 October 2010 (UTC)

It works for me, Bill. I used the regex tester on:

  • "foo 5 Kg, 6 Kgs, 7 kgs, 8 Km, 9 Kms, 10 kms bar"

and it produced:

  • "foo 5 kg, 6 kg, 7 kg, 8 km, 9 km, 10 km bar"

Thanks. Lightmouse (talk) 19:15, 27 October 2010 (UTC)

I've made the change to the rule. Also, modified the watt rule to correct also "kw" and removed the now-redundant kilowatt rule.--BillFlis (talk) 11:42, 28 October 2010 (UTC)

Thanks. Lightmouse (talk) 11:51, 28 October 2010 (UTC)

SI unit spelling: 'gramme' -> 'gram' and 'kilogramme' -> 'kilogram'

I'm trying to add a typo for 'kilogramme' -> 'kilogram'. I think the code is:

  • <Typo word="kilogram" find="\b([Kk]ilog|[Gg])ramme(s?)\b" replace="$1ram$2" />

Is that correct? Lightmouse (talk) 23:18, 25 October 2010 (UTC)

This works for me in the AWB Regex Tester - thanks! GoingBatty (talk) 02:31, 26 October 2010 (UTC)
Adding these to the typo rules would be against WP:ENGVAR. Rjwilmsi 07:21, 26 October 2010 (UTC)
"Gramme" is rarely used in British English. It's an old spelling. But people must also note that the SI spelling of "meter" is "metre" so just basing spelling on SI is not OK. McLerristarr | Mclay1 13:53, 26 October 2010 (UTC)

Quite. I'm just referring to the SI unit of mass. wp:engvar says "Wikipedia tries to find words that are common to all varieties of English." There is an occasionally quoted misconception that British spelling requires 'kilogramme'. The spelling 'kilogramme' merely has the status of an old alternative. Since metrication started in the 1970s, the spelling 'kilogram' started to be adopted and is now the default.

The spelling 'kilogram' has been used in legislation for the last 25 years (e.g. Weights and Measures Act 1985). It's the spelling taught by the Department of Education] and in style guides:

If there's any doubt, it would be simple enough to raise it in several forums but it seems clear cut to me. Regards Lightmouse (talk) 14:32, 26 October 2010 (UTC)

Is this in WP:MOSNUM? Rjwilmsi 14:38, 26 October 2010 (UTC)

Wikipedia:Manual of Style (spelling) says "gramme vs gram: gram is the more common spelling; gramme is also possible in British usage." Lightmouse (talk) 14:47, 26 October 2010 (UTC)

I would interpret that to mean that the typo rules shouldn't change it then. Rjwilmsi 14:52, 26 October 2010 (UTC)

OK. Thanks. Lightmouse (talk) 15:06, 26 October 2010 (UTC)

Excess code in "SI unit symbols"

All of the code in SI unit symbols seems excessive to me. For example, the code that will turn '100 kw' into '100 kW' is:

  • find="([\d\.]+(?:\s| |-)?)kw\b" replace="$1kW" />

It looks for a digit string. But I think it could be simplified by looking only for the last digit in the string. Thus:

  • find="(\d(?:\s| |-)?)kw\b" replace="$1kW" />

As far as I can see, that would give the same hit rate and the same false positive rate. The same applies across all 14 SI units. Am I correct? Lightmouse (talk) 14:44, 26 October 2010 (UTC)

Looks OK to me (unless someone writes "25. kw", which is a different error), but I would change the "?" to "*" to catch multiple spaces:
find="(\d(?:\s| |-)*)kw\b" replace="$1kW" /> --BillFlis (talk) 14:52, 26 October 2010 (UTC)
We match the entire number so that the edit summary shows the entire unit to make it easier for editors to understand the change. Rjwilmsi 14:53, 26 October 2010 (UTC)

Ah, good point. I wasn't aware of that. I thought the speed of the code was the deciding factor. Lightmouse (talk) 14:57, 26 October 2010 (UTC)

Possible duplicate in "SI unit symbols"

It seems to me that the line for kilowatt could be eliminated by changing:

  • <Typo word="W (watt)" find="([\d\.]+(?:\s| |-)?)([µmMGT])w\b" replace="$1$2W" />

to

  • <Typo word="W (watt)" find="([\d\.]+(?:\s| |-)?)([µmkMGT])w\b" replace="$1$2W" />

Have I missed something? Lightmouse (talk) 15:05, 26 October 2010 (UTC)

Duplicate words section

Since "It is" has its own entry in the Duplicate words section to fix "it it" and "is is", should the specific Duplicate words entry be tightened so it doesn't also look for "it it" and "is is"? GoingBatty (talk) 03:46, 30 October 2010 (UTC)

km² rule

Two questions about the km² rule:

  1. Could someone please expand it so it also fixes "km2" (without the superscript)?
  2. Speaking of superscript, why is the replacement "km<sup>2</sup>" instead of "km²"? GoingBatty (talk) 06:08, 30 October 2010 (UTC)
For the same reasons people still use HTML &ndash instead of the UTF-8 character (which they can get from the little tool strip below the edit window): tradition, recalcitrance, personal preference, obstinacy, obtuseness, drunkenness.--BillFlis (talk) 07:58, 30 October 2010 (UTC)
We use the <sup> tags because it's in the MOS. Rjwilmsi 21:01, 31 October 2010 (UTC)
Thanks for the feedback. So could someone expand the rule so it fixes both "km2" and "km²" (without superscript tags)? GoingBatty (talk) 01:40, 1 November 2010 (UTC)
I would point out that I have my own personal convert template regex rule, and I think there's a bot going around doing similar things. While both mine and the bot's rules could fix all versions, I currently don't and I don't know what the bot does. It pays to have some standardization... but I'm not hell bent to change the MOS rules for something like this. Shadowjams (talk) 06:26, 4 November 2010 (UTC)
Ha, there's a bit of a disconnect here somewhere. If on the "Insert" pull-down menu below the "Save page" button you select "Symbols", it makes available both "m²" and "m³" (with the Unicode exponents, not the <sup> markup).--BillFlis (talk) 12:04, 4 November 2010 (UTC)

in in

This is a recent addition; I've only seen it produce false postives so far. There are many phrases ending in "in", such as "bring in", "buy in", "carry in" and so on, which can legally be followed by another phrase that starts with "in", such as "in many cases", "in 2007", and so on. -- John of Reading (talk) 08:21, 31 October 2010 (UTC)

Hi John - I'm the one who made the addition based on the typo corrected in this edit. Could you please give an example of a grammatically correct sentence that contains "in in"? Thanks! GoingBatty (talk) 14:56, 31 October 2010 (UTC)
I've just done an AWB Google search for "in in". The rule made no correct changes, and was going to damage these:
A search for "in in early" found a roughly even mixture of correct and incorrect fixes. I didn't save anything, so you can try it yourself. -- John of Reading (talk) 20:50, 31 October 2010 (UTC)
I recently corrected an "in in" error by an experienced and usually careful AWB user. I added an extraneous comma to prevent it from happening again. MANdARAX  XAЯAbИAM 17:08, 2 November 2010 (UTC)
Based on John's feedback, I updated the rule here so it looks for a space before the duplicated word, so it won't catch "buy-in in" or "Drive-in in" anymore. GoingBatty (talk) 02:33, 3 November 2010 (UTC)
"I let the dog in in the morning." Two in's is the same situation as two on's. There's no way of getting around it. The typo fixer cannot possibly correct every typo so copyediting still needs to be done regularly. This is another typo that will have to be found the traditional way. McLerristarr | Mclay1 06:42, 3 November 2010 (UTC)
I think it's simply too complicated of a grammatical issue to handle with the typo rules. I'd note that there's absolutely nothing stopping anyone from using their own rules in AWB to identify common types of duplicate words (pretty much pronouns and prepositions), or just identifying duplicate words in any case (this should do it \b(\w+)\b\1\b) and using human judgment to fix them. This is probably better used for words that don't have this error. I don't have enough grammar knowledge to be confident about which words those are, but the usual "the the" examples are a good place to start. Shadowjams (talk) 06:24, 4 November 2010 (UTC)
Based on the discussion, I've reverted my change here. However, I disagree that "There's no way of getting around it."
  • "a player may go all in in exactly the same manner" → "a player may go all in exactly the same way"
  • "The thaw set in in early March." → "The thaw set in early March"
  • "I let the dog in in the morning." → "I let the dog inside in the morning." GoingBatty (talk) 17:41, 4 November 2010 (UTC)
Thanks for the regex suggestion, Shadowjams, but that didn't work for me. While \b(\w+)\s\1\b did work, I found that \s(\w+)\s\1\s helps to avoid the "buy-in in" examples above. GoingBatty (talk)
Even better is \s([a-z]+)\s\1\s to limit it to lowercase words (e.g. avoid fixing Bora Bora) GoingBatty (talk) 02:58, 5 November 2010 (UTC)
As well as avoiding "buy-in in" it could avoid "buy buy-in". I know that's not a good example but I can't think of a real one right now. McLerristarr | Mclay1 08:15, 5 November 2010 (UTC)
GoingBatty, your examples do not really avoid the problem because the typo fixer cannot possibly know what the change should be. McLerristarr | Mclay1 08:17, 5 November 2010 (UTC)
As a postscript I've tackled "in in" using a variety of Google searches ("in in 1857", "born in in", and so on) and a long regexp to skip most of the false positives; 450 fixes from around 2000 candidates. There will be many others that I've missed, I'm sure. -- John of Reading (talk) 19:51, 6 November 2010 (UTC)
Great job, John! I've done quite a few too (but not as many as you!) GoingBatty (talk) 23:38, 6 November 2010 (UTC)

Exactly the same

Please expand the "exactly the same" rule:

  • this exact same → exactly the same
  • that exact same → exactly the same
  • those exact same → exactly the same

Thank you. McLerristarr | Mclay1 16:05, 2 November 2010 (UTC)

  Done here GoingBatty (talk) 16:45, 2 November 2010 (UTC)

sq.kms → sq.km → km2

Typo fixing will change "sq.kms" to "sq.km" on the first parse, and then change to "km<sup>2</sup>" in the second parse. (Try Pakhal Lake.) What's the best way to combine the SI unit symbols so this all happens in one parse? GoingBatty (talk) 16:27, 6 November 2010 (UTC)

continguous → contiguous

The extra n in continguous is an error sometimes seen in the phrase "contiguous United States".

  • ([Cc])ontinguous → $1ontiguous
  • ([Cc])ontinguity → $1ontiguity

Continguity appears rarely. I haven't found continguously and continguousness so those might not be worth the trouble. —Mrwojo (talk) 19:14, 6 November 2010 (UTC)

  Done here to cover all of these. GoingBatty (talk) 23:33, 6 November 2010 (UTC)

Other duplicated words

Before starting another controversy, does anyone object to expanding the Duplicated words entry to fix "had had" and "that that"? GoingBatty (talk) 00:21, 7 November 2010 (UTC)

"had had" definitely is not acceptable in the typo list; for sentences like "He had had the apple," that would change the meaning. PleaseStand (talk) 00:37, 7 November 2010 (UTC)
Thanks for the example. Sorry for being dense, but what's the difference between "He had the apple" and "He had had the apple" ? GoingBatty (talk) 00:41, 7 November 2010 (UTC)
The second is used to refer to an action that happened before another (had something before another thing happened), as in "He had had a drinking problem, so he attended an AA meeting." The typo fixer shouldn't change something that is completely correct. PleaseStand (talk) 01:23, 7 November 2010 (UTC)
I agree that the typo fixer shouldn't change something that is completely correct. So does your example mean "He had a drinking problem, so he attended an AA meeting, and he no longer has a drinking problem." ? Thanks! GoingBatty (talk) 01:40, 7 November 2010 (UTC)
Found two more: "more more" and "other other" GoingBatty (talk) 01:40, 7 November 2010 (UTC)
For "had had" see Pluperfect or the splendid article James while John had had had had had had had had had had had a better effect on the teacher; for "that that" consider the sentences "He said that that man was the impostor" or "Not that that made any difference". Please don't add either of these to the automatic list.
"more more" and "other other" look OK to me, though "more more" will run into some false positives with song and TV program titles. (Comment revised after I saw the error in my test regexp) -- John of Reading (talk) 07:52, 7 November 2010 (UTC)
  Thank you for the links. I definitely won't add "had had" or "that that". I hope that the song and TV program titles would be "More More" instead of "more more". GoingBatty (talk) 15:46, 7 November 2010 (UTC)
  Done here so the typo fixer now fixes "more more", "other other" and "become become". GoingBatty (talk) 23:31, 7 November 2010 (UTC)

Does the typo fixer remove duplicate words in different casings (e.g. other Other)? I don't think it should because the capitalised word could be part of a proper name, making the duplication completely correct. McLerristarr | Mclay1 01:04, 9 November 2010 (UTC)

No, this rule has been written to match lowercase text only. -- John of Reading (talk) 07:16, 9 November 2010 (UTC)
there are many other duplicates though, obviously lupus lupus, bubo bubo etc. are legitimate. The top entries as of the last dump are:
  • solid 17216 "!style="border-style: none none solid solid;"
  • the 16219
  • that 15967
  • new 8773
  • history 7648
  • had 7008
  • in 6213
  • is 3285
  • sortable 3155 (table?)
  • to 3121
  • edit 2988 (?)
  • blah 2690
  • etc 2610 (etc etc should be just etc.
  • very 2393 (very very is bad style)
  • and 2057
  • on 2050
  • many 1871 (bad style)
  • it 1832
  • of 1672
Full list at User:Rich Farmbrough/temp113. Rich Farmbrough, 16:04, 10 November 2010 (UTC).
Thank you for generating that list - interesting. Why is "history history" so frequent? There are examples at History of Manila and Surviving History, which have [http://www.somewhere.com/history History of Something], but I'm surprised at the 7648 figure. -- John of Reading (talk) 17:40, 10 November 2010 (UTC)
Thanks indeed. The above includes uppercase instances, Rich? --LilHelpa (talk) 17:46, 10 November 2010 (UTC)
Is your list across all namespaces? I think the primary concern should be the article namespace. Anyone who wants to type "very very" or "blah blah blah" on a talk page isn't something we should be correcting. GoingBatty (talk) 17:59, 10 November 2010 (UTC)
Cool list! For comparison, the typo rule is currently fixing the following duplicates: a, am, an, as, at, and, are, become, be, by, could, did, do, for, go, has, he, if, is, it, me, more, no, of, or, other, she, should, the, their, them, then, these, they, this, thus, to, was, were, what, where, when, which, who, whom, why, with, would. GoingBatty (talk) 17:56, 10 November 2010 (UTC)
"her", "him", "how" and "its" seem to fit amongst those words. Could they be added? McLerristarr | Mclay1 07:03, 11 November 2010 (UTC)
"have", "shall", "should", "will"... There are many words that are unlikely to have false positives. McLerristarr | Mclay1 07:05, 11 November 2010 (UTC)
Actually, "will" has two meanings so that one is out. McLerristarr | Mclay1 07:07, 11 November 2010 (UTC)
  Done here, except for "shall" (not on Rich's list) and "should" (already part of typo rule) GoingBatty (talk) 01:30, 12 November 2010 (UTC)
Removed "her her" from list, as there were too many false positives (e.g. "It cost her her life" GoingBatty (talk) 05:12, 12 November 2010 (UTC)

This rule is getting very long - any speed benefit in breaking it into two rules vs. keeping it as one long rule? GoingBatty (talk) 01:41, 12 November 2010 (UTC)


What's more more problems are caused by including “more more” than omitting it! Please can we remove “more more”? — Hebrides (talk) 08:56, 22 November 2010 (UTC)

"What's more" should be followed by a comma. That's a problem with a lot of these rules; they would be correct if they were separated by a comma. McLerristarr | Mclay1 10:40, 22 November 2010 (UTC)

Pronomial

Is valid, as is pronominal that AWB wants to change it to. Rich Farmbrough, 04:45, 10 November 2010 (UTC).

  Done here GoingBatty (talk) 01:04, 11 November 2010 (UTC)

Rule didn't change "european" → "European"

In this edit, AWB fixed several typos, but did not change "european" to "European". The "Eur(asia/ope)" looks like it should do it, but didn't. GoingBatty (talk) 02:59, 15 November 2010 (UTC)

I think the automatic typo fixes are all turned off inside wikilinks. The only kind of fix that wouldn't break the link is this one, changing the case of the initial letter. -- John of Reading (talk) 07:56, 15 November 2010 (UTC)
You're right - I wouldn't expect AWB to change [[european individualist anarchism]]. However, since AWB changed "And so an european tendency..." to "And so a european tendency...", I expected it to change to "And so a European tendency..." GoingBatty (talk) 13:40, 15 November 2010 (UTC)
I found this in the manual - "If a typo rule is matching a wikilink target, this rule will be ignored on the whole page". So on that page, only, AWB thinks that "european" is allowable. -- John of Reading (talk) 14:12, 15 November 2010 (UTC)
Aha - that explains it! I tried to RTFM before posting this question, but looked in the wrong place. Could this sentence be added to the appropriate place on WP:AWB/T ? Thanks! GoingBatty (talk) 17:25, 15 November 2010 (UTC)
Done -- John of Reading (talk) 17:37, 15 November 2010 (UTC)

It would be really cool if we had some data from these in-link matches. Rich Farmbrough, 04:21, 17 November 2010 (UTC).

Interestingly Creedence at Woodstock Festival does not seem immune. Rich Farmbrough, 12:15, 17 November 2010 (UTC).
Time for someone to look at the source code... -- John of Reading (talk) 12:26, 17 November 2010 (UTC)
Not really. The logic works as described. On Woodstock Festival none of the "Creedence Clearwater..." wikilinks match the "Credence" typo rule, so it is applied. Rjwilmsi 17:53, 17 November 2010 (UTC)
Yes, my mistake. -- John of Reading (talk) 18:34, 17 November 2010 (UTC)

Pre-Columbian

Not Pre-Colombian. Rich Farmbrough, 04:20, 17 November 2010 (UTC).

Not sure what you're asking for here. There's already a rule set up to change "Pre-Colombian" to "Pre-Columbian". Are you saying this rule isn't working, or are you suggesting this rule be disabled, or something else? GoingBatty (talk) 04:26, 17 November 2010 (UTC)
My mistake. I was skipping the change on Columbia - reading the warning, not the diff. Rich Farmbrough, 10:04, 17 November 2010 (UTC).

Etc.…

OK I'm finding a lot of these, in variations; "etc. ..." etc. I will try and fix as many as possible but looks like a candidate for a typo rule. Rich Farmbrough, 10:04, 17 November 2010 (UTC).

Do you mean as in a proper etc. and then trailing periods (with or maybe without a space)? The current etc. rule has a kind of complicated negative lookback, so it's probably easier to just make a new rule for properly spaced etc.'s that have that feature. Test this:
Find: ([Ee])tc\.(\s)*\.*([Ee]tc\.?\s*\.*)*
Replace: $1tc.$2
I haven't tested it, that's a first draft attempt though. Shadowjams (talk) 11:00, 17 November 2010 (UTC)
The change of etc to etc. many times is not helpful. The use of a period becomes a full spot and so converting it with AWB makes this a not automatic process. How about instead convert etc to the full wording etcetera or otherwise not converting at all. Regards, SunCreator (talk) 11:08, 17 November 2010 (UTC)
I'm not sure the distinction between a full stop and a period... they're effectively the same thing... and I don't understand the issue with the change unless you prefer "etc" remains instead of becoming "etc." If you have an example of where the rule's making a mistake, please provide the diff. The manual of style, however, has long considered the "etc." version correct, as has every other style guide I've ever seen outside of Wikipedia. Shadowjams (talk) 11:45, 17 November 2010 (UTC)
Period and full stop are the same I was attempting to show the difference between a dot at the end of "etc." and the ending a sentence with "etc.". They are both the same and so it's an issue. Here is a made up example.
  • "During the succession of lead singers of Jones, Tomson, Harry, Dickson etc Smith's vocals had always been distinguishable."
Now if you change "etc" to "etc." you end up with two sentences. "During the succession of lead singers of Jones, Tomson, Harry, Dickson etc. Smith's vocals had always been distinguishable"
A better way would be to change "etc" to "etc.," to keep the sentence going. Splitting the sentence into two by "etc." is grammatically messy at best. Regards, SunCreator (talk) 23:24, 17 November 2010 (UTC)
We can't possibly account for mistakes. That sentence should be "During the succession of lead singers of Jones, Tomson, Harry, Dickson etc., Smith's vocals had always been distinguishable". If the comma has been omitted, that's not our problem. McLerristarr | Mclay1 06:23, 18 November 2010 (UTC)
You make a good point. Regards, SunCreator (talk) 02:42, 19 November 2010 (UTC)

It's not automatic, but it is complicated. I'm currently using 4 rules

  1. <Typo word="<enter a name>" find="etc\s*.\s*…" replace="etc." />
  2. <Typo word="<enter a name>" find="etc\s*\.\.\.\." replace="etc." />
  3. <Typo word="<enter a name>" find="etc\s*\.\.\." replace="etc." />
  4. <Typo word="<enter a name>" find="etc\. +([A-Z])" replace="etc.. $1" />

Plus of course the built in etc => etc.

  1. Rule 1 deals with the actual ellipsis character.
  2. Rule 2 assumes that four dots represent an abbreviation stop and an ellipsis, and removes the ellipsis.
  3. Rule 2 assumes that three dots represent an ellipsis, and removes the ellipsis, replacing it with a stop.
  4. Rule 4 assumes (very shakily) that a new sentence starts on the next word and inserts an end of sentence stop after the abbreviation stop.

This is, of course, only valid outside quotes, and even then only rules 1-3 can be given a very high positive and low negative hit rate. Rule 4 fails positively on succeeding proper nouns and fails negatively on intervening punctuation, breaks, titles, end of page etc. Rich Farmbrough, 12:25, 17 November 2010 (UTC).

Rereading the archived discussions about this rule have been enlightening. When would "Etc." (with a capital "E") be correct? GoingBatty (talk) 18:13, 17 November 2010 (UTC)
There were discussions about that in the archives too. *shrug* Shadowjams (talk) 09:14, 18 November 2010 (UTC)
I just reread the archives and didn't see it. Could you please show me where this was discussed? Thanks! GoingBatty (talk) 00:43, 19 November 2010 (UTC)
Sorry, I may be confused; come to think of it, it may have been regarding e.g. or i.e. or something like that. The discussion I'm thinking of had to do with trailing punctuation I think... In any case I think that issue dealt with some peculiarities of the old rule. So your question raises a good point. Shadowjams (talk) 00:52, 19 November 2010 (UTC)

etc..

I did some searching for etc and found lots of occurrences of "etc..". It seems much more common then "etc" in fact. Regards, SunCreator (talk) 10:45, 18 November 2010 (UTC)

Etc. and etc should be avoided in formal prose, IMO. "Such as ...", and "including ..." are just two subset terms that indicate that a list is incomplete, and avoid the brush-off informality of "etc"

"[number]-fold"

I have just removed the following as not being a typo.

 <Typo word="T(wo/hree/en/welve/wenty/hirty/housand)fold" find="\b([Tt])(wo|hree|en|welve|wenty|hirt(?:y|een)|housand)[-\s]+fold\b" replace="$1$2fold" />
 <Typo word=";(Four/Five/...)fold" find="\b([Ff](our|ive|orty|ift(y|een))|[Ss](ix|even)(teen|ty)?|[Ee](ight(y?|een)|leven)|[Nn]ine(teen|ty)?|[Hh]undred)[-\s]+fold\b" replace="$1fold" />

AFAIK, usage of the -fold suffix (i.e. 'three-fold' as opposed to 'threefold') is a accepted/bona fide variant, and does not fall to be treated as a typo. --Ohconfucius ¡digame! 04:10, 22 November 2010 (UTC)

Oxford Dictionaries Online doesn't list them as variants and I can't find any instances on Google, which thinks it's a typo. Usually hyphenated compound words are British but British usage seems to be no hyphen. McLerristarr | Mclay1 07:24, 22 November 2010 (UTC)

Saavy --> Savvy

A new user recently requested that this typo be fixed by an AWB user. I found the misspelling in 18 articles when I ran the request. --Andrew Kelly (talk) 03:49, 23 November 2010 (UTC)

  Done here GoingBatty (talk) 04:21, 24 November 2010 (UTC)

Tamborine

Please add "tamborine" → "tambourine", but not when capitalised to avoid changing Tamborine, a place in Queensland. McLerristarr | Mclay1 04:33, 23 November 2010 (UTC)

  Done here GoingBatty (talk) 04:21, 24 November 2010 (UTC)

Got a few more

  • persuing --> pursuing
  • persued --> pursued
  • persuit --> pursuit

Thanks! --Andrew Kelly (talk) 23:17, 24 November 2010 (UTC)

  Not done - already part of the typo rules GoingBatty (talk) 00:05, 25 November 2010 (UTC)
The first two could be perusing and perused, respectively. –Schmloof (talk · contribs) 00:47, 25 November 2010 (UTC)

Kilowatt hour - kWh?

A new typo rule was added for kilowatt hour to change typos to "kWh". Reading Kilowatt hour#Symbol and abbreviation for kilowatt hour makes me think that "kW·h" may be better. Thoughts? GoingBatty (talk) 05:05, 28 November 2010 (UTC)

The United States National Institute of Standards and Technology prefers "kW·h" but considers kW h acceptable. It acknowledges that the ISO allows dropping the space if there is no risk of confusion, but NIST disagrees with ISO's position.
My position is that the attention human editors give to reviewing AWB edits is often minimal, so a form that can be confusing, "kWh", should be forbidden for AWB purposes.
Also, since there are two acceptable forms, if examination of an article shows it consistently uses a correct form, together with a few errors, the AWB user must follow the established form for that article. Jc3s5h (talk) 18:00, 28 November 2010 (UTC)
The new rule is set up to change "KWh", "Kwh", or "Kph" → "kWh". RegExTypoFix can't suggest to the user to use one of multiple forms. Should it suggest "kW·h" or "kW h"? GoingBatty (talk) 02:41, 29 November 2010 (UTC)
"Kph"? It should be kW·h as that is the correct form. If people want to use the incorrect form, then that's up to MOS:NUM to decide, but a typo corrector should add the most correct form. It shold just not correct kW h. A new rule could be set up to add nbsp between units like that. Although, I'm not sure if that's a typo thing or a general AWB thing. McLerristarr | Mclay1 04:34, 29 November 2010 (UTC)
"Kph" is probably a typo for km/h and is certainly not a typo for kW·h. This is a clear error in the rule which must be fixed. I think the correction for the other typos should be kW·h. Editors who consistently fail to change this to kW h in articles where that form is appropriate should have their permission to use AWB revoked for failure to properly review their edits. Jc3s5h (talk) 13:57, 29 November 2010 (UTC)
I agree that AWB users should review their edits before saving, but I don't see how you would educate AWB users on the level on consistency you desire for the proper abbreviation, especially when the scientific community can't agree. You'd probably have better luck educating the editors who made the original mistakes, so the AWB users won't have to fix anything. GoingBatty (talk) 17:31, 29 November 2010 (UTC)
I would say AWB users should not use it on articles if they lack subject matter knowledge, or they should turn off any options that would make changes that require subject matter expertise to evaluate. Those who cannot be pursuaded to limit AWB use to situations they can properly evaluate should have the privilige of using it removed. Jc3s5h (talk) 18:12, 29 November 2010 (UTC)
Please don't put anything in the typo rules that requires expert knowledge. I use AWB to fix thousands of grammatical errors scattered randomly across hundreds of subject areas. If, as I read here, this typo rule is controversial or requires extra-careful review, I will simply turn off the RegExpTypoFix option on any article where this rule kicks in - and that means that other typos in that article won't be fixed. -- John of Reading (talk) 18:22, 29 November 2010 (UTC)

So John, by the same reasoning, you wouldn't want any typo correction for words that are spelled differently in various varieties of English, such as "colour", right? Jc3s5h (talk) 19:35, 29 November 2010 (UTC)

That's correct, we couldn't add a typo rule for "color > colour" or "colour > color", because they would give the wrong results too often. -- John of Reading (talk) 21:09, 29 November 2010 (UTC)
John, I don't think that's the proper analogy. Your example is a rule that is changing one correct version of the word for another. I think a better analogy is that we don't add a typo rule to fix the incorrect "colur", because the correct word could be either "color" or "colour".
In this case, a typo rule was added to fix the incorrect "KWh" or "KWh", but since the correct abbreviation could be "kW·h" or "kW h" or maybe even "kWh" (depending on which organization you want to follow), I think the safest thing would be to remove the rule and let those with the expert knowledge identify and fix all future errors.
Therefore I removed the "kilowatt hour" rule in this edit. GoingBatty (talk) 01:54, 30 November 2010 (UTC)

Incorrect spelling correction of disiciplinary

disiciplinary is changed incorrectly to dissiciplinary instead of disciplinary. Edit can be seen here[2] - Aeonx (talk) 03:53, 9 December 2010 (UTC)

It was the rule named "Dissi-". I don't know whether it's worth changing the rule, though, since this is such an uncommon typo - an AWB Google search finds just two examples, neither in article space. If I see that RegExpTypoFix has made an incorrect fix, I don't hit "Save"... -- John of Reading (talk) 07:35, 9 December 2010 (UTC)
Perhaps it would be wise to make a temporary rule of "dissiciplinary" to "disciplinary" to fix up the mistakes that the typo finder may have already made? McLerristarr | Mclay1 08:10, 9 December 2010 (UTC)
This search says the only example of "dissiciplinary" is at Wikipedia:WikiProject Death. -- John of Reading (talk) 09:47, 9 December 2010 (UTC)
Fair enough. Considering the typo finder very rarely makes this change, we probably don't need this rule at all. McLerristarr | Mclay1 11:19, 9 December 2010 (UTC)

False positive: Bicep

The "typo" fix here is invalid. The beach is called "Bicep Beach". (I just watched the short to confirm.) So I added {{typo}} around the word. Does AWB honor that? Is that the appropriate action in cases like this? --Mepolypse (talk) 15:43, 11 December 2010 (UTC)

Just a note: {{Typo}} has been moved to {{Not a typo}}. McLerristarr | Mclay1 15:57, 11 December 2010 (UTC)
A quick check with my sandbox - yes, normal typo fixing is disabled inside both {{typo}} and {{Not a typo}}. -- John of Reading (talk) 16:03, 11 December 2010 (UTC)
Thanks. (Agree that {{not a typo}} is a better name.) --Mepolypse (talk) 16:06, 11 December 2010 (UTC)
Wonder why Wikipedia needs both {{not a typo}} and {{sic}}. GoingBatty (talk) 22:37, 11 December 2010 (UTC)
Does {{not a typo}} have the options that {{sic}} has? One thing {{sic}} has is the ability to hide or display the word "sic"; sometimes you just want to tell spell checkers to leave it alone, and sometimes you want "sic" displayed in the article. --Auntof6 (talk) 03:20, 12 December 2010 (UTC)
I see the difference as {{sic}} is to tag a mistake made by someone outside of Wikipedia, e.g. in a quote, whereas {{Not a typo}} is to tag a deliberate mistake made by a Wikipedia editor or something that seems like a mistake but isn't. McLerristarr | Mclay1 04:37, 12 December 2010 (UTC)

KBE

People are not "made a KBE" they are "appointed a KBE". Ditto OBE DBE GBE MBE KCMG MVO LVO KCVO. Kittybrewster 14:07, 12 December 2010 (UTC)

That's not a typo or even incorrect, it is merely a personal preference. It is completely correct grammar to say someone was "made a Knight of the British Empire". The award is often used to refer to the recipient. Ringo Starr has an MBE = Ringo Starr is an MBE. Whether that is correct or not is not for a typo fixer to decide. McLerristarr | Mclay1 14:14, 12 December 2010 (UTC)

False positive: before it's

This edit is wrong. Can we get AWB to not do this? --Mepolypse (talk) 16:12, 13 December 2010 (UTC)

I have removed this fix. It was added on 18th October. -- John of Reading (talk) 17:34, 13 December 2010 (UTC)
I agree that it was wrong to change "before it's too late" to "before its too late". However, per WP:CONTRACTION, this use of "it's" is "informal and should be avoided." GoingBatty (talk) 17:54, 13 December 2010 (UTC)
Should be changed to "it is". Kittybrewster 09:17, 17 December 2010 (UTC)
I set up a "find and replace" run for "before it's too late" > "before it is too late", but quickly abandoned it.
  • Too many of the matches were album/song titles, or were in text that AWB failed to identify as quotes.
  • If an article uses informal contractions such as "it's", it probably needs a full copy-edit, way beyond anything that AWB can do.
-- John of Reading (talk) 09:44, 17 December 2010 (UTC)

Capitalisation of "internet"

I'm not going to revert this edit, but I'm not convinced it's necessary to change internet to Internet. According to Internet capitalization conventions, many publications are now using the common noun (uncapitalised) form. I've raised a query at Wikipedia talk:Manual of Style (capital letters) to see if Wikipedia has any conventions regarding this. In the meantime, I'm skipping making this change. —  Tivedshambo  (t/c) 21:57, 17 December 2010 (UTC)

I've disabled the rule until the discussion is settled. -- John of Reading (talk) 17:18, 18 December 2010 (UTC)

""Baptist_" rule

In this edit the correct fix "baptist" > "Baptist" was not recorded in the edit summary. Presumably the rule fails this guideline, but I don't know enough about regular expressions to fix it. -- John of Reading (talk) 17:57, 23 December 2010 (UTC)

  Done This update will ensure the correction shows up in the edit summary for all cases except the "John the baptist" fix. Rjwilmsi 01:45, 3 January 2011 (UTC)
Baptist should not always be capitalised. A baptist is one who baptises. McLerristarr | Mclay1 03:31, 3 January 2011 (UTC)
Yes, the rule looks at the next word. It only capitalises "baptist church", "baptist minister" and a few similar pairs. -- John of Reading (talk) 07:46, 3 January 2011 (UTC)

Suggestions

<Typo word="Cadillac" find="\b[Cc]ad(dil(l|)|il)ac\b" replace="Cadillac"/>
<Typo word="Be unable" find="\bnot\s+be\s+able\b" replace="be unable"/>
<Typo word="Aberrant" find="\b([Aa])b(b[ae]rr?|[ae]r|arr?)([ae](nce|nt|tes?|tions?)|)\b" replace="$1berr$3"/>
<Typo word="Accelerate" find="\b([Aa])c(cela|[ae]l[ae])rat(e(d|s|)|ing)\b" replace="$1ccelerat$3"/>
<Typo word="Accidentally" find="\b([Aa])cc?id[ae]nt([aei]?(ly))\b" replace="$1ccidentally"/>
<Typo word="across" find= "\bacros\b" replace="across"/>
<Typo word="Adaptation" find="\b([Aa])dapt([ae]|io)n(s?)\b" replace="$1daptation$3"/>
<Typo word="Adaptive" find="\b([Aa])dapt[aei]tive\b" replace="$1daptive"/>
<Typo word="Adultery" find="\b([Aa])d[aeu]lt[au]?ry\b" replace="$1dultery"/>
<Typo word="Anesthe(sia/tic)" find="\b([Aa])n[ai]sth[ae](sia|tics?)\b" replace="$1nesthe$2"/>
<Typo word="(A/E)ffect" find="\b([AaEe])fect(s|ing|)\b" replace="$1ffect$2"/>
<Typo word="affidavit" find="\baf(f[ae]|[aei])(d[ae]v[ie][td](s?))\b" replace="affidavit$3"/><!--To do: catch if start with aff) -->
<Typo word="Affluen(t/ce/cy|tial)" find="\b([Aa])fluen(c[ey]|t(tial)?)\b" replace="$1ffluen$2"/>
<Typo word="(Un)Afflict" find="\b([Uu]na|[Aa])flict(e(d(ly|ness|)|r)|i(ng|ons?(less)?|ve)|less|s|)\b" replace="$1fflict$2"/>
<Typo word="Aggravate" find="\b([Aa])g(gr[eo]|r[aoe])vat(ed?|i(on|ve)|or)\b" replace="$1ggravat$3"/>
<Typo word="Agrees to" find="\bagress\s+to\b" replace="agrees to"/>
<Typo word="Agreement" find="\bagree?[ia]nce\b" replace="agreement"/><!-- Per http://dictionary.reference.com/browse/agreeance agreeance is "considered obsolete and a bastardization of 'agreement' " -->
<Typo word="Aid" find="\b(to|give|provide)\s+aide\b" replace="$1 aid"/><!--Aid vs Aide needs more work-->
<Typo word="Album" find="\balbumn(s?)\b" replace="album$1"/>

Before I added the above typo suggestions in (and commit more time towards making the regex), I wanted to make sure that the above is correct. Could someone familiar with the Typos regex let me know if the above formatting/regex is correct?Smallman12q (talk) 23:01, 2 January 2011 (UTC)

Here's what I changed above:
  • Added missing left bracket to the "Cadillac" rule.
  • Changed the name of the "Unable" rule to "Be unable". What is your source for changing this?
  • Changed the end of the "Abberant" rule to $3.
  • Changed the beginning of "Aggravate" rule to $1.
  • Changed the name of the "Agreeance" rule to "Agreement".
  • Moved the comments immediately after the appropriate rule.
Thanks! GoingBatty (talk) 01:01, 3 January 2011 (UTC)
Anesthe... words are spelt "anaesthe..." in British English. McLerristarr | Mclay1 03:29, 3 January 2011 (UTC)
Miscellaneous comments:
  • Which of these mistakes are common enough to warrant a typo rule?
  • I think the "Be unable" rule goes beyond typo fixing into copy-editing.
  • In the "Anesthe(sia/tic)" rule the "replace" string is empty?
-- John of Reading (talk) 07:56, 3 January 2011 (UTC)

I've fixed the Anesthe(sia/tic) replace. Here are some more suggestions:

<Typo word="Literally" find="\b([Ll])it((t[aeo]r[aei]|[ao]r[aeio]|er[eio])l?|era)ly\b" replace="$1iterally"/>
<Typo word="illiterate" find="\b([Ii])l([aeoi]t[aeio]r[aeio]|l([aeo]tera|it[aio]ra|iter[eio]))t(e?(ly|ness|s|))\b" replace="$1lliterate$5"/>
<Typo word="A lot" find="\balot\b" replace="a lot"/>
<Typo word="Alphabetize" find="\b([Aa])lphabeticalize\b" replace="$1lphabetize"/><!--rare-->
<Typo word="all right" find="\balright\b" replace="all right"/><!--Alright is nonstandard-->
<Typo word="Alternate" find="\b([Aa])lterate\b" replace="$1lternate"/>
<Typo word="Ulterior" find="\balterior\b" replace="ulteriror"/><!--rare-->
<Typo word="Although" find="\b([Aa])ltho(?![s'])\b" replace="$1lthough"/><!--either add ' or ugh-->
<Typo word="Ambivalent" find="\bambiv[aeio]late\b" replace="ambivalent"/><!--rare-->
<Typo word="Ambivalen(t/ce/cy)" find="\b([Aa])mb(([aeo]va|ev[eio])lan|ival[aio])n(t|c[ey])\b" replace="$1mbivalen$4"/>

Smallman12q (talk) 13:40, 3 January 2011 (UTC)

Fixed "Literally" so it will find upper and lower case. Are you testing your regexes using AWB's regex tester? GoingBatty (talk) 13:48, 3 January 2011 (UTC)
I'm using RegexBuddy.Smallman12q (talk) 16:08, 3 January 2011 (UTC)

A style suggestion: Set each rule so that the replace field has only $1 and $2, and not jump from $1 to, say, $5. Then if someone later makes a change, they won't have to count whether $5 has to be increased to $6. For example:

<Typo word="Illiterate" find="\b([Ii]l)(?:[aeoi]t[aeio]r[aeio]|l(?:[aeo]tera|it[aio]ra|iter[eio]))te?(ly|ness|s?)\b" replace="$1literate$2"/>

Also, "altho" is not incorrect: http://www.merriam-webster.com/dictionary/altho

Have you checked to see whether all these errors actually occur in wikipedia?

Is "ambivilent" really a word? It's not in my gigantic dead-tree dictionary, and the only two occurrences I find in wikipedia are errors for "ambivalent". Even if it is a real word, I wouldn't create a rule for it, as it is apparently exceedingly rare.--BillFlis (talk) 14:06, 3 January 2011 (UTC)

No its not...that's my mistake...should be "ambivalent".I have checked some..."literaly" returns 8 results,"iliterate" returns 2. I also have a more technical question, is the typo scan plugin multi-threaded?Smallman12q (talk) 16:07, 3 January 2011 (UTC)

Here are some more suggestions...I believe I'm done with most of the A's...

<Typo word="Amidst/Amongst/Whilst" find="\b([Aa]m(ong|id)|[Ww]hil)st\b" replace="$1"/><!--archaic-->
<Typo word="Immoral/Immortal" find="\b([Ii])mor(t?)al(s|ity|l?y|)\b" replace="$1mmor$2al$3"/>
<Typo word="Immoral (2)" find="\bammoral" replace="immoral"/> <!--could also be amoral-->
<Typo word="Ampersand" find="\b([Aa])mp(?:ers[eiou]|[[aiou]rsa)nd(s|)\b" replace="$1mpersand$2"/>
<Tpo word="Anecdote" find="\ban[ia]dote(s|)\b" replace="anecdote$1"/>
<Typo word="Annoyance" find="\bannoyment(s)\b" replace="annoyance$1"/>
<Typo word="Anymore" find="\bany\s+more\b" replace="anymore"/><!--http://dictionary.reference.com/browse/anymore most commonly spelled as one word-->
<Typo word="Anyway" find="\b([Aa])n?nyways\b" replace="$1nyway"/><!--http://dictionary.reference.com/browse/anyways Anways is not standard-->
<Typo word="Arctic" find="\b([Aa])rtic" replace="$1rctic"/>
<Typo word="As usage (since)" find="\b([Aa])s\s+(al(most|)\b" replace="since"/><!--check if first letter caps-->
<Typo word="As usage (because)" find="\bas\s+there\b" replace="because"/>
<Typo word="Assertion" find="\b([Aa])ss?ertation(s?)\b" replace="$1ssertion"/><!-- Assertation obselete-->
<Typo word="Opinion" find="\bopinionation\b" replace="opinion"/>
<Typo word="Authentication" find="\b([Aa])u?thentification\b" replace="$1uthentication"/><!-- Authentification is incorrect-->
<Typo word="Backward" find="\bba(ckw[eio]|kw[[aeio])rd(s?)\b" replace="backward$1"/>

Smallman12q (talk) 17:52, 3 January 2011 (UTC)

I don't think we should "fix" amidst, amongst, or whilst. These are not marked as archaic in Merriam-Webster. "Ammoral" -> "Amoral" would be a better fix. "Anidote" is more likely "Antidote" than "Anecdote", IMO. Any more is right out -- "are there any more of these?" is perfectly valid. I do not understand "as al" -> "since" or "as almost" -> "since", and is missing a close paren. "As almost" (uppercase) -> "since" (lowercase) would be wrong regardless. -- JHunterJ (talk) 18:15, 3 January 2011 (UTC)
"Immoral" is probably handled by one of the "beginnings" rules. "Artic" without a final \b will damage "Articulated"; with a \b there will still be many false positives - try a search. "Opinionation" occurs only four times, three times as the title as a piece of music and once in the title of an academic work. As a general point, can one of the AWB performance experts please indicate roughly how many hits are needed to make a rule worthwhile? -- John of Reading (talk) 18:22, 3 January 2011 (UTC)
Also, "whilst" (if "fixed") should become "while", not "whil". -- JHunterJ (talk) 18:36, 3 January 2011 (UTC)

I don't agree with the "Anesthe(sia/tic)" rule. If someone had misspelt it "anasthesia", it could have meant to be "anaesthesia" or "anesthesia". We can't correct that one. McLerristarr | Mclay1 06:54, 4 January 2011 (UTC)

Why not? Picking either valid spelling from two variations (with the same meaning) is an improvement over a misspelling. -- JHunterJ (talk) 15:22, 5 January 2011 (UTC)
But picking the American variation on a British page or vice versa isn't that much of an improvement. The main problem is which one would we pick for the typo finder to use? Either way, it isn't fair on the other. McLerristarr | Mclay1 15:56, 5 January 2011 (UTC)
OTOH, picking a chiefly American variation on a British page (or vice versa) is a big improvement over a misspelling on either type of page (it only "violates" WP style, not English spelling). "Fair" isn't at issue. (Can't speak for all Americans, of course, but I'd rather see the chiefly British variation than the misspelling.) -- JHunterJ (talk) 17:32, 5 January 2011 (UTC)

The only occurrences of "opinionation" I found were 1) in a song title, "My Opinionation", hence a deliberate misspelling or nonce word, and 2) in the title of a journal article, so probably intended as technical jargon. Also, the "Backward" rule has a couple of problems: "[[" and "$1". Also, in the first set, "adaption" could be a type for "adoption".--BillFlis (talk) 14:24, 6 January 2011 (UTC)

False positive: "suppose to" → "supposed to"

In Perates the following sentence illustrates the fact that suppose can correctly be followed by to:

We do not read elsewhere of any Euphrates but the Stoic philosopher, who lived in the reign of Hadrian, whom we cannot suppose to have been a teacher of Ophite doctrine.

Moreover, in Ethics of eating meat the following sentence shows another instance:

… there are many examples of animal behavior and intelligence that surpass what people would suppose to be the boundary of animal consciousness …

The new rule that changes "suppose to" into "supposed to" is wrong.

Regards, Hebrides (talk) 21:18, 9 January 2011 (UTC)

I am currently working on a "find & replace" run for "suppose to" and have found several other false positives. I am removing the rule. -- John of Reading (talk) 21:33, 9 January 2011 (UTC)
Actually, I see I've just worked through the very same list of articles that you've been working on. Sorry! -- John of Reading (talk) 21:43, 9 January 2011 (UTC)
And you've both been working on the list of articles that are next on my list.  :-) How about readding it to only find "is/was suppose to"? GoingBatty (talk) 21:53, 9 January 2011 (UTC)
Maybe, but most of the bad articles have just been fixed by Hebrides and me. -- John of Reading (talk) 22:03, 9 January 2011 (UTC)
I've just corrected about a hundred more typos, and have more to do. Based on the volume that the three of us have fixed, I feel that this should be added to the list, just more carefully. Sorry I didn't get it right the first time. GoingBatty (talk) 23:50, 9 January 2011 (UTC)

tieing -> tying

Since there is already a rule for "dieing -> dying", there should also be one for "tieing -> tying". AWB did not find that typo here. —bender235 (talk) 23:45, 11 January 2011 (UTC)

  Done - I consolidated the "dieing -> dying" and "lieing -> lying" rules, and added "tieing -> tying" too. GoingBatty (talk) 00:39, 12 January 2011 (UTC)

Another false positive: "at at" → "at"

I just had to revert an AWB edit to Derrick Eide (Cubing), having mistakenly clicked OK on a change from:

"participated regularly in a number of tournaments, appearing at at least one each year "

to

"participated regularly in a number of tournaments, appearing at least one each year "

Changing "at at" to "at" is wrong.

Regards, Hebrides (talk) 11:58, 13 January 2011 (UTC)

Let's see if we can add an exception for "at at least". This rule catches quite a lot of typos. Rjwilmsi 12:32, 13 January 2011 (UTC)
And "at at least" should be reworded for clarity. "appearing in at least one each year" would be better. -- JHunterJ (talk) 12:36, 13 January 2011 (UTC)

Uknown-->unknown

Uknown-->uknown. I've come across it a few times....
find="\b([Uu])known\b" replace="$1nknown"
ManishEarthTalkStalk 18:19, 19 January 2011 (UTC)

  Done by expanding the existing rule. GoingBatty (talk) 02:19, 20 January 2011 (UTC)

then>than

(more|less|lesser|larger|smaller|bigger|longer|shorter|worse|better|higher|lower|younger|older) then>than ManishEarthTalkStalk 05:15, 20 January 2011 (UTC)

  Done, Someone please check if its done correctly... ManishEarthTalkStalk 05:25, 20 January 2011 (UTC)
This will create way too many false positives. If some leaves out the comma in "he got a lot worse, then he got better" we'd end up with "he got a lot worse than he got better". McLerristarr | Mclay1 05:27, 20 January 2011 (UTC)
Is "lesser than" proper English, or should those be changed to "less than"? GoingBatty (talk) 05:39, 20 January 2011 (UTC)
I don't think any of these are suitable: "He earned more/less then than he did now", "It was smaller/larger/bigger/etc then than it is now" -- John of Reading (talk) 08:05, 20 January 2011 (UTC)
Addressed that existing problem with the former rule with a negative lookahead. The other false positive of the missing comma is also an existing problem, not a problem with the expansion. Should this rule be deleted entirely? -- JHunterJ (talk) 15:12, 20 January 2011 (UTC)
I think it should. Any rule that would create false positives due to missing commas should not exist. Missing commas are very common. McLerristarr | Mclay1 05:10, 23 January 2011 (UTC)

Forgeting -> Forgetting

Not sure if this is already covered, but it came up on the VP(M) today. --Tagishsimon (talk) 23:28, 20 January 2011 (UTC)

  Done by expanding the existing "-tting" rule. GoingBatty (talk) 01:34, 21 January 2011 (UTC)

positive lookbehind results in no edit summary?

Using <Typo word="september" find="\b(?<=[Dd]en \d\d? |[Ii] )September\b" replace="september" /> on sv:WP:AWB/T to change September to september in "Den 18 September 2008..." in sv:Arsonist Lodge on svwp, but edit summary is not filled by the expected "typos fixed: September -> september" but comes out blank "typos fixed: ". Can anyone confirm this is a bug (and help me report the bug), or explain to me what I did wrong so I should expect this undesirable result? ~ Dodde (talk) 00:06, 25 January 2011 (UTC)

If the regex does not match the match value ("September" is the match value here, and the regex doesn't match it) then the edit summary can't show it, as I documented. Not a bug. You don't need a lookbehind there, make it a normal group and replace with "$1september" then all will be well. Rjwilmsi 08:45, 25 January 2011 (UTC)
What do you mean? Isn't the full "\b(?<=[Dd]en \d\d? |[Ii] )September\b" referred to as the regex? This regex does match "September" if it is preceeded by i.e. "Den 18 ". I don't want "Den 18 " to be part of the match since I don't want it to be part of the edit summary. I don't understand how this can be done without positive lookbehind. ~ Dodde (talk) 16:08, 25 January 2011 (UTC)
Match value is "September", the regex does not match it, so no edit summary is generated. What you want (to hide part of the match logic in the edit summary) is not supported. Rjwilmsi 17:36, 25 January 2011 (UTC)
I believe \b(?<=[Dd]en \d\d? |[Ii] )September\b will match September in "Den 18 September 2008", but the edit summary didn't show that. -- JHunterJ (talk) 18:36, 25 January 2011 (UTC)
In all other cases, what is matched by the regex is shown in edit summary is "typos fixed: matched value > replace value" (if that now are the correct terms?). How can a .NET supported regex match suddenly be unexpectedly "not supported (for showing up in the edit summary)" without being a bug? ~ Dodde (talk) 18:58, 25 January 2011 (UTC)
rev 7571 RETF to generate edit summary as normal for regex using lookarounds whereby regex doesn't match its own match value. I hadn't realized we could support this with a simple change. Rjwilmsi 08:36, 26 January 2011 (UTC)
Thank you. ~ Dodde (talk) 18:16, 26 January 2011 (UTC)

Is AWB still ignoring matches in wikilinks? Apparently not: [3][4]

While this is beneficial for this one rule (and what I hoped to happen when I wrote it), it might be harmful for other rules. Can someone confirm this? — Train2104 (talk • contribs • count) 01:57, 25 January 2011 (UTC)

If it was typo fixes, the edit summary would say: "typos fixed: VIA → Via". Since this says "replaced: VIA → Via", that indicates to me that the person is using AWB's find and replace feature without checking the "Ignore external/internal wikilinks" box. Please follow up with the person who made the edits. Good luck! GoingBatty (talk) 02:30, 25 January 2011 (UTC)

-ish

The "-ish" rule tries to be very clever and not damage certain proper names that end in "sih", but it is not yet good enough. There is a false positive "Fasih" at Special Tribunal for Lebanon, and that's not the first I've run into today. How about a simpler rule that only changes "-sih" to "-ish" when the word is entirely lowercase? -- John of Reading (talk) 18:26, 27 January 2011 (UTC)

Such as changing it from:
<Typo word="-ish" find="(?!\b(?:Ber|Sukae|M(?:[ae]s?|adra)|Ne|Ka|Finar)sih\b)\b([A-Za-z]+)sih(ing(ly)?|e[ds]|ers?)?\b" replace="$1ish$2" /><!--Don't match Bersih, Sukaesih, Massih, Masih, Nesih, Kasih, Messih, Finarsih, Madrasih-->
to the much simpler:
<Typo word="-ish" find="\b([a-z]+)sih(ing(ly)?|e[ds]|ers?)?\b" replace="$1ish$2" /><!--Don't match proper names (beginning with capital letters)-->
GoingBatty (talk) 03:45, 28 January 2011 (UTC)
I know it's still being clever, but can we make it safely clever for proper names by checking the letter before the "sih"?
<Typo word="-ish" find="\b([A-Z][a-z]*[^aers]|[a-z]+)sih(ing(?:ly)?|e[ds]|ers?)?\b" replace="$1ish$2" /><!--Don't match proper names with -asih -esih -rsih -ssih -->
(also added the non-capture for the -ly of -ingly). -- JHunterJ (talk) 11:56, 28 January 2011 (UTC)

certinaly -> certainly

Here AWB replaced "certinaly" with "certinally", but it should be "certainly". Please fix. --bender235 (talk) 15:52, 1 February 2011 (UTC)

Just to clarify the usage, you replaced "certinaly" with "certinally" there, using AWB and its catch-all rule for replacing -aly with -ally. I will expand the "certain" rule to handle "certin", but remember to check the edits you make using AWB. -- JHunterJ (talk) 17:31, 1 February 2011 (UTC)
I know. I immediately fixed it. --bender235 (talk) 23:32, 3 February 2011 (UTC)

Appeared

AWB did not catch "appeard" as misspelling of "appeared". Please add a rule. --bender235 (talk) 23:33, 3 February 2011 (UTC)

  Done GoingBatty (talk) 03:10, 4 February 2011 (UTC)
  Fixed 80+ misspellings. GoingBatty (talk) 03:53, 4 February 2011 (UTC)

More contractions

Is it possible to expand the contractions rule to allow for badly punctuated examples, such as did'nt, doesnt, etc? I don't think there's many of them, but it would be worth picking them up when they occur. Words like cant should be exempt from the rule. —  Tivedshambo  (t/c) 20:56, 1 February 2011 (UTC)

How about leaving the "cannot" rule the way it is, removing "we're", adding question marks to find zero or one apostrophe, and ensuring it doesn't change "hell" and "shell" as follows:
<Typo word="will not" find="\bwon[’'`]?t\b" replace="will not"/><!--don't change uppercase titles-->
<Typo word=" not" find="\b(are|(c|sh|w)ould|d(id|o|oes)|ha([ds]|ve)|is|m(igh|us)t|w(as|ere))n[’'`]?t\b" replace="$1 not"/><!--don't change uppercase titles, can't and won't have separate rules-->
<Typo word=" are" find="\b(they|wh(at|o)|you)[’'`]?re\b" replace="$1 are"/><!--don't change uppercase titles-->
<Typo word=" have" find="\b((c|sh|w)ould|they|wh(at|o)|you)[’'`]?ve\b" replace="$1 have"/><!--don't change uppercase titles-->
<Typo word=" will" find="\b(s?he|they|wh(at|o)|you)[’'`]?ll\b(?<!hell)" replace="$1 will"/><!--don't change uppercase titles or "hell" or "shell"-->
GoingBatty (talk) 02:45, 2 February 2011 (UTC)
Tweaks to avoid matching "wont" and "whore", and more alternatives rather than negative lookbehind (and non-capturing parens where the capture isn't used).
<Typo word="will not" find="\bwon[’'`]t\b" replace="will not"/><!--don't change uppercase titles or "wont"-->
<Typo word=" not" find="\b(are|(?:c|sh)ould|d(?:id|o|oes)|ha(?:d|s|ve)|is|m(?:igh|us)t|w(?:as|ere|ould))n[’'`]?t\b" replace="$1 not"/><!--do not change uppercase titles, can't and won't have separate rules-->
<Typo word=" are" find="\b(who[’'`]|(?:they|what|you)[’'`]?)re\b" replace="$1 are"/><!--do not change uppercase titles or "whore"-->
<Typo word=" have" find="\b((?:c|sh)ould|they|w(?:ould|h(?:at|o))|you)[’'`]?ve\b" replace="$1 have"/><!--do not change uppercase titles-->
<Typo word=" will" find="\b(s?he[’'`]|(?:they|wh(?:at|o)|you)[’'`]?)ll\b" replace="$1 will"/><!--do not change uppercase titles or "hell" or "shell"-->
--JHunterJ (talk) 12:03, 2 February 2011 (UTC)

But see Wikipedia talk:AutoWikiBrowser#Using RegexTypoFix to "fix" things which are not typos. -- JHunterJ (talk) 12:22, 7 February 2011 (UTC)

Amnd this is an example. Please read WT:MOS for the last week; there have been several instances where non-native speakers of English have simply substituted was not for wasn't, when the sentence needed to be recast - and where simple substitution changes the force of the sentence. WP:CONTRACTIONS says, advisedly, generally; there are occasions when the uncontracted form is a violation of idiom; there are occasions when the contraction is being quoted; and there are occasions when the change needed is much more complex than AWB can provide. Septentrionalis PMAnderson 19:40, 8 February 2011 (UTC)

Deletion of the "harmful" contraction section

I have undone this edit by Pmanderson (talk · contribs). The section containing the "harmful contractions", the ones that have provoked so much debate, was disabled by me 36 hours ago; that section is still there but is commented out. The section removed by Pmanderson is an ordinary typo-fixing section that fixes incorrect contractions, for example changing it's downfall to its downfall and was'nt to wasn't. I have restored that section. -- John of Reading (talk) 19:51, 8 February 2011 (UTC)

Thank you for explaining. How does it tell it's downfall from it's down the road ot it's downfallen (not desirable, but better than its downfallen)? Septentrionalis PMAnderson 20:00, 8 February 2011 (UTC)
That rule has been written to match only a very specific list of words. The rule does not match "it's down" or "it's downfallen", only "it's downfall". There's documentation on the rule syntax at WP:REGEX, but some of the rules have grown pretty complex. -- John of Reading (talk) 20:24, 8 February 2011 (UTC)

Having slipped once, I will ask you to edit this for me. Do you have a set of commands which enforce WP:ENDASH? if so, please comment them out on the same grounds:

I think those guidelines are so well-established that they are built into the software as General fixes. -- John of Reading (talk) 07:57, 9 February 2011 (UTC)

womens' → women's'

For the article Guide to Life, AWB wants to convert womens' to women's'. Could someone please update the men's rule to fix this? Thanks! GoingBatty (talk) 21:44, 30 September 2010 (UTC)

There's another suggestion about the men/women rule at the top of the page, too -- John of Reading (talk) 06:04, 1 October 2010 (UTC)
Has this been fixed yet? --bender235 (talk) 12:31, 12 February 2011 (UTC)
It looks like it has. Has it not? -- JHunterJ (talk) 12:49, 12 February 2011 (UTC)

Contractions

From Wikipedia:Basic copyediting:

  • Unless part of a quotation, contractions (don't, can't, etc.) should be changed to non-contracted forms (do not, cannot, etc.).

I think an AWB/T rule that covers these things would be nice. —bender235 (talk) 13:40, 20 January 2011 (UTC)

I think we have to be careful when changing contractions not to change proper names that may not be in quotes (e.g. Wouldn't It Be Nice) and contractions with more than one meaning (e.g. "'s" could be "is", "does" or "has"; same with "'d"). How about the following?
<Typo word="can not" find="\bcan't\b" replace="can not"/><!-- do not change uppercase titles -->
<Typo word="will not" find="\bwon't\b" replace="will not"/><!-- do not change uppercase titles -->
<Typo word=" not" find="(?!\b(wo|ca)n't\b)\b([a-z]+)n't\b" replace="$2 not"/><!-- do not change uppercase titles, can't and won't have separate rules -->
<Typo word=" have" find="\b([a-z]+)'ve\b" replace="$1 have"/><!-- do not change uppercase titles -->
<Typo word=" will" find="\b([a-z]+)'ll\b" replace="$1 will"/><!-- do not change uppercase titles -->
GoingBatty (talk) 01:28, 21 January 2011 (UTC)
Added a new proposal for "won't" and changed the " not" proposal. GoingBatty (talk) 03:03, 21 January 2011 (UTC)
Added a new proposal for "can't" and changed the " not" proposal. GoingBatty (talk) 03:16, 21 January 2011 (UTC)
I'll put these in my "Find & Replace" list and report back here. They ought to check for more kinds of single quote: [’'`]. -- John of Reading (talk) 09:46, 21 January 2011 (UTC)
These are working well for me. Two more comments: I much prefer "cannot" over "can not"; and I think the third rule should be rewritten as an explicit list of the contractions it is willing to fix. This will make it safer, and, as I understand it, will stop it being a performance hit. -- John of Reading (talk) 11:01, 21 January 2011 (UTC)

Suggest:

<Typo word="can not" find="\bcan[’'`]t\b" replace="cannot"/><!-- do not change uppercase titles -->
<Typo word="will not" find="\bwon[’'`]t\b" replace="will not"/><!-- do not change uppercase titles -->
<Typo word=" not" find="\b([a-z]+)n[’'`]t\b(?<!\b(?:wo|ca)n[’'`]t\b)" replace="$1 not"/><!-- do not change uppercase titles, can't and won't have separate rules -->
<Typo word=" have" find="\b([a-z]+)[’'`]ve\b" replace="$1 have"/><!-- do not change uppercase titles -->
<Typo word=" will" find="\b([a-z]+)[’'`]ll\b(?<!ya[’'`]ll)" replace="$1 will"/><!-- do not change uppercase titles, ya'll is more likely y'all than ya will -->

based on the comments, my preference for negative lookbehind (evaluated only after a possible match) over negatice lookahead (evaluated whether or not there's a match), and an exclusion for ya'll, which is more like y'all than ya will. -- JHunterJ (talk) 12:55, 21 January 2011 (UTC)

I've edited your third rule. The lookbehind section also needs to check for the three kinds of single quote. -- John of Reading (talk) 15:33, 21 January 2011 (UTC)
dictionary.com confirms that "cannot" is far more common than "can not". Per this list of standard contractions, I've modified the suggestion below. Note that I added an " are" rule and removed the " will" rule, since "he'll" could either be "he will" or "he shall".
<Typo word="cannot" find="\bcan[’'`]t\b" replace="cannot"/><!-- do not change uppercase titles -->
<Typo word="will not" find="\bwon[’'`]t\b" replace="will not"/><!-- do not change uppercase titles -->
<Typo word=" not" find="\b(are|(c|sh|w)ould|d(id|o|oes)|ha([ds]|ve)|is|m(igh|us)t|were)n[’'`]t\b" replace="$1 not"/><!-- do not change uppercase titles, can't and won't have separate rules -->
<Typo word=" are" find="\b(they|w(e|hat|ho)|you)[’'`]re\b" replace="$1 are"/><!-- do not change uppercase titles -->
<Typo word=" have" find="\b(they|w(e|hat|ho)|you)[’'`]ve\b" replace="$1 have"/><!-- do not change uppercase titles -->
<Typo word=" will" find="\b(s?he|they|w(e|hat|ho)|you)[’'`]ll\b" replace="$1 will"/><!-- do not change uppercase titles -->
Changed the " not" rule to find "don't". Please feel free to update the rules (especially to improve performance, if needed). GoingBatty (talk) 02:06, 23 January 2011 (UTC)
I would !vote to include "he'll > he will", and so on. I think that "he shall" is unusual and, if meant, would be spelt out explicitly and not contracted. -- John of Reading (talk) 08:54, 23 January 2011 (UTC)
Isn't "he shall" incorrect grammar anyway? Traditionally, shall is first person, will is second and third person. McLerristarr | Mclay1 10:16, 23 January 2011 (UTC)
There's way too much detail about this at Shall and will. -- John of Reading (talk) 11:46, 23 January 2011 (UTC)
OK, added the " will" proposal above. GoingBatty (talk) 14:38, 23 January 2011 (UTC)
  Added these rules to the Typos page. GoingBatty (talk) 15:49, 23 January 2011 (UTC)
I realize I am a bit late to the party but I am not sure that we should be doing this with AWB. For one the MOS doesn't state that there is a preference so this is subjective and I believe will cause some tensions. I also think that in several cases its too formal and makes reading more difficult and I have seen this change being applied to the title in cite web and in quotations which are both innappropriate IMO. Cheers. --Kumioko (talk) 16:17, 29 January 2011 (UTC)
Hi Kumiko - welcome to the party! WP:MOS#Contractions says "The use of contractions—such as don't, can't, won't, they'd, should've, it's—is informal and should generally be avoided." WP:AWB/T#Description says "Typo fixing is automatically prevented on image names, templates, wikilink targets and quotes." I've seen them being applied in quotations, but the root cause each time was missing/incorrect quotation marks. Could you please provide some examples of where the typo fixes were inappropriate? Thanks! GoingBatty (talk) 20:13, 29 January 2011 (UTC)
What about could've, should've and would've? -- John of Reading (talk) 16:07, 30 January 2011 (UTC)
  Added - thanks! GoingBatty (talk) 16:19, 30 January 2011 (UTC)

There's a list of contractions at Wikipedia:Manual of Style (abbreviations)/List of contractions. McLerristarr | Mclay1 16:22, 30 January 2011 (UTC)

  Added "wasn't" - are there others you think are significant enough to add? GoingBatty (talk) 16:38, 30 January 2011 (UTC)
I see that the " will" rule has been removed again. As I said above, I would prefer to ignore the possibility that "shall" was meant, and expand "'ll" to "will". -- John of Reading (talk) 16:51, 30 January 2011 (UTC)
Mclay1, could you please give examples of articles that contain a contraction ending in "'ll" where "will" would not be correct? GoingBatty (talk) 17:25, 30 January 2011 (UTC)
In modern English, shall and will are interchangeable. I don't believe we should be correcting things to one thing if another thing would also be correct. If we were to correct contractions ending in 'll, we should follow traditional grammar of first person = shall and second and third person = will. McLerristarr | Mclay1 17:28, 30 January 2011 (UTC)
I thought the " will" rule only changed instanced of the second and third person. Since the rule doesn't change "I'll", could you please give an example where the " will" rule would incorrectly change a first person contraction into "will"? Thanks! GoingBatty (talk) 17:37, 30 January 2011 (UTC)
It changed "we'll", although as I said that would not necessarily be incorrect in modern English. However, thinking about it, there would shouldn't be any first person writing outside of quotes. We should remove "we'll" and the others could be changed to " will". McLerristarr | Mclay1 09:17, 31 January 2011 (UTC)
I re-added the " will" rule but removed changes to "we'll" and "we've". I haven't edited the regex before so I hope I did it right. A problem I've noticed with expanding contractions is, for example, "couldn't that" should be expanded to "could that not" not "could not that". There are many cases where this could be the case. McLerristarr | Mclay1 10:22, 31 January 2011 (UTC)
See also this edit summarywhere'd that come from [sic]—by User:David Fuchs at 19:37, 9 February 2011. The contraction where'd can mean "where had" or "where would" or "where did", because the verb come has the same form as its past participle. Similar contractions are when'd, what'd, who'd, how'd, and why'd. Another contraction is let's for "let us".
Wavelength (talk) 19:31, 11 February 2011 (UTC)
"where'd" -> "where did". "where would" and "where had" are not contracted to "where'd", AFAIK (e.g., no "where'd you gone?" or "if you were to leave, where'd you go?" but only "where'd you go yesterday?"). Similarly for the other interrogatives + "'d". -- JHunterJ (talk) 19:47, 11 February 2011 (UTC)
These contractions were not part of the typo rules added recently. Since the contraction rules have been reverted, I don't see adding more any time soon. GoingBatty (talk) 23:31, 11 February 2011 (UTC)

Exception to "colour" rule

AWB just tried to replace "British Coloumbia" with "British Colourmbia". Please add an exception to the "(Dis)Colour" rule. --bender235 (talk) 10:40, 11 February 2011 (UTC)

  Done Exception added. Rjwilmsi 12:03, 11 February 2011 (UTC)
Why? British Coloumbia isn't a false positive -- it isn't spelled correctly. The AWB user should recognize it and change its fix to the correct fix, British Columbia. Having it changed as a typo was still useful. -- JHunterJ (talk) 12:12, 11 February 2011 (UTC)
Yes, "British Coloumbia" is a misspelling, but AWB applied the wrong rule. That was my point. And by the way, I did change it to the correct fix manually. --bender235 (talk) 12:24, 12 February 2011 (UTC)

New addition - "AUD/CAD/HKD/NZD/USD"

I see that RegExpTypoFix wants to change "USD$587 million" to "USD&nbsp;587 million". Before we run into another "controversial fix" fiasco, may I point out that MOS:CURRENCY doesn't mention this format? Is there a discussion somewhere?

The actual text (at Men in Black (series)) is "USD$587 million worldwide on a $90 million budget" so the fixed version looks a mess - I won't be saving this edit.

Also, as a technical quibble, the rule ends with a lookahead so the change does not show up in the edit summary. I gather that's been fixed, but I'm using the 5.2.0.0 release. -- John of Reading (talk) 16:46, 12 February 2011 (UTC)

I removed the currency rules. We can add them back in if we can figure out how they should work. Thanks! -- JHunterJ (talk) 17:29, 12 February 2011 (UTC)
The discussion is at Wikipedia talk:AutoWikiBrowser/Feature requests#Currencies and numbers. GoingBatty (talk) 22:20, 12 February 2011 (UTC)
It should probably be moved here, if it's to be a typo fix rather than a built-in feature. -- JHunterJ (talk) 22:22, 12 February 2011 (UTC)

Gaddafi

We should think about a rule for Gaddafi, because there seem to be dozens of variants out there. Just to establish some consistency, like we do with several other names. --bender235 (talk) 12:56, 23 February 2011 (UTC)

We don't establish consistency, but we do correct misspellings. Which of the dozens of variants are misspellings, and which are acceptable variants? The misspellings could be corrected to a consistent spelling (e.g., Gaddafi), but the variants should not be. Gaddafi#Name cites http://www.straightdope.com/columns/read/513/how-are-you-supposed-to-spell-muammar-gaddafi-khadafy-qadhafi; I would think those variations should be left unchanged. -- JHunterJ (talk) 13:37, 23 February 2011 (UTC)
But why not? The article's name is Muammar al-Gaddafi, and that should be the only variant on Wikipedia. Just like we have a rule for "Beijing", or "Beirut", although these also have a number of variants. --bender235 (talk) 19:27, 23 February 2011 (UTC)
I think that while we can try to be consistent with our own use, his name has so many documented spellings (as shown in the article), I imagine it would be a nightmare to make exceptions for quotations from commentators, older sources, etc. I agree with JHunter, if someone has used a recognized missspelling (ie no one uses it anywhere), revert to our form, otherwise leave other forms as is.66.80.6.163 (talk) 20:01, 23 February 2011 (UTC)(mercurywoodrose)
Color is also at "color", but we can't "correct" usages of "colour", a valid spelling of the same topic. Similarly, we can't correct "Brontosaurus" to "Apatosaurus"; WP might have a preferred spelling, but the other is not a typo or misspelling. Moammar Qaddafi is (I assume) a valid spelling of the same topic as "Muammar al-Gaddafi", not a typo or misspelling, so we don't "correct" it. If an individual AWB user wants to use it to identify non-controversial replacement opportunities and use the (non-typo) replacement tools, that would work. -- JHunterJ (talk) 21:47, 23 February 2011 (UTC)

palyer --> player

palyers --> players as well. Alex (talk) 05:10, 28 February 2011 (UTC)

A WP search found only six of these typos, which I fixed manually. GoingBatty (talk) 05:53, 28 February 2011 (UTC)

Minster - Minister

I've done enough of the following manually to believe that AWB would be more efficient:

prime minster - prime minister

cabinet minster - cabinet minister

Thanks ϢereSpielChequers 18:08, 28 February 2011 (UTC)

  Done with these two edits. -- JHunterJ (talk) 19:48, 28 February 2011 (UTC)

Speed of 'Pre-'

Will the 'Pre-' regex run acceptably fast now? If not, what metric are you using? (it doesn't hang for me, at least.) PS. Thanks for the enable. – Regregex (talk) 06:33, 4 March 2011 (UTC)

hda/had Error

Hello! Not sure if the regex can be adjusted to actually catch this, but "hda" should not correct to "had" if it's in a path, e.g. "/dev/hda1/foo/" :) Avicennasis @ 03:47, 18 Adar I 5771 / 22 February 2011 (UTC)

Thanks. Which rule triggered that change (found in the Typos tab on AWB)? There might be a way to identify the path in the Wikimarkup as not-English. (Also, additions to Talk pages are not minor edits. WP:MINOR. Editors who are ignoring minor edits won't see your new question.) -- JHunterJ (talk) 11:47, 22 February 2011 (UTC)
All such uses should probably go in code tags. Rjwilmsi 13:55, 23 February 2011 (UTC)
Ah. Must've forgot to uncheck the minor box last time. :)
How about this: "english" to "English" unless in a link, e.g., http://www.israelradio.org/english.html ? I've encountered this as well. Avicennasis @ 03:07, 30 Adar I 5771 / 6 March 2011 (UTC)
I did a search for "http://www.israelradio.org/english.html" and found 5 articles on Wikipedia. I processed them all through AWB, and AWB didn't try to change "english" to "English" on any of the articles. On what article did you encounter this issue? GoingBatty (talk) 03:21, 6 March 2011 (UTC)
That was just an example - I forget the actual link. I encountered this on Wikibooks the other day. Since they don't have any AWB pages, the RegEx is loaded from here. I'll try to rescan and find out exactly what the link was. Avicennasis @ 04:35, 30 Adar I 5771 / 6 March 2011 (UTC)
Found an example on Wikibooks, the page is here, and the "typo" it finds is
  • www.stimulus.virginia.gov > www.stimulus.Virginia.gov
Not sure if that helps at all, or if it can be avoided. But there's an example of what I mean. Avicennasis @ 06:41, 30 Adar I 5771 / 6 March 2011 (UTC)
  Done - updated the "Virginia" rule with this edit. GoingBatty (talk) 14:49, 6 March 2011 (UTC)

Spurious endash?

Hello!

This[5] change from "-" to "–" is obviously incorrect, see WP:HYPHEN. I am not sure that it is a bug of AWB, not some hand-made rule of a particular user. I know little about AWB and therefore ask here to help fix the problem, via technical changes or maybe social interaction (I have some negative bias towards automated editors and experience some troubles communicating with them). Incnis Mrsi (talk) 16:49, 8 March 2011 (UTC)

The word "replaced" in the edit summary shows that this change is a "find and replace" rule set up by the user, not something built into AWB. But I think the change is correct. The edit summary refers to WP:ENDASH, and the change seems a correct example of point 5 there, since "World War" contains a space. -- John of Reading (talk) 18:10, 8 March 2011 (UTC)
Point 5? It looks bizarre to me and apparently contradicts to WP:HYPHEN, but it exists. Sorry for a false alarm. Incnis Mrsi (talk) 19:17, 8 March 2011 (UTC)

Is "du Pré" rule necessary

Is the "du Pré" rule really necessary? I stumbled across the rule when it wanted to make an incorrect change to blogger "Jacqueline Dupree" in Media in Washington, D.C. Thanks! GoingBatty (talk) 03:25, 9 March 2011 (UTC)

  Fixed with this edit. Kept the rule, but allowed the "Dupree" spelling. -- JHunterJ (talk) 12:19, 9 March 2011 (UTC)

Colege -> College

Alex (talk) 23:44, 12 March 2011 (UTC)

  Done, along with Colegiate → Collegiate - GoingBatty (talk) 00:22, 13 March 2011 (UTC)

"an unusually long period" → "a unusually long period" ?

  Resolved

RegexTypoFix wants to change "an unusually long period" → "a unusually long period" on Fisher Hall and Marcum Center (Miami University), based on the "A ..." rule. This doesn't seem like a correct change to me. Comments? GoingBatty (talk) 03:34, 13 March 2011 (UTC)

It wants to change "an usually" to "a usually". That seems a correct change, although it seems the text may have intended unusually there. -- JHunterJ (talk) 12:09, 13 March 2011 (UTC)
Ah, the text is "an usually"! Once I changed the text to "an unusually long period", then RegexTypoFix doesn't want to change it. Thanks! GoingBatty (talk) 15:17, 13 March 2011 (UTC)

"niger" matching even when part of scientific name

Although the "Niger(ia)" rule is set up to not match scientific names, it tries to change Chlidonias niger to Chlidonias Niger on articles such as List of birds of Oregon. Could someone please see if the rule can be updated? Thanks! GoingBatty (talk) 04:36, 13 March 2011 (UTC)

It did not attempt the change when I just tried it. -- JHunterJ (talk) 12:06, 13 March 2011 (UTC)
It still does for me - I'm using AWB SVN 7634. GoingBatty (talk) 15:10, 13 March 2011 (UTC)
I'm a few builds behind, SVN 7471. -- JHunterJ (talk) 17:07, 13 March 2011 (UTC)

billionaire

AWB did not catch "bilionar" as misspelling of "billionaire" here. Please add a rule. --bender235 (talk) 18:06, 25 February 2011 (UTC)

It does not appear to be a common typo.[6] -- JHunterJ (talk) 20:14, 25 February 2011 (UTC)
Still we could add the rule in case the typo occurs again. Wikipedia is an evolving resource. --bender235 (talk) 19:07, 26 February 2011 (UTC)
"To do: Remove rare words. Note that no matches today does not mean a rule is rare, since another user may have used the rule to fix many articles yesterday." Each rule consumes some resources, and the goal is not to have 100% of possible typos included at the expense of being unable to run the tool to fix any of them. -- JHunterJ (talk) 21:43, 26 February 2011 (UTC)
If a word is rare why not put it into a rolling set for each day of the month? That you can still run the tool but rare fixes will still be picked up. ϢereSpielChequers 09:57, 2 March 2011 (UTC)
You can do that yourself. The AWB ruleset I run is the basic typo rules listed here, in addition to my own set of more nuanced rules (some of which require more human discretion than is appropriate for AWB). You could do the same. Shadowjams (talk) 07:23, 17 March 2011 (UTC)

"Maintenance" rule does not catch "maintanance"

I fixed about 2 dozen of these by supplying my own Find/Replace. I would like to fix the "Maintenance" rule, but I get dizzy when I look at that one for more than a few seconds. Chris the speller (talk) 17:09, 15 March 2011 (UTC)

  Done with this edit. -- JHunterJ (talk) 17:16, 15 March 2011 (UTC)

I tweaked a rule -- sorry about any confusion that may have followed

I tweaked the "(In)Significant" rule, then reverted it, not because I saw anything wrong with it, but because when I reloaded AWB to retest the whole thing live, I found that no RegEx fixes were working for me. It was as if "Enable RegEx TypoFix" was unchecked. So in a near panic, I reverted the rule change. Well, after the reversion, RegEx fixes are still not working for me. I had saved my AWB settings before shutting it down, and reloaded settings after launching it again, so it's not some setting that I forgot. My own Find/Replace rules work fine, as do General Fixes, so I can work, but it's like sweeping with a smaller broom. Any ideas would be appreciated, as would a report that other editors are successfully using RegEx Typo after reloading AWB. As for the rule tweak, maybe an experienced tweaker can look it over and give an opinion. Chris the speller (talk) 22:05, 17 March 2011 (UTC)

It's working for me. I restarted and tried processing A Rocha, "typos fixed: accomodation → accommodation" -- John of Reading (talk) 22:11, 17 March 2011 (UTC)
Thanks, that's very reassuring. I feel better that it's just me. Of course, it also makes me feel a little paranoid. Chris the speller (talk) 22:21, 17 March 2011 (UTC)
Well, now *some* of the RegEx rules seem to be working for me. I'm going to reboot the whole machine (mine, not WP!), what the heck. After that, I'll retest and evaluate whether the existing two rules for "significant" are actually working better than I thought. Chris the speller (talk) 22:58, 17 March 2011 (UTC)

Only a handful of RegEx Typos work for me now, and only intermittently. I know most of you have more fun things to do, but for a change of pace, if an editor wants to help, try the following:

  1. enable Find and Replace
  2. add a Find and Replace for 'reponsible' to 'responsible'
  3. enable RegEx Typos
  4. put 'User:Chris the speller/Sandbox2' in the page list
  5. Start.

It should fix 'reponsible' if Find and Replace is working. It should also fix about 11 other misspellings on 11 other lines if Regex Typos is working. Don't bother to save it, to allow retesting. When I do it, it doesn't catch the other 11 lines. I'd love to hear how other editors fare with this. Chris the speller (talk) 01:44, 18 March 2011 (UTC)

When I try your test, Find and Replace works, but AWB AVN 7634 doesn't find any typos. However, it does find typos on articles (see my contributions) and User:GoingBatty/Sandbox2. GoingBatty (talk) 03:17, 18 March 2011 (UTC)
I am using AWB SVN 7471, just downloaded last week. Am I already 163 versions behind? How often do I need to update it? Since yours also missed all the typos on my sandbox page, it seems that mine is not the only one misfiring. I have found a few typos, but they seem to come in spurts; it finds typos in 2 or 3 pages in a row, then misses them in dozens and dozens. Thanks for giving it a try. You don't sound too worried, but I have a bad feeling about this. Chris the speller (talk) 04:04, 18 March 2011 (UTC)
I download the latest AWB snapshot as soon as it's available, usually because it fixes a bug or includes a feature request I've submitted. Maybe you should report a bug? Good luck! GoingBatty (talk) 04:16, 18 March 2011 (UTC)
The typos in User:Chris the speller/Sandbox2 are in indented paragraphs. RegExpTypoFix skips those in case the indent is marking a quotation. Take out the colons, and the typos get fixed. (Using SVN 7471). -- John of Reading (talk) 08:08, 18 March 2011 (UTC)
Thanks, John, for providing the clear and simple answer to my problem. And thanks again, Batty, for taking the time to look into it. A humbling experience; I feel like the sorcerer's apprentice. Maybe I should change the name of this talk section to "The wrong way to set up a regression test for changes to AWB Typos" ;-)       Chris the speller (talk) 14:33, 18 March 2011 (UTC)
I added a note on the project page to indicate that typos are not checked in indented paragraphs. Thanks for the info, John! GoingBatty (talk) 01:58, 19 March 2011 (UTC)

-hsi

This seems a fairly common part of a name. Be nice not to try to change it to "-his". Rich Farmbrough, 00:14, 25 March 2011 (UTC).

  Done with this edit. GoingBatty (talk) 00:50, 25 March 2011 (UTC)

bilbliography > bibliography

Fairly common typo: [7].

By the way, why have both Wikipedia:Lists_of_common_misspellings and Wikipedia:AutoWikiBrowser/Typos?

Thanks. 128.138.43.231 (talk) 04:10, 29 March 2011 (UTC)

Yes, there were 10; I fixed 8, was headed off at the pass on 2 of them. I guess these may pop up at a rate of about 4 or 5 a month. I'm not sure what is a good cutoff number to qualify for adding a new typo rule.
"Lists of common misspellings" is for people, who are expected to show good judgment, brush off false positives and decide on changing "guerrila" to either "guerilla" or "guerrilla" based on the predominant spelling in the article, while "Typos" is for AWB and high speed. Even a fairly low rate of false positives makes for a pretty bumpy ride while using AWB. And "Typos" is tuned so that one rule can fix various suffixes and prefixes. Listing every possible variant in a separate rule would really bog it down, or maybe jam it good. Chris the speller yack 05:39, 29 March 2011 (UTC)
Thanks for your explanation. Also, I was confused by the "View (previous 20 | next 20) (20 | 50 | 100 | 250 | 500)" at the bottom of the search results. I assumed there were as many as 500 occurrences of that typo. Sorry. =) 128.138.43.231 (talk) 05:56, 29 March 2011 (UTC)

Capitalization of "earth"

An editor using the AWB has twice made edits to the article "Drummond Matthews" by which the text "the earth" has been capitalized to "the Earth". My contention is as follows: that "earth" should be capitalized only when it is used specifically as a name (e.g. Earth has only one moon, or Earth, Venus and Mercury are the three innermost planets). When used after "the" the word becomes a common noun analogous with "the world" or "the globe" and should not be capitalized. That's my understanding, anyway. Can the software be tweaked to enable it to make this distinction? Godingo (talk) 22:45, 31 March 2011 (UTC)

"the Earth's mantle", etc., are correct. See also the occurrences of "the Earth" on The Earth. -- JHunterJ (talk) 23:26, 31 March 2011 (UTC)
I've disabled the rule for now. -- John of Reading (talk) 06:06, 1 April 2011 (UTC)
Why? -- JHunterJ (talk) 10:52, 1 April 2011 (UTC)
Rightly or wrongly, that's my instinctive reaction when someone raises an objection to a "New addition" and not much discussion has happened. Please re-instate the rule if you are sure that it is correct. You might want to fix note 5 in The Earth as well. Reference 80 is someone else's title, so of course that should stay as a lowercase "e". -- John of Reading (talk) 11:06, 1 April 2011 (UTC)
I fixed Ref 80 as well; someone else's automated case fixing of someone else's title was incorrect. I believe the rule is correct, but I'll see if I'm alone in this. Thanks; I didn't realize it was in the "new additions" section. -- JHunterJ (talk) 12:58, 1 April 2011 (UTC)
I've now found the right section in the manual of style, and looked at the first twenty of so potential corrections found by a database scan for \b[Tt]he\s+earth's\b. I'm happy with this rule. -- John of Reading (talk) 13:29, 1 April 2011 (UTC)
Capitalizing "Earth" on Drummond Matthews looks good to me (and I capitalized one more instance inside a wikilink which AWB won't change), as the article specifically refers to our planet. I would have guessed that the only distinction was "Earth" meant the planet and "earth" meant dirt (which seems to coincide with the MOS), but dictionary.com has other examples of where lowercase "earth" is acceptable. GoingBatty (talk) 18:13, 1 April 2011 (UTC)

Cataloger or cataloguer

AWB corrects cataloger to cataloguer. Cataloger is common US spelling used by the Library of Congress among others. Please adjust your list. Thanks.Dankarl (talk) 13:17, 1 April 2011 (UTC)

  Fixed with this edit. -- JHunterJ (talk) 13:45, 1 April 2011 (UTC)

"Improv(e/ise)" rule goofs up "imprevious"

It changes "imprevious" to "improvious", which is a further step away from the correct "impervious". Anyone want to tweak it to prevent this strange twist? It's probably not worth getting it to actually fix this misspelling, as only Ramalinga Swamigal had an example of it, and it is now extinct in the wild. Chris the speller yack 05:15, 10 April 2011 (UTC)

Performance question

Which runs faster, "(M|m)" or "([Mm])", on the RegEx engine used by AWB? The former is shorter by one character, but if the other construction runs faster, that might be the way to go. The Typos list has the first format at the front of most rules, such as "\b(M|m)imic(ing|ed)\b". There was a discussion in October 2010 Wikipedia talk:AutoWikiBrowser/Typos/Archive 3#Profiling heads up for you guys that said explicit character classes ran faster than shorthand character classes ([A-Za-z] faster than \w), but did not cover explicit character classes versus alternation. A web search of "regex performance character classes alternation" led me to High Performance JavaScript by Nicholas C. Zakas on Google Books, which advises against starting a RegEx with alternation. Would anyone be interested in testing this? Here's the real fun part: if it's worth changing, AWB would be a great way to change the Typo rules! Chris the speller yack 19:27, 11 April 2011 (UTC)

lineraly → linerally

AWB fixed this one[8], another user said that lineraly is the correct spelling, and cited wikt:linearly (which does not have linerally)... I do not know which one is the correct one. Christian75 (talk) 17:15, 22 April 2011 (UTC)

Someone came along after you and corrected it to "linearly", which I'm sure was intended from the context. From my search, it seems that you found the only occurrence of "lineraly" in wikipedia, so I don't think it deserves a rule here to fix it.--BillFlis (talk) 18:43, 22 April 2011 (UTC)

Provably/probably

Please see Wikipedia talk:WikiProject Mathematics#Provably/probably (permanent link here) and QWERTY.
Wavelength (talk) 16:15, 23 April 2011 (UTC)

OK, I have seen it. What is your point? Have you found that a rule here changes "provably" to "probably"?--BillFlis (talk) 05:38, 24 April 2011 (UTC)
My point was that someone with experience at WP:AWB/T might be able to orevent incorrect changes and/or advise at WP:WTM. John of Reading has given advice there, for which I am grateful. I have not found a rule here which changes provably to probably.
Wavelength (talk) 06:43, 24 April 2011 (UTC)
AWB/T can't prevent incorrect changes that are done outside of AWB. -- JHunterJ (talk) 11:25, 24 April 2011 (UTC)

Removing hyphens after -ly adverbs

Today on WT:MOS, an editor requested a script for automatically removing hyphens after -ly adverbs in compound modifiers. I have put about 140 RegEx rules on User:Chris the speller/Adverbs, along with instructions on using an XML editor to splice them into an AWB settings file. This method misses very few standard -ly adverbs, but completely avoids problems with fly, July, Italy, family and the like. Chris the speller yack 23:35, 12 April 2011 (UTC)

To clarify: I'm not pushing this method for addition to the WP:AWB/Typos list, but this seemed to be the place to let a few daring editors know how they can load Find & Replace rules if they want to take this on. Since the list contains only known standard -ly adverbs, there are very few false-positive hits. The differences still need to be examined, and a hyphen removal does not exactly jump off the difference window. The main things to watch out for are changes to quotations, links (I don't usually have the "Ignore ..." boxes checked), and longer compounds (a slowly-but-surely strategy). If this is the wrong forum to bring this up, please move this discussion or ask me to move it. Chris the speller yack 14:31, 13 April 2011 (UTC)
The relevant guideline is at Wikipedia:Manual of Style#Hyphens, subsection 3, point 4 (shortcut WP:HYPHEN, permanent link here).
The current WT:MOS discussion is in the specific section Wikipedia talk:Manual of Style#"A hyphen is not used after a standard -ly adverb" and a requested exception for articles on New Zealand (permanent link here). [link corrected by Wavelength (talk) 01:21, 13 May 2011 (UTC)]
Some archived discussions about the guideline are the following.
For any software designed for automatically removing hyphens after ly adverbs in compound modifiers, I recommend that the software discriminate at least four different categories of instances.
  • instances where the hyphen is automatically removed
(including those with quotations from websites which omit the hyphen)
  • instances where the hyphen is automatically retained
(after fly, July, Italy, and family; in surnames such as Beverly-Smith; in French-language place names such as Romilly-sur-Seine; in English-language place names such as Ashly-on-Avon; in web addresses; in quotations from websites which also use a hyphen; in only-begotten)
  • instances where the hyphen is automatically retained but the suffix ly (or y in adverbs such as fully) is automatically removed
(http://www.onelook.com/?w=full-*&ls=a, http://www.onelook.com/?w=close-*&ls=a, http://www.onelook.com/?w=loose-*&ls=a, http://www.onelook.com/?w=tight-*&ls=a, http://www.onelook.com/?w=high-*&ls=a, http://www.onelook.com/?w=low-*&ls=a, http://www.onelook.com/?w=deep-*&ls=a, http://www.onelook.com/?w=new-*&ls=a, http://www.onelook.com/?w=old-*&ls=a, http://www.onelook.com/?w=narrow-*&ls=a, http://www.onelook.com/?w=wide-*&ls=a, http://www.onelook.com/?w=open-*&ls=a)
  • instances which the software flags for human inspection
(see Wikipedia talk:Manual of Style/Archive 106#Another kind of exception)
half-hourly, hourly, daily, nightly, weekly, fortnightly, semi-monthly, monthly, quarterly, yearly
early, kindly, likely, only
easterly, northerly, southerly, westerly
After my discussions with User:Noetica at User:Noetica/Archive4#Complications with ly and User:Noetica/Archive4#Specific cases with ly, I have been compiling a list of examples at User:Wavelength/About English (permanent link here), which I reproduce here as follows.
[adverb][space] or [adjective][hyphen] with: minded, named, shaped, forked, sized, colo(u)red, tinted, rooted, charged, dated, banked, spaced
Is it possible to design software which can do all those things?
Wavelength (talk) 20:53, 13 April 2011 (UTC)
From http://www.onelook.com/?w=*ly-*&ls=a and the following pages, I selected these examples for the second category above.
assembly (in assembly-line), belly, billy, butterfly (in butterfly-bush), curly, deadly, earthly, friendly, googly (in googly-eyed), holly, holy, hurly (in hurly-burly), jelly, jolly, kelly, Kimberly (in Kimberly-Clark), lily (in lily-of-the-valley), lonely, lovely, manly, McNally, mealy (in mealy-mouthed), oily (in oily-grain), pearly, pimply, poly (in poly-Bernoulli number), prickly, reply (in reply-paid), roly (in roly-poly), sally, scaly, (in Scaly-throated Honeyguide), shilly (in shilly-shally), sickly (in sickly-sweet), silly (in silly-sider), sly (in sly-boot), steely (in steely-eyed), supply (in supply-side economics), tally (in tally-ho), whirly (in whirly-ball), willy (in willy-nilly), woolly (in Woolly-necked Stork), worldly (in worldly-minded)
Included there are these types of cases: radio and television stations (for example, KVLY-TV), Hungarian names (for example, László Moholy-Nagy).
Wavelength (talk) 23:58, 14 April 2011 (UTC)
After chewing on this for a while, my first two conclusions are (1) you have put a lot of effort into this—my hat's off to you. (2) I had incorrectly included "overly" in my list of standard -ly adverbs, and now it has been removed. Chris the speller yack 01:44, 15 April 2011 (UTC)
Why isn't "overly" a standard adverb? "John Smith was overly confident." "Overly" is an adverb and it's not slang; I don't see what's non-standard about it. McLerristarr | Mclay1 15:30, 16 April 2011 (UTC)
At User:Noetica/Archive4#Complications with ly (which I mentioned above), I posted these consecutive statements at 01:46, 18 June 2009 (UTC): "I try to avoid the word overly but I understand that it has gained some degree of acceptance by dictionaries. I see three options: (a) removing the hyphen, (b) removing the ly and leaving the hyphen, and (c) removing both the ly and the hyphen."
Please see Noetica's first reply, at 09:40, 18 June 2009 (UTC), under "1. Overly".
You can find links to dictionary definitions of overly at http://www.onelook.com/?w=overly&ls=a. The link to Wikipedia is redirected to List of English words with disputed usage. The entry there for overly (permanent link here) includes a link to http://en.wiktionary.org/wiki/overly.
There is information about overly at these web pages.
[I am striking out three irrelevant links which I mistakenly included in my first post.]
Wavelength (talk) 19:15, 16 April 2011 (UTC)
You can see my most recent hyphen removals by visiting this page and finding hyphen.
Wavelength (talk) 15:43, 22 April 2011 (UTC)
I am correcting a permanent link in my message of 20:53, 13 April 2011 (UTC).
Wavelength (talk) 01:21, 13 May 2011 (UTC)

distint: distant/distinct

I've recently encountered some instances of "distint", which AWB attempted to change to "distant", but in each case the correct word was "distinct". The rule replaces

(?!\b[Dd]istention\b)\b([Aa]ssi|[Cc]on|[Dd]i|[Ii]n|[DdRr]esi|[Ss]ub)st[ei]n(ci?[ey][ds]?|t[a-z]*)\b

with

$1stan$2

MANdARAX  XAЯAbИAM 02:00, 8 May 2011 (UTC)

Disith -> Dissith

Some typo rule is making this change, which is not needed, but I haven't found which rule is misfiring. You can check it against List of Last Exile characters. Chris the speller yack 17:59, 22 May 2011 (UTC)

The Typo tab in AWB will identify the rules that fired on a page for you. -- JHunterJ (talk) 18:43, 22 May 2011 (UTC)
Thanks. I never had any use for that tab before, and wasn't aware of it. It didn't immediately answer my question, because my browser had choked on the Typos page and not loaded all of it, and that explains why my searches had found nothing. After finally getting it all to display, a search showed it was 'word="Dissi-" find="\b(D|d)isi([a-ko-rt-z]|m[a-nq-z]|s[a-km-z])([a-z]+)\b" replace="$1issi$2$3"'. I'm not going to mess with that little stinker. I don't know if it's worth adding "Disith" to the list of false positives to avoid. Thanks for the response. Now I know what to watch out for and why. Cheers! Chris the speller yack 21:10, 22 May 2011 (UTC)
  Fixed with this edit -- JHunterJ (talk) 01:48, 23 May 2011 (UTC)

Mischieveous > mischievous

An occasional misspelling and mispronunciation of 'mischievous'. I couldn't see it on the list, although it's been a while since I last edited it. Mephtalk 04:42, 23 May 2011 (UTC)   Done with this edit. -- JHunterJ (talk) 11:38, 23 May 2011 (UTC)

Its + apostrophe

Chris the speller and I have just fixed about 1200 of these in response to this request, and in over 99% of the cases the fix was to remove the apostrophe. I did a database scan for \b[Ii]ts[’'`] (three kinds of single quote), skipping those that matched \b[Ii]ts'' (italic/bold formatting). I don't know how to encode that in a single rule. If anyone thinks this is worth adding, could they add it? -- John of Reading (talk) 16:24, 1 June 2011 (UTC)

How about this:
<Typo word="Its'" find="\b[Ii]ts[’'`]\s" replace="$1ts " />
Notice that the offending apostrophe has to be followed by a whitespace character. This rule will skip bold/italics, and though it might be a little restrictive, I think the vast majority of cases will be followed by a space. Chris the speller yack 21:57, 1 June 2011 (UTC)
Still better would be: :<Typo word="Its'" find="\b[Ii]ts[’'`](\s)" replace="$1ts$2" /> Chris the speller yack 22:19, 3 June 2011 (UTC)
Even better would be: <Typo word="Its'" find="\b([Ii])ts[’'`](\s)" replace="$1ts$2" /> GoingBatty (talk) 23:46, 3 June 2011 (UTC)
Thanks for gently correcting my inexcusable blunder. Of course, parens are necessary to capture the first letter. Chris the speller yack 01:02, 4 June 2011 (UTC)

Why does the Typo rule "-junction" sometimes miss "conjuction"?

To test my outlandish claim, run Typos against Astro (Malaysian satellite television) (please don't save; I will fix it once others have looked and the dust has settled). Yet it fixed "conjuction" in A Boy and His Blob: Trouble on Blobolonia; see this diff. I am using SVN 7734. This is not a show-stopper, as I can run a Find & Replace rule with the "After fixes" box checked. Chris the speller yack 22:15, 3 June 2011 (UTC)

I'm not sure exactly where, but there's some sort of encompassing tag that the wiki engine ignores but AWB respects that's making it not check it. If you strip away all of the text except for that paragraph you'll see that it does fix it. I'll investigate and see if I can find what exactly is doing it. Shadowjams (talk) 00:58, 4 June 2011 (UTC)
Found it. The "" in U.K.’s is making the AWB engine think it's in quotes, or something, and tripping it out. We incidentally have a MoS about this, and converting "’" to "'" (same with full quote marks) should be common, however it's such a pervasive typo that it's a bore to fix. But this is another example of how the MoS guidance is right on this point. Shadowjams (talk) 01:17, 4 June 2011 (UTC)
Thanks. In a parallel investigation, I found that taking any quotes (straight or curly) from around "live" in the paragraph following the unfixed "conjuction" allowed Typos to nail it. Shows how convoluted this problem is. Now I formally declare war: "Death to curly apostrophes!" and maybe curly double quotes, too. Thanks for the help. Happy editing! Chris the speller yack 02:24, 4 June 2011 (UTC)

Fix for youtube tag

Hi! First of all, thanks for this amazing list! It has been incredibly useful! Though there's a minor error in it. At Wikia, we use <youtube> tags, without capitals. However, AutoWikiBrowser wants to "fix" these tags by changing them into <YouTube>. Could this be fixed? Thanks! 213.93.184.183 (talk) 19:54, 5 June 2011 (UTC)

Oh and, the youtube tag can also hold paramters such as <youtube width=>, so that should be taken in consideration too. 213.93.184.183 (talk) 19:55, 5 June 2011 (UTC)
I've disabled the rule so that it doesn't damage your pages. With luck, one of the regexp experts will be able to fix and re-enable it. -- John of Reading (talk) 20:35, 5 June 2011 (UTC)
  Fixed, I think, with this edit. Would you try it again on an appropriate Wikia page? -- JHunterJ (talk) 22:22, 5 June 2011 (UTC)


(EC) The tag's not case sensitive is it? Isn't this just a cosmetic issue?
Nevertheless, this should fix it: (?<!<)\b(?:Yout|you[Tt])ube\s. It should fix all cases of your tags too because it will just avoid all youtube phrases that begin with <. Shadowjams (talk) 22:31, 5 June 2011 (UTC)
I guess J already did it. That version's a little more expensive, but it's a little clearer to read too. Shadowjams (talk) 22:31, 5 June 2011 (UTC)
Leading with a negative-lookbehind strikes me as more expensive than ending with one. When leading, at every position (not just every occurrence of youtube, but every position) in the text, the parser has to check whether the previous character is not a <, and if not (which is usually), then look for a mal-cased YouTube. With trailing, it only looks back if it has found a mal-cased YouTube, so should be cheaper. The compiler may disagree, but I'd have to see stats. And then I cleaned up the .com check, to look for actual .coms, so it will correctly fix " .... video on Youtube." -- JHunterJ (talk) 02:15, 6 June 2011 (UTC)
Thanks for the fix :)! 213.93.184.183 (talk) 14:55, 6 June 2011 (UTC)
Me again, sorry for not reporting this earlier but I completely forgot: <youtube> now works, but </youtube> (note the slash) is still converted to YouTube. Thanks in advance :). 213.93.184.183 (talk) 22:24, 18 June 2011 (UTC)
Should be   Fixed for that with this edit. -- JHunterJ (talk) 11:47, 19 June 2011 (UTC)

What broke?

Re this edit summary. I reloaded the typo list right after saving that and made several fixes. What was breaking? -- JHunterJ (talk) 17:52, 12 June 2011 (UTC)

The error was reading "too many 's" specifically with the Blu-Ray regexp. AWB was refusing to activate Typo correction for me because of it. Stuart.Jamieson (talk) 18:02, 12 June 2011 (UTC)
Odd (esp. since there are no apostrophes in the change). I'm running AWB 5.3.0.0 SVN 7728 on IE 9.0.8112.16421, .NET 2.0.50727.5444, Windows 6.1. Can you tell me which versions if any are different in yours? I wonder if any of my hyphen changes need to be escaped under some environments. Thanks. -- JHunterJ (talk) 18:10, 12 June 2011 (UTC)
.NET is 2.0.50727.4211 and Windows is 6.0 but there was more to the message, I only realised by recreating it - you had missed a closing bracket. Stuart.Jamieson (talk) 18:37, 12 June 2011 (UTC)
Thanks. (I also must have mis-timed my reload of the typo list somehow.) Cheers! -- JHunterJ (talk) 18:40, 12 June 2011 (UTC)

"Communtiy"

I encountered 4 dozen articles with this typo, and suggest adding:

<Typo word="Community" find="\b(C|c)ommuntiy\b" replace="$1ommunity" />
but we already have
<Typo word="Community_" find="\b(C|c)om(?:un|m?unn|m?unn?t)(al(ly)?|ity|ities|ions?|is[mt]s?)\b" replace="$1ommun$2" />
and I don't see a way to combine them. Any takers? Chris the speller yack 18:05, 26 June 2011 (UTC)
I left Jabari Simama as it is for testing purposes. Chris the speller yack 18:16, 26 June 2011 (UTC)
No, not without splitting Communal and Communion and Communism/t from the latter. -- JHunterJ (talk) 20:13, 26 June 2011 (UTC)

  Done by JHunterJ, and a nice job, too. Chris the speller yack 22:26, 26 June 2011 (UTC)

Inheritence -> nhernheritance?

AWB is correcting "inheritence" into "inhernheritance" for some reason--it happened to me on several articles in a row. Sample: [9] Thanks! -- Khazar (talk) 15:37, 13 July 2011 (UTC)

Fixed -- John of Reading (talk) 16:02, 13 July 2011 (UTC)
Thanks! -- Khazar (talk) 16:19, 13 July 2011 (UTC)

A new typo fix suggestion

I would like to make a typo-correction suggestion that relates to capitalization, specifically of the CamelCase variety. I'm not sure how often these words occur outside of Nickelodeon- and cartoon-related articles, but it's not uncommon to see the character name "SpongeBob SquarePants" from the show by the same name incorrectly written without the capital "B" or "P" in his name. My suggestion would be to correct "Spongebob" to "SpongeBob" and "Squarepants" to "SquarePants" through AWB. --Sgt. R.K. Blue (talk) 23:02, 17 July 2011 (UTC)

I added a rule to fix his name when both the capital "B" and capital "P" are lowercase. Let's see how this works, and maybe others can expand the rule to fix other cases. Enjoy! GoingBatty (talk) 00:20, 18 July 2011 (UTC)
Changed my mind and split this into two rules: one for "SpongeBob" and one for "SquarePants". GoingBatty (talk) 00:41, 18 July 2011 (UTC)
Interesting how the rule doesn't always fix "Spongebob" (e.g. SpongeBob SquarePants (season 3)) GoingBatty (talk) 00:53, 18 July 2011 (UTC)
I hacked away at it for a while. If you take out the Italian inter-wiki line, then the Typo fixes work. Then you can put the inter-wiki line back. Go figure. Sometimes AWB is battier than you are. Chris the speller yack 04:41, 18 July 2011 (UTC)
Thanks for figuring that out, Chris! Guess when the instructions state "Typo fixing is automatically prevented on image names, templates, wikilink targets and quotes", that means interwiki links too. GoingBatty (talk) 04:46, 18 July 2011 (UTC)
Well, it's a start, anyway. Thanks for giving it a go. --Sgt. R.K. Blue (talk) 03:45, 18 July 2011 (UTC)
After fixing over 100 articles, I just updated the rules to also catch "Sponge Bob" and "Square Pants". Enjoy! GoingBatty (talk) 04:52, 18 July 2011 (UTC)
Thanks also for all the SpongeBob-related fixes; I ran AWB for a short time a little while ago and didn't pick up any more, though I'm sure there are still many out there yet to be discovered. Meanwhile, another CamelCase correction also struck me that might be worth adding: I've sometimes seen the company DreamWorks incorrectly written without the capital "W" (Dreamworks). --Sgt. R.K. Blue (talk) 08:47, 18 July 2011 (UTC)
There's still a lot of work to do for the "SpongeBob" fixes for those pages where the Typo rule won't fix it. The SpongeBob SquarePants (season 3) example above indicates that the typo rule won't fix pages that contain an Italian interwiki link with "Spongebob" in the title. Also, the typo rule won't fix any pages that link to Spongebob Squarepants or other spelling variants created as redirects.
I created a typo rule for "DreamWorks", as you requested. GoingBatty (talk) 16:50, 18 July 2011 (UTC)
  Done - I believe all the instances of improper capitalization for "SpongeBob" and "SquarePants" have now been fixed (via typo fixing or find/replace). I'll leave it to you to work with the other wikis to get their SpongeBob-related articles fixed.  :-) GoingBatty (talk) 04:38, 20 July 2011 (UTC)

Regex for SI units

I see that the regex for SI units has:

  • ([\d\.]+(?:\s| |-)?)foobar

The following appears to be identical in effect:

  • ([\d\.](?:\s| |-)?)foobar

Furthermore, I think 7. foobar (note trailing decimal) are not worthy targets. Thus the following code might be tighter:

  • (\d(?:\s| |-)?)foobar

Does that seem reasonable? Lightmouse (talk) 12:09, 18 July 2011 (UTC)

I think that [\d\.]+ is deliberate in these SI rules, so that the edit summary is more informative. I don't feel strongly about the trailing decimal; perhaps wait to see how many false positives get reported. -- John of Reading (talk) 15:02, 18 July 2011 (UTC)

The edit summary doesn't show the regex so that can't be the reason. The [\d\.]+ regex was added by User:BillFlis. Perhaps we should ask him. Lightmouse (talk) 17:25, 18 July 2011 (UTC)

The edit summary does't show the regex but can show the string that the regex matched. So if the article said "123.4 foo" and now says "123.4 Foo", then the edit summary will say that instead of just "4 foo -> 4 Foo". -- John of Reading (talk) 18:43, 18 July 2011 (UTC)

Ah. In that case, it should be:

  • ([\d,\.]*\d(?:\s| |-)?)foobar

Consider the very common format "12,000 foobar". Lightmouse (talk) 18:52, 18 July 2011 (UTC)

My 2 cents: I think you all are on the right track as to improving the edit summary by showing as much as possible of the phrase that's being corrected. But, while the naked-decimal-point form "7. foobar" is generally deprecated (I haven't checked the style guide used here), I don't think that means we shouldn't fix the foobar part if that's needed. So we can get away with Lightmouse's last suggestion without requiring the decimal-point-covering digit:
    • ([\d,\.]+(?:\s|&nbsp;|-)?)foobar

--BillFlis (talk) 20:32, 20 July 2011 (UTC)

I seem to remember some previous discussion about speeding up the code. It would be simpler just to check for the last digit. Can we make the edit summary repeat the unit name without the numeric value? Lightmouse (talk) 21:11, 20 July 2011 (UTC)

Second question: what does the preceding "fb (foobar)" section of the code do? Lightmouse (talk) 21:28, 20 July 2011 (UTC)

Is this a duplication?

The kg/km code appears to have a duplication. It says:

  • (?:K([gm])'?s?|[Kk]([gm])'?s)

is that the same as:

  • (?:[Kk]([gm])'?s?)

Regards Lightmouse (talk) 12:06, 27 July 2011 (UTC)

No. The strings "kg" or "km" will match the second regex but not the first. The rule mustn't match a correct use such as "3 kg". -- John of Reading (talk) 12:17, 27 July 2011 (UTC)

Ah yes, I see the difference in match. However, I can't see the difference in effect. Lightmouse (talk) 12:25, 27 July 2011 (UTC)

If the rule said
<Typo word="kg/km (kilo(gram/metre))" find="([\d\.]+(?:\s|&nbsp;|-)?)(?:[Kk]([gm])'?s?)\b" replace="$1k$2$3" />
then it would find "3 km" and replace it with "3 km". This goes against the second bullet point here. Without looking at the source code I don't know for sure, but I imagine this could lead to edit summaries saying "typos fixed: 3 km -> 3 km". -- John of Reading (talk) 12:45, 27 July 2011 (UTC)

Ah. I tend not to worry about zero change matches in my regex. But I don't link the summary to the match. Even with that constraint, is the regex as simple as it can be? The upper case 'K' appears to be duplicated. How about:

  • (?:K([gm])'?s?|k([gm])'?s)

Lightmouse (talk) 13:02, 27 July 2011 (UTC)

Yes, I think that has the same effect as (?:K([gm])'?s?|[Kk]([gm])'?s) and is a little shorter. -- John of Reading (talk) 13:14, 27 July 2011 (UTC)

I'll leave it to somebody else to check and upload. I'm not confident enough. Lightmouse (talk) 13:19, 27 July 2011 (UTC)

Latin Wikipedia

I don't know where the problem is but AWB is trying to correct typos on the Latin Wikipedia. It's trying to correct Latin words to English. McLerristarr | Mclay1 10:40, 1 August 2011 (UTC)

I guess from this search that the Latin Wikipedia needs to create a placeholder page named la:Vicipaedia:AutoWikiBrowser/Typos. -- John of Reading (talk) 10:56, 1 August 2011 (UTC)
Will an empty page work? McLerristarr | Mclay1 11:03, 1 August 2011 (UTC)
Try it and see? A better answer would need someone with the AWB source code handy. -- John of Reading (talk) 11:12, 1 August 2011 (UTC)

Removing disabled rules?

Would removing disabled rules from the list speed up AWB page processing? If so, should they be archived somewhere so editors can refer to them before adding new rules? Thanks! GoingBatty (talk) 13:22, 31 July 2011 (UTC)

I think it might trim a few milliseconds off the startup time, but would have no effect on the per-page processing time. -- John of Reading (talk) 13:27, 31 July 2011 (UTC)
Disabled rules don't affect processing time. Rjwilmsi 23:00, 3 August 2011 (UTC)

Lookbehind and WikiEd

Currently there are some rules such as

<Typo word="Broadly" find="\b([Bb])roadyl?\b(?<!\bBroady\b)" replace="$1roadly" />

using lookbehind, but this is not supported on JavaScript and as such, are not working on WikEd. Would it be possible to change them to something equivalent without using lookbehind? Helder 19:07, 1 August 2011 (UTC)

I don't think it's possible to do generally across all rules. It is possible to hack your way around these in some instances. For instance, doing tricks excluding character classes [^a] or writing more complete rules. In the vast majority of cases instances with a look-behind/ahead are trying to exclude some particular version of a word. I suppose you could ignore rules like that altogether. Maybe someone else knows of some code you could plug into those programs that would do the lookahead/behinds for you, but I can't think of a general, across the board fix, as those features do things that more primitive regex's can't. Shadowjams (talk) 16:48, 2 August 2011 (UTC)
Maybe I'm wildly misunderstanding how WikEd works, but is there a library or plugin for grease monkey that would allow this? The javascript only version I guess uses the native regex engine... it does actually support look ahead... so perhaps there is a way to convert look behinds into look aheads in an automated way.
Here's a question for computer science people... can all look aheads be made into look behinds or vice versa? If not then I'd suspect there's no elegant way to do it, but if yes... Shadowjams (talk) 16:59, 2 August 2011 (UTC)

<Typo word="Broadly" find="\b(?:(Broad)yl|(broad)yl?)\b" replace="$1ly" /> is actually a few characters shorter.--BillFlis (talk) 16:57, 2 August 2011 (UTC)

I think he's trying to do this for all rules and he's using that one as an example. Shadowjams (talk) 16:59, 2 August 2011 (UTC)
The lookbehinds here are generally more efficient -- the number of characters in the pattern is not a good indicator of efficiency. Lookbehinds look for exceptions after a match has been found. Lookaheads look for exceptions at every character and then look for a match. -- JHunterJ (talk) 17:03, 2 August 2011 (UTC)

Also, per Wikipedia talk:AutoWikiBrowser/Typos/Archive 1#The negative lookbehinds used in the regex lists, wikEd's not broken by them, it just skips them. Can a secondary wikEd list be created that reproduces whichever skipped regexps are sought? -- JHunterJ (talk) 17:05, 2 August 2011 (UTC)

Simplification

What about changing codes such as ([ab]) to (a|b) to save ~1800 characters (one per rule where this construct is used)? Helder 18:27, 2 August 2011 (UTC)

This was a deliberate edit by Lightmouse (talk · contribs) to improve the per-page processing time. Processing speed is much more important than the size of the page. -- John of Reading (talk) 20:09, 2 August 2011 (UTC)
There was a long discussion about this a few months back and the conclusion was that, while not critical, the character class is quicker and should be preferred, all else being equal. It shouldn't be dogmatically enforced, but the processing speed is important. Shadowjams (talk) 03:12, 3 August 2011 (UTC)
Good to know.
Could anyone inform me how can I test the efficiency of two versions of a determined rule? Helder 16:46, 3 August 2011 (UTC)
The debug builds of AWB have a built in typo profiling function. Alternatively you can benchmark regular expressions in general using expresso or JRegexAnalyser or something similar of your choice. Rjwilmsi 22:59, 3 August 2011 (UTC)

D-dropping

G-dropping has a long history in the English language, but d-dropping is relatively new, in my experience. Apparently as a result of young people spending more time at being entertained than at being educated, and the ability of a /d/ sound to disappear between an /s/ sound (or a /z/ sound or a /ʒ/ sound or a /ʃ/ sound) and a /t/ sound, I have been seeing the letter omitted from expressions like those listed below. This can happen when they include verbs in the past tense, but especially when they include past participles (as in is/are/was/were supposed to). Of the words which I have listed here, I found most at http://wordover.com/. It seems to me that some people actually do not know the correct spelling. Even when the past-tense verb or past participle is not followed directly by the word to, sometimes the d is dropped from a word that should have it.

  • supposed to, forced to, advanced to, convinced to, announced to, enticed to, induced to, introduced to, sourced to, reduced to, increased to, decreased to, traced to, ceased to, dispensed to, dispersed to, asked to, passed to (See Note 1)
  • used to ("was accustomed to; utilized to"), pleased to, advised to, authorized to (authorised to), disclosed to, exposed to, proposed to, opposed to, refused to, generalized to (generalised to), advertised to, paused to, poised to, espoused to, surprised to, televised to, closed to (see Note 2)
  • changed to, charged to, enlarged to, acknowledged to, judged to, engaged to, divulged to, diverged to, emerged to, managed to, outraged to, pledged to, surged to, urged to
  • attached to, dispatched to, hitched to, latched to, matched to, pitched to, preached to, reached to, stretched to, switched to, abashed to, banished to, crushed to, dashed to, demolished to, diminished to, embellished to, established to, leashed to, published to, vanished to, wished to

Note 1: The expression closed to ("shut to, unopened to; blockaded to") should not be confused with the expression close to ("near to”).
Note 2: The expression passed to ("moved to", transitive or intransitive) should not be confused with the expression past to ("onward to”).

This problem can occur also with /l/, /m/, /n/, and /r/: thrilled to, claimed to, inclined to, ordered to. I have not spent time in searching for more examples of these, but I am willing to do so on request.

Can AutoWikiBrowser be used to correct these errors?
Wavelength (talk) 18:44, 23 August 2011 (UTC)
[I am correcting the wikicode in my message.—Wavelength (talk) 18:57, 23 August 2011 (UTC)]

A lot more of these have false positives than the two you point out:
    • Imperative and interrogative moods; e.g., "don't preach to the choir"; "How did you pitch to him?"
    • Noun forms of the same words; e.g., "it's quite a stretch to assume that"; "turn the switch to the left"
These would be difficult to screen out, and would probably have to be done on a case-by-case basis.--BillFlis (talk) 19:56, 23 August 2011 (UTC)
It seems to me that a search limited to expressions with is/are/was/were would find only true positives with most of the expressions which I listed. Here is an example.
  • [is, are, was, were] (not) (very) (actually, always, ever, never, often, only, really, sometimes, truly, usually) suppose to
Exactly one of the auxiliary words is present. At most one expression from each set of parentheses is present.
I have not mastered the AWB code for condensing multiple variables into one line.
Wavelength (talk) 22:30, 23 August 2011 (UTC)
When I said "as in is/are/was/were supposed to” in my first post, I meant that AWB might search for these expressions with is/are/was/were , among other possibilities, but now I am referring to restricting the search to expressions with is/are/was/were.
Wavelength (talk) 22:36, 23 August 2011 (UTC)
I suggest we write up a Find&Replace regular expression, and that you run it on some articles to see how accurate it is in finding genuine grammatical errors. Rjwilmsi 07:50, 24 August 2011 (UTC)
A search of wikipedia for "was please to" (with quotes) returned 2 hits (both errors). Searches for "is please to", "were please to", "am please to", and "are please to" returned no hits.--BillFlis (talk) 13:41, 24 August 2011 (UTC)
There is already a rule for "Supposed to" in "New additions".--BillFlis (talk) 12:49, 24 August 2011 (UTC)
I created a new rule for "(Ex/Op/Pro)posed to", which might have a chance of catching something. If this pans out without problems, it can later be merged with the existing "Supposed to" rule.--BillFlis (talk) 13:54, 24 August 2011 (UTC)
FWIW, that's not “relatively new” at all.[10]
A. di M.plédréachtaí 12:36, 4 September 2011 (UTC)

Well-known

Can we get rid of "well-known" --> "well known" please? The hyphenated version is perfectly acceptable in my dictionary (Chambers 9th ed.), and seems to be purely a matter of taste. I'm fed up of skipping it.—An  optimist on the run! 19:58, 2 September 2011 (UTC)

Can you provide an example of AWB attempting to make such a change where you thought it should be skipped? That might save us a lot of thrashing around. Chris the speller yack 00:31, 3 September 2011 (UTC)
WP:HYPHEN, subsection 3, point 5, approves of “well-known”.
Wavelength (talk) 20:15, 2 September 2011 (UTC)
I've temporarily disabled the rule while we discuss this topic. Does anyone have a source that states when to use "well-known" versus "well known"? GoingBatty (talk) 22:18, 2 September 2011 (UTC)
I just checked in the Oxford English Dictionary (the enormous 20-volume one). It's given as "well(-)known" denoting that hyphenated and unhyphenated forms are both acceptable. The list of example quotations from literature includes both forms. There is no reason for AWB to change it. — Hebrides (talk) 23:00, 2 September 2011 (UTC)
But WP:HYPHEN, sub-subsection 3, point 5 clearly spells out when the hyphen is needed, and implies that the hyphen is not used predicatively. It would not need to mention different treatments of "well" if either form is acceptable anywhere. Chris the speller yack 00:35, 3 September 2011 (UTC)
The only reason to change it is if "well-known" should be used in some contexts and "well known" should be used in other contexts. GoingBatty (talk) 23:10, 2 September 2011 (UTC)
This is exactly the case. That is why I made the rule to only change the hyphenation in certain contexts. Chris the speller yack 00:38, 3 September 2011 (UTC)
Where is your rule documented? Or do we need to read code to decide whether we think it's right? Dicklyon (talk) 19:07, 4 September 2011 (UTC)
Here's the rule (which I've temporarily disabled):
  • <typo word="(best/well) known" find="\b([Bb]est|[Ww]ell)-known(?<!the (?:best|well)-known)(?= for\b| as\b| by\b| in\b)" replace="$1 known"/>
It means change "best-known" or "well-known" to "best known" or "well known" if it's followed by "for", "as", "by", or "in", but not if it's preceded by "the". GoingBatty (talk) 21:07, 4 September 2011 (UTC)
These links might be useful to editors who wish to evaluate this expression, and I suggest that they be added to Wikipedia:AutoWikiBrowser/Typos as external links for editors to use in evaluating other expressions.
Wavelength (talk) 00:39, 3 September 2011 (UTC)
It has been 9 days since I asked for an example of this rule making a change where it should not have, and none has been provided. I suggest restoring the rule. Chris the speller yack 22:39, 12 September 2011 (UTC)
  Done - rule restored. GoingBatty (talk) 01:25, 13 September 2011 (UTC)

If it is used as an adjective before a noun it is hyphenated, see this. So I think it is dangerous and unwise to automatically make that change unless AWB can tell that it is not before a noun. Bubba73 You talkin' to me? 17:21, 7 November 2011 (UTC)

Creedence

The current rule "Credence" tries to change every occurrence of "Creedence/creedence" that is not followed by "Clearweater". This includes "The Ultimate John Fogerty/Creedence Collection", "Creedence drummer ..." and "Creedence-ish". What this rule catches is 100% false positives. It's hard to think up many sentences that start with the common noun "credence", so changing this rule to change only the lower-case word would be a vast improvement. But there are no occurrences of "creedence" in lower case, so I wonder if the rule is badly needed. Am I missing something? Chris the speller yack 15:17, 17 September 2011 (UTC)

disiciple

AWB amends disiciple to dissiciple, but the correct spelling is disciple. Regards, SunCreator (talk) 15:44, 17 September 2011 (UTC)

I updated the "Dissi-" rule so it does not try to change "disiciple", and manually fixed all instances of "disiciple". I'll leave it to others to change the "Disciple" rule to also fix "disiciple" (if needed). GoingBatty (talk) 02:45, 18 September 2011 (UTC)

Advertize

AWB says advertize is a typo and changes it to advertise. While advertise is certainly valid I'm not sure advertize is not i.e. http://www.thefreedictionary.com/advertize but I don't really know as I'm not an American. Regards, SunCreator (talk) 12:04, 11 September 2011 (UTC)

"Advertize" is not preferred by Americans, or any group that I know of. It's an oddball spelling that seems to be tolerated to a fair degree, though most respectable dictionaries do not list it as an alternative to "advertise". I'd rather not see "advertize" used in formal writing, as in WP, but if there were a groundswell of opposition to this Typo rule, I'd be OK with its removal. Chris the speller yack 14:37, 11 September 2011 (UTC)
Funny, I've never thought about that... but I guess that's true. As an American I'll tell you I would never think to write it as advertize (although spell check doesn't seem to mind). I don't mind the rule either way. Shadowjams (talk) 02:54, 16 September 2011 (UTC)
Here is the listing for "advertise" at an authoritative American dictionary (which lists many British variants): http://www.merriam-webster.com/dictionary/advertise No "-ize" variant is given.--BillFlis (talk) 07:43, 30 September 2011 (UTC)

manifactured

manifactured => manufactured. Regards, SunCreator (talk) 20:19, 17 September 2011 (UTC)

  Done by expanding the existing "Manufacture" rule. GoingBatty (talk) 02:50, 18 September 2011 (UTC)

Musial to Musical

The Typo rule "-ical" should not change "Musial" to "Musical", though the lower-case "musial" to "musical" would be fine. In any event, it should not change "Stan Musial". Chris the speller yack 17:55, 18 September 2011 (UTC)

  Done by changing the "-ical" rule so it will change "musial" but not "Musial". (FYI, Stan isn't the only Musial mentioned on Wikipedia.) GoingBatty (talk) 21:42, 18 September 2011 (UTC)

"is is" usually is "is"

In my experience, "is is" should be fixed to "is" at least three quarters of the time; it is currently changed to "it is". MANdARAX  XAЯAbИAM 03:31, 28 September 2011 (UTC)

I agree. A sampling showed probably 90% should be changed to "is" instead of "it is". However, this is examining what is left in WP after years of AWB users clicking "Skip" when the rule does not make an appropriate change, so we can't be sure what the ratio is for new edits. Complicating this is the fact that the rule also tries to change "it it" to "it is". Many times this is caused by a missing period, as in "Fred tried to fly it it got caught in a tree". Again, these would be expected to accumulate as AWB uses click "Skip". Something should probably be done, but I think more discussion would be helpful before we change the rule. However, if someone wants to be bold, I won't fight it. Chris the speller yack 15:29, 28 September 2011 (UTC)
I hope users won't click "Skip" when AWB suggests the wrong fix, but instead manually make the correct fix and click "Save". GoingBatty (talk) 16:44, 28 September 2011 (UTC)
Speaking for myself, not usually. --mboverload@ 00:05, 6 October 2011 (UTC)
  Done - Since it's been over a week since Chris' response with no further discussion on Mandarax's proposal, I was bold and made the change. GoingBatty (talk) 04:00, 6 October 2011 (UTC)
Not only are almost all of the "is is" I've just corrected supposed to be "is", but most of them are in the first sentence of the lede! Hope others will help fix these. GoingBatty (talk) 04:30, 6 October 2011 (UTC)

Mediterranian

Could someone with more proficiency in regexes than me please add this fairly common misspelling to the existing (rather complex) rule? Colonies Chris (talk) 19:06, 22 November 2011 (UTC)

  Done with these edits. The problem you were likely facing was that the previous regexp required two typos in the word: one before the "ra" and another one after it. I opted for a negative lookbehind to avoid the null op; the other choice would be to make it two separate rules. -- JHunterJ (talk) 19:23, 22 November 2011 (UTC)
Thanks. Colonies Chris (talk) 23:07, 22 November 2011 (UTC)

Interwiki list

Hi, can someone please add [[hy:Վիքիփեդիա:ԱվտոՎիքիԶննարկիչ/Վրիպակներ]] in the interwiki list? --Vacio (talk) 11:50, 3 December 2011 (UTC)

  Done - though as the page is only semi-protected, you should have been able to do this yourself? -- John of Reading (talk) 11:57, 3 December 2011 (UTC)
Thank you. --Vacio (talk) 12:32, 3 December 2011 (UTC)

well-known again

"Well known" is hyphenated if is used as an adjective before a noun, see this. So I think it is dangerous and unwise to automatically make that change unless AWB can tell that it is not before a noun. Bubba73 You talkin' to me? 17:21, 7 November 2011 (UTC)

Thanks for providing a source. I've disabled the rule pending further discussion. Shouldn't the hyphenation of "well-known" be a well-known rule? :-) GoingBatty (talk) 17:36, 7 November 2011 (UTC)
The rule only removes the hyphen in cases where "well-known" is immediately followed by "for", "as", "by" or "in". -- John of Reading (talk) 17:41, 7 November 2011 (UTC)
If it uses the rule only if followed by those words, then it should be OK. Yes, it should be a well-known rule. But is the rule well known? (I think those are correct uses.) Bubba73 You talkin' to me? 18:04, 7 November 2011 (UTC)

That's a crowd-sourced answer. Actually language is far more flexible, and not only do words refuse to take on the roles assigned by syntactic rules, but writers use language more flexibly. In this case, for example, Google Books shows that about 10-15% of the number uses of "well-known for" as "well-known for", including Harpers Magazine, and "Foo for Dummies". Rich Farmbrough, 01:16, 12 December 2011 (UTC).

Is it perfectly conventional in normal English writing to mis-use hyphens 10 to 20% of the time. But I think WP style is to try to get it right. Dicklyon (talk) 01:20, 12 December 2011 (UTC)

Combining collaboration rules

<Typo word="Collaborate" find="\b([Cc])ol(?:(?:abor+|l?aber+|labo)at([a-z]+)(?:\stogether)?|laborat([a-z]+)\stogether)\b" replace="$1ollaborat$2$3" />

(with the $2$3 at the end) would serve to combine those two. But perhaps it's best left separate. I'll let someone else opine. -- JHunterJ (talk) 22:36, 11 December 2011 (UTC)

That looks good to me! Thanks for the creative solution. GoingBatty (talk) 23:02, 11 December 2011 (UTC)

'Publisher', 'work', 'agency' parameters in 'cite' templates

These parameters are frequently wrongly used; for example, "publisher=''The Times''" should be "work=The Times". There are some further complications; the name of the work wrongly declared as 'publisher' is sometimes included in double apostrophes so that it shows italicised (as 'work' automatically does) and sometimes not; the work is sometimes wikilinked and sometimes not; the word 'The' is sometimes part of the name and sometimes not. And quite often organisations that are correctly defined as 'publisher' such as BBC News, will be artificially (and inappropriately) italicised. Also, agencies such as Reuters are sometimes described as 'publisher' where they should be 'agency'. I've developed a set of regexes that (mostly) fix these problems. If there's agreement that this would be a useful RETF task, I'll collect them up and post them here and hopefully people with more regex knowledge than me can simplify them and plug the loopholes. Colonies Chris (talk) 20:46, 23 December 2011 (UTC)

Hi Chris. While I agree with the need for such cleanup, please remember that the Usage section states "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes." GoingBatty (talk) 23:02, 23 December 2011 (UTC)
If you'd like me to assist with a bot request in this area let me know. Rjwilmsi 20:15, 30 December 2011 (UTC)

Assitant - Assistant

We have 50 articles with the word assitant. I've manually fixed a few but I think it would be easier to put assitant - assistant into AWB. If someone agrees with me can they make the change?

The "Assist" rule already covers "Assitant" → "Assistant". I'm using AWB to fix these articles now. GoingBatty (talk) 15:43, 30 December 2011 (UTC)
Thanks. If it is already in AWB I'm surprised there were so many, what is the typical lag nowadays between an AWB fixable anomaly going into an article and it being fixed? ϢereSpielChequers 15:53, 30 December 2011 (UTC)
Using Article Blamer on the last 20 "assitant" typos AWB was able to fix, the average lag was 117.5 days (min 17, max 264). GoingBatty (talk) 17:02, 30 December 2011 (UTC)
Thanks GoingBatty. I've linked that answer in to some research I'm doing to challenge a report into our typo fixing standards. Any feedback to User:WereSpielChequers/typo_study would be much appreciated. ϢereSpielChequers 18:22, 30 December 2011 (UTC)
I added my thoughts to User talk:WereSpielChequers/typo study. GoingBatty (talk) 18:50, 30 December 2011 (UTC)

Exceptions needed for -sih --> -ish rule

  • The AWB is "fixing" spellings that don't need to be fixed. Kekasih (Indonesian/Malay word for lover) is turning into kekaish (which has no meaning anywhere) (examples: 1, 2, 3) and Qadarsih (Indonesian/Malay name) --> Qadarish (also no meaning) (examples: 1, 2, 3); the latter would be quite embarrassing if either Indra or Titi Qadarsih get their own articles. Could this be fixed please? I've already fixed the issues in the articles. Crisco 1492 (talk) 09:52, 7 January 2012 (UTC)
  Done Fixed the typo rule, the rule already had the appropriate exceptions but wasn't formatted correctly. Rjwilmsi 10:12, 7 January 2012 (UTC)
I get false positives on this rule a lot too, which I'll try to start listing here so they can be logged as exceptions. Today's is "Tung Tish-shia". Thanks! -- Khazar (talk) 15:35, 21 January 2012 (UTC)

occsionally => occasionally

AWB missed the example here. Regards, SunCreator (talk) 19:41, 9 January 2012 (UTC)

Four other examples of this typo currently exist. Regards, SunCreator (talk) 19:43, 9 January 2012 (UTC)
  Done - Since this typo doesn't happen very "occsionally", I fixed them all manually without adding a typo rule. GoingBatty (talk) 02:14, 10 January 2012 (UTC)

Ngugi wa Thiong'o, Ngugi wa Thiongo, Ngugi Wa Thiong'o => Ngũgĩ wa Thiong'o

Because of the complicated (for Westerners) spelling of his name, almost all of our articles for Kenyan novelist Ngũgĩ wa Thiong'o had one of the above misspellings of his name. I've manually fixed the existing offenders, but would it be possible to autocorrect the above variations in the future, or is this too small a fix? It's only 50 or so instances of this error so far, but it was misspelled in more places than it was properly spelled. Thanks! -- Khazar (talk) 07:14, 15 January 2012 (UTC)

Would all instances of "Ngugi" qualify to be converted to "Ngũgĩ" or is there something distinct about it being followed by "wa"? Would it be possible to generalize the rule for Thiongo to Thiong'o too? Shadowjams (talk) 18:02, 22 January 2012 (UTC)
Honestly, I'm not sure; I happen to know a lot about Ngũgĩ from teaching one of his books, rather than an understanding of Gikuyu language. Probably safer to limit it to him for now. -- Khazar (talk) 18:07, 22 January 2012 (UTC)
Alright, I actually may ask in the language desk in a bit to see if we could generalize the rule more. For the time being, here's a regex that should work. I didn't insert it into the typo list but you can do so if you think it's widespread enough. Or you could use it to scan the database for other instances. I have a dump from a few months ago but you've probably fixed all of them I'd find in that.
<Typo word="Ngũgĩ wa Thiong'o" find="\b[Nn]gugi [Ww]a Th(io|oi)ng['‘’]?o\b" replace="Ngũgĩ wa Thiong'o"/> Shadowjams (talk) 19:38, 22 January 2012 (UTC)
Question asked here. Shadowjams (talk) 19:45, 22 January 2012 (UTC)
Cheers! -- Khazar (talk) 20:09, 22 January 2012 (UTC)

Double-hyphen is converted to an em dash (except between digits, when it becomes an en dash). However, if the word to either side of he hyphens is linked, the conversion is prevented. This means that where the dashes occur in pairs, often only one of them will be fixed.

An example is here,[11] where only the second "--" was autofixed in

included [[Eugene Ormandy]]--who later turned to conducting--and [[Eugene Lehner]].

(I did the other by hand.) — kwami (talk) 05:25, 23 January 2012 (UTC)

Hi Kwami - I believe this is one of AWB's general fixes, not a typo fix. I suggest you create a new bug report for this isue. Good luck! GoingBatty (talk) 17:32, 23 January 2012 (UTC)

Okay. This one should be more straightforward: the template {{ndash}} is not supposed to have a leading space. It includes a nbsp, and adding another messes up the spacing. Should be covered under typos? Deleting the space seems to be okay even when adjacent to a ref, as here. — kwami (talk) 01:26, 25 January 2012 (UTC)

WP:AWB/T#Usage states "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes (including indented paragraphs)." Seems like a feature request would be the way to go on this one. Thanks! GoingBatty (talk) 03:13, 25 January 2012 (UTC)

Honshu

User:Ryulong suggested that the "Honshu → Honshū" rule should be removed, since it was decided in December 2010 to refer to the island as Honshu. --bender235 (talk) 19:12, 28 January 2012 (UTC)

  Done GoingBatty (talk) 23:14, 28 January 2012 (UTC)

populary known

How about

populary known => popularly known?

19 instances, no false positives. Cheers, Khazar (talk) 22:55, 29 January 2012 (UTC)

"Populary" isn't a word... is it used correctly anywhere though? Because we could just fix that and leave out the known... although thinking about it all of the popularly .... phrases are going to include past participles right? Off the top of my head that seems right... anybody weigh in on that?
As for the regex: <Typo word="Popularly" find="\b(P|p)opular([ia]l)?y\b" replace="$1opularly" />
You can tack "known" onto the end if there's false positives with the above. Feel free to insert it if it checks out. Shadowjams (talk) 23:29, 29 January 2012 (UTC)
Yeah, my only concern is that it might be a typo for popularity as well as popularly. But let me give it a run later tonight or tomorrow and see what I get. Cheers, Khazar (talk) 23:47, 29 January 2012 (UTC)
Ran it and it looked good. Caught about twenty more "popularly"s, 1 "popularity", 3 "popular"s, but no articles where "populary" was in fact the correct spelling. Inserted the text above. -- Khazar (talk) 01:02, 30 January 2012 (UTC)

Analgous, Analagous => Analogous

How about modifying the rule for "Analogous" to also catch "Analagous" and "Analgous"? I'm having trouble deciphering what's already there for this one--I'm still trying to teach myself how to do this--and I don't want to muck it up. Thanks! -- Khazar (talk) 17:55, 1 February 2012 (UTC)

The rule as it is right now requires there to be either a double n or a double l in the first part of the word, and it allows for other errors, like the two you specify. The biggest thing we try to avoid is rules that will match correct spelling, and so these rules get ever more elaborate, and clever, in trying to ensure that anything that matches the rule is somehow wrong. That's what's going on with this rule.
Two options to incorporate your error. Either we can regroup the rule as it is, making it more complex but keeping it in one rule, or we break out a second rule. The second choice may seem simpler, but it may take longer to run 2, and it may also make long-term maintenance of the rule hard.
Here's my shot at changing the rule that's there now. Maybe someone can weigh in on which is faster (I don't have a good way to benchmark it). I'll insert this for now, but if there's something wrong about it let me know.
<Typo word="Analogous" find="\b([Aa])(?:(?:n(?:n?al|nal?)l(?:oge|[oa]g))|(?:nn?all?a?g))(y|ies|ous)\b" replace="$1nalog$2" />. Shadowjams (talk) 22:00, 1 February 2012 (UTC)
So my changes do one of two matches... either the word's got an extra n or l (first version), in which case we correct it, or it's fine but uses an "a" or misses the o, in which case we correct that too. But it's either or. Before it would not correct for double l or n if the o was correct. Again, please let me know if this breaks anything. I tested it some and it seemed fine though. Shadowjams (talk) 22:05, 1 February 2012 (UTC)
Thanks for tackling this one. I'm glad to see I wasn't the only one intimidated by it; I'm trying to teach myself how to write these rules on my own, but this clearly wasn't the place for me to start. Cheers Khazar (talk) 22:26, 1 February 2012 (UTC)

Inforamtion --> information

There appears to be multiple cases of the above misspelling on Wikipedia. Alex (talk) 20:39, 1 February 2012 (UTC)

I have added a rule to handle this and a few other misspellings of it, and of "informational". Chris the speller yack 21:43, 1 February 2012 (UTC)

Removing spaces before a comma

Can someone with regexp skills improve the rule named " ,"? If the text is "foo ,bar ,baz" then the rule changes it to "foo,bar,baz" which I think is worse. The correct fix would be "foo, bar, baz", but if that can't be managed then can the rule disable itself when there is no space after the comma?

Ideally I'd like this to be part of the general fixes, like all the other fiddly rules that shuffle spaces and punctuation, as this rule leads to clutter in the edit summary - I always tick "Add replacements to edit summary". But that's not an issue for this page. -- John of Reading (talk) 14:25, 4 February 2012 (UTC)

I've definitely encountered this problem, too, but my experience has been that for every "foo ,bar" I run into, I encounter 2-3 "foo , bar"s--in other words, that the rule autocorrects correctly more than it's wrong. That's just a subjective impression, though. Khazar (talk) 17:15, 4 February 2012 (UTC)
  Done with this edit. GoingBatty (talk) 19:39, 4 February 2012 (UTC)
Nice work, Batty! I was just looking into it. BTW, I am the one who created this "fiddly" rule (I know John didn't mean any offense). I agree that it would be better in the general fixes. Should one of us make a request? Chris the speller yack 20:01, 4 February 2012 (UTC)
Agreed that a feature request would be appropriate. You may want to include more comma rules than this Typo rule fixes. (e.g. adding a space after a comma, except when there's a digit before & after the comma). Thanks! GoingBatty (talk) 20:40, 4 February 2012 (UTC)
You may want to expand on my existing feature request. GoingBatty (talk) 20:42, 4 February 2012 (UTC)
Thank you, both, that was a very neat change to the rule. I'll update your feature request if no-one else does, but not tonight (UK time). -- John of Reading (talk) 21:45, 4 February 2012 (UTC)
It was working splendidly until I ran typos on Five Major Types of Hydrocarbons of Interest to Petroleum Exploration, where the line after the heading "The lower the specific gravity, the higher the API gravity" causes rather strange behavior. Maybe we should change the first "\s" to " ", or do we really need to find other than a plain old blank? Chris the speller yack 04:04, 6 February 2012 (UTC)
Don't want the typo rule making a replacement that doesn't change how an article is displayed, so I updated the rule. Thanks! GoingBatty (talk) 04:22, 6 February 2012 (UTC)

lack of spaces with punctuation

I don't know if anything much can be done, but I've noticed that a lot of articles written by Indian editors (generally on specifically Indian topics) lack spaces after punctuation, and before it in the case of parentheses. Accounting for all the exceptions may make it impractical to address, though. — kwami (talk) 04:09, 6 February 2012 (UTC)

It's too tricky to attack with AWB, I think, but this RegEx code works pretty well with wikEd (select it on "Gadget" page in your preferences), stepping through an article one (missing) space at a time:
Find:   ([,\.\?\)])(?!\d)(?!com\b|gov\b|jpg\b|net\b|org\b|pdf\b|png\b|svg\b|dll\b|php\b|htm|in\b|asp|\w+=)(\w)(?![-a-zA-Z0-9\.]+(?:.com\b|.net\b|.org\b|.gov\b|.in\b))
Repl:   $1 $2
The biggest hangup is URLs, which the code avoids to a degree. In wikEd you can click a button to hide the references, and that gets quite a few of the false positives out of the way. The other things to watch out for are dotted acronyms (e.g. S.P.E.C.T.R.E.) and "e.g."!!!, and unspaced initials (e.g. M.K.Gandhi), especially within links. If you find a paragraph or two that is really plain text and visually lacks these pitfalls, highlight the paragraph(s) and click the replace button to change 'em all at once. Have fun! Chris the speller yack 05:31, 6 February 2012 (UTC)
I don't know whether you were just busy, or if the length of that, or unfamiliarity with wikEd scared you off. Never mind, there is hope. I have F&R rules for AWB that do a pretty good job and produce few false positives.
Find:   \.([A-Z][a-z]+\b)(?<= [A-Z]?[a-z]+\.[A-Z][a-z]+)(?= )
Repl:   . $1
Find:   ,(?<!inline,)([A-Z]?[a-z]+\b)(?<= [A-Z]?[a-z]+,[A-Z]?[a-z]+)(?=[,\s])
Repl:   , $1
I would like you and other AWB users to run this through a few hundred articles and give feedback here. I think it can be considered for inclusion as a WP/Typo rule, but leaving out the "(?<!inline,)" lookbehind, which should not be necessary for a Typo rule, since Typo rules leave templates alone. That exception is to prevent changing "display= inline,title" in Coord templates. You may look at two edits of Haripad that I made today, the first with the "mild" AWB F&R rules (fixed 32 periods and 15 commas), and the second with selective use of the rough-edged wikEd RegEx code I posted at the top of this discussion. This gives a pretty good feel of what each can accomplish. Chris the speller yack 17:21, 7 February 2012 (UTC)
Perhaps I should explain what the "mild" rules are trying to fix. The first fixes space-word-period-uppercaseword-space, and the second fixes space-word-comma-word-space (or left square bracket instead of the ending space). Chris the speller yack 17:31, 7 February 2012 (UTC)
I have similar rules in my AWB. I've noticed the problem as well. I agree though that it's too error prone for AWB to have built in. I used the Find and Replace and clean up as I go. There are instances though, chemistry formulas are what I think of offhand, where there shouldn't be a space after a comma, for instance. It's things like that that make it difficult for AWB's defaults. Shadowjams (talk) 21:55, 6 March 2012 (UTC)

ecspecialy --> ecspecially

Just noticed a slight mistake here: [12] -download ׀ message 04:51, 23 February 2012 (UTC)

That was the only case of "ecspecialy" in all of Wikipedia, and it doesn't seem to warrant a change to the typo rules. It was probably the least of the problems that had beset that article, which is now much improved. Chris the speller yack 08:44, 23 February 2012 (UTC)

Violante

The rule word="-en(ce/t)" too often tries to change the given name "Violante" to "Violente". Perhaps someone who has experience with this rule could give it a good tweak. Chris the speller yack 18:50, 2 March 2012 (UTC)

  Done in this edit. GoingBatty (talk) 19:28, 2 March 2012 (UTC)

swahili -> Swahili

I noticed that AWB autocorrected english => English and french => French in a document that I was just worked on, but not swahili => Swahili. Perhaps that could be added as well? Glancing around the Swahili articles (culture, people, language), all appear to use a capital letter. Khazar2 (talk) 18:17, 14 April 2012 (UTC)

  Done -GoingBatty (talk) 18:21, 14 April 2012 (UTC)

including-->includeing error

See diff: [13] -download ׀ message 01:54, 28 March 2012 (UTC)

There's a soft hyphen U+00AD hiding between the "includ" and the "ing", which the software is treating as a word break. An obvious fix is to delete and retype the word, but I'll leave it for now in case anyone has a more general solution. -- John of Reading (talk) 07:13, 28 March 2012 (UTC)
That's very odd. I may run through a database dump and try to get rid of those soft hypens if there are others lurking around. Shadowjams (talk) 16:21, 15 April 2012 (UTC)
Whoa, there are a lot of these. I looked through the WP:MoS pages and there doesn't seem to have ever been express guidance about this. I can imagine some areas where this would make sense, but it would seem that most of the time it's being introduced by someone copy-pasting from a word-processor.
I'm going to run through and see if any patterns emerge. I think in most cases though, where this occurs in the middle of a paragraph, it does nothing to help improve the formatting, but only makes AWB and other text searching less effective. Shadowjams (talk) 16:42, 15 April 2012 (UTC)
I'm starting a discussion about soft-hyphens at Wikipedia talk:Manual of Style#Soft hyphens. Shadowjams (talk) 19:16, 15 April 2012 (UTC)

Deleting "Todo" --> "To do"

I've come across this at least a dozen times and not once was it warranted... it's always the Spanish word. FWIW I don't focus on Hispanospheric articles. I'd like to delete it from the list but am thinking I should discuss it first... has anyone else had this problem? PhnomPencil talk contribs 20:14, 20 April 2012 (UTC)

Delete I've noticed this too, though haven't kept a count. -- John of Reading (talk) 08:44, 21 April 2012 (UTC)
Delete 20+ Times today alone. Never once been 'to do'. ChrisGualtieri (talk) 06:36, 25 April 2012 (UTC)
I've removed it. -- John of Reading (talk) 06:53, 25 April 2012 (UTC)
Ah, was just about to... thanks, John of Reading. PhnomPencil talk contribs 18:38, 25 April 2012 (UTC)

British English and 'manoeuvre'

While going through an article I found this word, "manoeuverability" which AWB changed to maneuverability. The American version. Then I get a message saying I changed it from the UK version to the American version... which is odd because it doesn't typically trigger on it. So I looked up the matter in MacMillan. [14] MacMillan states it should be 'manoeuvrable' which means 'manoeuvrability' not 'maneuverability'. The spelling version on the original article is wrong, American or British. A lot of articles seem to use 'manoeuVERAbility' rather then 'manoeuVRAbility', and given what I assume, could someone verify and update the typo list to that effect? ChrisGualtieri (talk) 06:56, 25 April 2012 (UTC)

Here's the rule as it stands now:
<Typo word="Maneuverable" find="\b([Mm])anoeuverab(ility|le)\b" replace="$1aneuverab$2" />
My "Concise Oxford" confirms "manoeuvre", "manoeuvrability" as the British spellings and "maneuver" as the US spelling. I've also seen the entry for "manoeuverability" in Wikipedia:Lists of common misspellings/M, where it is listed with two possible fixes, "maneuverability [American], manoeuvrability". So I'm going to delete this rule from the automatic list and add "manoeuverable" to Wikipedia:Lists of common misspellings/M. This misspelling is too hard to fix automatically, as only a human editor can decide which fix is appropriate. -- John of Reading (talk) 10:15, 25 April 2012 (UTC)
A good call for the time being. I'm sure we could make a rule for it, but I'm not sure how to do so. ChrisGualtieri (talk) 14:35, 25 April 2012 (UTC)
I've worked through them by hand. Definitely too hard for an automatic spelling rule. -- John of Reading (talk) 16:35, 26 April 2012 (UTC)

Bicicleta -> Bicycleta

Not sure why AWB keeps trying to go from Bicicleta to Bicycleta. Bicicleta seems to be valid, but I've hit on 10+ false positives so I was wondering what everyone else thinks about this change. As it is not English, I'd opt for its removal if in doubt. ChrisGualtieri (talk) 18:35, 26 April 2012 (UTC)

I tweaked the "-cycle" rule to leave "bicicleta" alone. Chris the speller yack 01:47, 27 April 2012 (UTC)

"audion" vs "audition"

AWB incorrectly tried to change "audion" to "audition" in the context of the Oscillation article. diff. Turns out "audion" is actually a word. See this search. Maybe this is one of those cases that we just have to check that our fix is proper, but perhaps the regex can be adjusted somehow. Just wanted to point it out anyway. Jesse V. (talk) 19:40, 20 April 2012 (UTC)

  Done in this edit. GoingBatty (talk) 22:38, 28 April 2012 (UTC)

It is 'French fries' or 'french fries'

This has come up several times as well. Do we capitalize the 'f' in french fries? Seems like AWB wants to mark it because of the nation. Possible link to 'French horn' as well? ChrisGualtieri (talk) 18:37, 26 April 2012 (UTC)

Every reputable dictionary that I have checked has "French fries" and "French horn", so it looks like AWB/Typos is working correctly. Chris the speller yack 01:31, 27 April 2012 (UTC)
Just double checking. Thank you! ChrisGualtieri (talk) 13:13, 27 April 2012 (UTC)

Nobel, Pulitzer Price => Prize

One common misspelling I come across is "price" for "prize". While obviously we can't get all of these with Regex, how about the following fixes to catch at least some of the more common?

Nobel Price => Nobel Prize (40ish results) Pulitzer Price => Pulitzer Prize (20 or so) Peace Price => Peace Prize (40 or so) literary price => literary prize (6)

Ideas for other famous prices we might add?

-- Khazar2 (talk) 23:45, 29 April 2012 (UTC)

  Done - added a rule for the famous prizes. We can fix "literary price" manually. GoingBatty (talk) 00:24, 30 April 2012 (UTC)
Partially undone - don't want to incorrectly change the name of the Sesame Street character Dr. Nobel Price. GoingBatty (talk) 02:15, 1 May 2012 (UTC)

twitter -> Twitter

Is having a definition for 'twitter -> Twitter' possible? We currently have Facebook and Myspace, but I've noticed twitter doesn't flag. The word 'twitter' is comparatively rare to 'Twitter'. ChrisGualtieri (talk) 16:03, 30 April 2012 (UTC)

I added rules for Facebook, Myspace, and Twitter a year ago, but the rule for Twitter was reverted since "twitter is a word".
So I use a find & replace rule instead: "\stwitter([\s,])" → " Twitter$1". GoingBatty (talk) 01:41, 1 May 2012 (UTC)
How about a rule that looks at the next word:
<Typo word="twitter" find="\btwitter(\s+)(accounts?|feeds?|hashtags?|pages?|profiles?|sites?|that)\b" replace="Twitter$1$2" />
-- John of Reading (talk) 07:34, 1 May 2012 (UTC)
Not sure about "that", but it seems reasonable that the rest would only pertain to Twitter. GoingBatty (talk) 23:10, 1 May 2012 (UTC)
About 100 of these 835 articles have a lowercase "twitter". -- John of Reading (talk) 06:59, 2 May 2012 (UTC)
Willing to try "twitter that", but if there are false positives, we may need to have another rule like "(announced|posted)\s+(on|via)\s+(his|her|their)\s+twitter". GoingBatty (talk) 16:26, 2 May 2012 (UTC)
I've added a verson of my rule, using a lookahead so that the second word doesn't end up in the edit summaries. -- John of Reading (talk) 17:32, 2 May 2012 (UTC)

Universtity

Could some kind regexpert add this to the 'University' line? It's so complex already that I'm wary of tampering with it. Colonies Chris (talk) 10:19, 7 May 2012 (UTC)

  Done. Rjwilmsi 20:50, 7 May 2012 (UTC)
Thanks. Colonies Chris (talk) 08:55, 8 May 2012 (UTC)

More questions/suggestions

Another odd notice I just discovered. We have womens to women's, but not mens to men's. Kinda strange when viewing sports teams to have one correct itself and the other avoided. ChrisGualtieri (talk) 02:49, 1 May 2012 (UTC)

See Wikipedia talk:AutoWikiBrowser/Typos/Archive 2#mens → men's. GoingBatty (talk) 03:41, 1 May 2012 (UTC)
Ah. I figured something was preventing it. At the risk of not making another section for every issue I'll just rename this and move to my next point. Many articles have this -{{okina}}ie which triggers the 'ie to i.e.' Is it possible to create an exception to the okina matter? I've seen like a dozen of these and the use of the okina is always in words like, "Lāʻie" Or is this a matter of it existing but looking for the character for the okina? Again, I really do not understand the Typoscan database, otherwise I'd try to find my own answer. ChrisGualtieri (talk) 03:54, 1 May 2012 (UTC)
I just added a rule to only fix "mens" if it is a sports phrase: "mens basketball", "mens lacrosse", "mens sports", "mens team", "mens tennis" or "mens and womens". Hope this helps! GoingBatty (talk) 04:26, 1 May 2012 (UTC)
Precisely the thing I was looking for! Thanks. ChrisGualtieri (talk) 04:35, 1 May 2012 (UTC)
As for -{{okina}}ie, that's a tough one. I noticed it does not try to fix "Lāʻie", so if you create a find a replace rule to change {{okina}} to ʻ (and don't check the After fixes box), that should fix the problem. Maybe someone else can come up with a more elegant solution. GoingBatty (talk) 04:42, 1 May 2012 (UTC)
Some years ago I remember changing an Okina and getting a sound drubbing for doing so. I'll dig in my archives. Rich Farmbrough, 19:11, 10 May 2012 (UTC).

"having being"

I have fixed quite a few of these (usually to "having been"), but there are still hundreds left. There are so many articles already afflicted that just fixing them all now probably won't solve the problem, and new cases will be introduced often enough to warrant a Typo rule, but I don't want to do that if there will be many false positives. I have been using a Find & replace rule "\bhaving being (\w+)ed\b" --> "having been $1ed", and it works rather well, because most of the errors to be fixed are of the form "having being relegated" or "having being diagnosed". But this misses "having being sold", "having being built" and "having being previously named". I already fixed a lot of similar cases: "have being" and "had being", but there are definitely false positives for those, such as "who existed when nothing else had being, and who created that which exists after she had come into being". Anyone have any ideas for Typo rules to fix most of these without an unacceptably high rate of false positives? Chris the speller yack 03:11, 19 May 2012 (UTC)

Hongkonger

  • typos fixed: Hongkongers → Hong Kongers using AWB

"Hongkonger" is not a typo; just a variant spelling. It should not be autocorrected to Hong Konger. (The accepted terminology would be "Hong Kong people" anyway.) Deryck C. 20:11, 26 May 2012 (UTC)

Hi Deryck! Do you have a source for "Hongkonger" being a word? It's not listed at dictionary.com. Thanks! GoingBatty (talk) 00:17, 27 May 2012 (UTC)
SCMP uses "Hongkonger", eg. [15]. For more examples just Google "Hongkonger" to see 300000+ results. Deryck C. 10:21, 27 May 2012 (UTC)
I'll take SCMP as a reliable source - the change is   Done. GoingBatty (talk) 14:38, 27 May 2012 (UTC)

Buddha and Buddah

Getting quite a few false positives with Buddah Records releases. Fair enough to say I haven't had one misspelling yet. Perhaps this can be disabled for now? ChrisGualtieri (talk) 10:42, 28 May 2012 (UTC)

By chance, one of the typos I am currently working on is "Buddah > Buddha" from Wikipedia:Lists of common misspellings/B. So far I've made just one fix and found dozens of false positives. So yes, I'm going to remove this typo rule. -- John of Reading (talk) 11:59, 28 May 2012 (UTC)

Cleaning up New additions section

How long should rules be in the New additions section before they get moved into the appropriate section? Thanks! GoingBatty (talk) 15:02, 28 May 2012 (UTC)

merged together => merged

There's about 500 "merged together"s in Wikipedia. This is a commonly listed redundancy (e.g., [16]) that appears to be avoided by major media organizations. (The New York Times used the word "merged" 1000 times this year, with zero uses of the phrase "merged together", for example.) I'd suggest adding a fix replacing merged together with merged; I gave this a trial run and made about 100 replacements without finding any false positives. Khazar2 (talk) 20:17, 28 May 2012 (UTC)

  Added along with "merge", "merges", and "merging". GoingBatty (talk) 21:00, 28 May 2012 (UTC)

Where RETF rules are applied

The documentation says "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes (including indented paragraphs). If a typo rule matches a wikilink target, this rule will be ignored on the whole page." I understand the reasoning behind this, but it means that many corrections I add to the list frequently have no effect, and I have to duplicate them in my personal AWB find-and-replace list to get them to work for me. Could we have an AWB option to apply all RETF changes everywhere except within image/file names, URLs, and quotations? Colonies Chris (talk) 14:48, 22 May 2012 (UTC)

Great idea! I have also wished for this many times. Chris the speller yack 12:31, 23 May 2012 (UTC)
If AWB had an option to fix redirects from misspellings before running RETF, I think that would solve the issue. For example, if AWB changed links from Shelly Winters to Shelley Winters, then the typo rule would fix the rest. Thoughts? GoingBatty (talk) 17:31, 23 May 2012 (UTC)
GB, I've been reading and rereading your contribution without understanding it, but now I see your 'source code', I get it - I've fixed the category link in your comment so that the text appears. Now it makes sense to me. Good solution. To take it to the next level, that option could automatically apply the same correction to all other (non-link) occurrences of that string in the article, and then there would be no need to manually add this type of misspelling to the typo list at all. And there could be similar options for Category:Redirects from other capitalisations and Category:Redirects from titles without diacritics. This would be an excellent way of making use of all the work that other editors have put into creating and categorising redirects. Colonies Chris (talk) 21:41, 23 May 2012 (UTC)
Thanks for fixing my typo and expanding on the idea. I have submitted a feature request for this. Thanks! GoingBatty (talk) 02:21, 24 May 2012 (UTC)
I can't see how we could put this into AWB without doing an API call for each wikilink to see if its target is in one of the categories, clearly performance would be very bad. However, Wikipedia:Bots/Requests for approval/RjwilmsiBot 8 will cover diacritics, from using a database dump. If/when that task is approved I will file a similar task for Category:Redirects from misspellings. Rjwilmsi 19:18, 30 May 2012 (UTC)
Nice solution Rjwilmsi - thanks! GoingBatty (talk) 01:24, 31 May 2012 (UTC)

It it

'It it' is currently being auto-corrected to 'it is', but there are cases where it can be correct. For example, 'If you set fire to it it burns with a blue flame' - yes, you might put a comma in (inserting handy x-ref for lots of relevant info on comma styles), but a common journalistic style is to omit commas, so it isn't strictly wrong to leave it out. If the word after the second 'it' is a verb or adverb then you've probably encountered such a case.

For sentences where the comma is definitely incorrect, rather than merely debatable, see cases such as 'Is it him? Is it her? Is it it?', or 'Was it it that did that?'

As I'm not a Wikipedia regular I'm flagging the mistake here rather than that diving into the AWB typo regex list and just changing it. --82.69.54.207 (talk) 11:24, 30 May 2012 (UTC)

This strikes me as not worth changing the Typo rule (at least not for the few cases where "it it" is correct). I would just change each case to {{Not a typo|it it}}. There are also cases where "it it" should be changed to "if it" or "it" or "is it", but I think we should also leave the Typo rule alone so that at least it finds these cases, and then we can fix them in the edit box. Chris the speller yack 12:50, 30 May 2012 (UTC)

Derrick Caracter

Derrick Caracter's last name goes to 'Character' automatically. I've hit this twice so far, but I was wondering if there was someway to deal with it. Its minor, but I just put a invis tag next to one. ChrisGualtieri (talk) 18:36, 4 June 2012 (UTC)

Every article that mentions him should now contain a link to his article, which will prevent the incorrect typo fixing. GoingBatty (talk) 01:42, 5 June 2012 (UTC)

Widley/Widely and Enbil/Embil

Both of these definitions alter names, and quite frequently on the Enbil- type. I have found only false positives with both of these definitions, and I do not make that statement lightly as I've done several thousand of them now. What purpose does this definition serve? Widley might just be past its proper use otherwise, I never see it as a typo for widely. Nor do I understand what is going on with Enbil as in [[Jorge Oteiza] (Enbil) Plácido Domingo (Embil) in names. ChrisGualtieri (talk) 20:10, 5 June 2012 (UTC)

In the case of "Widley/Widely", what you see (many false positives) may be an indicator that the rule is working. As editors fix the true hits and skip the false positives, the false positives will begin to predominate. I found 3 articles that need fixing, and about 30 false positives. Why not wrap a "Not a typo" template around the false positives, and then they will trouble you no more? As for Enbil, I have no idea what it's good for; I have observed it a couple of times and just hit the "Skip" button. Chris the speller yack 00:28, 6 June 2012 (UTC)
Another thought — the false positives tend to be capitalized, with true hits tending to be in lower case, so maybe the rule can be adjusted. Chris the speller yack 00:32, 6 June 2012 (UTC)
True... after doing several thousand typos one's sense of 'why is this rule in existence?' starts to peak. I had a string of ones which referred to the place (naturally capitalized) rather then 'Widley used' or some such typo. I'm not particularly ready to go altering every other case, so much so as my question is answered and I understand why such awkward rules come up. A lot of the Spanish and Italian typo words get me simply because I do not know if they are incorrect, leading to many skips. As more typos get removed from Wikipedia the more of these false positives there will be. I will finish the backlog off by the end of June, I assure you of that. ChrisGualtieri (talk) 04:40, 6 June 2012 (UTC)

'HongKong' vs 'Hong Kong' Egyptian Bank.

HSBC Bank Egypt has the text which cites the original name as 'Hongkong Egyptian Bank' AWB wants to change it to 'Hong Kong', while I think this may be a typo. A first look at the HSBC site clearly states otherwise. [17] "With HSBC Bank Egypt was established in 1982 as Hong Kong Egyptian Bank." So I think this is a case of a misunderstanding (thought I was wrong at first), but I'd still like a comment from others on this. ChrisGualtieri (talk) 00:23, 6 June 2012 (UTC)

Guess the official website is wrong. Seems to affect Hongkong Bank of Canada, Hongkong Bank of Australia, Hongkong Bank Malayasia Berhad as well. Anything can be done to make sure it doesn't ping these? Or is it best left with invisi-tags.ChrisGualtieri (talk) 03:02, 6 June 2012 (UTC)

André de Toth or Andre de Toth

I've had an editor make a post on my talk page about correcting the name 'Andre de Toth' to 'André de Toth' in accordance with the definition and from what I know of the director, the name and biographical article is André de Toth and 'Andre de Toth' is the redirect. Of his movies, he is accredited more so with 'André de Toth' then 'Andre de Toth' 16 to 4 it seems. Of the 4, 3 of them go directly to the redirect and only Play Dirty has 'Andre de Toth', but links directly to André de Toth. Hate to bring up the IMDB argument, but all of those movies list as André de Toth as does our André de Toth biography. The editor insists that he was billed as 'Andre de Toth' it should stay 'Andre de Toth' even though it is not a pseudonym and the lack of the accent appears to be a technical matter, as the director's name is André de Toth. Just wanting some input on this. ChrisGualtieri (talk) 17:42, 6 June 2012 (UTC)

If it were up to me, and I owned Wikipedia, I think I would show the accent in all cases. But there are other editors involved, and (since IMDB seems to keep track of which movies use an accent on his name) it seems like a valid method to use an accent or no accent in any movie article, according to the way each movie gives credit. The complaint that "you can't go around" doing your thing rubs me the wrong way; it implies that you don't have enough to do, or are not making thoughtful edits. In any case, since there is some resistance, you might want to avoid going at the accents with hammer and tongs in this case. Chris the speller yack 01:30, 7 June 2012 (UTC)
I will try and discuss the matter with him to see if we can't work something out. Stephen King represents a classic case with the Richard Bachman pseudonym, but Andre de Toth seems to be a localization matter that went with the movies. Though he did have variants such as 'Andre De Toth | Andre DeToth | André DeToth | Tóth Endre | Endre Tóth | Andre de Toth' it seems. I'd opt for temporary removal of the spelling rule at this time until we sort this matter out. Seems like much of the work is also absent on Wikipedia as well. I'm not going to force my preference on anyone, after all I care more about fixing errors. ChrisGualtieri (talk) 18:47, 7 June 2012 (UTC)

Aministration => Administration

Was 23 occurances of this until recently when User:Topbanana corrected them. Regards, SunCreator (talk) 22:12, 7 June 2012 (UTC)

  Added to typo rules - thanks!

et al.

I think we should keep our eyes on this discussion on whether "et al." should be italicized or not. One of the typo fixes is to change "et al." to "et al." GoingBatty (talk) 03:15, 6 June 2012 (UTC)

I commented, I do not believe italics should be used in accordance with the MLA style, but I have seen it listed on some publishers. I think it may just be a hold out of the 'Latin requires italics' even though it is very common and MLA does not require italics on common Latin phrases. Might as well put all the 'etc.' in italics then. ChrisGualtieri (talk) 04:35, 6 June 2012 (UTC)
Thanks for commenting there and here, Chris. I've also created a new topic MOS talk page to request that the MOS documents be made consistent. I've also taken the conservative approach and disabled the typo rule. Thanks! GoingBatty (talk) 22:46, 6 June 2012 (UTC)
Until this 2010 edit, the rule didn't change the italicisation, and only concerned itself with getting the dots right - no dot after "et", and a dot after "al". I've re-instated that version of the rule. -- John of Reading (talk) 20:25, 9 June 2012 (UTC)

e.g. and i.e.

Wikipedia:Manual of Style/Abbreviations#Latin abbreviations states: The initialisms "e.g." and "i.e." should not be followed by a comma. However, the rules for "e.g." and "i.e." preserve the comma (and other punctuation). Could someone please update these rules so they follow MOS? (For example, both "e.g," and "e.g.," are replaced with "e.g."). Thanks! GoingBatty (talk) 00:05, 7 June 2012 (UTC)

Once updated properly I'll ask for a new dump to be sorted so I can begin wiping those out, I must admit I've let a few of these go by due to being unfamiliar with the rule and the natural pause in speech did seem to require a comma, "He made and itemized list for his shopping trip which included eggs, milk, bread, etc., but he forgot the pasta anyways." A lame example I just created, though I wonder if it would be valid. The whole (etc., but') matter is probably rare, yet still questionable for me. If it is awkward yet valid, could the rule be modified to include it? ChrisGualtieri (talk) 18:51, 7 June 2012 (UTC)
I've removed the "not" from Wikipedia:Manual of Style/Abbreviations#Latin abbreviations and noted why on the talk page there. -- JHunterJ (talk) 20:06, 7 June 2012 (UTC)
FYI: It appears your edit was reverted. GoingBatty (talk) 03:59, 8 June 2012 (UTC)
Yes, and also one from February which softened the prohibition. There's a discussion going on there. -- JHunterJ (talk) 14:27, 8 June 2012 (UTC)

Anothet => Another

Example. Regards, SunCreator (talk) 23:40, 10 June 2012 (UTC)

You fixed one, I've just fixed the only other example reported by the search box. So this doesn't warrant a typo rule. -- John of Reading (talk) 06:51, 11 June 2012 (UTC)

Higlights => Higlights

  Resolved
Looks like AWB is fooled by the unbalanced italic marker in the preceding picture caption. AWB doesn't fix any typos in italicised text, because they might be quotations. If you run AWB on the current version of User:John of Reading/Sandbox, it will fix the two "Higlights" near the top of the page. If you remove the two apostrophes after "Old man and cow" in the AWB edit window and then press F5 to have the text re-processed, the typo fixer fixes the third "Higlights". So this one is a bug. -- John of Reading (talk) 15:49, 11 June 2012 (UTC)

Louisianian or Louisianan

Possible false positive. Which is correct? http://en.wiktionary.org/wiki/Louisianan or http://en.wiktionary.org/wiki/Louisianian? Current rule moves to 'Louisianan'. Regards, SunCreator (talk) 17:48, 11 June 2012 (UTC)

If anyone has a spare half hour check out the article Louisiana Tigers, it's got so many typos and so much quoting I eventually decided not to save. Regards, SunCreator (talk) 07:30, 12 June 2012 (UTC)
(e/c) I don't know; there's seems to be plenty of evidence for both spellings. Just to save anyone else the trouble of looking, the rule was added in December 2007. -- John of Reading (talk) 07:51, 12 June 2012 (UTC)

Typo is subheading is not corrected

  Resolved

For Donald_"Duck"_Dunn_discography#with_Manhatten_Transfer the correct spelling for Manhatten is Manhatten, AWB has corrected the words inline but not the heading. Can this be fixed? Any ideas why this happens? Regards, SunCreator (talk) 22:35, 11 June 2012 (UTC)

I assume that typo-fixing within headings is disabled because it might break incoming links from other articles. -- John of Reading (talk) 07:08, 12 June 2012 (UTC)
Makes sense. Thanks. Regards, SunCreator (talk) 10:22, 13 June 2012 (UTC)

False positive

The search box only finds The Battle of Sherramuir, and that's a false positive. So unless someone saw this post, found lots of examples, and fixed them all, it's not worth worrying about. -- John of Reading (talk) 07:18, 12 June 2012 (UTC)
I don't see this name mentioned in History of Newcastle upon Tyne. You could ask on the talk page, perhaps? -- John of Reading (talk) 07:18, 12 June 2012 (UTC)
Asked on articles Talk:History_of_Newcastle_upon_Tyne and Talk:Newcastle_upon_Tyne#Newcastle-on-Tyne but no reply yet. Regards, SunCreator (talk) 10:24, 15 June 2012 (UTC)

frindship => friendship

  Resolved

example. Regards, SunCreator (talk) 13:48, 11 June 2012 (UTC)

Only a handful of examples. I've put it on my to-do list. -- John of Reading (talk) 07:09, 12 June 2012 (UTC)
Existing typo rule updated. Rjwilmsi 12:28, 16 June 2012 (UTC)

unkown => unknown

  Resolved

37 occurrences of unkown. Regards, SunCreator (talk) 08:56, 13 June 2012 (UTC)

The existing "unknown" rule already covers this typo. Rjwilmsi 12:02, 16 June 2012 (UTC)
Nope. Test on Confidant from the Batlló House if you don't believe me. Regards, SunCreator (talk) 12:20, 16 June 2012 (UTC)
That typo occurs within double quotes, so AWB won't apply the typo fixes to it. Rjwilmsi 12:23, 16 June 2012 (UTC)
Okay. Better example with typo is final sentence from Jonas_Bronck#Relations_with_Lenape_tribes. Regards, SunCreator (talk) 12:45, 16 June 2012 (UTC)
That page has a sic tag. The typo rule works. Rjwilmsi 13:22, 16 June 2012 (UTC)
Okay, I learnt something! Regards, SunCreator (talk) 13:36, 16 June 2012 (UTC)

diesease => disease

  Resolved

15 occurrences. Regards, SunCreator (talk) 08:58, 13 June 2012 (UTC)

Rule added. Thanks Rjwilmsi 12:20, 16 June 2012 (UTC)

Expand rules to accept more suffixes and prefixes

Un prefix

  Resolved

Existing rules to have to have (Un) added as a prefix

  • (Un)appealing, to handle unappealling/unapealling etc. Existing rule Appealing
  • (Un)assume, to handle unasuming/unasume/unasuming/unassuming/unasumed/unasumingly etc. Existing rule Assume
  • (Un)acceptable. Existing rule Acceptable Regards, SunCreator (talk) 15:17, 15 June 2012 (UTC)
Rules updated Rjwilmsi 12:33, 16 June 2012 (UTC)

roman => Roman

Surprised this was not discussed before. Regards, SunCreator (talk) 10:26, 16 June 2012 (UTC)

Nouveau roman would cause some false positives. I may try a database scan to see what else turns up. -- John of Reading (talk) 10:44, 16 June 2012 (UTC)
This would be a great deal of false positives if all "roman" got replace but can't the rule handle it like many do with suffix , i.e. on "Embarrass" rule with "Embarras River". Regards, SunCreator (talk) 12:31, 16 June 2012 (UTC)
After my trial database scan, I'm now working through a list of 250 articles that have "roman" followed by (amphitheatres?|aqueducts?|archaeology|[Bb]asilica|calendar|candles?|city|coins?|emperor|empire|farmhouses?|forts?|roads?|towns?|villas?). I think this is too messy to be a typo rule. A naive rule would damage articles about French literature (Nouveau roman) and typography (roman type). -- John of Reading (talk) 14:17, 17 June 2012 (UTC)
Okay, thanks for looking into it. Regards, SunCreator (talk) 14:22, 17 June 2012 (UTC)

anglo-saxon => Anglo-Saxon

Capitalisation. Current rules only capitalises the 'S'. Regards, SunCreator (talk) 09:36, 17 June 2012 (UTC)

  Added rule for "anglo" → "Anglo". GoingBatty (talk) 13:59, 17 June 2012 (UTC)

False positive button

  Resolved

There is a 'False' button if you enable it in the View->Display false positive button. Does anyone use it? Is says "Add to false positive file", it that local or central file? Regards, SunCreator (talk) 19:03, 17 June 2012 (UTC)

Per the AWB manual: "The list is saved in False positives.txt in AWB directory (where the AutoWikiBrowser.exe that you are executing is located)." GoingBatty (talk) 19:11, 17 June 2012 (UTC)
Thanks! RTFM. Regards, SunCreator (talk) 09:08, 18 June 2012 (UTC)

The current rule changes Advertized to Advertised, this is a false positive according to the Collins English Dictionary. Regards, SunCreator (talk) 09:07, 18 June 2012 (UTC)

According to the OED "Unlike most verbs ending in -ise, advertise cannot be spelled with an -ize ending.". That's why we have the typo rule. Rjwilmsi
The source given is for World English(British English) and so does not cover American English. In addition to Collins English Dictionary, you can find advertize here, here and here. Regards, SunCreator (talk) 13:36, 18 June 2012 (UTC)
The full OED says the same, and covers all forms of English. We could decide that the typo rule is too strict, but as it stands it meets the OED guidance. Rjwilmsi 14:28, 18 June 2012 (UTC)
Every American knows what "advertise" means, and some dictionaries of American spellings only recognize that spelling, omitting "advertize" completely, not even acknowledging it as an alternate spelling. I think Wikipedia should use the form that will not offend or surprise users of British English or American English. WP:SPELLING says "In both British English and American English, many words have variant spellings, but most of the time one variant is preferred over the other." Since "advertise" is the preferred spelling in both cases, it's clearly the preferred spelling. Of course, there's no need to flame over an editor who inserts "advertize", but it's OK to quietly change it to the preferred spelling. This will also tend to take care of an article where both variants appear. Chris the speller yack 15:14, 18 June 2012 (UTC)
Most helpful. Thanks. Regards, SunCreator (talk) 17:08, 18 June 2012 (UTC)

ebay.com

Changing ebay.com to eBay.com is a false positive. URL's are lower case. Regards, SunCreator (talk) 08:12, 19 June 2012 (UTC)

  Done -- JHunterJ (talk) 11:20, 19 June 2012 (UTC)

Jonah Krakow

Jonah Krakow is a animation/video reviewer for IGN.(http://uk.tv.ign.com/articles/846/846145p1.html) So Jonah Krakow => Jonah Kraków is a false positive. His name occurs in at least 25 articles. Regards, SunCreator (talk) 08:26, 19 June 2012 (UTC)

  Done. -- JHunterJ (talk) 11:18, 19 June 2012 (UTC)

Combining rules

Would it be reasonable to combine the new "well received" rule with the "(best/well) known" rule? GoingBatty (talk) 23:54, 6 May 2012 (UTC)

I'm disinclined to combine them, because "best-received" is generally not a problem, and the prepositions listed in the lookahead are somewhat different. But if you want to merge them, go ahead. Chris the speller yack 14:20, 7 May 2012 (UTC)
Isn't there about 5000 of the 'well-received' typos? I'd love to help but for some reason my AWB cannot connect to the tool server to load up new lists. Even for the CHECKWIKI project. I'm stuck doing assessments until it comes back online. Something with a new version maybe? So no typoscan for me. Otherwise I'd try to do some of them. ChrisGualtieri (talk) 14:26, 7 May 2012 (UTC)
Hi ChrisG - sorry you're having trouble loading new lists - I'm not having that problem with SVN 8062. If you post on the AWB talk page to see if anyone has a solution for you, I hope it's well-received. (Sorry - couldn't resist.)
The "well-received" rule won't be picked up on Wikipedia:WikiProject TypoScan unless there's another typo on the page or until a new database dump is processed.
Good luck! GoingBatty (talk) 16:48, 7 May 2012 (UTC)
Not even the URL to the toolserver loads for me. Its working for you? ChrisGualtieri (talk) 17:23, 7 May 2012 (UTC)
Yes, I can view http://toolserver.org/~awb/typoscan/ -GoingBatty (talk) 02:35, 9 May 2012 (UTC)
I'm scanning to March dump for "well received" there seem to be many thousands. Maybe I'll do some as my swan song. Rich Farmbrough, 07:47, 8 May 2012 (UTC).
Wikipedia:Teahouse/Questions#well-received? would suggest that the hyphenated version is correct. Rich Farmbrough, 13:00, 10 May 2012 (UTC).
I have commented on that forum, pointing out that they have misunderstood the MoS. Chris the speller yack 18:20, 10 May 2012 (UTC)
Today, I finished the remaining did 500 or so occurances of 'well-received'. Only one revert onMeddle so far. Regards, SunCreator (talk) 22:36, 10 June 2012 (UTC)
Why does the well received not rule want a space (or a full stop?) follwing? What is different about commas and semi-colons? Regards, SunCreator (talk) 21:51, 24 June 2012 (UTC)
If you are asking whether the rule could be expanded to fix "well-received" followed by a semicolon, the answer is that it could, but I imagine that there are very few such cases. If you can provide a couple of dozen such cases, I will expand the rule. If you are asking whether the rule could be expanded to fix "well-received" followed by a comma, the answer is that I can imagine false-positive cases, such as "Schulmklopfer's first effort was a well-received, completely sold-out play." I'm not anxious to go that far. Chris the speller yack 22:50, 24 June 2012 (UTC)
I can't tell the number, but I encountered The Town of No Return with a semi-colon. Perhaps someone with a database copy can answer the question. Regards, SunCreator (talk) 23:19, 24 June 2012 (UTC)
Why in some cases doesn't the rule work. i.e. Major Tom (Coming Home), Antonio Reparaz, and Tracey_Lee_(female_impersonator) which all have "well-received "? Regards, SunCreator (talk) 23:28, 24 June 2012 (UTC)
The rule is working very well. There is nothing wrong with "well-received" when it precedes the noun that is being modified; this is the case in all three examples. The hyphen is not needed when "well received" is used predicatively, or when an intensifier is used, as in "He wrote several very well received books". Chris the speller yack 03:51, 25 June 2012 (UTC)

-men's

This rule doesn't work properly:

<Typo word="-men's" find="\b([Cc]hair|[Ff]ore|[Gg]entle|[Ww]o)men(?:s['′’]s?['′’]?|;?s['′’]?)\b" replace="$1men's" />

If I've understood this correctly, the first half of the complicated bit tries to match <<s'>>, <<s's>>, <<s''>> or <<s's'>>, and the second half tries to match <<;s>>, <<;s'>>, <<s>> or <<s'>>. But if you run it on Food security or on User:John of Reading/Sandbox, you'll see that it actually changes <<Womens'>> to <<Women's'>>. I think the problem is the behaviour of the \b character after an apostrophe.

At least for my test cases, this is a possible fix:

<Typo word="-men's" find="\b([Cc]hair|[Ff]ore|[Gg]entle|[Ww]o)men(?:s['′’]s?['′’]?|;?s['′’]?)([^\w])" replace="$1men's$2" />

Is this a safe fix? Can anyone find a neater one? (And is it a good idea for a regexp to match a pair of consecutive apostrophes?) -- John of Reading (talk) 09:07, 22 June 2012 (UTC)

  Fixed with this edit. Further improvements are probably possible, in particular as you noted with the match possibilities for double apostrophe's. I accounted for them in my addition; I'll try to account for them in the existing match as well. Soon. -- JHunterJ (talk) 23:57, 24 June 2012 (UTC)

Church

Why is a Catholic church a Catholic Church? But a Protestant church is a Protestant church? Regards, SunCreator (talk) 20:43, 22 June 2012 (UTC)

In most cases, the rule works. In "Pope Benedict XVI is the leader of the Catholic church", the last word needs to be capitalized. There is no parallel for "the Protestant church", because there is no equivalent organization. When I see AWB trying to change "the town's Catholic church was built in 1894", I just nix that change and proceed. You could wrap a "Not a typo" template around "church" to prevent further capitalization attempts, but that's probably heavy-handed Chris the speller yack 21:29, 22 June 2012 (UTC)

philippines => Philippines

The Philippine rules doesn't capitalise this. Regards, SunCreator (talk) 23:03, 23 June 2012 (UTC)

  Fixed GoingBatty (talk) 12:40, 24 June 2012 (UTC)

slavic => Slavic

  Resolved

Slavic language of Slavic people. Am I missing something? Regards, SunCreator (talk) 23:08, 10 June 2012 (UTC)

Seems sensible! Added. -- John of Reading (talk) 13:35, 11 June 2012 (UTC)
Also variation slavonic => Slavonic. Regards, SunCreator (talk) 18:00, 3 July 2012 (UTC)
  Done -GoingBatty (talk) 00:11, 4 July 2012 (UTC)

Capitalisation of mumbai

mumbai => Mumbai. I think this would be false-positive free. Regards, SunCreator (talk) 13:53, 2 July 2012 (UTC)

  Added -GoingBatty (talk) 01:20, 3 July 2012 (UTC)

Thomas De Quincey

There is a rule which changes Thomas De Quincey => Thomas de Quincey but the article has moved after discussion to Thomas De Quincey. Regards, SunCreator (talk) 14:32, 3 July 2012 (UTC)

  Done I've switched all the D/d's in the rule so that it should convert "de" to "De" now. -- John of Reading (talk) 15:32, 3 July 2012 (UTC)

georaphical => geographical

example. Regards, SunCreator (talk) 13:26, 11 June 2012 (UTC)

According to the search box, you just fixed the only example of "georaphical" in the whole of Wikipedia. So it's not worth adjusting the typo rules to fix it. -- John of Reading (talk) 13:30, 11 June 2012 (UTC)
How many occurrences before you are interested in the typo? Regards, SunCreator (talk) 13:52, 11 June 2012 (UTC)
I've had a look through the archives and can't find any guidance on this. Since the list is already so large, I wouldn't like to see it expanded to cover rare typos. 25, maybe? Opinions, anyone? -- John of Reading (talk) 16:20, 11 June 2012 (UTC)
I had just been thinking about this, and had decided on 2 dozen. We can split the difference—24.5 seems about right. Chris the speller yack 17:24, 11 June 2012 (UTC)
 
The long tail of typo occurences
I think such a high number is inappropriate. It seems to be saying only fix common typos and leave the others. Is there such a downside to adding more rules? Regards, SunCreator (talk) 15:32, 15 June 2012 (UTC)
Yes, because each new rule slows down the processing of each page. If a typo does not appear on many pages, it is probably simpler just to fix them. To do this with AWB, use "Wiki search (text)" and a "Find & Replace" rule - or just fix it by hand in the edit box. -- John of Reading (talk) 15:47, 15 June 2012 (UTC)
Is there a list of such words that are typos but rejected from AWB rules that we can go through to correct in the way you describe? Regards, SunCreator (talk) 20:52, 15 June 2012 (UTC)
Common misspellings can be listed at WP:LCM; some of those are covered by AWB rules and some are not. I'm not aware of any place to list "uncommon misspellings". I just keep a list of any I find in a file on my computer, and every few weeks I take a break from my other projects and fix those typos instead. -- John of Reading (talk) 08:08, 16 June 2012 (UTC)
Not sure if this discussion is stale... and I'm undecided about if we need to cut out a lot of the typo fixes. I do find the searching to be slower than is ideal, but then again I'm running it on a slow computer. There's a danger in declaring which rules are rare because a lot of rules are getting fixed because they're in the list. Only if we had detailed stats about which rules hit the most could we really know which ones are rare.
And even if they are rare, they are often ones that people don't notice and correct on their own. So at the very minimum, we should put the "deleted" rules into another list, such as a secondary AWB list. That way someone could run through a database dump in batch every so often and correct these orphan typos. Shadowjams (talk) 22:55, 15 July 2012 (UTC)
This thread was about adding a new rule, not deleting an existing rule. But I agree 100% that we shouldn't delete any existing rules without collecting proper statistics on how many times each rule is used. That could be a feature request, perhaps? -- John of Reading (talk) 14:57, 16 July 2012 (UTC)

friend

Could the friend rule be adjusted so it does NOT change Frindall to Friendall. There are many occurrences of the individual Bill Frindall. Regards, SunCreator (talk) 19:19, 21 July 2012 (UTC)

  Done GoingBatty (talk) 23:59, 21 July 2012 (UTC)
Thanks. This was causing a lot of false positives. Regards, SunCreator (talk) 22:47, 22 July 2012 (UTC)

Fun with sports and hyphens

A few suggested sports fixes:

  • game winning goal => game-winning goal (at least 300-400 occurrences)
  • walkoff => walk-off (at least 50; this appears to fix "walk-off" in the sense of a 9th inning baseball win as well as its occasional use for striking workers)
  • game winning home => game-winning home (40-50)
  • game winning hit => game-winning hit (30)

I've given each of these substitutions a test run and didn't see any significant false positives. Thanks as always for your efforts Khazar2 (talk) 21:46, 21 July 2012 (UTC)

  Added "game-winning" and "walk-off" rules. GoingBatty (talk) 01:34, 23 July 2012 (UTC)

"Under-development"

I reverted the "Overdevelopment" rule to its former state that does not treat "under-development". There are a number of articles that use "under-development" attributively, such as Grupo Alexander Bain. It's an ugly construct, and I would rather see "an under-development campus" changed to "a campus that is under development", but we can't change "an under-development campus" to "an underdevelopment campus", which has has a different, and pejorative, meaning. Chris the speller yack 12:53, 22 July 2012 (UTC)

Qur'an rule

I'm a bit uneasy with this rule given the subjects article is called Quran. Regards, SunCreator (talk) 13:08, 22 July 2012 (UTC)

Quran states: "The Quran...also transliterated Qur'an, Koran, Al-Coran, Coran, Kuran, and Al-Qur'an, is the central religious text of Islam". Since there are apparently several acceptable spellings, I wouldn't want the rule to change "Quran" to "Qur'an" or vice versa. Which replacements does the typo rule make that are you concerned about? Thanks! GoingBatty (talk) 01:29, 23 July 2012 (UTC)
This. Just me being unfamiliar it seems. Regards, SunCreator (talk) 02:09, 23 July 2012 (UTC)
It does seem that there are many ways to write it. That specific one seems okay from some websites, the Qu'ran has more then half a dozen 'okay' ways to write it. Koran, Coran, Quran, Qur'an, Qur’ān and al-Qur’ān are some of the most popular ones. Though it seems to be due to a shift in political correctness and accuracy of the religious text for transcription. The evolution of it is still ongoing. ChrisGualtieri (talk) 14:58, 23 July 2012 (UTC)

paraguayan => Paraguayan

People of Paraguay are Paraguayan. Regards, SunCreator (talk) 13:29, 22 July 2012 (UTC)

  Done. Chris the speller yack 14:24, 22 July 2012 (UTC)
Thanks. Regards, SunCreator (talk) 22:46, 22 July 2012 (UTC)

Number of article pages with an AWB typo

To give a rough estimate of the number of article pages with an AWB typo I took a sample of 1000 mainspace articles(using random in AWB). AWB reported that 17 had typos after a pre-parse mode scan. After checking manually two contained false positives and where dismissed the remaining 15 where saved(although 8 where cosmetic issues). 15 in 1000 scaled up for the 3,975,490 articles on Wikipedia is 59,632 typos page to go. Regards, SunCreator (talk) 14:20, 17 June 2012 (UTC)

I sampled another thousand with 20 found of which 2 where false positives. So 18 in 1000 is 71559. Will try and check again in a months time. Regards, SunCreator (talk) 16:44, 17 June 2012 (UTC)
TypoScan lists at least 80,000 left to go (depending on my own activities), but it should be clear that the first pass has a user error ratio 3x higher then what it should be on the 'skips' so I believe the actual number is 135,000 to 150,000 that WILL be hit upon by the rules contained herein. Also since we are not running 100% detection of typos with the rules, the actual number of typos on articles could be much higher. ChrisGualtieri (talk) 15:10, 24 June 2012 (UTC)
Sampled another two thousand. 42 where typos, 7 where false positives leaving 35 (btw 12 where white space typos). So a typo rate of 35 in 2000 for 4,011,244 article works out as 70187. Regards, SunCreator (talk) 23:21, 26 July 2012 (UTC)

April Fools Day

  • <Typo word="April Fool('s/s') Day" find="\b[Aa]pril\s+[Ff]ool('s|s')\s+day\b" replace="April Fool$1 Day" />
  • <Typo word="April Fools' Day" find="\b[Aa]pril\s+[Ff]ools\s+[Dd]ay\b" replace="April Fools' Day" />

Seems considerable duplication here. Why not one rule? Regards, SunCreator (talk) 22:58, 19 July 2012 (UTC)

While the second rule adds the apostrophe, the first rule doesn't change "Fool's" to "Fools'" since both are used. GoingBatty (talk) 01:13, 23 July 2012 (UTC)
Got it thanks, both are used, that's what I missed. Regards, SunCreator (talk) 23:43, 26 July 2012 (UTC)

Morissette rule

There are three 'Bill Morrisette'. I think rule should be refined to avoid them. Regards, SunCreator (talk) 22:45, 22 July 2012 (UTC)

Bill isn't the only Morrisette with a Wikipedia article. I've taken the conservative approach and changed the rule so it will only fix misspellings of Alanis Morissette. As always, other ideas are appreciated. GoingBatty (talk) 01:45, 23 July 2012 (UTC)
Thanks. Regards, SunCreator (talk) 22:08, 26 July 2012 (UTC)

Guerilla -> Guerrilla

Several dictionaries list "Guerilla" as an alternate spelling. Macmillan and oxforddictionaries.com are a couple. I don't like the single-"r" spelling at all, but there it is. Sorry, I'm going to remove the rule. Chris the speller yack 14:52, 25 July 2012 (UTC)

BTW, you might feel better after seeing that someone once tried to go the other way with this (see talk Archive 1). I also commented in talk Archive 3 that research on each article is needed before choosing "r" or "rr". Chris the speller yack 16:48, 25 July 2012 (UTC)

Umayyad entry

Not sure why this is happening, but the entry seems to be going for any loose 'd' and attempting to change it to Umayyad. Even with the shortening for 'd.' for 'died' in articles like Abdullah Al-Refai. I am not disabling it yet, but I've had 6+ false positives in the last 10 minutes. ChrisGualtieri (talk) 12:28, 26 July 2012 (UTC)

I fixed it; it had one too many vertical bars. Chris the speller yack 15:01, 26 July 2012 (UTC)
Thanks for fixing this - I'd just noticed the same problem myself. Colonies Chris (talk) 15:26, 26 July 2012 (UTC)
Thank you! I was wondering why it was doing that, I've only corrected 2000 typos and had it come up so many times. I don't fully understand the rules and how they operate. I hate to say it, but it had a good detection on contractions like 'they'd' and 'she'd' which bug me. ChrisGualtieri (talk) 16:13, 26 July 2012 (UTC)
What a Chris team! One Chris to create it, one Chris to test it, one Chris to fix it! I'm sure we'll be swapping roles in the future. Chris the speller yack 17:05, 26 July 2012 (UTC)

Servey -> Survey

I constructed the following rule after encountering misspellings of "survey" (and other forms of the word):

<Typo word="Survey" find="\b([Ss])[^u]rvey([a-z]*)\b" replace="$1urvey$2" />

However, in testing, I found less than 20 pages in a wikitext search for "servey", "serveyed", and "serveying". Several of those turned out to be false positives, matching on people whose names were actually "Servey".

Might this rule be too risky to add to the RETF list? Maybe it would be better to have it match only forms with a suffix?

<Typo word="Survey" find="\b([Ss])[^u]rvey([a-z]+)\b" replace="$1urvey$2" />

Input is appreciated. Thanks, Tuvok[T@lk/Improve] 08:40, 29 July 2012 (UTC)

If the false positives are people called "Servey" then amending the rule to avoid those starting with an uppercase 'S' would likely improve the rule considerably. Regards, SunCreator (talk) 09:27, 29 July 2012 (UTC)
This seems to be a case where there are too few hits to justify a new Typo rule. There should be at least a few dozen errors, with very few false positives, before adding a new rule. Chris the speller yack 12:55, 29 July 2012 (UTC)

Extra spaces left by one or more rules that move punctuation to before <ref> tags

Sometimes, a period is moved from after a closing </ref> to before the starting <ref>, but instead of being simply moved its old location is filled with a space, resulting in two spaces between sentences. Not the end of the world, but certainly unnecessary.

This also happens at the end of a line sometimes, which seems to bypass the rules that trim trailing spaces. (Obviously that's related to running the rules in a particular order.) Tuvok[T@lk/Improve] 08:44, 29 July 2012 (UTC)

This isn't part of the typo rules, this is one of AWB's general rules. Even though this wouldn't change how the article is presented to the reader, you may wish to create a bug report. GoingBatty (talk) 03:11, 30 July 2012 (UTC)
Thanks, GoingBatty. Apparently I've taken your username to heart in advance of meeting you, thanks to the complexity of AWB. I'll take this elsewhere, and thanks again for correcting my heading. Cheers, Tuvok[T@lk/Improve] 04:37, 30 July 2012 (UTC)

.i.e. rule for Irish websites

<Typo word="i.e." find="\bi(?:\.?e|e\.)(['\s,:;\)&-])(?<!\.ie.|'ie')" replace="i.e.$1" /><!--don't generalize to capital Ie; avoid matching website.ie; avoid matching 'ie' used as syllable -->
This rule was changed to avoid Irish .ie domains. I just noticed it's not working, and still changes .ie. to .i.e. for example on Irish poetry. Can someone with Regex wizardry take a look at correcting the issue. Regards, SunCreator (talk) 22:36, 29 July 2012 (UTC)

Etc. => etc.

Only in the exception article Etc. could this start a sentence, so couldn't it be made into lowercase? Like the "i.e." rule it would seem appropriate to use only lowercase "etc." Regards, SunCreator (talk) 21:39, 29 July 2012 (UTC)

There are several uppercase examples at the disambiguation page ETC, such as Etc... (a Czech rock band), Etc. (the b-sides and rarities album of the influential punk band Jawbreaker), and Etc. (a bonus disc accompanying the Pet Shop Boys' 2009 release Yes.) GoingBatty (talk) 03:22, 30 July 2012 (UTC)
Thanks. Those possibilities seem to cover a small number of topics that could be individually resolved with {{Not a typo}}. I'm encouraged to think this could be a workable rule. Regards, SunCreator (talk) 16:02, 30 July 2012 (UTC)

More French loanwords

I see there are typo rules for some French loanwords. Should we also add rules for bête noire, bourrée, château(x?), passé, and séance? (Potential rules for château and séance should not include capital letters - see their disambiguation pages.) Thanks! GoingBatty (talk) 03:17, 31 July 2012 (UTC)

Not for "chateau" or "seance". "Chateau" is the English word, which allows "château" as an alternate spelling (in some dictionaries). Same goes for "seance/séance". The Château page has been roughly handled by a group of editors suffering from fairly bad cases of hyperforeignism. This is the English Wikipedia, and the standard for spellings is a good English dictionary. Chris the speller yack 04:26, 31 July 2012 (UTC)
OK, I've updated the château and séance articles to indicate the unaccented versions are acceptable (just like fête). Thanks! GoingBatty (talk) 04:37, 31 July 2012 (UTC)

False positive 2

humourous => humorous. Humourous is the British English spelling. Regards, SunCreator (talk) 15:44, 2 July 2012 (UTC)

What's your source? The OED disagrees with you. Rjwilmsi 17:02, 2 July 2012 (UTC)
Interesting. Will check some sources nearer the end of the week. Regards, SunCreator (talk) 14:29, 3 July 2012 (UTC)
I went to a bookshop and checked out some dictionaries but erroneously somehow thought the word to check was anonymous. Darn! Regards, SunCreator (talk) 15:19, 1 August 2012 (UTC)

Working/upper/middle/lower-class

Should this rule also cover combinations like upper-middle-class lower-middle-class. Regards, SunCreator (talk) 13:15, 21 July 2012 (UTC)

Upper middle class contains "upper middle class individuals" in the lead. What's the proper hyphenization? Thanks! GoingBatty (talk) 00:08, 22 July 2012 (UTC)
I fixed it (and "white-collar professionals"). "Upper-middle-class individuals", because "upper" modifies "middle-class", not "individuals". Chris the speller yack 00:38, 22 July 2012 (UTC)
I don't think any change has been made to the rule. Was that an oversight? Regards, SunCreator (talk) 22:50, 22 July 2012 (UTC)
I think it's waiting for someone to decide that such a change is reasonable, doable and worthwhile; the rule is already somewhat clunky. Maybe some other editor will comment; it's only been a day since the issue came up. Chris the speller yack 00:29, 23 July 2012 (UTC)
FWIW I agree on the correction, but lack the programming chutzpah to make the change myself. Khazar2 (talk) 00:32, 23 July 2012 (UTC)
  Added new rule for "(Upper/lower)-middle-class". GoingBatty (talk) 01:22, 23 July 2012 (UTC)
Also ending with home(s) i.e upper middle class homes on Huntley's far west side => upper-middle-class homes on Huntley's far west side. Regards, SunCreator (talk) 12:14, 29 July 2012 (UTC)
Have added homes to this Working/upper/middle/lower-class rule. Not sure about "home" so left it for now. Perhaps Chris the speller could comment on that. Regards, SunCreator (talk) 22:05, 1 August 2012 (UTC)

Capitalisation of egyptian => Egyptian(s)

Seems like a good idea. No obvious false positives at Egyptian. Regards, SunCreator (talk) 21:43, 28 July 2012 (UTC)

Doesn't the existing rule in the Geographical proper names section cover this? GoingBatty (talk) 03:03, 30 July 2012 (UTC)
Good question. I wasn't aware of that rule and had to manually correct this. So I guess the answer is the existing rule doesn't cover it. But I'm not sure why. Regards, SunCreator (talk) 03:57, 30 July 2012 (UTC)
My guess was there were unbalanced quotation marks in the article causing AWB to skip that section of the article, but I didn't see that. GoingBatty (talk) 04:14, 30 July 2012 (UTC)
No, as it corrected the word after; allready => already. See the previous edit. It appears the problem is with the rule. I will test it later. Regards, SunCreator (talk) 04:45, 30 July 2012 (UTC)
The text is in User:John of Reading/Sandbox. A typo rule is disabled if it matches any wikilink in the article. By experiment, I find that this test is fooled by "links" to the File namespace. So, because the article contains [[File:First 3 egyptian pilots.jpg|thumb|upright|left|First three Egyptian pilots]], the "Egypt" rule is turned off. I'll log a bug. -- John of Reading (talk) 07:04, 30 July 2012 (UTC)
What a strange error! Thank you John. Regards, SunCreator (talk) 12:19, 30 July 2012 (UTC)
Rjwilmsi (talk · contribs) is happy to make the change if we can agree that it will do more good than harm. But, on reflection, I think it will be very difficult to work out if this change would be an improvement. Using the current code, some typos are not getting fixed - but it took a sharp-eyed AWB user to notice one of them and raise it here. Using the proposed new code, these typos would be fixed - but there would probably be some new false positives. I have no idea whether the extra fixes would outnumber the extra false positives, and it would take a serious amount of work to find out. -- John of Reading (talk) 05:43, 1 August 2012 (UTC)

Text with a lot of typos

here is some text with a lot of typos corrected. Could any of these be good for typo fixing? Regards, SunCreator (talk) 23:58, 31 July 2012 (UTC)

I would say at least three: Conservative, Successor and student but possibly others also. Kumioko (talk) 00:42, 1 August 2012 (UTC)
There are less than a dozen articles with "stuent", and even fewer cases of the other misspellings. I would say that this so rare as to be slightly below the threshold for adding a Typo rule. Please read the section above, "georaphical => geographical", for other ways to deal with rare misspellings. Chris the speller yack 03:23, 1 August 2012 (UTC)
I suspect only the capitalisation of 'panjab' would meet the previously discussed 24 or 25 occurrence level. It just goes to highlight that the majority of typos are low volume and therefore the current AWB typo strategy misses them. Regards, SunCreator (talk) 14:14, 1 August 2012 (UTC)

Distinct rule

Distinct rule converts 'Distict' => 'Distinct' but many times the correct word is 'District'. Regards, SunCreator (talk) 12:35, 3 August 2012 (UTC)

There is a District rule that converts 'Distict' => 'District' also, but is seems in practice the Distinct rule gets it first. Regards, SunCreator (talk) 14:04, 3 August 2012 (UTC)

Lifelong false positive

She sacrificed her life long ago. => she sacrificed her lifelong ago. Regards, SunCreator (talk) 00:05, 4 August 2012 (UTC)

Plus "and a way of life long gone", "ended her life long before they reached her", "a mode of life long since defunct" and "of a life long-lived on one side. Regards, SunCreator (talk) 02:30, 4 August 2012 (UTC)
Adjusted rule is handle those situations. Regards, SunCreator (talk) 02:37, 4 August 2012 (UTC)

Sports vocabulary

I propose that the following misspellings be corrected.

  • athalet(e,ic) —> athlet(e,ic)
  • (bi,tri,pent,hept,dec)athalon —> (bi,tri,pent,hept,dec)athlon
  • cycle(ing,ist) —> cycl(ing,ist)
  • parapaleg(ia,ic) —> parapleg(ia,ic)
  • quadrupaleg(ia,ic) —> quadrupleg(ia,ic)
  • Ukarainian —> Ukrainian

I have been seeing some of those errors on external pages.
Wavelength (talk) 17:35, 5 August 2012 (UTC)

I'm all for correcting these, but there are very few, too few to merit the creation of Typo rules. Chris the speller yack 18:24, 5 August 2012 (UTC)
Thank you for considering my proposal.
Wavelength (talk) 18:51, 5 August 2012 (UTC)
No problem; if you find a misspelling that occurs in a couple of dozen articles or more, let us know. With that many to chew on, we'll try to give AWB a rip at them. Chris the speller yack 19:13, 5 August 2012 (UTC)
Already is a -athalon rule for handling (bi,tri,pent,hept,dec)athalon. Regards, SunCreator (talk) 05:15, 6 August 2012 (UTC)

homonomy => homonymy

I think this change could be a false positive per http://dictionary.reference.com/browse/homonomy, http://www.thefreedictionary.com/Homonomy but they could be mistakes. Oxford English online doesn't have the word homonomy. Regards, SunCreator (talk) 06:38, 6 August 2012 (UTC)

AWB avoids too many areas that contain typos

I'm fairly new to typo correction with AWB. In my testing of regex additions/changes, I find that AWB skips a substantial portion of the typos that would match because they're 1) in references, 2) in text indented with a colon, 3) seemingly many other areas. None of this is well documented. I don't quite understand this: we're expected to review changes anyway, so why have so many areas ignored? Here's an example: my target "origional" did not hit here [18], but an unrelated typo hit (I had edit summary trouble here, ignore that). So I manually and temporarily removed the indentation ":" that I presumed was blocking the typo fix, within AWB in that edit. Then, parsing the article again, AWB corrected the typo I wanted [19], so it was the colon causing the problem (and I manually replaced the colon). It kneecaps the project to have some many textual areas excluded from correction. I wouldn't mention it if I hadn't had about 40% or more of target typos ignored by AWB so far. Riggr Mortis (talk) 02:48, 5 August 2012 (UTC)

Wikipedia:AutoWikiBrowser/Typos#Usage states "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes (including indented paragraphs). If a typo rule matches a wikilink target, this rule will be ignored on the whole page." GoingBatty (talk) 03:42, 5 August 2012 (UTC)
I've seen that; I said well documented. "Indented paragraphs": there are many ways to do that. So "Joe's Journal of Psychaitry" doesn't get corrected because it has an asterisk in front of it: pointless. Templates: the template name itself (obviously), or its parameters too? In any case, the substantive point remains. Riggr Mortis (talk) 03:54, 5 August 2012 (UTC)
It's the entire template. What article contains/contained "Joe's Journal of Psychaitry" with an asterisk? GoingBatty (talk) 04:32, 5 August 2012 (UTC)
I think you're taking me rather literally; but in fact, AWB is ignoring no less than seven instances of "psychatric", which relates to a regex I added the other day. Try it. An article with a bullet point and "psychatric" is List of oldest buildings and structures in Toronto (it's also contained with a link, but not the URL, so who cares—all regular text is susceptible to typos, regardless of what wikicode it's wrapped in.) Riggr Mortis (talk) 05:07, 5 August 2012 (UTC)
I think it's good that AWB does not fix "psychatric" in the reference in Manpreet Singh, since the source actually uses "Psychatric". That's an example why AWB is conservative in its corrections, and does not make changes to the other six articles where "psychatric" is in a reference or external link. GoingBatty (talk) 22:45, 5 August 2012 (UTC)
I don't agree to be honest unless people are using AWB sloppily. I won't make such a change unless I could validate it somehow. Either way, Chris's solution of setting level of exclusion is the way to go. Regards, SunCreator (talk) 22:53, 5 August 2012 (UTC)
(edit conflict) I Agree with Riggr. It would seem appropriate to work towards having less content ignored. It's not ignored in wikEd and I imagine in the future that will the common editing method. Might take reading to find out the reason behind these in the past but I'm open in having more content even if that means more cleaning up in terms of image renaming, marking more {{not a typo}} etc. Regards, SunCreator (talk) 04:37, 5 August 2012 (UTC)
I often put a Typo rule into my Find & Replace rules and then search for that error and run them all to ground. But it might be a better move to add options in AWB to allow Typo fixes in indented paragraphs, Wikilink targets, etc. This would let each AWB user choose his or her own comfort level with how many hits will need to be skipped, how much extra examination will be needed, and how much risk they want to take. Chris the speller yack 12:50, 5 August 2012 (UTC)
Some options to set level of exclusion seems a great first step. Regards, SunCreator (talk) 13:01, 5 August 2012 (UTC)
The number of typos Regex can get is also very limited. So it won't get them all, or even half of all typos on a page in which they may hide. Aside from loading every page with a built in checker (instead of AWB) we will continue to miss many simply by using AWB loaded with Regex. ChrisGualtieri (talk) 05:39, 6 August 2012 (UTC)
So you'd prefer that we not maximize the value of all the work that's been done here over years, because perfection can't be obtained? Not a strong argument in any context, really. Riggr Mortis (talk) 23:43, 6 August 2012 (UTC)
I'd agree with you if it wasn't for the fact that I've corrected more then 40,000 articles worth of typos with Typoscan. I'm in the boat of 'Regex is good', but I cannot bypass the sheer force of a modern spellchecker that offers options but retains a 97-99% detection rate or higher. Regex is limited for many reasons, but its limitations cover important typos. ChrisGualtieri (talk) 00:34, 7 August 2012 (UTC)

I have submitted a feature request to add an option(s) to allow Typo fixing in more of these areas. Chris the speller yack 15:27, 6 August 2012 (UTC)

Thank you Chris! Riggr Mortis (talk) 23:43, 6 August 2012 (UTC)

Invoke the RETF option

  • "AWB loads directly from this list whenever someone invokes the RETF option."

How does one invoke the RETF option other then closing AWB and restarting it? Regards, SunCreator (talk) 20:51, 6 August 2012 (UTC)

Do you mean refreshing the typo list without restarting? If so, "File->Refresh status/Typos". Riggr Mortis (talk) 23:34, 6 August 2012 (UTC)
Yes, that is what I meant. Thank you, it was not obvious. Regards, SunCreator (talk) 23:56, 6 August 2012 (UTC)
I always thought that meant when someone checks the Enable RegexTypoFix box. GoingBatty (talk) 00:58, 7 August 2012 (UTC)

Redundant units of currency

How common are errors involving redundant units of currency, such as "$10 dollars" and "£10 pounds"? Additional units and their symbols are mentioned in the article "Currency sign".
Wavelength (talk) 22:07, 6 August 2012 (UTC)

née

It seems that there are up to seven ways that people spell their own name when it contains a variation of "nee", and Regex wants to change every single one of them to née. It accounts for maybe 1/5 of the "typos" that Regex picks up in my filtered searches. Is there any way we could change, or even better, eliminate this rule? hajatvrc @ 20:02, 4 August 2012 (UTC)

In fact I've never seen it make a correct change with this rule. hajatvrc @ 20:04, 4 August 2012 (UTC)
Examples please, I'll look into it. Regards, SunCreator (talk) 20:19, 4 August 2012 (UTC)
This edit is an example of a correct change. GoingBatty (talk) 20:40, 4 August 2012 (UTC)
Was looking for examples of false positives. Here are some [20] [21] correct changes. Regards, SunCreator (talk) 20:49, 4 August 2012 (UTC)
I was responding to Hajatvrc's statement saying "I've never seen it make a correct change with this rule.". This edit and this edit are two more correct changes. I hope Hajatvrc can provide examples of false positives, per your request. GoingBatty (talk) 20:53, 4 August 2012 (UTC)
Here [22] [23] is some correct ones that hajat did. Regards, SunCreator (talk) 20:57, 4 August 2012 (UTC)

Generally, it is when it is not used to say "born as" but it is their actual name. I am searching for examples I've come across. But in the meantime I'm curious whether the uses are correct in: Petra Taylor, Annabelle Collins (Brookside), Jackie Corkhill, etc.. hajatvrc @ 21:03, 4 August 2012 (UTC)

Forgive me for saying "never" it was an inappropriate hyperbole. hajatvrc @ 21:10, 4 August 2012 (UTC)
No problem. Regards, SunCreator (talk) 21:16, 4 August 2012 (UTC)
They look okay to me. Are you saying the née change is questionable as it may not be her maiden family name? I'm somewhat confused at what the issue is. Regards, SunCreator (talk) 21:16, 4 August 2012 (UTC)
I feel like there was one category of people from a certain ethnicity where nearly every woman had that as their actual name, but I'm trying to remember which one it was! hajatvrc @ 21:19, 4 August 2012 (UTC)
The rule is case sensitive so would successfully avoid Watchman Nee, John Nee, Lim Nee Soon and similar names. Regards, SunCreator (talk) 22:02, 4 August 2012 (UTC)
And the point with the three that I linked is they spelled it with a grave accent on the second e. So that is correct also? hajatvrc @ 21:23, 4 August 2012 (UTC)
neè seems incorrect so changing neè => nèe would appear to be good. Regards, SunCreator (talk) 22:02, 4 August 2012 (UTC)
Based on exact-phrase Google searches, there appear to be countless women who spell it "neè" and countless women who spell it "née". I had never encountered the former until I started using TypoScan a few days ago. The problem is, I can't find a reputable source that says née is or is not the only way to spell it. Do you know of one? hajatvrc @ 22:07, 4 August 2012 (UTC)
Google News shows no English language result for "neè". "neè" is not in my Collins dictionary or online on the Oxford dictionary. Tell me what you are looking at in Google? All I see is social media and Facebook typos. Regards, SunCreator (talk) 22:31, 4 August 2012 (UTC)
"neè" -facebook -twitter -youtube hajatvrc @ 22:34, 4 August 2012 (UTC)

Then on the other hand, "neè" site:en.wikipedia.org only produces four articles. hajatvrc @ 22:44, 4 August 2012 (UTC)

I suppose I could change those and see if anyone gets angry... hajatvrc @ 22:46, 4 August 2012 (UTC)
I found one that neither the wiki or Google search found. Regards, SunCreator (talk) 23:48, 7 August 2012 (UTC)

Off- and oficial

<Typo word="Off-" find="\b([Oo])f(?:|ff)(er(?:ed|ings?)|ice(?:r?|holder)s?|icia(l(?:s?|ly|dom|ism)|te[ds]?|ting))\b" replace="$1ff$2" />
Many rules try to avoid 'oficial' because of common foreign language usage. The above rule does change it although the comment implies otherwise. Can we amend this so oficial is left unchanged. Regards, SunCreator (talk) 14:29, 5 August 2012 (UTC)

Please do so! That and differencia or whatever it is. Same with whatever changes Enpippi to Empippi, anything which sets En to Em. These rules constantly hit upon articles with foreign languages, the chances of finding an actual correction seems very low. ChrisGualtieri (talk) 05:42, 6 August 2012 (UTC)
-Emp now disabled. I just hit it with a false positive "Enpl." minutes after reading the above post. I may re-enable and tune it at another time. Regards, SunCreator (talk) 06:20, 6 August 2012 (UTC)
Also, foreign language texts should be flagged with appropriate {{lang}} templates. The all of the typo rules will ignore the text. -- JHunterJ (talk) 12:52, 6 August 2012 (UTC)
What if you don't know what language it is? Regards, SunCreator (talk) 21:26, 6 August 2012 (UTC)
See Wikipedia:Language recognition chart and its list of external links.
Wavelength (talk) 21:44, 6 August 2012 (UTC)
So what language is "Interlingue" or "Sillaba votz es literals" or "La Diferencia"? Some times you only get a word and Wiki article deal with everything including the most unusual ancient languages. Labelling text is not only time consuming to research but if incorrect misleading to those that later edit the article. So useless it is obvious I use {{Not a typo}}. Regards, SunCreator (talk) 22:11, 6 August 2012 (UTC)
Just discovered another solution just leave the language empty i.e {{lang||foreign words}}. Regards, SunCreator (talk) 02:51, 8 August 2012 (UTC)

Departement => French?

Current rule:<Typo word="Département(al)" find="\b([Dd])epartement(ale?)?\b" replace="$1épartement$2" />
I don't understand this rule which changes Departement => Département(the French word for department), but why not go with Departement => Department the English spelling. On the English Wikipedia even Departments of France has the spelling departments. Regards, SunCreator (talk) 13:10, 29 July 2012 (UTC)

See false positive here. Maybe the rule can be made more specific i.e to change to French spelling if being preceded with le or des or proceeded with au or des. Regards, SunCreator (talk) 13:22, 29 July 2012 (UTC)
I've already done the ones which specifically mention the french variant and ignore all others for a great many pages, the false positives vastly outnumber the real ones. ChrisGualtieri (talk) 14:40, 29 July 2012 (UTC)
So I take it your all for disabling the rule then. BTW, a false positive from earlier today. Regards, SunCreator (talk) 19:24, 29 July 2012 (UTC)
I disagree with that. The correct term for a French department is département. Department and département are not the same. So when referring to the specific département, as in the French département of Côtes-d'Armor I would expect to use the correct term. Seems to be a matter that was previously dealt with back in 2006 and never again. Why use an english word when the french term is there. ChrisGualtieri (talk) 03:34, 1 August 2012 (UTC)
In one way that is correct, because it depends on context. But the rule currently has no context and thus blindly recommends changing every departement typo to the French when the English may be correct. It's the same as the 'distict' typo that could be either 'distinct' or 'district'. Regards, Sun Creator (talk) 10:38, 9 August 2012 (UTC)

Qaran → Qur'an

I'm concerned about AWB changing Qaran → Qur'an in edits like [24] and [25]. Qaran clearly is used in these cases as a placename, and searches indicate that such a place exists (see, for example, here). People are using AWB to turn such usage into nonsense. — Hebrides (talk) 12:55, 8 August 2012 (UTC)

Sorry for that edit I should not of saved it. I'll adjust the rule not to change the place of Qaran. Regards, Sun Creator (talk) 13:04, 8 August 2012 (UTC)
Thanks. Also, how do I search for all instances where AWB has changed Qaran → Qur'an so that I can decide whether to change them back? This is vital. — Hebrides (talk) 13:06, 8 August 2012 (UTC)
Not sure, that's difficult. Maybe get a database dump(or someone who has one) prior to the rule being added(Feb 28,2010) and find articles with 'Qaran' spelling and check they are still okay? Regards, Sun Creator (talk) 13:28, 8 August 2012 (UTC)
I wish there was a way, I'll ask around about searching edit summaries. Because an edit summary search tool would bring this one up with the way AWB works, it won't catch 100% if the typo changes are numerous, but I bet it would grab a majority. ChrisGualtieri (talk) 13:43, 8 August 2012 (UTC)
Thanks. We really do need a way of undoing the trail of damage that a rogue regex can leave in its wake. — Hebrides (talk) 09:30, 9 August 2012 (UTC)

New space after a full stop

For the new full stop rule, please report any false positives here. I've ran it though several thousand of the most difficult articles, domain stuff mainly but it's conceivable that there it has a blind spot, but I don't know where to look. So any reports of false positives would be useful, even one would be great. Regards, Sun Creator (talk) 13:25, 8 August 2012 (UTC)

Preparing for 2000 article check with TypoScan. Will respond after I run the test. ChrisGualtieri (talk) 13:41, 8 August 2012 (UTC)
Question. I assume it is meant to fix errors such as this, "public.Among" -> "public. Among" in Kairos Future, right? It is not adding the space to this and other articles, I haven't taken it on a test drive in the 'India section' of Wikipedia where such sentences have higher then normal errors and lack of spacing. ChrisGualtieri (talk) 14:59, 8 August 2012 (UTC)
I think a mistimed rule edit, invoke the RETF. Try again. Let me know the article if the problem persists. Regards, Sun Creator (talk) 15:08, 8 August 2012 (UTC)
Still continues. The only reason it hits the page with Regex is because of an actual typo from before, but it is not catching the spacing matter. ChrisGualtieri (talk) 15:14, 8 August 2012 (UTC)
What is the article name? Regards, Sun Creator (talk) 15:16, 8 August 2012 (UTC)
Kairos Future as noted above. :) ChrisGualtieri (talk) 15:20, 8 August 2012 (UTC)
The rule worked. Not sure why it doesn't for you. Regards, Sun Creator (talk) 15:24, 8 August 2012 (UTC)
Huh. That is unusual, I'll try it again later on and report back. ChrisGualtieri (talk) 15:56, 8 August 2012 (UTC)
Works now. Odd why it didn't take on the refresh before. ChrisGualtieri (talk) 03:46, 9 August 2012 (UTC)

I've disabled this. It's a great rule but many computer articles have valid 'Somevarible.Somefunction' or 'Somesoftware.Someproduct' used in them. I don't feel that adding {{not a typo}} to many articles is productive at this point. Regards, Sun Creator(talk) 16:58, 9 August 2012 (UTC)

Rule tuning

Before fine tuning the existing rules I'd like to establish the purpose clearly and ideally get consensus on the general intent of the rules.

Degree of precision

At one end you can have blunt rules with many false positives or you can have precise rules which deal with specific variations of a word that have yet to occur. Some options on this spectrum maybe:

  1. Basic word, anything goes, no consideration of variants
  2. Check the most common related forms
  3. Check variants in several dictionary's including related forms
  4. Check variants in several dictionary's including related forms ignoring stuff not in the wild
  5. Check variants in several dictionary's including related forms and related forms of related forms etc
  6. Check variants in several dictionary's including related forms and related forms of related forms etc ignoring stuff not in the wild
  7. No false positive is acceptable, disable any rule that produces any false positives
Most rules today appear to be a 2, occasionally some are 4. It's also to be noted that precision is related to length of root letters. I'd like to see rules become more precise, ideally a 6. Regards, Sun Creator (talk) 15:07, 9 August 2012 (UTC)

Exceptions

How much should a rule deal with exceptions? A rule should:

  1. Ignore exceptions
  2. Handle the most obvious exceptions
  3. Handle common exceptions found or reported
  4. Handle common exceptions occurring in Wikipedia
  5. Handle common exceptions occurring on the internet
  6. Handle reoccurring exceptions in Wikipedia
  7. Handle reoccurring exceptions on the internet
  8. Handle all exceptions in the wild(properly technically impossible)

URL options

Regardless of a rule you could add a begin and end part to deal with avoiding websites URLS and domain name but it would result in a longer rule and an occasional miss of a typo. Is this a desired option?

Splitting up existing rules

In some cases splitting a rule into two would result in more precision. Especially if a rules doesn't deal with a single typo. If precision is the aim is it okay to split a rule?

Multiple possibilities

Many typos have multiple possibilities. 'distict' could be corrected to 'district' or 'distinct' or simply ignored. Maybe in the future a disambiguation option like a spell checker could be available but for now we have a more limited choice. Many of our current false positives are as a result of a rule picking the incorrect choice out of multiple possibilities.

Should the purpose be to correct with multiple rules, correct to the most likely word with only one rule or leave it alone entirely?

Documentation

In order to tune a rule you have to first work out what you want it to correct, what to avoid and once a rule is created to know it's pitfalls. It would seems appropriate to leave separate documentation showing the typos fixed along false positive information ,so that others could check or adjust a rule at a later time. Would individual /Typos/Rulename pages for each rule be welcomed?

Feedback appreciated. Regards, Sun Creator (talk) 14:52, 9 August 2012 (UTC)

Edit summary

Please check Wikipedia_talk:AutoWikiBrowser/Feature_requests#Improve_edit_summary_for_.22typos_fixed.22. Do you think we should implement this for enwiki? -- Magioladitis (talk) 19:04, 1 August 2012 (UTC)

Good idea, so that this part of the edit summary is self-contained. -- John of Reading (talk) 19:43, 1 August 2012 (UTC)
Sounds like a good idea to me!  :-) GoingBatty (talk) 03:08, 2 August 2012 (UTC)
(Aside) And if the general fixes added "[[AWB/GF|general fixes]]" if and only if the general fixes did anything, I wouldn't have to pick one of my two edits summaries before saving each edit. -- John of Reading (talk) 20:19, 1 August 2012 (UTC)

rev 8255 done for en-wiki Wikimedia projects. Rjwilmsi 08:25, 13 August 2012 (UTC)

C# code → C#code ???

Why is AWB changing "C# code" to "C#code"? I haven't tried any tests, but several other programming languages also end with # and might be caught by the same unfortunate rule. – Hebrides (talk) 10:26, 10 August 2012 (UTC)

I think that's because AWB has logic to remove the space after # for the external links sections. Maybe the code needs to be refined a little. Kumioko (talk) 11:08, 10 August 2012 (UTC)
What article? I tested on List of numerical libraries, and it was fine. Regards, Sun Creator(talk) 15:31, 10 August 2012 (UTC)
Sorry, I was just AWBing through 500 new articles and when I spotted it wanted to change [[C# code]] to [[C#code]] I just clicked Skip for that article. So I'm sorry I have no idea which of the 500 it was. A few articles later I decided I'd better flag up this problem here. I don't have AWB on the computer I'm using this evening, or I'd test it out by putting [[C# code]] into a sandbox. — Hebrides (talk) 21:12, 10 August 2012 (UTC)
The next time AWB tries to make a questionable change, the first thing to do is hit the "Typos" tab, and it will show you what Typo rule fired on that article. Chris the speller yack 21:16, 10 August 2012 (UTC)

rev 8253 Exception for C# code etc. in genfixes function FixLinkWhitespace. Rjwilmsi 06:36, 13 August 2012 (UTC)

Thanks, Rjwilmsi, but you seem to have included only C# and F# in your exception. Probably worth catering for A# and J# too. Cheers — Hebrides (talk) 11:57, 13 August 2012 (UTC)
rev 8256. Rjwilmsi 13:03, 13 August 2012 (UTC)
Good. Thanks. — Hebrides (talk) 13:12, 13 August 2012 (UTC)
So this is genfixes related and not directly about typos. Regards, Sun Creator(talk) 13:33, 13 August 2012 (UTC)
I just built and tested rev 8258 and confirmed this problem is now fixed. Thanks Rjwilmsi. — Hebrides (talk) 06:17, 14 August 2012 (UTC)

Womens and Mens

Why is Womens always converted to Women's with the "-men's" rule but Mens is not? I don't understand the rule or maybe the exceptions. Regards, Sun Creator(talk) 12:06, 10 August 2012 (UTC)

'Womens' does not yield as many false positives as 'Mens', which will hit phrases such as "mens rea" and "Mens sana in corpore sano". Chris the speller yack 13:23, 10 August 2012 (UTC)
O yes. It'd obvious when you point that out! Regards, Sun Creator(talk) 15:45, 10 August 2012 (UTC)
According to Apostrophe#Possessives in names of organizations (version of 20:17, 31 July 2012), "[s]ometimes the apostrophe is omitted in the names of clubs, societies, and other organizations, even though the standard principles seem to require it".
Wavelength (talk) 14:43, 10 August 2012 (UTC)
Not compelling. The only 'womens' exception is an organisation without any mention on Wikipedia except the Apostrophe page. Regards, Sun Creator(talk) 15:45, 10 August 2012 (UTC)

"long time" hyphenation

An uncertain suggestion for discussion: is it possible or wise to establish a rule hyphenating "long time" before (and only before) a noun? I've been manually cleaning up some by searching phrasing like "his long time" or "her long time", but this won't catch phrases like "Jane Jones, a long-time opponent of birth control," etc. On the other hand, a rule of "long time [noun]" to "long-time [noun]" would create some false positives from "a long time period" or a "a long time capsule". Khazar2 (talk) 20:41, 11 August 2012 (UTC)

As an update to this, I've now corrected several hundred instances of "long time friend" to "long-time friend" with AWB. If it's not possible to make a more general rule about this, perhaps one could be crafted simply by looking for common phrases like "long time friend", "rival", "boyfriend", etc. Khazar2 (talk) 23:59, 12 August 2012 (UTC)
What should not be overlooked is that most dictionaries indicate that "longtime" should be closed, not hyphenated. If you prefer the hyphenated form (allowed in some dictionaries), the most proper way to fix these is to make two passes: 1) Skipping pages that contain "longtime", changing "long time" to "long-time"; 2) Skipping pages that do not contain "longtime", changing "long time" to "longtime". This way the changes will conform to the style of each article. My preference is "longtime", but Macmillan (usually the best reference on hyphenation) and Cambridge specify "long-time", so I won't change that to the closed form. Chris the speller yack 14:57, 13 August 2012 (UTC)
Thanks, Chris. I'll follow your suggestion. Khazar2 (talk) 15:03, 13 August 2012 (UTC)

-ound- rule

The "-ound-" rule now no longer matches further endings yet still has a $2, what is the rule now supposed to be doing? Rjwilmsi 06:20, 13 August 2012 (UTC)

Oops, I've removed the $2, it's not needed and was tested without it. The words ending(if there is one) is left the same as this rules deals with the earlier "uond" part so now both "Gruond"=>"Ground" and "Suondproof"=>"Soundproof" work. Regards, Sun Creator(talk) 09:40, 13 August 2012 (UTC)
Though now it won't meet the convention that typo rules match at least a whole word, so that the edit summary shows entire words? Rjwilmsi 21:39, 13 August 2012 (UTC)
Wasn't aware of any such convention. Don't see that written anywhere, but I'll go adjust it to give a pretty edit summary. Regards, Sun Creator(talk) 21:58, 13 August 2012 (UTC)
The edit summary now shows the middle and end of word. It is convertion to show the word in the edit summary in full? This rule doesn't look like it's ever shown the word in full. It's possible to do that of course, but it's a few more cycles to do it that way. Regards, Sun Creator(talk) 22:19, 13 August 2012 (UTC)

Extra rules with false positives

What do we do with rules that naturally have lots of false positives but are still useful when used with care. I have some in my find and replace. Do we want to throw them in the standard rules? Properly not, but shall we have a seperate list for anyone who wants additional find and replaces? Regards, Sun Creator(talk) 14:00, 13 August 2012 (UTC)

Good idea. The separate list should allow plenty of room for a description of what to watch out for. Chris the speller yack 14:42, 13 August 2012 (UTC)
I agree--good idea. Khazar2 (talk) 14:51, 13 August 2012 (UTC)
Have made a strt at Wikipedia:AutoWikiBrowser/Typos/Extra. Feel free to change the formating, I have no real idea what is best layout for this. Regards, Sun Creator(talk) 16:33, 13 August 2012 (UTC)

Superbowl -> Super Bowl

The American Super Bowl may well always be spelt this way but il Superbowl (the Italian equivalent) is not and this has now twice been corrected on this page and perhaps on others. Please could this error be corrected? mgSH 12:12, 18 August 2012 (UTC)

I found three pages where this has happened, and corrected them all and wrapped a "Not a typo" template around them. This should prevent both AWB users and manual editors from changing them. This is the best way to handle such a rare occurrence, rather than monkeying with AWB. Chris the speller yack 15:09, 18 August 2012 (UTC)
Ah, thanks; I didn't know this was possible. mgSH 18:18, 18 August 2012 (UTC)

Edit request on 23 August 2012

Request to add fix of spelling error jewellary into regex. Despite the article redirect, the error seems to persist: http://en.wikipedia.org/w/index.php?search=jewellary&fulltext=Search Chrishelenius (talk) 18:14, 23 August 2012 (UTC)

  Done. I don't see many occurrences, but expanding the "Jewellery" rule slightly should incur very little additional cost. Chris the speller yack 20:38, 23 August 2012 (UTC)

Suggestion regarding "New" additions

Some of the "New" additions have been there for a very long time and some even duplicate typo fixes found further below. What is the procedure if any for moving them down? How long do we leave them there before they are no longer new?

Also, some, such as some names seem unnecessary and relatively low impact. Some such as Sam Elliot would probably be better IMO if we just took a few at a time and ran them as tasks, removed them from the list and add them to a subpage showing they were there and what we did about them. Kumioko (talk) 01:13, 28 August 2012 (UTC)

One thing that might help editors to answer those questions is a mechanism for recording, for each listed item, the date and time of its addition to the list, the date and time of its removal from the list, and the number of true-positive corrections made because of its presence on the list.
Wavelength (talk) 01:22, 28 August 2012 (UTC)
Each edit is recorded in the wiki, so you can find out when something is added or deleted. But how would you suggest capturing the number of "true-positive corrections"? GoingBatty (talk) 03:57, 28 August 2012 (UTC)
The revision history does show many of the details that I mentioned, but some searching is required if one wishes to find the date and time of the addition or removal of a particular item. I had in mind a separate list for compiling additions and removals, which now I suggest can be a sortable wikitable with columns for "item", "date and time of addition", and "date and time of removal".
The AutoWikiBrowser might record the number of revisions (supposed "corrections") that it makes for each item listed at Wikipedia:AutoWikiBrowser/Typos. Those numbers might be compiled in one place, possibly in a fourth column in the previously mentioned sortable wikitable. Human editors who revert "false-positive" corrections might record corresponding numbers in a fifth column there. Human editors might also record, in a sixth column, the difference between the numbers in columns 4 and 5. Human editors might also record, in a seventh column, the value of each number in column 5 as a percentage of the corresponding value in column 4. Spreadsheets might help with the calculations.
Wavelength (talk) 20:47, 28 August 2012 (UTC)

What is the scope of a typo rule

What is the default scope of a typo rule in AWB, I mean does it search in: interlanguage links, inside <--- commented out text -->, does it search inside <syntaxhighlight=code> here</syntaxhighlight>, <ref>references</ref> and "quoted text"? Some rules don't apply in some case for example some consider grammar should not be done in quotes but spelling typos can be. Perhaps an option can be added to each rule to define it's scope. Regards, Sun Creator(talk) 13:46, 29 August 2012 (UTC)

I believe that the typo rules in general skip the following things: Comments, templates, and the area next to sic templates. I'm not sure about Source code or other HTML tags. Kumioko (talk) 14:46, 29 August 2012 (UTC)

Ultra-high-definition television

I think that "Ultra-high-definition television" looks fine and proper. On the other hand, I wouldn't be brokenhearted if the hyphen after "ultra" were dropped, because there is really no chance that a reader would stumble over it by thinking that "ultra" was modifying "definition" or "television", which is the driving reason for using hyphens in compound modifiers; "ultra television" would not be understood. But "ultra high-definition television" looks strange with just the one hyphen, as "high definition" is so pervasive that it does not really need a hyphen even when used adjectivally. So I would prefer two hyphens, or no hyphens, to a single hyphen. As for "Ultra-high definition television", now that could be a stumbling block for readers. Chris the speller yack 12:44, 31 August 2012 (UTC)

Well, now that I vented, I see that there is a lively discussion about a proposed renaming on Ultra-high-definition television, where the choice is between good punctuation and the punctuation chosen by the industry's engineers and advertising folks; not surprisingly, they have chosen the worst option of the four possibilities for hyphenating (or not hyphenating). That talk page is a better venue for this discussion than the AWB/Typos talk page. — Preceding unsigned comment added by Chris the speller (talkcontribs) 18:42, 31 August 2012

Louisianian or Louisianan

An update on this subject that got archived. The official name of a native of Louisiana is Louisianian however from December 2007 until today the typo rule had been making incorrect the correct spelling. Regards, Sun Creator(talk) 22:33, 31 August 2012 (UTC)

"full-time" and "part-time" false positives

I've been playing with searches for "full-time" and "part-time", and these rules seem to generate an unfortunate number of false positives--or perhaps a better way to put it would be unnecessary positives. My understanding is that the phrase "full-time work" must always be hyphenated, but "work full time" may or may not be. Quick searches of the LA Times [26] and NYT [27] show that their style guides allow both usages, so the hyphenated/non-hypenated appears to be a null issue. Would it be possible to reset this rule to only cases where the words "full time" or "part time" precede the noun? Khazar2 (talk) 14:51, 30 August 2012 (UTC)

Macmillan Dictionary (which I have found to be very specific and very dependable on hyphenation issues) lists the adjective "full-time" with the notation "usually before noun" – "It is hard to combine study with a full-time job." And it lists the adverb "full-time" – "Her youngest child is in daycare full-time." Is there a case where a sentence is better because "full time" is unhyphenated? I can't think of a case where the hyphen could confuse a reader, and it sure is going to make the fixing of the adjective more difficult if the Typo rule has to list all possible nouns that could possibly follow "full-time", or adjective-noun phrases, such as "a full-time, permanent job". WP:HYPHEN says "Consult a good dictionary", but not "Consult a big newspaper". The punctuation in most Wikipedia articles stinks; how is it ever going to get better if more obstacles are placed in front of editors and tools are taken away? Chris the speller yack 02:34, 31 August 2012 (UTC)
I share your concern for Wikipedia spelling and punctuation, of course. But I'm also wary of setting AWB to auto-correct things that appear to be legitimate variation, and this rule generates a tremendous number of neutral edits. An equal case could be made that by having tens of thousands of valid sentences like "he worked full time" flagged for review and correction is itself an obstacle, due to slowdown it creates in other work. (And it does seem to me that newspaper style guides can be considered at least a legitimate variant here; at the very least, if the New York Times is also employing it, this is not a usage that's begging for correction.)
I'm a big fan of your work generally, though, so having said my piece, I'm happy to yield to your judgement if no one else objects. Cheers, and thanks for all your work, Khazar2 (talk) 03:11, 31 August 2012 (UTC)
I prefer the exclusive use of the hyphenated form for the technical reasons explained by Chris the speller, and technical reasons have been invoked at WT:MOS and WP:MOS. To forestall complaints by subsequent editors, the edit summary can mention "technical reasons". Also, I recommend that this be discussed at WT:MOS, but please wait until User:Noetica is again available.
Wavelength (talk) 03:40, 31 August 2012 (UTC)
I can't think of a single instance where full time shouldn't be hyphenated, whether before or after the noun it qualifies: she worked full-time; they were resource-constrained. Sure, there's slightly less imperative to hyphenate after the noun than before, but some items have the hyphen ingrained wherever they are. Here's a grammatical twist that would be a false positive: the stadium was full time after time. Probably vanishingly rare. Tony (talk) 03:51, 31 August 2012 (UTC)
I think doing such things automatically is always dangerous, and people who can't think of where it's not right are not being very imaginative. See for example the usage here.
But speaking of hyphens, what do people think of hyphens as I put them in Ultra-high-definition television? There's an RM to remove at least one... Dicklyon (talk) 06:00, 31 August 2012 (UTC)
I have added the section below to discuss the Ultra-high-definition television question.
But AWB Typo rules are not applied automatically; a human editor looks at every change to see if it is appropriate. Chris the speller yack 12:48, 31 August 2012 (UTC)

Sorry to return to this one again, but I've encountered another "full time" situation that I wanted to check in on. When one says "full time" to mark the completion of a rugby or association football match, should this be hyphenated? (As in, "a few minutes before full time, ...") I've run into a few dozen of these in football articles so far, and wanted to check before changing any. I note that Wiktionary has this listed at "full time" (unhyphenated), but my American dictionaries don't cover this usage. Khazar2 (talk) 21:55, 1 September 2012 (UTC)

I ground to a halt over the same issue, but then started up again after seeing that Association football#Duration and tie-breaking methods had it hyphenated. I put no stock in Wiktionary. Hope this helps. Chris the speller yack 01:12, 2 September 2012 (UTC)
That's good enough for me--thanks. Khazar2 (talk) 04:22, 2 September 2012 (UTC)
Agree with Chris: no trust in Wiktionary from me. On the football term, half-time seems more likely to demand the hyphen. I'm unsure, but wouldn't be upset if the term weren't hyphenated predicatively (after the noun). But before the noun, like full-time score, it would be needed. Tony (talk) 08:39, 2 September 2012 (UTC)

Double letters

I notice a lot of the words on the typo list haev double letters like, TT, SS, RR, PP, etc. but we aren't using any logic to catch for typos where people misspell them. Mississipi rather than Mississippi for example. I realize that this won't work for every one but there are a lot I think that could. Kumioko (talk) 20:44, 31 August 2012 (UTC)

It's a good idea to check for a single occurence when a double occurence is expected. I added this to the format(t) rule a while back to handle formating instead of formatting etc. Regards, Sun Creator(talk) 22:43, 31 August 2012 (UTC)
Good news - the Mississippi rule already fixes "Mississipi". GoingBatty (talk) 00:35, 3 September 2012 (UTC)

New rule "On board"

I have had many complaints and questions in the past about the difference between "on board" and "onboard", so I will lay it out here and reference this discussion in a comment attached to the rule.

The adjective "onboard" (or "on-board", according to a few dictionaries) is attributive, and is always followed by a noun (or another adjective and noun):
"They brought their own sandwiches, as the onboard food was usually tasteless".
"He hoped there was enough power for the on-board electrical devices."
The prepositional phrase or idiom "on board" indicates that something is located or installed in a train, airplane or vessel:
"Everyone was on board, so he shut the door."
"She was glad to see that there was a toaster on board the lifeboat."

The Typo rule fixes many cases where "onboard" is followed by something other than a noun or adjective (such as punctuation, an article or an adverb), indicating that it is not used attributively, so it knows that "on board" should be substituted. The rule certainly misses many misuses of "onboard", but after much testing it has produced next to zero false positives, and there are a ton of these to be fixed. Chris the speller yack 01:15, 23 July 2012 (UTC)

I haven't seen any false positives yet and I've corrected quite a few so far. ChrisGualtieri (talk) 15:02, 23 July 2012 (UTC)

Sorry, you are making a basic grammatical error here. Atributive Attributive adjectives are not always immediately followed by the noun, although they are usually. The important fact when considering the adjective when it occurs after the noun is whether or not there is a linking verb between the noun and the adjective. If there is no such verb then the adjective is still attributive. - Nick Thorne talk 15:01, 31 July 2012 (UTC)

I was trying to keep things simple enough that most AWB editors (and AWB critics) can get a handle on what the rule is trying to accomplish without spending a whole afternoon on a grammar refresher course. The point is that the rule does a good job of avoiding changes where "onboard" is an attributive (that's spelled correctly, BTW) adjective. If you have seen cases where the rule has changed an actual attributive case of "onboard" to "on board", please let us know. I think you'll have a hard time finding even one or two cases of the attributive use of "onboard" that is not followed immediately by a noun or another adjective. If you can't find such cases, what is the point of making this discussion more complicated? Our purpose here is to improve and maintain Wikipedia, not to display our knowledge of the fine points of grammar. Creating AWB Typo rules is largely a game of controlling the odds, and this rule seems to be ahead of the game at this point. Chris the speller yack 19:30, 31 July 2012 (UTC)
Sorry about the spelling mistake in my first use of the word, now corrected. (I always try to keep my spelling correct.) I take your point about trying to keep things simple, but I question whether that is always a good thing when dealing with subtle points of grammar. As for an example, the reason I raise this whole issue was this edit of an article on my watch list. I think that bots are not best suited to making grammatical changes on the less well understood points of grammar, not least because there are always exceptions, usually contextual in nature, that make it hard or impossible to codify every possible situation. - Nick Thorne talk 22:45, 31 July 2012 (UTC)
I'm skeptical that the edit you flag here is a false positive. Googling NYT and BBC (to make sure ENGVAR isn't an issue), "people on board" outnumbers "people onboard" by about 150:1. "Personnel onboard" has a smaller sample size but equivalent results. Clearly the former is the preferred usage. Khazar2 (talk) 23:17, 31 July 2012 (UTC)
Khazar2 is right: in the example provided by Nick Thorne, "onboard" was not used attributively, and should be two words; it is a prepositional phrase, the equivalent of "aboard". While writing that last sentence, I suddenly realized that there is a simple test to help decide whether "onboard or "on board" should be used: if "aboard" could be substituted, then "on board" is correct; otherwise, "onboard" should be used. Using the above example, "Everyone was aboard, so he shut the door." makes as much sense as "Everyone was on board". Another point: AWB is not a bot; editors are looking at each change to verify its correctness. Chris the speller yack 03:10, 1 August 2012 (UTC)
Your example fails because there is a linking verb between the noun and the adjective, an important point. In the Nias article it said the aircraft had 11 people onboard. This could have been written the aircraft had 11 onboard people with no change in meaning, it just seems a little unnatural which is why the adjective follows the nouns in this case. The word onboard in both cases is being used attributively - it is attributing the property of location to the people. As a former Fleet Air Arm officer, I watch many pages related to naval aviation and nautical matters. It was because of this that the subject came to my attention. The word onboard is perhaps not very common in everyday speech, but in aviation and nautical discussions it has a particular meaning which is not quite the same as on board. One of the things that disappoints me about Wikipedia is that sometimes well intentioned people make changes to articles that indicate an incomplete understanding of the particular subject. It is a form of unintentional dumbing down of the encyclopedia. I would have thought that one of the purposes of the encyclopedia is to educate people. If part of that is making sure that obscure points of grammar are attended to then IMO that is no bad thing. This is not a criticism of your work, on the contrary, fixing up spelling and grammar mistakes in the encyclopedia is a great service to the community. In this case however, I think you're missing a subtle shade of meaning. In any case I don't plan keep on about this. If you decide to change the article back I will of course be happy about that. If not, well let's face it, it's not the most pressing issue on Wikipedia is it? - Nick Thorne talk 23:20, 1 August 2012 (UTC)
I know perfectly well what a prepositional phrase is. The most pressing issue on WP is accuracy, but spelling, grammar and punctuation are important. I will continue to correct those aspects as well. Chris the speller yack 02:56, 2 August 2012 (UTC)

Australian English any different? 'come on-board' TransACT. Regards, Sun Creator(talk) 13:58, 3 September 2012 (UTC)

No, not different, though there is one editor who claims that it is whenever his personal hyphenation style is at variance with every modern dictionary. Chris the speller yack 01:31, 4 September 2012 (UTC)

Misspelling of "government"

I recently corrected more than 30 misspellings of "government", which I found by searching for "goverment". Along the way, I found many occurrences of that misspelling in web addresses and Wikipedia file names. Is there a practical method for correcting the misspelling in those file names?
Wavelength (talk) 00:25, 1 September 2012 (UTC)

Request the file(s) to be renamed on Wikimedia Commons, assuming that is where the files reside. Regards, Sun Creator(talk) 01:08, 1 September 2012 (UTC)
Thank you for your reply.
Wavelength (talk) 02:08, 1 September 2012 (UTC)
I have not yet edited on Wikimedia Commons, and I am not yet ready to monitor a watchlist there, but I have started a list at User:Wavelength/About Wikipedia/File namespace#Misspellings in titles. In the future, I might comment at Wikimedia Commons, and provide a link to that list. Meanwhile, other Wikimedians are welcome to monitor that list and to mention it or its contents at Wikimedia Commons.
Wavelength (talk) 21:07, 1 September 2012 (UTC)
Great idea, in process of moving File:Goverment_surplus_or_deficit_(EU-USA-OECD).png and File:Goverment.jpg. Regards, Sun Creator(talk) 22:20, 1 September 2012 (UTC)
I have changed the heading—User:Wavelength/About Wikipedia/File namespace#Files with misspellings—because misspellings can occur in the body of a file.
Wavelength (talk) 16:29, 3 September 2012 (UTC)

Question about "a vs. an" rule

In Eden Springs Europe, AWB wants to change "securing a EUR 150 million credit facility" to "securing an EUR 150 million credit facility". Seems this change is incorrect whether you would pronounce this "a hundred-fifty million euro", "a one hundred-fifty million euro" or "a euro 150 million". Thoughts? GoingBatty (talk) 20:07, 3 September 2012 (UTC)

Should be "a EUR 150 million" but "an EUR 80 million", I've updated the rule to leave it unchanged, please reload it. I think there will be other currencies with the same thing, USD is ok because capital U is pronounced 'yoo'. Regards, Sun Creator(talk) 20:39, 3 September 2012 (UTC)
Confirmed the fix - thanks for the quick service! GoingBatty (talk) 20:48, 3 September 2012 (UTC)

Avoid having a rule detect a correct spelling

Writing typo rules says : Avoid having a rule detect a correct spelling

  • Is the above a rule or a guide? If it's a rule then both the lifetime break it. There maybe others.
  • It seems however that it's better to create a single rule that detects correct spelling then multiple ones, but perhaps I'm missing something. Regards, SunCreator (talk) 08:58, 6 August 2012 (UTC)
  Fixed with this edit. It's a rule, otherwise you end up with edit summaries like "Typos fixed: lifetime ban -> lifetime ban". -- JHunterJ (talk) 12:50, 6 August 2012 (UTC)
  1. The point really was not one in particular but the logic behind why to see if it is still required. There are long standing rules that match correct spelling
    • "New Hampshire", "Rhode Island" and "Uninhabited" rules match correct spelling to name but three
  2. Rules that detect a correct spelling don't seem to cause anyone a problem
  3. The writing of rules to correct more then one problem is comprimised by the avoiding of correct spelling. When you have two or more possible things to correct you have to pick the lesser one to ignore in order to avoid a self match.
    i.e The "Cayman Islands" Capitalisation rule corrects "Cayman islands" but "cayman Islands" is left untouched, it's a comprise by the rule writer to avoid the correct spelling.
  4. The example given in Wikipedia:AutoWikiBrowser/Typos#Writing_typo_rules of ("false positive": for example, a rule that fixes housa to house must not detect thousand or house) does not avoid detection of correct spelling rather it avoids a false positive. Regards, Sun Creator(talk) 02:26, 5 September 2012 (UTC)
"New Hampshire" and "Rhode Island" are not long standing rules; they were added on August 26, 2012. I've fixed the Rhode Island rule so it doesn't match the correct spelling. I'll wait until we get consensus on my discussion below about the "New England" rule before changing the "New Hampshire" rule.
The "Uninhabited" rule appears to only match "Unihabited" (missing the second "n"). When I run AWB on articles that contain "Uninhabited", it's not identifying a typo fix for me. Do you have an example? Thanks! GoingBatty (talk) 04:39, 5 September 2012 (UTC)
Okay, but the examples are kinda getting away from the point, having detect a correct spelling does no harm(it seems) but writing the rule to avoid does harm because the rule is made to avoid a part that it could otherwise correct. Regards, Sun Creator(talk) 05:10, 5 September 2012 (UTC)

a high > an high

False positive at China Railways CRH380A, "on the Zhengzhou - Xi'a high-speed line" -- John of Reading (talk) 07:55, 4 September 2012 (UTC)

That is with the the old 'A h-' rule. Now fixed with this change. Regards, Sun Creator(talk) 09:57, 4 September 2012 (UTC)

New England/Mexico rules?

It seems that the New England (& New Mexico) rules may encounter four capitalization varieties:

  1. "New England" (correct, should not be changed)
  2. "New england" (incorrect, should be changed to "New England")
  3. "new England" (may be correct, should not be changed)
  4. "new england" (incorrect, "England" should be capitalized, but "new" should not based on rule above)

Therefore, if my premise is correct, the only thing this rule should be doing is capitalizing "England", why not just have an "england" --> "England" rule? Thanks! GoingBatty (talk) 04:12, 5 September 2012 (UTC)

ah, this has caused me hours of thinking aleady . 4 is undermined and could be "New England" or "new Engand" but chances are the former I think but the England part is sure capitalized. If you decide the 'new' in 4 should not be capitalized (as doing so could be a FP) then yes just capitalize Endland(ers?) and same with Mexico rule. New York rule also? New Jersey could all be false positives - i.e do no capitalization, I had FP with this a few times. The team wore the new jersey. Regards, Sun Creator(talk) 05:00, 5 September 2012 (UTC)

Profiling the typos

I've just noticed from profiling.txt that it takes my computer 27 seconds to run RegExTypoFix on List of Doctor Who universe creatures and aliens. On the one hand, we've been adding more and more clever rules; on the other, some users are noticing that the latest versions of the program are slower than the older versions.

I've run AWB's "profile typos" option and have posted the results in User:John of Reading/Sandbox (permanent link). They are sorted by CPU time. Right at the top of the page is the new "a to an" rule, but there are many others not far behind.

I'm tempted to turn the RegExTypoFix option off. -- John of Reading (talk) 20:37, 5 September 2012 (UTC)

Where do you get the profiling.txt? Some of those can be optimised, I spent some time already looking at speed of the "a to an" rule in regex, it's effectively 5+ rules in one so it doesn't surprise me. Will look into optimising some of the rules with the profiling.txt once I can emable/find it. Regards, Sun Creator(talk) 21:45, 5 September 2012 (UTC)
Updated the "a to an" rule to be speedy. Can you let me know what's it's new profiling time is. Regards, Sun Creator(talk) 00:32, 6 September 2012 (UTC)
Of the 267 endlings, 31 take over a third of the total processing time. They use [A-Za-z]+ at the beginning everytime, with no prechecking in order to make the edit summary pretty.
Endings can be made fast by removing the beginning check:
  • i.e. for the '-itely' rule \b([A-Za-z]+[lnst])itly\\b (currently 443ms) => ([A-Za-z][lnst])itly\b (73ms) although the edit summary would not be 'pretty' and say for example 'litly => litely' instead of say 'impolitly => impolitely'.
Nevertheless 31 multipled by around 370ms saving per rule is over 10 seconds. So of John's 27 seconds over 10 seconds can be saved by amending those 31 endings. Regards, Sun Creator(talk) 01:48, 6 September 2012 (UTC)
Benefit outweighs the confusion, anyone can see the change in the diff anyways. It if it speeds the process up by a considerable %, then by all means go for it. Even the rules like 'a' to 'an' does not show the word which follows, it notes the change and not the reason for the change because it would be very time consuming to explain it. If anyone DOES have a problem with it, they can see the diff or come here. I'm also planning to build a new list of Regex typos from the database dump, if by amending these rules increase speed by 30% then that should be enough of a reason to have just the change highlighted in this case. ChrisGualtieri (talk) 16:26, 6 September 2012 (UTC)
Some great work here. Don't change anything just yet though, I might be able to make a code change to the way the rules are processed to improve the speed of the endings rules without affecting the edit summary or having to change the rules themselves. Rjwilmsi 17:21, 6 September 2012 (UTC)
As a followup to a comment I made a while back. I think that there are a number of typo corrections in the existing list that are low numbers so it might be beneficial to move those to another "Inactive typos" list or something that can be run periodically instead of every time. I had checked a few in the past and couldn't find some of them at all and others only had 1 or 2 articles affected so it seems of little value for them to be on the "active" list. Kumioko (talk) 17:51, 6 September 2012 (UTC)
Sun Creator, you're not comparing apples with apples with your 31x, 370 ms analysis. Firstly, the typos.txt output is for 1,000,000/article length iterations (so 3 for the Dr Who test case), so any numbers should be divided by 3. There are 35 rules starting "\b([A-Za-z]", they total 6975 ms out of a total of 113889 ms for the Dr Who test case (on my PC, where AWB typos time is around 13.5 s, so John must have a slower CPU). That's 6.1%, so even if we rewrote them to take zero time, they would only give about 1.6 s improvement for John. Secondly, and you couldn't have known this, the typos profiling does not profile the way in which AWB actually runs the typo rules. The profiling just does IsMatch for each typo rule for n iterations and returns the summed time per rule. The actual typo fixing puts typos into larger typos in groups of 20 (this is faster, the sum profiling time / iterations for me is 113889/3 i.e. around 37 seconds, but around 13 seconds at runtime) and does IsMatch against the grouped one. (An example group is \b(([Ss])ea-(board?|foods?|m[ae]n|ports?|planes?|wards?|weeds?|worth(?:y|iness))|um([dntv][a-z]+)|([Uu])(?:n|nnn)(amed|atural[a-z]*|avigable|ecessar(il)?y|eeded|otice[a-z]*|umber[a-z]*)|([Ww])(ere(?:abouts|by)|isker(?:s|ed)|istl(?:er?s?|ed|ing))|([Xx])yph([io][a-z]+)|([IiUu]n)?([Aa]ccept|[Aa]rgu|[Cc]ap|[Cc]onfigur|[Ff]orgiv|[Hh]ospit|[Mm]istak|[Nn]ot|[Oo]ppos|[Ss]cal|[Tt]ranslat|[Uu]s|[Vv]alu|[Vv]ulner)(?:ea?|[eiu]a?)b(l[ey]|ilit(?:y|ies))|((?:[IiUu]n)?[Dd]e)(bat|cid|fin|form|grad|[lt]ect|not|pend|plor|p?riv|sir|spi[cs])(?:ea|i)bl([ey])|((?:[IiUu]n)?[Rr]e)(ad|ason|charge|cogni[sz]|concil|cover|cycl|deem|mark|mov|new|pai?r|pea[lt]|place|put|view|voc)(?:ea?|[eiu]a?)b(l[ey]|ility)|([BbFfHhJjmNnRrSsTtw]?|[Tt]r)aill(ed|ing)|([Mm]is|[Rr]e)?([BbFfMmRrTtWw]|[LlPp]e|[BbCcFfWw]re|[Ss](?:[hlnot]|[np]e|[ct]re))kaing(s)?|([DdQq]u|[Ee]qu|[FfNn]at|[FfNn]orm|[LlRr]eg|[Ll]oc|[Rr]e|[Tt]o[nt]|[Vv]it)all+it(y|ies)|([Ff]il|[Ll]ig|[Tt]est|[Tt]ourn)ia?ment(s?|ary)|((?:[Pp]?[Rr]e)?[Aa]rr|(?:[Ee]x|[Ii]nter|[Ss]hort|[Uu]n)?[Cc]h|[Dd]er|R|r)an(?:gei|egi)?ng|([Bb]ot|[Mm]ech|[Pp]urit|[Ss]at)annical(s?|ly)|([Aa]dam|[Aa](?:bu|tte)nd|(?:[Dd]is|[Rr]e)?[Aa]ppear|(?:[Rr]e)?[Cc]ogni[sz]|(?:[Aa]s|[Cc]on|[Dd]is)son|[Dd]efend|[Ii]gnor|[Mm]erch|[Oo]xid|[Ss]erv|[Vv]ac)(?:en|and)(ts?|tly|ci?es?|cy)|([Aa](?:ccep|cqu(?:ain|it)|dmit)|[Bb]la|(?:[Nn]on)?[Cc]omba|[Ee]xpec|(?:[Ii]n)?[Hh](?:ab|e[rs])i|[Ii]mp[ao]r|[Mm]ili|[Pp]it|[Rr]e(?:luc|mit|pen))t[ei]n((?:c[eiy]|t(?<!\b[Rr]emittent))[a-z]*)|([Aa]ssi|[Cc]on|[Ii]ncon|[Dd]i|[Ii]n|[Rr]esi)st(?:atn|ent)(s?|ly)|([Ee]dw|[Hh]ow|[Rr]ich)rad((son)?s?|ians?)|([Bb]ound|[Dd]iction|[Ll]egend|[Pp]rim|[Ss](?:al|econd)|[Tt]ern)e?r(y|ies)|([Aa]br|[Ee]v|[Ii]nv|[Oo]cc|[Pp]ersu)ation(s?|al(ly)?))\b though every time the AWB rules page changes the groups may change as well). Last weekend I looked at the group size number (20, and also the fact that the typos are compiled regexes so add 10 seconds to the typo time on the first run in the AWB session) and could not find a number that gave better performance. However, all may not be lost: 26 of the 205 groups take 6 of the 13 seconds. Rjwilmsi 20:12, 6 September 2012 (UTC)
That does not make any sense at all. You are saying "profiling.txt" which you call "typos.txt" has figures that are three times to big? Well it doesn't matter because I worked out my own baseline. The List of Doctor Who universe creatures and aliens article takes 27 seconds to process on my PC also, the same as Johns. I used the Regex tester repeatably to get an accurate figure for the rule times both with and without the "[A-Za-z]+" pre-code. How can it take 113889 ms for you in total, that's 113 seconds, but you have a faster PC? Perhaps you have made some miscalculations? Regards, Sun Creator(talk) 21:18, 6 September 2012 (UTC)
Grab the 5.4.0.1 snapshot and use Tools->Profile typos. Profiling.txt only has the runtime typos summary (13 s for me, 27 s for you). Typos.txt has the detail (113889 ms total for me), but as I've said it's not detailing exactly what AWB actually runs, so you cannot directly compare one and the other. Rjwilmsi 21:38, 6 September 2012 (UTC)
So profile.txt reports a figure in milliseconds about 8 to 9 times the actual time taken, could file that as an AWB bug. Regex tester is similar to profile.txt as it gives similar figures so is also has time reporting bug. Recalculating the '-itely' rule it currently takes 50ms and without the pretty edit summary would be around 9ms The saving for removing 40ms from each of the 31 ending pretty edit summaries would total around 1.2 to 1.3 seconds, not so good as first thought. Regards, Sun Creator(talk) 23:13, 6 September 2012 (UTC)
There's no bug. Profiling.txt and Typos.txt are measuring different things. With the regex tester you are replicating the Typos.txt method; 1.2 to 1.3 s on Typos.txt total is probably about the saving that changing the endings rules would achieve, roughly in agreement with what I measured when I changed \b([A-Za-z]+ to ( during the typo load to simulate the rules change. However, when it's applied to the actual method (as measured by Profiling.txt), it's not going to be the same time saving. As the AWB grouping behaviour already makes the typo rules about 3x faster overall (c.f. my 37 seconds Typos.txt to 13.5 s Profiling.txt), you could estimate that any time saving will be 1/3 as much. On Dr Who I only measured about a 200 ms improvement on 13.5 s for Dr Who in profiling.txt. What I think this means is that finding large improvements in performance is going to be hard, if not very hard. On the other hand, the Dr Who list is the most extreme example I've seen, typical large articles (featured articles etc.) run (Profiling.txt) in 2 to 3 seconds for me, which seems reasonable. Rjwilmsi 06:14, 7 September 2012 (UTC)
Thank you for an attempt to improve this. An idea(!) to speed this up. Could a new type of replace variable be added to the replace= that is only used in the edit summary but not in the actual article? If so then faster rules can be made that use a post look behind when matching to find cycle consuming pre-text for the edit summary. The find for the 'itely' rule would be "([A-Za-z][lnst])itly\b(?<=([A-Za-z]*)[A-Za-z][lnst]itly)\b" and the replace would be "%2$1itely" where %2 is the contents of $2 but %2 is not applied to the article only the edit summary. Regards, Sun Creator(talk) 22:19, 7 September 2012 (UTC)
The combining of 20 or so together appears very clever especially to vary it according to the rules. How does it know what to replace with. You could manually combine many rules if the replace restiction was removed, not that manually combining would necessarily be any better then the automatic way. Regards, Sun Creator(talk) 11:12, 7 September 2012 (UTC)

"Involved" with "Revolved" added

I'm not convinced that this is a good idea. There were only 4 cases of "revovled/es/ing" (one of which was in a title that Typos wouldn't fix), and I corrected them. Not worth the extra cycles for so few hits. Same thing for the upper-case "Invovled"; vanishingly rare, except in a couple of titles, where Typo rules won't touch them. Chris the speller yack 20:24, 7 September 2012 (UTC)

Plaque

The plaque rule just changed placque to plaque but http://www.merriam-webster.com/medical/placque says placque is a variant. That didn't seem a productive rule so it's disabled. Regards, Sun Creator(talk) 16:19, 8 September 2012 (UTC)

Is "placque" just a variant in medical applications (e.g. dental plaque) or in all variations (e.g. commemorative plaque)? Thanks! GoingBatty (talk) 16:42, 8 September 2012 (UTC)
All variants unless this is a typo. Regards, Sun Creator(talk) 16:55, 8 September 2012 (UTC)

Avoid match when part of scientific name

In the article 2012 in paleontology is written: The type species is Bicentenaria argentina. The Regex rule is, find="\bargentin(a|e(an)?s?)\b(?!'')" and it avoids matching in the Regex tester, yet when AWB looks at the article it matches and wants to capitalise argentina to Argentina. Can someone look and see what the problem is in this case. Regards, Sun Creator(talk) 10:35, 14 September 2012 (UTC)

I tweaked the rule, changing the order of \b and the negative lookahead, just on a hunch, but no help. This looks like a bug, as it works fine as an F&R rule. Chris the speller yack 13:22, 14 September 2012 (UTC)
Italic text should automatically be hidden from typo fixing, so not sure what's happened here. I'll investigate later. Rjwilmsi 14:15, 14 September 2012 (UTC)
Thank you. I added backslashes before each single quote, for "\bargentin(a|e(an)?s?)\b(?!\'\')" as that works in Regex and worked in the typo rule also. Worth knowing that '' does not work as it appears. Regards, Sun Creator(talk) 14:17, 14 September 2012 (UTC)

Regex testing

You can use the AWB find and replace to test new typo rules. I just found this out and feeling like a n00bie, so sharing in case others might not know of this excellent way of testing new rules. Regards, SunCreator (talk) 12:55, 8 August 2012 (UTC)

You can also use the AWB Regex tester. GoingBatty (talk) 14:48, 15 September 2012 (UTC)

Capitalisation of Jain, Jains and Jainism

This seems a good possibility. Main article Jainism. Regards, Sun Creator(talk) 23:57, 15 September 2012 (UTC)

Rule suggestion: "self employed' => "self-employed"

Confirmed with Merriam-Webster as well as NYT, LAT, BBC, and Guardian that the latter is the correct usage. I've run about a hundred of these in AWB with only one false positive so far, the unusual phrasing "he found himself employed by... " Khazar2 (talk) 19:15, 4 September 2012 (UTC)

Many other words beginning with "self-" are listed at http://www.onelook.com/?w=self-*&ls=a.
Wavelength (talk) 19:27, 4 September 2012 (UTC)
I have been running this F&R for months, and see very few false positives:
find:
\b([Ss])elf(?<!inner self) (\w+)(ed|ing)\b
replace:
$1elf-$2$3
Something with features like this and/or a list like Wavelength's suggestion might be a good starting point. Chris the speller yack 23:07, 4 September 2012 (UTC)
Both sound logical to me, though the list Wavelength links includes words that are comparatively rare ("self-aligned") along with more common ("self-abuse"). I'm not savvy enough on the programming side to know whether it's worth winnowing that list down to only common errors. Khazar2 (talk) 23:29, 4 September 2012 (UTC)
In my copy of The New Merriam-Webster Pocket Dictionary (1965), the main entry "self-" is followed by this list of 96 derivative entries (which I have divided into nine groups of 10 words each, and one group of 6 words):
  • self-abasement, self-accusation, self-acting, self-addressed, self-adjusting, self-administered, self-advancement, self-analysis, self-appointed, self-assertion
  • self-assertive, self-assurance, self-assured, self-awareness, self-betrayal, self-closing, self-command, self-complacent, self-conceit, self-concerned
  • self-condemned, self-confidence, self-confident, self-constituted, self-contradiction, self-control, self-created, self-criticism, self-cultivation, self-deception
  • self-defeating, self-defense, self-delusion, self-denial, self-denying, self-destruction, self-determination, self-discipline, self-distrust, self-doubt
  • self-driven, self-educated, self-employed, self-employment, self-esteem, self-evident, self-examination, self-explaining, self-explanatory, self-expression
  • self-forgetful, self-giving, self-governing, self-government, self-help, self-importance, self-important, self-imposed, self-improvement, self-induced
  • self-indulgence, self-inflicted, self-interest, self-limiting, self-love, self-luminous, self-mastery, self-perpetuating, self-pity, self-portrait
  • self-possessed, self-possession, self-preservation, self-propelled, self-propelling, self-protection, self-regard, self-regulating, self-reliance, self-reliant
  • self-reproach, self-respect, self-respecting, self-restraint, self-sacrifice, self-satisfaction, self-satisfied, self-seeking, self-service, self-starting
  • self-sufficiency, self-sufficient, self-supporting, self-sustaining, self-taught, self-will
Next, the main entry "self-centered" is followed by the derivative entry "self-centeredness".
Next, the main entry "self-conscious" is followed by the derivative entries "self-consciously" and "self-consciousness".
Next, the main entry "self-contained" is without a derivative entry.
Next, the main entry "self-effacing" is without a derivative entry.
The next two main entries are unhyphenated.
Next, the main entry "self-made" is without a derivative entry.
Next, the main entry "self-righteous" is without a derivative entry.
Altogether, this part of the dictionary has 105 words beginning with the combining form "self-".
Wavelength (talk) 16:38, 5 September 2012 (UTC)

I did a Wikipedia search for ~"self complacent", and only hit 7 pages, all of them correctly hyphenated. This exercise, repeated 95 times, will probably indicate which ones are worthwhile candidates for inclusion in a Typo rule. A search of ~"self inflicted" came up with 913 pages; a pre-parse AWB run changing "elf inflicted" to "elf-inflicted" would be a fairly quick way to count how many of the 913 are not hyphenated. If any search finds a couple dozen unhyphenated cases, it should probably be included. If this makes sense, I'll search a group of 10 words, and someone can volunteer for another group or two. Chris the speller yack 18:08, 5 September 2012 (UTC)

Sounds good to me. Since I'm already doing self-employed, I'll start with this row and will check back in in 24-48 hours:
  • self-driven, self-educated, self-employed, self-employment, self-esteem, self-evident, self-examination, self-explaining, self-explanatory, self-expression Khazar2 (talk) 18:17, 5 September 2012 (UTC)
Ok, first set of results. The number in parentheses is the total corrections I made, followed by a plus sign if I didn't finish the list: self-driven (3 corrections), -educated (26), -education (14), -employment (40-50), -employed (200+), -esteem (50+), -evident (30+), -explaining (1), -examination (40-50), -explanatory (50-60), -expression (30+).
I'll work on self-forgetful through -induced nextKhazar2 (talk) 15:42, 6 September 2012 (UTC)
Once your gone 4 letters in with "self" the cycle time usage will be negligible so you may as well add all the "self-" variants you want. Regards, Sun Creator(talk) 01:12, 9 September 2012 (UTC)
In that case, how should we proceed from here? Include the full list above, or check each first to make sure none creates some large number of unforeseen false positives? I'll continue correcting words from the list individually for the time being. Khazar2 (talk) 18:44, 11 September 2012 (UTC)
Worked through another set this week, which I'll just mark yes or no this time for whether or not it is worth including (using a threshold of ~30 results): self-forgetful (no), self-giving (no), self-governing (yes), self-government (no), self-help (yes), self-importance (no), self-important (no), self-imposed (yes), self-improvement (yes), self-induced (yes).
Would anyone watching this thread like to include some or all of this list? I'm also still curious to hear what other editors think of the possibility per Sun Creator that we simply add the full list to AWB. -- Khazar2 (talk) 16:59, 18 September 2012 (UTC)
Each individual one doesn't need to be occurring 25 times, once is fine as the self- rule in total will exceed 25 easily. What is important is that no known false positives happen. Regards, Sun Creator(talk) 17:18, 18 September 2012 (UTC)
I'll keep running them individually then to test; meanwhile, all twenty from those lines can safely be added. Khazar2 (talk) 17:56, 18 September 2012 (UTC)

Oxford have many more then the Merriam-Webster Pocket Dictionary of 1965, without going into variant words. See full list here. Regards, Sun Creator(talk) 18:17, 18 September 2012 (UTC)

I posted the words from The New Merriam-Webster Pocket Dictionary (1965) after I posted a link to a list at http://www.onelook.com/?w=self-*&ls=a and after Khazar2 suggested "winnowing that list down to only common errors".
Wavelength (talk) 18:35, 18 September 2012 (UTC)
Well, I can check over the Webster's list, but I've noticed that even some from that list only had 20-30 total occurrences in the correct form in WP; I'll probably pass on checking the OED ones, too. Obviously I've no objections if you want to test and include a longer list, though, Sun. Khazar2 (talk) 20:05, 18 September 2012 (UTC)
Adding Oxford(OED) because it is usually considered the most authoritative source. I notice that OED does not have self-explaining or self-giving. Would it be okay to remove those? Regards, Sun Creator(talk) 20:28, 18 September 2012 (UTC)
That's fine; each had almost no results anyway. Khazar2 (talk) 21:10, 18 September 2012 (UTC)
Complete OED list in one rule
<Typo word="Self-" find="\b([Ss])elf\s+(?<=\s[Ss]elf\s+)\b(abandon|abandoned|abandonment|abasement|abnegating|abnegation|absorbed|absorption|abuse|accusation|accusatory|acting|activating|actualisation|actualization|addressed|adhesive|adjusting|adjustment|admiration|admirer|admiring|advancement|advertiser|adverting|advertisement|advocacy|affirmation|aggrandisement|aggrandizement|aggrandizing|alienation|aligning|analysis|analysing|analyzing|annihilation|appointed|approbation|approval|approving|approvingly|assemble|assembly|asserting|assertion|assertive|assertiveness|assessment|assurance|assured|assuredly|aware|awareness|balancing|betrayal|build|builder|canceling|cancelling|catering|censorship|centered|centeredly|centeredness|centred|centredly|centredness|certificate|certification|certify|cleaning|closing|cocking|colored|coloured|command|compatible|conceit|conceited|concept|condemnation|condemned|condemning|confessed|confessedly|confession|confessional|confidence|confident|confidently|congratulation|congratulatory|conscious|consciously|consciousness|consistent|consistency|contained|containment|contempt|contemptuous|contradicting|contradiction|contradictory|control|controlled|correct|correcting|correction|created|creating|creation|critical|criticism|deceit|deceiver|deceiving|deceptive|deception|defeating|defence|defensive|defense|defensive|definition|delight|delusion|denial|denying|dependence|deprecating|deprecatingly|deprecation|deprecatory|depreciation|depreciating|depreciatory|despair|destroying|destruct|destruction|destructive|destructively|determination|development|devotion|diagnose|diffusion|directed|direction|directing|discipline|disciplined|discovery|disgust|doubt|dramatisation|dramatization|drive|educated|education|effacing|effacement|effacingly|employed|employment|enclosed|esteem|evaluation|evident|evidence|evidently|examination|excited|executing|existent|explanatory|expression|expressive|faced|feeder|feeding|fertile|fertility|fertilisation|fertilization|fertilized|fertilizing|financed|financing|flagellation|flattering|flattery|forgetful|forgetfulness|fulfilling|fulfillment|fulfilment|generating|glorification|governed|governing|government|gravitation|guided|harm|harmer|hate|hatred|help|identification|identity|image|immolation|importance|important|importantly|imposed|improvement|incompatible|incompatibility|induced|inductance|induction|inductive|indulgence|indulgent|indulgently|inflicted|insurance|interest|interested|involved|involvement|justification|justificatory|justifying|knowledge|knowing|limiting|liquidating|loader|loading|loathing|locking|love|loving|made|management|managing|mastery|medicate|medication|mockery|mocking|mockingly|mortification|motion|moving|motivated|motivating|motivation|murder|murderer|mutilation|neglect|observation|obsessed|obsession|opinion|opinionated|parodic|parody|parodying|perpetuating|perpetuation|pity|pitying|pityingly|policing|pollinated|pollinating|pollination|pollinator|portrait|portraiture|possessed|possession|praise|preservation|proclaimed|promoter|promoting|promotion|propagating|propagation|propelled|propelling|protection|protective|proving|publish|published|publisher|raising\sflour|rating|realisation|realization|reference|referential|referentiality|referentially|reflection|reflective|reflexive|regard|regarding|regulating|regulation|regulatory|reliance|reliant|reliantly|renewal|renewing|renunciation|report|reproach|reproachful|respect|respecting|restraint|restrained|revealing|revelation|revelatory|righteous|righteously|righteousness|righting|rising\sflour|rule|sacrifice|sacrificial|sacrificing|satisfaction|satisfied|sealing|seed|seeder|seeker|seeking|select|selecting|selection|service|serving|shifter|similar|similarity|slaughter|sow|sown|starter|starting|sterile|sterility|stick|stimulation|storage|styled|subsistent|sufficiency|sufficient|sufficiently|suggestion|supporting|support|surrender|sustaining|sustained|system|tailing|tanner|tanning|tapping|taught|timer|titled|torture|tracking|transcendence|understanding|will|willed|winding|worth)\b" replace="$1elf-$2" />
While you might think the above will be slow because it deals with 398 words, it's actually executes very fast and is quicker then the (In)significant rule. Regards, Sun Creator(talk) 00:38, 19 September 2012 (UTC)
I could not think of a reason not to make the 398 OED words live so I've updated the rule to include them. Regards, Sun Creator(talk) 02:19, 19 September 2012 (UTC)
An unexpected false positive in Self_(band) "Self made available the track ...", not self-made. Hopefully not many crazy matches like that about. Regards, Sun Creator(talk) 03:33, 19 September 2012 (UTC)
We have article for Self timer, Self insurance and Self control. It seems logically that these article titles should be hyphenated. The lack of sourcing found in the articles points in the direction also. Regards, Sun Creator(talk) 03:37, 19 September 2012 (UTC)
I moved two of these articles: Self-timer and Self-control. Sun Creator moved Self-insurance. Chris the speller yack 13:33, 19 September 2012 (UTC)

Candidate rules for removal

While randomly looking at the rules:

  • <Typo word="His" find="\bIhs\b" replace="His" />
Just 11 matches in my current database dump, and all are false positives in web addresses, image names and such like. I'm sure this one can go. -- John of Reading (talk) 07:49, 17 September 2012 (UTC)

Proposal for a fuzzy rule

I have investigated a very "fuzzy" rule to replace the current "Individual" rule. The old rule effectively fixes two misspellings (idividual and indvidual) with two more possible endings (individuals and individually). The new rule fixes about a thousand possible misspellings (though many of these would be double or triple typos, and so would be very unlikely) and any number of endings, such as "individualistically", "individualized" and "individualism". The new rule has actually caught some words with double typos (e.g. "indvidiuals") and doubled syllables (e.g. "individidual"). I have used it to fix almost 100 articles with misspellings that the old rule doesn't find.

<Typo word="Individual*" find="\b([Ii])n?i?di?a?vi?a?d?i?d?u?a?la?(?<!ndividual)([a-z]*)" replace="$1ndividual$2"/>

It finds any word of the form "i__d__v__l__" that has other letters from "individual" in just about any order.

Are there any problems with this approach? I think it will run about as fast as the old rule. I have not found any false positives in English text, though there may have been one or two in some Romanian text (these should have "Not a typo" templates slapped on them, anyway). I would be pleased if some of you would try it out and comment. I think this technique has promise for a number of other words that have many possible misspellings. Chris the speller yack 03:55, 20 September 2012 (UTC)

The concept is a good one for longer words where fixtures(fixed parts) like the "i__d__v__l" make it unique and excludes false positives, I used it on the 'Wiki(p/m)edia' rule recently.
The draw back is you have to do extra false positive and foreign spelling research, for false positives in this example 'Id' is short for Identification and val short for value, so you could have 'Idval' as a false positive; DVD is a product and apple has/had an idvd that is almost a match for this rule, Idvallo and Idaville are places, and those are false positives on the above rule. Useful tools to check are Wikipedia search preview, Google, multilingual. Once a well researched rule is written you can pick up many more typos then a normal rule.
Further refinement of the rule is possible. Things that are excluded at the beginning of the rule tend to help it's speed for that reason I would have a fixed second letter the 'n', but add 'm' because it's adjacent on the keyboard, and thus making the structure "i(n/m)d__v__l" which is more robust to false positives. Vowels are often substituted in misspelling, so I'd consider making them optional as well as missing, so 'Induvidual' is quite plausible with typos on both Wikipedia and google occurring, same for 'Indavidual', 'Indivadual', 'indevidual', 'indivedual', 'Indivadaul' which google have as occurring somewhere. A check on the vowels after the 'l' does not find anything helpful, 'Individual(e/i)' are foreign, 'Individuala' could be also and 'Individual(o/u)' doesn't occur on google. End result:
<Typo word="Individual*" find="\b([Ii])[nm]i?d[aeiou]?v[aeiou]?d?[aeiou]?d?[aeiou]{0,3}l(?<!ndividual)([a-z]*)" replace="$1ndividual$2"/>
Regards, Sun Creator(talk) 12:47, 20 September 2012 (UTC)
The above still have a minor issue with foreign(?) endings 'Individuale' and 'Individuali' so easist to add a lookahead excluding them, plus allow for hyphened words and pretty edit summary for misspelling of individual's.
<Typo word="Individual*" find="\b([Ii])[nm]i?d[aeiou]?v[aeiou]?d?[aeiou]?d?[aeiou]{0,3}l(?<!ndividual)(?![aei]\b)([a-z-\']*)" replace="$1ndividual$2"/>
Regards, Sun Creator(talk) 13:41, 20 September 2012 (UTC)
Thanks. I met you about halfway, and put it into production. It fixes a ton of misspellings, and finds very few false positives. I have excluded "Individuel", which is a fairly common misspelling, but also a French word. If a few dozen in French phrases get wrapped with "Not a typo" templates, we could remove the exclusion. Chris the speller yack 18:34, 20 September 2012 (UTC)

Request change to "-ment" rule

What's the best way to fix the "-ment" rule so it doesn't try to fix the surname "Segman"? (e.g. Try Chirplet transform) Thanks! GoingBatty (talk) 01:20, 24 September 2012 (UTC)

Use lookbehind to exclude this one case, it's very fast. So add the following to the end of the rule: (?<!Segman) Regards, Sun Creator(talk) 02:45, 24 September 2012 (UTC)
  Done - Thanks! GoingBatty (talk) 01:07, 25 September 2012 (UTC)

Decade apostrophe

The recent change brings to end most possibilities with this rule. What was the issue before? Example would be nice. 01:25, 25 September 2012 (UTC) — Preceding unsigned comment added by Sun Creator (talkcontribs)

False positives was the issue before, such as on these: http://en.wikipedia.org/w/index.php?search=%222010%27s+iron+man%22&title=Special%3ASearch -- JHunterJ (talk) 01:37, 25 September 2012 (UTC)

Consider new rule for "the so called" -> "the so-called"

I think we could have a new typo rule for "the so called" -> "the so-called", but would value input/investigation by others. Thanks Rjwilmsi 08:48, 22 September 2012 (UTC)

  Added without "the" -GoingBatty (talk) 22:42, 23 September 2012 (UTC)
But now that I've added it, I'm having second thoughts. Should it fix both "the so called" and "it is so called because"? GoingBatty (talk) 23:12, 23 September 2012 (UTC)
No, just the former case. See www.macmillandictionary.com, which is head and shoulders above the other dictionaries for hyphenation; it says "adjective [only before noun]". Chris the speller yack 23:56, 23 September 2012 (UTC)
Then I'm glad I changed the rule to only fix "a so called", "the so called", and "these so called". Thanks! GoingBatty (talk) 00:01, 24 September 2012 (UTC)
Why do we need a regex lookbehind for this rule? Rjwilmsi 17:35, 25 September 2012 (UTC)
Are you asking whether its function can be accomplished without the lookbehind? Yes, but it would be slower to look for "A|a|by|of|The|the|These|these|Her|her|Their|their|This|this|His|his" first and "so called" second, or so we believe. Or are you asking whether we can skip the lookbehind and just hyphenate all cases of "so called"? The comment after the rule explains one type of sentence where it should not be hyphenated, as in that case it is not an adjective that precedes a noun. Chris the speller yack 17:54, 25 September 2012 (UTC)

On my current computer using the article List of Doctor Who universe creatures and aliens

Without a lookbehind:

<Typo word="so-called_" find="\b([Aa]|by|of|[Tt]he|[Tt]hese|[Hh]er|[Tt]heir|[Tt]his|[Hh]is)\sso\scalled\b" replace="$1 so-called" />
time taken for the above rule according to AWB Regex tester is 89ms

With a lookbehind:

<Typo word="so-called_" find="\bso\scalled\b(?<=\b([Aa]|by|of|[Tt]he|[Tt]hese|[Hh]er|[Tt]heir|[Tt]his|[Hh]is)\sso\scalled)" replace="so-called" />
time taken for this according to AWB Regex tester is 1ms

An order of magnitude different. Regards, Sun Creator(talk) 18:39, 25 September 2012 (UTC)

Percentage

Why does AWB remove spaces before the "%" sign ? Thanks, — Racconish Tk 19:42, 23 September 2012 (UTC)

Because this is the recommendation at MOS:PERCENT. -- John of Reading (talk) 20:01, 23 September 2012 (UTC)
Removing spaces before the "%" sign is one of AWB's new general fixes, not a typo fix. GoingBatty (talk) 21:31, 23 September 2012 (UTC)
Belated thanks. — Racconish Tk 06:10, 1 October 2012 (UTC)

Jewelry/Jewellery

Jewellery is the British version, Jewelry is the American version but how do we know what to do in a typo rule? Regards, Sun Creator(talk) 00:15, 9 October 2012 (UTC)

Warwick Tarboton

Typo fixing suggests changing his surname to Tributon. False positive. Thanks. --Breno talk 06:25, 9 October 2012 (UTC)

The rule was changed 3 days ago to prevent this. You only need to "Refresh status/typos" from the File menu (Ctrl+R). Chris the speller yack 12:55, 9 October 2012 (UTC)
Thanks! I just learned a new AWB tip :) --Breno talk 09:07, 12 October 2012 (UTC)

Couple of proposals

They (talk) 03:46, 24 October 2012 (UTC)

Nice ideas. Previous discussion has put a requirement of 25 occurrences before it's worth a rule, due to each new rule slowing down the checking. In each of these cases it appears less then 25 are available after the false positives of initial are removed. On a related note I've started some work on an improved -ably and -ally rule but they are far from finished; see User:Sun Creator/-ally and User:Sun Creator/-ably. Regards, Sun Creator(talk) 04:22, 24 October 2012 (UTC)
I noticed there is already a character rule. That rule could be amended to fix the above typo but in reading of the rule it appears the rule would already fix it. Regards, Sun Creator(talk) 04:31, 24 October 2012 (UTC)
"Intially" already has a Typo rule; the cases you see are on pages that have not had AWB Typos run on them lately. I haven't checked the other two suggestions yet. Chris the speller yack 04:34, 24 October 2012 (UTC)
Same story with "Charecter"; the existing rule works. I modified the "Notably" rule to also fix 'notabily'. Chris the speller yack 16:22, 24 October 2012 (UTC)
So, in conclusion: more people are required to clean the typo's that existing rules already fix. I noted back a week or so that for the 'a to an' rule alone there are 17,000 plus articles with the grammar/typo. Regards, Sun Creator(talk) 19:46, 24 October 2012 (UTC)

Thanks. OK, a new proposal: retireved → retrieved. I just did 74 of those. They (talk) 11:34, 1 November 2012 (UTC)

Now added to retrieve rule. Regards, Sun Creator(talk) 14:28, 1 November 2012 (UTC)
Thanks! They (talk) 06:19, 2 November 2012 (UTC)

Articles with typos

Many article with existing typos. You can add to AWB but using "Source: Links on page(blue links only)". Regards, Sun Creator(talk) 20:20, 24 October 2012 (UTC)

We're keeping track of who is running what list at Wikipedia talk:WikiProject TypoScan#Articles with typos. Join in the typo fixing fun today! :) --Breno talk 11:28, 3 November 2012 (UTC)

Broken regex

Incorrectly changing devided → dedivi$2. Thanks :) --Breno talk 10:37, 3 November 2012 (UTC)

I think I've fixed it. -- John of Reading (talk) 10:47, 3 November 2012 (UTC)

Potential Lua-based Template:CheckMOS to spot errors

Editors in this project may wish to see User talk:Jimbo Wales/Archive 119#Potential Lua-based Template:CheckMOS to spot errors (version of 15:44, 9 November 2012).
Wavelength (talk) 17:03, 9 November 2012 (UTC)

Wikipedia:Database reports/Linked misspellings

Wikipedia:Database reports/Linked misspellings is a new report at Wikipedia:Database reports.
Wavelength (talk) 16:32, 6 December 2012 (UTC)

Interesting! The report would be more useful, though, if it only included unpiped links where the wrong spelling is visible to the reader - like the three articles which link to Tennesseee. Both articles which link to Natural satelite do so through a piped link, so those don't need to be fixed. -- John of Reading (talk) 17:13, 6 December 2012 (UTC)
It's not completely clear whether fixing "piped typos" is inappropriate. There's further discussion here. --MZMcBride (talk) 19:54, 12 December 2012 (UTC)

Request for change to "A to An"

Could someone please change the "A to An" rule so it won't try to change "a i7 chord" as in "Penny Lane"? (See the edit summary in this edit) Thanks! GoingBatty (talk) 15:44, 16 December 2012 (UTC)

The change seems fine. Vowel sounds have an an before them, 'i' pronounced ''. Why would it not be an i7, an i6, an i8, an iPhone, or an iPad? Regards, Sun Creator(talk) 22:24, 16 December 2012 (UTC)
I'm interpreting the edit summary I linked above to mean that "i7" isn't pronounced "eye-seven". GoingBatty (talk) 22:35, 16 December 2012 (UTC)
Right, a web search indicates that it's pronounced as "one-seven" (or wuhn-sehvehn).   Fixed with this edit -- JHunterJ (talk) 22:45, 16 December 2012 (UTC)
So use of Roman numerals in a chord.
 
Regards, Sun Creator(talk) 23:01, 16 December 2012 (UTC)
Thanks to both of you! GoingBatty (talk) 00:19, 17 December 2012 (UTC)

One typo to another?

Implicted to implicited?? Both are mistakes since implicit is an adjective and not a verb. Implicate or imply might be workable replacements. Thank you. Mr T(Talk?) (New thread?) 14:49, 11 December 2012 (UTC)

  Fixed with these edits. -- JHunterJ (talk) 13:55, 17 December 2012 (UTC)

Request for change to "Sculpt(ure/or)"

Could someone please change the new "Sculpt(ure/or)" rule so it doesn't try to fix "Scutts" in Mick Jagger? Thanks! GoingBatty (talk) 15:35, 16 December 2012 (UTC)

  Fixed with this edit. -- JHunterJ (talk) 22:49, 16 December 2012 (UTC)
Thank you! GoingBatty (talk) 00:18, 17 December 2012 (UTC)
I also edited the rule to avoid "scuttle" and derivatives. Rjwilmsi 08:20, 18 December 2012 (UTC)

Interaction between typo-fixing and the names of linked images

I've been asked to start a new discussion here, following on from a stalled discussion on this page four months ago and a stalled discussion on the "Bugs" page.

Let's suppose that the word foo is almost always an incorrect spelling of fob, and that someone has added a typo-fixing rule to do the fix.

Current behaviour

If an article contains a link [[foo]] or [[abc foo def]], then the typo-fixer has been coded to turn off that rule for the entire article. It won't change foo to fob in the body text of the article. Sometimes this makes sense, as the existence of a [[foo]] link is a sign that the word has a special meaning when used in the article. Sometimes this merely means that a typo goes unfixed - the link is itself a typo, possibly an unintended red link or a link to something in Category:Redirects from misspellings. As I understand it, AWB cannot efficiently discover whether each link is a red or blue link, an article or a redirect.

It seems very difficult to assess whether this is the best behaviour. It takes a sharp-eyed AWB user to notice that a foo has been left unchanged. Questions, then:

  • Does anyone have a feel for how many correct fixes are missed because of this behaviour?
  • Does anyone have a feel for how many incorrect fixes are avoided because of this behaviour?

Proposed change

I propose that a link to an image should never cause a typo-fixing rule to be turned off. Images are frequently uploaded with non-English names, or with typos in their names, or with names that don't conform to our picky hyphenation or capitalisation rules; and since image names aren't displayed in the articles there is no great incentive to rename them. I think, then, that the existence of a link to [[File:abc foo def.jpg]] should not be taken as a sign that the word foo has a special meaning in the article, and that the "foo to fob" rule should be allowed to run normally.

One concrete example to put this in perspective: the use of the image File:First 3 egyptian pilots.jpg in the article Egyptian Air Force should not be taken as a sign that it is OK to spell "egyptian" with a lowercase "e" within that article. -- John of Reading (talk) 12:52, 5 December 2012 (UTC)

Support – 100%. Chris the speller yack 15:58, 5 December 2012 (UTC)
Support - If it's an important enough word, it should also appear in the body of the article. GoingBatty (talk) 18:03, 5 December 2012 (UTC)
Support - Image names are meaningless as those on commons can be named in another language. Regards, Sun Creator(talk) 18:40, 7 December 2012 (UTC)
Support - In what concerns Portuguese, names in images are most of the time misspelled (usually people do not use diacritics), and the bad collocation of Diacritic is one of the most common errors in pt.wikipedia. As AWB is to be operated by people (not bots), a think of two possible solutions:
  • Keep the feature (external links disable a Typo rule) but giving an Alert similar to 'sic tag/template - "Contains matching external links"
  • Remove the feature (external links do not disable a Typo rule) but giving an Alert as above --JotaCartas (talk) 00:57, 12 December 2012 (UTC)

Second proposed change

There is a related bug report here which I have verified in my sandbox - a typo-fixing rule is currently disabled if it matches an interwiki link. This looks dubious, as these links are in arbitrary non-English languages most of the time. I propose that an interwiki link should never cause a typo-fixing rule to be turned off. -- John of Reading (talk) 22:43, 7 December 2012 (UTC)

Some doubts - As above, in what concerns Portuguese, most names in interwikis to foreign languages do not use diacritics, even if the name is Portuguese. In the other hand, if the article is about a foreign subject, the existence of the interwiki may prevent a correction of a False Positive. A 'balance' must be done, but I have no data to give an opinion.

rev 8834 Typo restriction that typo rule not applied if it matches a link target will apply only to wikilinks, not image/interwiki/category links. Rjwilmsi 19:17, 21 December 2012 (UTC)

"full time" again

I undid my edit at 1960 Norwegian Football Cup and went to look at the rule to see how it could be tweaked. To my surprise, it's a very simple rule that does not try to distinguish "full time" used as a noun from "full-time" used as an adjective or adverb. My printed Concise Oxford makes this distinction, as does Collins (Onelook.com).

I've read this archived discussion but I'm not convinced. At Association football#Duration and tie-breaking methods there is the statement "The end of the match is known as full-time", but the stated reference for this only mentions "half-time" - and the COD and Collins both agree that half-time should always be hyphenated. -- John of Reading (talk) 09:58, 20 December 2012 (UTC)

en-au use of nee vs née

I have received a complaint from an Australian editor that "the Macquarie Dictionary allows for both with nee being listed first". Is this the death knell for our Typo rule? Chris the speller yack 12:13, 20 December 2012 (UTC)

No no! Trust me: the Macquarie is not a good guide in these matters. And even if it were an accurate index of substandard Australianisms, it wouldn't matter. Please, let's continue to prefer née, which is internationally accepted and understood. (What next, phenomena and criteria are accepted as singular forms, because some dictionaries list them as such?)
Major British and dictionaries prefer née. Rationally.
NoeticaTea? 03:51, 30 December 2012 (UTC)

Macquarie Dictionary

Is there any critical discussion on the Macquarie Dictionary, to back up your comment. It is not reflected in the article on the dictionary. Anyway, the bot does not appear to pick up all instances of nee. Paul foord (talk) 10:28, 30 December 2012 (UTC)

Is there a typo fixing bot? GoingBatty (talk) 12:47, 30 December 2012 (UTC)
Paul, I don't know if there is any published critical discussion that makes the exact point that I make above. I do know that Wikipedia is for an international readership, and that internationally accepted forms are therefore preferable. Macquarie is based on a third-rate US dictionary, but it bends over backward to distinguish itself and to justify itself as dinky-di Australian. For example, over two decades it gave shockingly ignorant pronunciations of many foreign terms, apparently on the ground that they can indeed be heard here and there. Corrected in recent times, mercifully. NoeticaTea? 23:50, 30 December 2012 (UTC)
Sounds like the Macquarrie will never be good enough in your view - how is this different from cultural cringe. I also understood that en-au was an internationally accepted variety and accepted on Australian pages. -- Paul foord (talk) 20:59, 1 January 2013 (UTC)
Is there a dictionary for en-au you would recommend? -- 21:05, 1 January 2013 (UTC)
The one I use is is the spellchecker for the browser and using English (Australian, American) nee is not flagged while née is flagged as misspelled, but using English (British), neither are flagged as misspelled. Apteva (talk) 19:49, 3 January 2013 (UTC)

False positive

This seems to be a false positive of Regex Typo fix. Please fix this. Thanks! Forgot to put name 13:27, 29 December 2012 (UTC)

There were only two cases of this particular misspelling in all of the millions of pages in en.wikipedia (both now fixed), and it is not a false positive, as it did not try to change a properly spelled word. It seems to me that it is not worth messing with the Typo rules to handle this extremely rare misspelling that might only pop up once every few years. The typo rules are not intended to catch every possible mistake, just somewhat common ones. As the first sentence of the Typo rules says, "These regular expressions find and fix common misspellings". At least the current rule brought the misspelling to your attention. Chris the speller yack 13:57, 29 December 2012 (UTC)

"A " or "a " → "An" or "an" when following vowel preceded by "[["

User:GoingBatty suggested asking about this here.

AWB typo fixing recognises "a" or "A" before a vowel and corrects it to "an" or "An". All good. However it misses cases where the vowel is separated from the article by link syntax eg "A [[iron..." . An enhancement to recognise this would be good.

Examples at this old revision of Glossary of meteoritics.

-Arb. (talk) 15:53, 3 January 2013 (UTC)

ISBN fixing?

  • Find: ISBN-(10|13)(-|:| )
  • Replace: ISBN (with a space at the end)

This will enable MediaWiki magic. Should I add this rule? -- Magioladitis (talk) 08:29, 4 January 2013 (UTC)

Instead, I suggest ISBN-(10|13)(-|:)?\s?(\d+)ISBN $3 - try both on Wikipedia:ISBN. GoingBatty (talk) 00:56, 9 January 2013 (UTC)

hyphens and dashes

Instead of rules for "Wade–Giles" and "McCune–Reischauer" and more, should we have a rule that changes all "Foo-Bar" to "Foo–Bar"? Thanks! GoingBatty (talk) 00:39, 9 January 2013 (UTC)

No. For example, hyphenated names such as Stanton Macdonald-Wright and words such as blue-green should not be changed. MANdARAX  XAЯAbИAM 02:24, 9 January 2013 (UTC)

You shouldn't have to allot a lot of time for this one

I've been encountering the typo "alot" a lot. The typo rule for "Allo-" under "Beginnings" changes it to "allot", which is incorrect for every instance I've come across. In general I don't like the unencyclopedic tone of "a lot" in articles, but just for typo fixing, the typo should be changed to "a lot" rather than "allot". MANdARAX  XAЯAbИAM 03:49, 12 January 2013 (UTC)

  Done - I changed the "Allo-" rule so it won't change "alot" to "allot". Instead, the "A lot" rule will change "alot" to "a lot". Thanks! GoingBatty (talk) 15:40, 12 January 2013 (UTC)

Wilma Doesnt

What she doesn't do I don't know, but she certainly doesn't have an apostrophe in her name. She's listed in 147 articles at present. Could a regex expert make her an exception to the apostrophe rule please. Thanks.  An optimist on the run! 22:59, 19 January 2013 (UTC)

  Done. The next step will be to remove the apostrophe in each instance of "Wilma Doesn't". GoingBatty (talk) 03:55, 21 January 2013 (UTC)
Thanks - I've already done the next step.  An optimist on the run! 06:31, 21 January 2013 (UTC)

Request for change to french

Can somebody modify the rule to prevent french fries from being capitalized? Thanks. sumone10154(talk) 03:46, 19 January 2013 (UTC)

Also include french fry and french fried as well. sumone10154(talk) 04:11, 19 January 2013 (UTC)
But major dictionaries (Macmillan's, Collins, Cambridge) indicate that "French fries" should be capitalized. Chris the speller yack 14:28, 19 January 2013 (UTC)
Capitalization seems consistent with Wikipedia's French fries article. GoingBatty (talk) 20:08, 19 January 2013 (UTC)
According to wiktionary, french fry is not capitalized in American usage. sumone10154(talk) 04:28, 20 January 2013 (UTC)
But Wiktionary is not a respectable or dependable dictionary. Chris the speller yack 16:49, 20 January 2013 (UTC)
Ok, how about Merriam-Webster? sumone10154(talk) 03:50, 21 January 2013 (UTC)

You're right, Webster's says it can go either way. Anyone else care to comment? We usually allow variant spellings that are mentioned in any decent dictionary. Chris the speller yack 03:45, 22 January 2013 (UTC)

See http://www.onelook.com/?w=french+fr*&ls=a.
Wavelength (talk) 04:08, 22 January 2013 (UTC)
I see that, but it just provides a list of miscapitalized phrases. I don't see how that helps us decide whether "french fries" should be changed. Chris the speller yack 05:28, 22 January 2013 (UTC)
Each phrase listed is linked to a list of links to definitions in various dictionaries. This is in regard to your comment "We usually allow variant spellings that are mentioned in any decent dictionary."
Wavelength (talk) 17:17, 22 January 2013 (UTC)

Request for change to Massachusetts

Please add Massachusetss → Massachusetts. Thanks. sumone10154(talk) 20:40, 22 January 2013 (UTC)

  Done, although someone may wish to consolidate the two "Massachusetts" rules. I am also fixing each page that contains "Massachusetss". GoingBatty (talk) 04:18, 23 January 2013 (UTC)

Nbsp

To avoid situations like this one where the page was to be re-procecced to add nbsp between the Kg and the number, the typo rules should be updated to include nbsp in the fix. (This problem just appeared because we yesterday moved typo fixing AFTER general fixes. -- Magioladitis (talk) 12:35, 12 February 2013 (UTC)

  Done. GoingBatty (talk) 15:03, 12 February 2013 (UTC)

"Croation" to "Croatian"

I found many articles needing corrections from "Croation" to "Croatian".
Wavelength (talk) 03:31, 18 February 2013 (UTC)

  Done - someone else can merge the new rule with the other Croatia rule if needed. GoingBatty (talk) 04:27, 18 February 2013 (UTC)
  Fixed the typo in about 87 articles. GoingBatty (talk) 05:35, 18 February 2013 (UTC)

"Proprietarial"

I don't know whether this is the right place but...I noticed two users recently changing proprietorial to "proprietarial" in the same article using AWB. (These edits here and here) Is it something to do with AWB? AFAIK, "propietarial" doesn't exist in any English variety. I only care because I watch the article in question and don't want to have keep correcting this if AWB users are going to repeatededly make this edit. DeCausa (talk) 20:45, 18 February 2013 (UTC)

  Fixed - thanks for letting us know! GoingBatty (talk) 21:08, 18 February 2013 (UTC)

webiste

There are a hundred or so webistes I've changed a dozen to website but it would be nice to add this to AWB. ϢereSpielChequers 21:20, 1 March 2013 (UTC)

  Done by updating one of the two existing "Website" rules. Maybe someone can combine them into one rule. GoingBatty (talk) 23:27, 1 March 2013 (UTC)

Been

Hi! Even English is not my native language, I found one misspeling: have/had/has/having bee or have/had/has/having benehave/has/had/having been. Do you see any false positives? Because English is not my native language, I didn't add it, I'm only suggesting. Thanks. Matt S. (talk | cont. | cs) 14:10, 7 March 2013 (UTC)

Hi Matt - thanks for the suggestions! For the first rule, I'm concerned that there would be false positives, such as "having bee hives" (see Lake Isle of Innisfree). For the second rule, most of the matches are within quotations from hundreds of years ago. I'll manually fix the few instances of these instead of adding rules. Thanks for the suggestions! GoingBatty (talk) 04:32, 8 March 2013 (UTC)

request for change to french

I suggested earlier that french fry not be automatically capitalized, but the rule hasn't been added yet. Merriam Webster says that either upper case or lower case is acceptable. Were there any other objections? sumone10154(talk) 02:28, 8 March 2013 (UTC)

  Done - thanks for the reminder! GoingBatty (talk) 04:17, 8 March 2013 (UTC)

Didn't change 1990's -> 1990s

This edit changed 1980's -> 1980s, however it didn't change 1990's -> 1990s. It does change 1990's if it is the only date change in an article. Problem if two date changes on the same line? Bgwhite (talk) 22:28, 10 February 2013 (UTC)

The "Decade apostrophe" typo rule is set up to look for the word "the". So in your example, it changed "the late 1980's" to "the late 1980s", but didn't touch "and early 1990's" because there is no "the". GoingBatty (talk) 23:02, 10 February 2013 (UTC)
Once again you witnessed why I went into math and computers instead of written/spoken professions. Thank you Batty. Bgwhite (talk) 05:51, 11 February 2013 (UTC)
Note that the apostrophe can be correct, when initiating something belonging to 1990. Rich Farmbrough, 19:33, 24 March 2013 (UTC).
"1990's" can be correct, such as "In 1990's Die Hard 2...", which is why the rule is designed to look for the word "the". Could you please provide an example where the typo rule is suggesting an invalid change? Thanks! GoingBatty (talk) 19:45, 24 March 2013 (UTC)

Polinativelama → Pollinativelama

No.   Rich Farmbrough, 19:29, 24 March 2013 (UTC).

  Fixed - Since List of Other Backward Classes is the only article in Wikipedia that contains "Polinativelama", I wrapped the word with {{not a typo}} so AWB won't try to change it again. In the future, could you please provide the article in question? It's helpful to run the article through AWB again (without saving the edit) to see which rule is making the change. GoingBatty (talk) 19:41, 24 March 2013 (UTC)

Enborne is not a typo

This. Twice. It's a waste of my time finding my watchlist clogged with unnecessary and inconsequential edits in violation of WP:AWB#Rules of use item 4, but when those go hand in hand with incorrect "typo" fixes, it just annoys the hell out of me. I shouldn't need to clean up bad AWB edits. I suppose that I now need to search for those instances of "Emborne" on pages which I don't have watchlisted, and check that they're not also bad "typo" fixes that should have been left as "Enborne". --Redrose64 (talk) 20:08, 22 March 2013 (UTC)

Well, that didn't take long. --Redrose64 (talk) 20:14, 22 March 2013 (UTC)
  Done. I have changed the "Emb-" rule to avoid creating "Emborne". If you see other shortcomings in the rules, please bring them to our attention, but the sharp tone is not necessary. We are all volunteers here. BTW, the "Emborne" change does not violate item 4 of the "Rules of use"; it is not inconsequential. If you wish to complain to the person who approved and saved the changes that the Typo rule suggested, I think you are on solid ground, but please be kind. Chris the speller yack 14:57, 23 March 2013 (UTC)
Belated apologies for my incorrect edit from 2 years ago. That's the only instance of that incorrect fix that I could find in my edit summary search. I hope that I'm more careful now. GoingBatty (talk) 01:58, 18 April 2013 (UTC)

Opponnent - Opponent

There seem to be more than 80 of them, could you code it into AWB? ϢereSpielChequers 21:55, 17 April 2013 (UTC)

  Done - thanks for the suggestion! GoingBatty (talk) 01:50, 18 April 2013 (UTC)

Sufi

Could Sufism and Sufi be added to the Capitalization section. Bgwhite (talk) 06:24, 15 May 2013 (UTC)

  Added -GoingBatty (talk) 23:18, 15 May 2013 (UTC)

Template capitalization

Could a section be added something like:

==== Template names ====
<syntaxhighlight lang="xml">
<Typo word="Templates" find="^[\{]{2}([a-z])(.*?)[\}]{2}$" replace="{{{{SUBST:UC:$1}}$2}}" />

</syntaxhighlight> This would accomplish that all templates would start with an uppercase letter which although it does nothing to functionality, it is a pet peeve of mine and having the capital letter actually makes me notice the template quicker and more easily than without it. I know the mediawiki core really doesn't care, but there is no reason for templates to not match their respective page name titles. If this is considered to be unnecessary "fluff", which I kind of expect people to say, could AWB's general fixes at least be modified to capitalize templates it injects? Thank you. Technical 13 (talk) 16:35, 22 May 2013 (UTC)

Per WP:AWB/Typos#Usage, AWB doesn't make any typo fixes within templates. If you want to request a change to the general fixes, then Wikipedia talk:AutoWikiBrowser might be a better place for that discussion. GoingBatty (talk) 02:42, 23 May 2013 (UTC)
Before you begin that discussion, I recommend that you read through these ANI archives. -- John of Reading (talk) 07:16, 23 May 2013 (UTC)
@John of Reading: - I don't support changing the capitalization of existing templates, but I think discussing whether AWB should add {{orphan}} or {{Orphan}} is reasonable, if done in the right place. GoingBatty (talk) 23:55, 23 May 2013 (UTC)
I am actually going to start that discussion. I don't see any reason there can't be a per-user preference to capitalize templates or at least set the capitalization of templates that AWB injects/updates. I was just busy with RL stuff today starting a new semester in school this week and I had lots of related errands I needed to run. I'll likely open the discussion tomorrow. I'm personally opposed to using AWB to solely capitalize templates, but don't see it as an issue if people are there fixing multiple other things at the same time. I personally feel that it is the better option considering $wgCapitalLinks forces capital links anyways. Technical 13 (talk) 00:03, 24 May 2013 (UTC)

Humourous

The software says the word "humourous" should be changed to "humorous". On wikt:humourous it says it is "uncommon, nonstandard, and "Nowadays, this spelling is much less common than humorous, even in regions where the spelling humour is overwhelmingly preferred." Even though it is non-standard, it is still an alternate spelling. So should "humourous" be getting changed to "humorous"? Inks.LWC (talk) 03:14, 23 April 2013 (UTC)

Wiktionary entries should always be cross-checked against respected dictionaries. "humourous" isn't listed at all in my Concise Oxford or Shorter Oxford; it only has poor-quality hits at OneLook.com (compare humorous at OneLook.com. So I think the rule is correct; the word is so uncommon that it could distract readers who notice it. -- John of Reading (talk) 10:34, 23 April 2013 (UTC)
I apologize for making the revert, that I made in haste after doing a google search to confirm that the word existed with that spelling. I think for Canadian spelling, we should go with the Hansard style-guide (http://www.hansard.ca/styleguide.pdf), which says "humour, but humorous". Although a quick search of a actual hansard entries will find both spellings common. -- Earl Andrew - talk 13:26, 23 April 2013 (UTC)

regualy -> regually ?

Why? --John (talk) 21:37, 11 June 2013 (UTC)

@John: - This unusual misspelling was caught by the "-ally" rule in error. GoingBatty (talk) 23:46, 11 June 2013 (UTC)

Dael

AWB did this strange edit; I think it is based on some issue with the Regex template. I saved the edit as mere proof that AWB does it; the page is Dale and should hit for anyone else wanting to test the page. ChrisGualtieri (talk) 15:07, 18 June 2013 (UTC)

It was a broken Typo rule, "Deal", which is now fixed. I also added a 'Not a typo' template to the article. Chris the speller yack 15:44, 18 June 2013 (UTC)
@ChrisGualtieri: - Saving a bad edit makes it harder to duplicate the error. Instead, as long as you report the page with the problem, someone else can run AWB to see which rule is causing the error and then fix the rule. Thanks! GoingBatty (talk) 22:33, 20 June 2013 (UTC)
I undid it, so it can be checked, as you see in the history. But noted for future events. Thanks. ChrisGualtieri (talk) 22:40, 20 June 2013 (UTC)

There seems to be aproblem with Neighbor

Per this edit there seems to be a problem with the typo logic for Neighbor. Kumioko (talk) 19:01, 28 June 2013 (UTC)

I tightened up the rule; reload the Typos. Chris the speller yack 21:40, 28 June 2013 (UTC)
Thanks. Kumioko (talk) 12:54, 29 June 2013 (UTC)

Capatalise Twitter

When typo checking I noticed it was capitalising 'facebook' to 'Facebook'. The phrase was something like "...twitter and facebook..." to begin with but it only auto-corrected the facebook. Could this be added please — Preceding unsigned comment added by Jamesmcmahon0 (talkcontribs) 09:12, 19 July 2013 (UTC)

The problem is that when birds twitter, they do it with a lowercase "t". There is a "twitter > Twitter" rule, but it only operates when the next word makes it clear that the web site is meant. See Wikipedia talk:AutoWikiBrowser/Typos/Archive 3#twitter -> Twitter -- John of Reading (talk) 09:25, 19 July 2013 (UTC)
Is it possible to use "Facebook and Twitter" or "Twitter and Facebook" as a catch, since those two are frequently used together and Facebook definitely doesn't chirp? --TKK! bark with me if you're my dog! 19:40, 26 July 2013 (UTC)
That's possible, but I don't think there are enough of either of these to make it worth making the rule more complicated. -- John of Reading (talk) 20:45, 26 July 2013 (UTC)
@Tikuko: - If you don't edit articles about ornithology, then I suggest making a find and replace rule in AWB for "twitter" to "Twitter", and always check your edits before saving. GoingBatty (talk) 18:36, 27 July 2013 (UTC)

mid- (New Zealand spelling)

Grutness (talk · contribs) posted on my talk page after I used the typo fixer to changed a mid 1870s to mid-1970s. He says that the New Zealand convention is to not use the hyphen. Can anything be done to change AWBs behaviour on this or should something like {{bots|deny=AWB}} be added to affected pages, although this would obviously exclude them from lots of other useful corrections. Jamesmcmahon0 (talk) 09:53, 19 July 2013 (UTC)

You are correct, {{bots|deny=AWB}} is too drastic; if a rule is good except in a handful of cases, the proper workaround is to mark them with {{Not a typo}}. But I think the best way forward in this case is for the rule to be disabled while we discuss it. I've done that, so if you restart AWB or reload the typo list (Ctrl+R) you won't make any more of these changes. The rule was added by GoingBatty (talk · contribs) [ping!]. -- John of Reading (talk) 10:29, 19 July 2013 (UTC)
Slight correction - it's optional in New Zealand, and you will find both forms. Generally though, the variety without the hyphen is far more common. So if it can't be fixed, it will be just a minor annoyance rather than anything drastic. Grutness...wha? 11:19, 19 July 2013 (UTC)
Seems that The New Zealand Herald uses "mid-1980s" (e.g. this article). GoingBatty (talk) 12:29, 19 July 2013 (UTC)
If there are just a handful of cases, John's suggestion is correct, or consider wrapping "As written" templates around them. It's a rarely used alias of "Not a typo", but it seems well suited for a case where neither the presence or absence of a hyphen is wrong. Chris the speller yack 13:46, 19 July 2013 (UTC)
GoingBatty, If you look at the examples I posted on James's talk page you'll see they didn't include the Herald. They only included the NZ Government, the national library, the national encyclopedia... Grutness...wha? 00:41, 20 July 2013 (UTC)
However, each of those web sites also has results when you search for "mid-1980s": stuff.co.nz, The Encyclopedia of NZ, National Library, NZ History Online, NZ government. GoingBatty (talk) 06:06, 20 July 2013 (UTC)
Well, as I said, it's optional and you'll see both forms. In general, though, you'll find the hyphenless form more often. Grutness...wha? 01:21, 21 July 2013 (UTC)

Misspelling of "publicly"

My search for "publically" in article namespace reported 505,557 results. Please add it to your list of misspellings to be corrected.
Wavelength (talk) 23:42, 9 June 2013 (UTC)

@Wavelength: - When I searched for "publically" with quotation marks, I got only 398 results. One of those results is Pleonasm#Morphemic_pleonasm, which states: "'publically' vs. 'publicly' – both spellings are considered correct/acceptable in American English". GoingBatty (talk) 00:41, 10 June 2013 (UTC)
Thank you for your reply. I am surprised that both spellings are considered correct in American English.
Wavelength (talk) 00:56, 10 June 2013 (UTC)
I think many (perhaps most) Americans consider "publically" a marginally acceptable spelling, substandard at best. Not all American dictionaries accept it; I don't know why Merriam-Webster stooped so low in this case. Chris the speller yack 01:15, 10 June 2013 (UTC)
We shouldn't accept "publically". --John (talk) 21:38, 11 June 2013 (UTC)
Has this actually been rectified? I've just found an instance of the use of "publically" here. The fact that proof readers at Merriam-Webster fell asleep on the job doesn't justify a blatant misspelling. The fact that the item is little more than blatant self-promotion is an issue unto itself. --Iryna Harpy (talk) 05:57, 22 July 2013 (UTC)
AWB should not be the judge on this, since a major dictionary allows it. If you can get a consensus on the talk page of WP:SPELLING, we can follow. Chris the speller yack 14:00, 22 July 2013 (UTC)
The full OED does not allow "publically", see OED, indeed it is listed on its common misspellings page. Rjwilmsi 15:39, 22 July 2013 (UTC)
It's fine to correct "publically" in articles that are established as using British English; at this point we can't add a rule to change it in articles that use American English. Chris the speller yack 16:01, 22 July 2013 (UTC)
Considering that it's an American entry, I'll have to (reluctantly) abide by that judgement. Iryna Harpy (talk) 23:02, 30 July 2013 (UTC)
I have started a discussion at Wikipedia talk:Manual of Style/Spelling#Misspelling of "publicly" (version of 16:32, 22 July 2013).
Wavelength (talk) 16:39, 22 July 2013 (UTC)

two fold -> twofold

In the phrase "two fold-out maps", AWB mistakenly combines two and fold to form "twofold-out maps". Harryboyles 10:36, 16 July 2013 (UTC)

  Fixed in this edit. Thanks! GoingBatty (talk) 00:10, 17 July 2013 (UTC)

@GoingBatty:This rule is also suggesting three folds -> threefolds I can't think of a case where that would be right. For an example run it against Ipad Jamesmcmahon0 (talk) 15:00, 5 August 2013 (UTC)

I've had a go at fixing this. Test data at User:John of Reading/X2. -- John of Reading (talk) 16:53, 5 August 2013 (UTC)

Double pounds

Since the 'double dollar' rule (by Chris the speller (talk · contribs)) which finds instance of '$100 million dollars' (and similar) was added, I have corrected many instances of it and have yet to see a false positive. Would it be possible to expand it or make a new rule to catch 'double pounds' i.e. instance of £100 million pounds etc. Jamesmcmahon0 (talk) 17:19, 29 July 2013 (UTC)

Sterling idea! (pun intended)   Done. It also fixes "£100 million pounds sterling" and such. Reload Typos and let 'em have it! Chris the speller yack 18:35, 29 July 2013 (UTC)
I've been making my way through the 1500+ 'Double dollars' and 500+ 'Double pounds' I just committed this edit; notice in "They also donated $50 million in the $100 million dollar cost for the new 14-story" the rule missed the second 'dollar' I assume this is becasue it's not plural, is it possible to catch these without getting too many false positives? Jamesmcmahon0 (talk) 15:07, 2 August 2013 (UTC)
Do you have an example where changing "$100 million dollar" to "$100 million" would be a false positive? Thanks! GoingBatty (talk) 03:58, 13 August 2013 (UTC)
None come to mind. Jamesmcmahon0 (talk) 13:17, 13 August 2013 (UTC)
  Done. It will now fix "He decided to pay $1 dollar for a donut" and "to finance the $3 million dollar cost of painting the bridge". Chris the speller yack 02:12, 15 August 2013 (UTC)

At bat

"At bat" is perfectly appropriate. It should not be automatically changed to "at-bat."--Epeefleche (talk) 20:10, 29 July 2013 (UTC)

Is the expression used attributively ("the at-bat player") or predicatively ("the player [who is/was] at bat")?
Wavelength (talk) 20:45, 29 July 2013 (UTC)
See at bat ... it is used generally as "in his first at bat", or "in 2000 he had 333 at bats".--Epeefleche (talk) 21:23, 29 July 2013 (UTC)
In your examples, it is used substantively. If you visit http://www.onelook.com/?w=at+bat&ls=a, you will find a list of links to definitions. Some entries have the expression hyphenated.
Wavelength (talk) 21:48, 29 July 2013 (UTC)
The unhyphenated expression "at bat" is listed at "Glossary of baseball".
Wavelength (talk) 21:54, 29 July 2013 (UTC)
Both spellings are mentioned at http://www.thefreedictionary.com/at+bat. The unhyphenated spelling "at bat" is listed at wikt:Appendix:Glossary of baseball jargon (A). The expression "At Bat" is listed at http://baseball.epicsports.com/baseball-glossary.html, but the lexical category (part of speech) is not indicated. The hyphenated spelling "AT-BATS" is listed at http://www.baseball-statistics.com/Leaders/glossary.htm. The equivalence "AB = At-Bats" is listed at http://baseballtips.com/glossary/index.html/. The expression "At-bats" is mentioned at http://www.baseball-almanac.com/teamstats/glossary.shtml. The expression "At-bat" is mentioned at http://baseball.about.com/od/termstatglossar1/a/statsglossary.htm. The spelling "At-bat" is listed at http://www.baseballscorecard.com/glossary.htm. The expression "At bat" is listed at http://www.webball.com/cms/page3205.cfm?presto_view=A. The equivalency "AB=At bats not including walks or sacrifices" is listed at http://www.gamerisms.com/baseball-games-glossary.html. The equivalency "AB - At Bats" is listed at http://www.sportspool.com/baseball/terms.php.
Wavelength (talk) 22:19, 29 July 2013 (UTC)
Just to clear up any confusion, Epeefleche is responding to some edits I was making on my own initiative to try to address the awkward mix of "at bats" and "at-bats" that we often have in the same article. This isn't currently in the typo list. So far as I can see, the MLB official rules, MLB.com, the New York Times, Sports Illustrated, ESPN, The Associated Press, LA Times, Chicago Tribune, etc., all use only "at-bats" for the plural (though it can be either "at bat" or "at-bat" in its singular form); I was unable to find any contemporary publication that used "at bats". Epeefleche has stated that MLB publications of the '80s and '90s were less consistent in this, however, so I've agreed not to standardize any more of these. -- Khazar2 (talk) 22:47, 29 July 2013 (UTC)
  • Apologies -- I thought from the edit summary, which pointed to AWB, that it was listed as a typo in AWB. If not, then no matter. I pointed to the Official Baseball Rules (as codified and adopted by the Professional Baseball Official Playing Rules Committee), published by The Sporting News through at least 2005, and the Official Rules of Major League Baseball, published by the commissioner's office through Triumph Books in the late 1990s, which never used the hyphen, the Macmillan Baseball Encyclopedia (MLB's official encyclopedia through the 1990 edition) which always used "at bats", from the first edition (1969) through the final tenth (1995); Total Baseball (MLB's official encyclopedia beginning with the 1995 edition) used "at-bats" beginning with the first edition (1989), but dropped the hyphen beginning with the sixth edition (1997); the final seventh edition was in 1999; and the official American League Red Book and National League Green Book which did not use the hyphen at all (from the late 1940s) until the AL began using it in 1987; the NL book never used it. There is certainly an inconsistency across sources -- and even within some sources. But I can't see sufficient support for the notion, given the above, for asserting that "at bat" is incorrect ... though there is a discussion, to be sure, as to whether at-bat is "correct" as well.--Epeefleche (talk) 00:02, 31 July 2013 (UTC)


Rescind rule needs looking at

I just committed this edit before I realised it had wrongly corrected residential using the rescind rule. Jamesmcmahon0 (talk) 12:59, 7 August 2013 (UTC)

I found only one other similar misspelling, and have fixed it. Considering the rarity and the fact that the rule did not harm a correctly spelled word, I don't think there is much to be gained by messing with the rule. It brought a misspelling to your attention, and that is a point in its favor. Chris the speller yack 13:39, 7 August 2013 (UTC)
Ok, cheers for looking at it Jamesmcmahon0 (talk) 13:52, 7 August 2013 (UTC)

progidy -> prodigy

Another common one. If someone can add it please, thanks. -- œ 20:32, 8 September 2013 (UTC)

  Done - the rule will also fix "progidal" and "progidies". GoingBatty (talk) 01:50, 9 September 2013 (UTC)

Honourary

This is valid in Canadian English according to Wiktionary. At least, I can confirm that while I was fixing this word back in March, I found it in many Canadian articles, and took care not to change it. Any other opinions out there? -- John of Reading (talk) 20:14, 14 September 2013 (UTC)

I've put the rule back again after seeing User talk:Mogism#Honourary to Honorary. - looks like I wasted a lot of time in March carefully skipping the Canadian articles. -- John of Reading (talk) 20:21, 14 September 2013 (UTC)

Sea-going

The Oxford English Dictionary gives this spelling and only this spelling of this word. Please remove it from the list of typos. Thanks. DrKiernan (talk) 16:18, 19 September 2013 (UTC)

How old is your OED? oxforddictionaries.com lists only 'seagoing', and not 'sea-going'. And why did you remove the fix to 'sea-faring'? Chris the speller yack 19:35, 19 September 2013 (UTC)
The online version is dated June 2012. Print version was last updated in 1989. Both "seafaring" and "sea-faring" are given in the OED. The hyphen isn't a typo. DrKiernan (talk) 20:22, 19 September 2013 (UTC)
I have to take your word for it. Most folks in the US do not have free access to either online or print OED. Even my neighborhood library has only a Compact OED in print. I consider your request to remove the Typo rule satisfied, since you removed it yourself. Happy editing! Chris the speller yack 20:39, 19 September 2013 (UTC)

Unspace em dashes

Per MOS:EMDASH: some words — some wordssome words—some words.

Could be general fixes, too. --bender235 (talk) 00:58, 8 October 2013 (UTC)

When fixing a spaced em dash, how would AWB know whether to change it to an unspaced em dash or a spaced en dash? GoingBatty (talk) 01:26, 22 October 2013 (UTC)
I think it would always be correct to change a spaced em dash to an unspaced em dash. Then if another editor thought the spaced en dash looked better, they could always change it. That would be the end of the matter. Chris the speller yack 02:05, 22 October 2013 (UTC)

Tbit and others

There's some more which could be added to the SI Unit section. Such as Terabit and more in the table on that page. --Danrok (talk) 16:20, 17 October 2013 (UTC)

What's the best way to tell which other issues happen often enough that they're important enough to add the extra rules? GoingBatty (talk) 01:29, 22 October 2013 (UTC)
I see your point! I suspect these words are probably not that frequently used at this time, but may become more frequent in the future. --Danrok (talk) 03:36, 28 October 2013 (UTC)

qualy → qualification

There's this unpleasant habbit of abbreviating qualification as qualy, like for instance here or here. AWB tries to correct it to qually, which is obviously false. Could someone please add a rule that replaces it with qualification? --bender235 (talk) 17:30, 31 October 2013 (UTC)

And qualies to qualifications too? GoingBatty (talk) 18:16, 31 October 2013 (UTC)
I've changed the "-ally (1)" rule to exclude "qualy". GoingBatty (talk) 18:21, 31 October 2013 (UTC)
I've changed the "-ally (2)" rule to exclude "Qualy", but not adding a new rule to change "qualy"/"qualies" per Glossary of tennis terms. GoingBatty (talk) 00:11, 1 November 2013 (UTC)
Agree, "qualy" is also used in motorsport. Probably should not be used in text on wp except in quotes as it's slang language, but also not something I think we can/should deal with in typo rules. Rjwilmsi 08:24, 1 November 2013 (UTC)
I agree also including qualies to qualifications. --bender235 (talk) 12:43, 3 November 2013 (UTC)

establisihing → establishing

Here AWB fixed establisihing to establiishing, when it should've been establishing. Please fix that error. Thanks. --bender235 (talk) 12:41, 31 October 2013 (UTC)

Looks like it might be the "-ish" rule in the Wikipedia:AWB/T#Endings section. GoingBatty (talk) 17:33, 31 October 2013 (UTC)
Don't have AWB in this PC, but I wonder if changing the rule from "\b([A-Za-z]+)sih..." to "\b([A-Za-z]+)i?sih..." will fix it. GoingBatty (talk) 17:35, 31 October 2013 (UTC)
Nope - that didn't do it. GoingBatty (talk) 00:00, 1 November 2013 (UTC)
It's that rule. Need to change + to +? for your added i? to work. Rjwilmsi 10:31, 1 November 2013 (UTC)
  Done - thanks! GoingBatty (talk) 20:30, 3 November 2013 (UTC)

Ballyhealy false positive

Just for info. It's Ballyhealy not Ballyheally as incorrectly changed here. Not sure which rule(not much time to check), but likely the "-ally (2)" Regards, Sun Creator(talk) 14:00, 1 November 2013 (UTC)

  Fixed the "-ally (2)" rule. Thanks! GoingBatty (talk) 20:36, 3 November 2013 (UTC)

Correction: now days → nowadays

My search for now days (instead of nowadays) reported 115,885 results. I suggest the inclusion of wikt:nowadays in the edit summaries. That is for a convenient reference, and not because of any special reliability of Wiktionary.
Wavelength (talk) 17:16, 3 November 2013 (UTC)

However a search for the phrase "now days" (with the quotation marks) reported only 87 results. I expanded the existing "Nowadays" rule to also fix "now days" and "now-days". Feel free to use whatever edit summary you wish when fixing these. GoingBatty (talk) 20:46, 3 November 2013 (UTC)
Thank you for reminding me about the quotation marks, and for expanding the rule.
Wavelength (talk) 23:08, 3 November 2013 (UTC)

suggested rule: "of the of the" -> "of the"

I've been gradually cleaning out some "of the of the"s lately (e.g., [28], [29]), but there seem to be enough left to justify adding it as a regular expression. I've cleaned up 100 of these or so without encountering any false positives for the rule. Would someone be willing to add it? -- Khazar2 (talk) 12:25, 5 November 2013 (UTC)

  Done. Rule "of xxx of xxx" also fixes 'of his of his' and anything else. I happened to hit a bunch like this last week, including about 50 'for the for the'. For now, let's see how this rule goes. Chris the speller yack 17:40, 5 November 2013 (UTC)
Anyone who's brave enough to run a general rule for these cases can use:
find   "\b([a-z]+) ([a-z]+) \1 \2\b"
replace   "$1 $2"
The only false positive I have found so far is "calling a spade a spade", but there are probably many others, so this is not a good candidate for a Typo rule. Chris the speller yack 17:54, 5 November 2013 (UTC)
By coincidence I'm currently working on a list from a database scan for a regex very like that one. I'm skipping about 60% of the list and saving only 40%. Some of the 60% are articles where the problem has already been fixed, since my database dump is from September. But consider phrases such as "arm in arm", "side by side", "smaller and smaller", "back to back" - all these are used in contexts where the preposition occurs again just before or after the phrase. Still, if you look at my last few hundred contributions you will see the fixes. -- John of Reading (talk) 18:17, 5 November 2013 (UTC)
The article "Repetition (rhetorical device)" might be helpful. I found that page and others like it by doing a Google search for phrases with repeated words.
Wavelength (talk) 19:57, 5 November 2013 (UTC)

losing on penalties

I try to keep on top of Loosing - Losing among the ones I search for, but there's one subset that I think could go to AWB as it doesn't get false positives. "loosing on penalties" should always be "losing on penalties". ϢereSpielChequers 19:17, 11 November 2013 (UTC)

  Added. GoingBatty (talk) 20:36, 11 November 2013 (UTC)

Especially for attention...

Hi all, I was running a regex check on a list earlier and it found espicialy and offered me espicially, presumably using an ending rule; would there be a case for adding this to the especially rule? Baldy Bill (sharpen the razor|see my reflection) 20:07, 11 November 2013 (UTC)

Hi Bill, since you seem to have found the only instance of "espicialy" in Wikipedia, I don't think there would be a case. Happy editing! GoingBatty (talk) 20:38, 11 November 2013 (UTC)
Ah, I see, I'm still getting used to the workings of AWB. I'll do a bit more research next time... thanks for reading my ramblings! Baldy Bill (sharpen the razor|see my reflection) 22:45, 11 November 2013 (UTC)

parisitic → parasitic

Please add, thanks. (Note that "parisite" is a rare mineral) -- œ 15:46, 13 November 2013 (UTC)

It seems that you fixed the only instance of a rare typo, so it doesn't seem to warrant a new rule. GoingBatty (talk) 01:13, 14 November 2013 (UTC)
Yes I noticed it's rare on Wikipedia, but I do see it quite frequently everywhere else, which prompted me to search for it here in the first place. But alright. -- œ 01:27, 14 November 2013 (UTC)
Is a proactive approach to typo fixing not worth it? Is there a cost to adding new rules? Just curious. -- œ 01:29, 14 November 2013 (UTC)
The cost is that each rule makes the typo checking take a bit longer to run. GoingBatty (talk) 03:12, 14 November 2013 (UTC)
I understand, thanks. -- œ 05:11, 14 November 2013 (UTC)

Various namespaces

Applying the rules to all subject namespaces (see the chart at Wikipedia:Namespace) would further help to prevent editors and readers from being wrongly influenced by incorrect usage. (Some subject pages in the "Wikipedia" namespace are actually used as talk pages, so care needs to be taken to avoid them.) Wikipedia:Database reports/Most-watched pages by namespace seems to be a good place to start.
Wavelength (talk) 20:29, 16 November 2013 (UTC)

@Wavelength: - Since your comment refers to AWB functionality (and not to the individual rules), this might be a better discussion for Wikipedia talk:AutoWikiBrowser or Wikipedia:AutoWikiBrowser/Bugs. GoingBatty (talk) 21:44, 16 November 2013 (UTC)
Thank you for your reply. I have posted a message at Wikipedia talk:AutoWikiBrowser.
Wavelength (talk) 21:51, 16 November 2013 (UTC)

Incorrectly changes Phycology to Psychology

[30] -- Magioladitis (talk) 07:29, 20 November 2013 (UTC)

  Fixed -- John of Reading (talk) 08:29, 20 November 2013 (UTC)

Do not change if....

Do not change "Lachlan Nieboer" to "Lachlan Neighbor". This is a name. Please make an opt-out or add this to a white/blacklist(?) -(tJosve05a (c) 23:04, 23 November 2013 (UTC)

  Done. Chris the speller yack 06:00, 24 November 2013 (UTC)

Carsten Niebhur Institute

Don't change "Carsten Niebhur Institute" or "Carsten Niebhur" to "Carsten Neighbour Institute" or "Carsten Neighbour". (found in Lorenzo Nigro). -(tJosve05a (c) 22:30, 2 December 2013 (UTC)

  Done. I have changed the rule to leave "Niebhur" alone. Chris the speller yack 23:32, 2 December 2013 (UTC)

Jornal de Sciencias Mathematicas

Do not change Jornal de Sciencias Mathematicas to Jornal de Sciencias Mathematics. -(tJosve05a (c) 18:27, 5 December 2013 (UTC)

hingham → Hingham

  Done If someone could insert hingham → Hingham in the RegEx-list, that would be great. -(tJosve05a (c) 19:37, 5 December 2013 (UTC)

Only found one occurrence that is not in a url and it's recently been corrected. Regards, Sun Creator(talk) 20:24, 5 December 2013 (UTC)

Intrust Super Cup

Do not change 'Intrust Super Cup' to 'Entrust Super Cup'. -(tJosve05a (c) 20:37, 5 December 2013 (UTC)

Fiance → Fiancé

Fiancee is apparently a film so maybe that shouldn't be changed to Fiancée but the male equivalent should be Jamesmcmahon0 (talk) 14:56, 27 November 2013 (UTC)

Films and songs should be inside italics or quotation marks (and wikilinked, if possible), so typo fixing shouldn't change them. GoingBatty (talk) 16:34, 27 November 2013 (UTC)
@GoingBatty: would you be able to add the diacritics rule(s) to change Fiance to Fiancé and Fiancee to Fiancée please. I've got better at RegEx but definitely don't trust myself to add a typo rule yet! Jamesmcmahon0 (talk) 12:06, 2 December 2013 (UTC)
@Jamesmcmahon0: - I'm currently travelling without AWB access, so I'd like to wait a few days until I get back to AWB. I'll be happy to add this rule, but I want to immediately see whether "fiance" should always be changed to "fiancé", or if some should be changed to "finance". However, if another editor wants to do this, please go ahead without me. GoingBatty (talk) 05:19, 3 December 2013 (UTC)
@Jamesmcmahon0: -   Doing.... GoingBatty (talk) 00:42, 6 December 2013 (UTC)
@Jamesmcmahon0: - dictionary.com seems to indicate that "fiance" is also acceptable, so I've reverted the rule. GoingBatty (talk) 01:07, 6 December 2013 (UTC)
Cool, ok, thanks for looking into it Jamesmcmahon0 (talk) 10:21, 6 December 2013 (UTC)

1990's

Id did not change 1990's to 1990s here. I had to make the edit manually. -(tJosve05a (c) 18:09, 6 December 2013 (UTC)

  Added - I added a new rule to cover this. Thanks! GoingBatty (talk) 05:16, 7 December 2013 (UTC)

Febuary → February needs looking at

I just committed this edit as you can see, AWB caught one Febuary but missed a couple of others of exactly the same typo. Jamesmcmahon0 (talk) 14:06, 9 December 2013 (UTC)

AWB does not check inside quotes. Many quote can be found on that page. Regards, Sun Creator(talk) 14:20, 9 December 2013 (UTC)
@Jamesmcmahon0: -   Fixed the mismatched quotation marks and typos on List of UK Rock Chart number-one singles of 2000. GoingBatty (talk) 16:12, 9 December 2013 (UTC)

Numbers with "year" and "old"

WP:HYPHEN (sub-subsection 3, point 3) says the following.

  • Many compounds that are hyphenated when used attributively (adjectives before the nouns they qualify: a light-blue handbag, a 34-year-old woman) or substantively (as a noun: she is a 34-year-old), are usually not hyphenated when used predicatively (descriptive phrase separated from the noun: the handbag was light blue, the woman is 34 years old). Where there would otherwise be a loss of clarity, a hyphen may optionally be used in the predicative usage as well (hand-fed turkeys, the turkeys were hand-fed).

When "year" and "old" are modified by the word "one" or the figure "1", then a semantic understanding of the context is necessary for deciding whether hyphens are required: "one year old" or "one-year-old". Otherwise, plural numbers (as words or as figures) with the singular form "year" indicate that hyphens are required: "244-year-old" and "ninety-nine-year-old".
Presumably, the author has not absent-mindedly omitted the plural suffix "s" from places where it should be, and has not followed the pattern of some foreign languages, such as Russian, where numbers ending in the digit "1" are used with singular nouns: "двадцать один год", where "двадцать один" means "twenty-one" and "год" means "year" (singular). (See http://www.russianlessons.net/lessons/lesson11_main.php and http://learnrussian.rt.com/speak-russian/russian-numbers.)
We need to be careful about omitting or adding a space in multiple-digit numbers: "25-year-old" for "2 5-year-old" or vice versa, or "480-year-old" for "4 80-year-old", or vice versa. Also, we need to avoid confusion among "twenty-one year-old" and "twenty-one-year-old" and "twenty one-year-old", or among "five hundred-year-old" and "five-hundred-year-old" and "five hundred year-old".
Please add a rule that would find numbers (except "one" and "1") followed by "year old", and insert the missing hyphens, whether the numbers are expressed as words or as figures. Please include all multiple-digit numbers ending in "1" or "one", for example, "21", "twenty-one", "321", and "eight hundred forty-one". Occurrences of "one year old" and "1 year old" would have to be checked in a different process. Mentioning "WP:HYPHEN (sub-subsection 3, point 3)" in edit summaries would be helpful.
(All of the previous examples are possible for a context about trees in a park, or buildings in a community.)
Also, if the plural suffix "s" is attached to the word "old" ("five-year-olds"), then the expression is used substantively, and even expressions with "one" or "1" have the plural suffix "s" and should have hyphens. Please do include "one" and "1" in hyphenating these expressions.
(These expressions usually refer to people, but could also refer to animals.)
Wavelength (talk) 03:36, 2 December 2013 (UTC) and 03:42, 2 December 2013 (UTC) and 05:49, 2 December 2013 (UTC) and 06:19, 2 December 2013 (UTC) and 16:21, 2 December 2013 (UTC)

This rule will do much of what you request, and will find very few false positives:
Find: " year old(s?)\b(?<=\b(?:\d+|[Tt]wo|[Tt]hree|[Ff]our|[Ff]ive|[Ss]ix|[Ss]even|[Ee]ight|[Nn]ine|[Tt]en) year olds?)(?<!\b1 year olds?)"
Replace" "-year-old$1"
To fix "a group of one year olds", a second rule would be needed.
This rule will not change "a group of twenty one year olds", and I'm glad it won't.
It will not fix all cases of spelled-out numbers that are higher than 10, but WP:MOSNUM recommends those that require more than two words be expressed as numerals. I change "eight hundred forty-one year old" to "841-year-old" when I run across these. Teens and "-ty"s like "fifteen", "Thirty", etc. could be added to this rule, but not "thirty-one". Well, not easily.
Mentioning "WP:HYPHEN (sub-subsection 3, point 3)" in edit summaries is not doable through the Typo list, and would have do be specified by an AWB user who has selected a list of articles whose main shortcoming is this lack of hyphenation. Not sure how I would do that.
I would like to hear from other rulewrights before proceeding. Chris the speller yack 23:27, 2 December 2013 (UTC)
Thank you for your reply. What do you think of this rule (modified from the one presented above)?
Find: " year old(s?)\b(?<=\b(?:\d+|[Tt]wo|[Tt]hree|[Ff]our|[Ff]ive|[Ss]ix|[Ss]even|[Ee]ight|[Nn]ine|[Tt]en|[Hh]undred|[Tt]housand|[Mm]illion||[Bb]illion|[ 0123456789][ 0123456789][0123456789]) year old[ s]?)(?<!\b1 year old[ s]?)"
Replace: "-year-old$1"
Wavelength (talk) 01:45, 3 December 2013 (UTC)
There is a double vertical bar before "[Bb]illion", and that matches everything, which is not what we want. I'm not sure what you intend "[ 0123456789][ 0123456789][0123456789]" to do; it allows "he was   1 year old when" to be changed to "he was   1-year-old when" (note extra spaces).
The construction with "old[ s]?)" causes it to miss "a six year old, well-fed boy".
How about this:
Find: " year old(s?)\b(?<=\b(?:\d+|[Tt]wo|[Tt]hree|[Ff]our|[Ff]ive|[Ss]ix|[Ss]even|[Ee]ight|[Nn]ine|[Tt]en|[Ee]leven|[Tt]welve|[A-Za-z][a-z]+teen|[Tt]wenty|[Tt]hirty|[Ff]orty|[Ff]ifty|[Ss]ixty|[Ss]eventy|[Ee]ighty|[Nn]inety|[Hh]undred|[Tt]housand|[MmBb]illion) year olds?)(?<!\b1 year olds?)"
Replace: "-year-old$1"
Chris the speller yack
That looks good, as far as I can tell. The double vertical bar in my attempted rule was a typographical error. By "[ 0123456789][ 0123456789][0123456789]", I was hoping to accommodate numbers in figures from 1 to 999, but my understanding of the coding is very rudimentary, and I am not sure about how to manage null quantities in leading "hundreds" positions and "tens" positions. Maybe it should be "[ 123456789][ 0123456789][0123456789]". Also, I do not know how to make it exclude "1" itself, in the case of predicative expressions, and also "0" itself. Also, maybe it would be simpler to have one rule for attributive and substantive expressions, where "one" and "1" are included, and another rule for predicative expressions, where "one" and "1" are excluded. Incidentally, where can I best learn the coding?
In retrospect, I realize that perhaps your rules (the first and the second) are intended to apply to both "old" and "olds", for attributive and substantive expressions having numbers greater than "one". Is that the case? In that case, a second rule would be needed for substantive expressions using "one", such as "a group of one year olds", as you indicated in your first reply. (Maybe a diagram would help me to keep my thoughts organized.)
Wavelength (talk) 17:23, 3 December 2013 (UTC) and 19:42, 3 December 2013 (UTC)
This is more complex than I visualized when I started this discussion. Maybe I will study it more thoroughly in the future, and start another discussion.
Wavelength (talk) 03:13, 10 December 2013 (UTC)

perpertrated → perpetrated

..and variants. Please add. Thanks. -- œ 07:52, 11 December 2013 (UTC)

@OlEnglish: -   Done - GoingBatty (talk) 00:11, 12 December 2013 (UTC)

"also know as"

"also know as " and "also knows as " can both be added to AWB as typos of "also known as ". I've fixed enough over the last year or so manually to be confident it would be a good test for AWB. ϢereSpielChequers 22:34, 11 December 2013 (UTC)

@WereSpielChequers: - The "known as" rule already picks up "also know as". I tweaked it so it also fixes "also knows as". Thanks! GoingBatty (talk) 00:13, 12 December 2013 (UTC)
I also added other variations including "Also know as" and "also knowed as". GoingBatty (talk) 03:02, 12 December 2013 (UTC)

Incorrect fix of et al.→et al..

Since the Latin phrase "et al." is often italicized (et al.), RegExTypoFix should fix the incorrectly punctuated, non-italicized version (et al→et al.) or the incorrectly punctuated, italicized version (''et al''→''et al''., which yields "et al.") but should ignore the correctly punctuated, italicized version (''et al''.) so that it's not creating incorrect double punctuation (''et al''.→''et al.''., which yields "et al.."). Ninjatacoshell (talk) 16:17, 13 December 2013 (UTC)

@Ninjatacoshell: - You're right about the logic. Could you please give an example of an article where an incorrect fix occurs, and which tool suggests the incorrect fix? (e.g. AWB, WPCleaner, wikEd). Thanks! GoingBatty (talk) 18:24, 13 December 2013 (UTC)
@GoingBatty: I noticed the problem on List of Astragalus species. The tool was AWB. Thanks! Ninjatacoshell (talk) 22:55, 13 December 2013 (UTC)

Tom Kealy

Please don't change Tom Kealy to Tom Keally. -(tJosve05a (c) 11:06, 18 December 2013 (UTC)

  Done -- John of Reading (talk) 11:20, 18 December 2013 (UTC)

Numbers

Would it be possible to change 1,2,3,...9 to one, two,... , nine? As per MOS:NUMERAL. Maybe by looking for plurals such as 2 things? There's probably a couple of other tricks to get low false positives... Also changing pronounceable fractions such as 1⁄4 to written words such as: 1/4 yd to a quarter of a yd Jamesmcmahon0 (talk) 13:08, 18 December 2013 (UTC)

This would be quite challenging - I wonder if there would be false positives such as "January 2 concerts" to "January two concerts". Also MOS:NUMERAL states:
  • "there are frequent exceptions to these rules."
  • "Comparable quantities should be all spelled out or all figures: we may write either 5 cats and 32 dogs or five cats and thirty-two dogs, not five cats and 32 dogs."
  • "Common fractions for which the numerator and denominator can be expressed in one word are usually spelled out, e.g. a two-thirds majority; use figures if they occur with an abbreviated unit, e.g. 14 yd and not a quarter of a yd."
Maybe some of the exceptions could be turned into rules instead? GoingBatty (talk) 14:36, 18 December 2013 (UTC)

facilites→facilities

Change facilites to facilities as in this edit Jamesmcmahon0 (talk) 16:47, 18 December 2013 (UTC)

sq km → km<sup>2</sup>

Please change sq km → km<sup>2</sup> to km² (using the symbol ² instead.) -(tJosve05a (c) 15:08, 21 December 2013 (UTC)

Or...maybe not...I am not realy sure about this...-(tJosve05a (c) 15:10, 21 December 2013 (UTC)
@Josve05a:   Not done - Wikipedia:Manual of Style/Dates and numbers#Unit symbols states:
GoingBatty (talk) 15:47, 21 December 2013 (UTC)

Xi'an University

Please don't chnage Xi'an University of Architecture and Technology to Xi'a University of Architecture and Technology. (tJosve05a (c) 18:55, 26 December 2013 (UTC)

@Josve05a:   Done - thanks for reporting this! GoingBatty (talk) 20:18, 26 December 2013 (UTC)

E. Neighbour

On Winner-take-all AWB wants to change * L. Itti, C. Koch and E. Niebur[...]
to
* L. Itti, C. Koch and E. Neighbour[...]
Maybe adding tis page to a whitelist or something, since making change in the RegEx for this case might be difficult. (tJosve05a (c) 01:19, 27 December 2013 (UTC)

@Josve05a: One option is that you could put the reference inside a citation template, which would prevent AWB from "fixing" the typo. (It might also help to get the references in one consistent style.) GoingBatty (talk) 01:28, 27 December 2013 (UTC)
@Josve05a:   Fixed the rule in this edit. GoingBatty (talk) 01:31, 27 December 2013 (UTC)

Might not be common but...

add this to the RegEx:
her/his confident to her/his confidence.

(tJosve05a (c) 19:26, 27 December 2013 (UTC)

@Josve05a:   Not done - it's also likely that it should be her/his confidant. GoingBatty (talk) 19:41, 27 December 2013 (UTC)
Also, "his confident" seems correct in articles such as Michael Maze, Kashif Mehmood, and Batty Langley. GoingBatty (talk) 19:43, 27 December 2013 (UTC)
I just went through the 70 or so articles that contained her/his confident. While most were correct, I did change 12 to her/his confidant and only one to her/his confidence. GoingBatty (talk) 20:06, 27 December 2013 (UTC)
GoingBatty thanks! I'm glad that this little post did fix (thanks to you) some articles that needed help! -(tJosve05a (c) 20:11, 27 December 2013 (UTC)

A few names

Please don't change Staf Dobbelaere to Staff Dobbelaere . -(tJosve05a (c) 22:11, 27 December 2013 (UTC)

Please don't change Maureen Bonnano to Maureen Bonanno. (tJosve05a (c) 22:28, 27 December 2013 (UTC)

Please don't change Chris Loosley to Chris Loosely. (tJosve05a (c) 22:40, 27 December 2013 (UTC)

@Josve05a:   Done as follows:
  • Updated the "Staff" rule to not change "Staf Dobbelaere"
  • Disabled the "Bonanno" rule
  • Updated the "-ely" rule not change "Loosley"
You can also use the {{not a typo}} template to designate text that should not be changed by typo rules. Happy editing! GoingBatty (talk) 03:39, 28 December 2013 (UTC)
Thanks! I know that I can use {{Not a typo}}, but I feel more confident if ig gets added here so that no othe page with it in might get changed by misstake.
Here is another one:
Please don't chnage Daryl Missen to Daryl Mizzen
(tJosve05a (c) 05:30, 28 December 2013 (UTC)
  Done GoingBatty (talk) 06:05, 28 December 2013 (UTC)

A few suggestions

All of the → All the

web site → website

light weight → lightweight

Jamesmcmahon0 (talk) 21:49, 28 December 2013 (UTC)

  Not done - "website" is acceptable per The American Heritage Dictionary GoingBatty (talk) 00:28, 29 December 2013 (UTC)
  Not done - "light weight" is acceptable - see Bowstring for an example. GoingBatty (talk) 00:32, 29 December 2013 (UTC)
See also Light Weight Kernel Threads and light-weight process. GoingBatty (talk) 00:40, 29 December 2013 (UTC)
  Not done - "all of the" is acceptable per Grammar-monster.com. GoingBatty (talk) 00:35, 29 December 2013 (UTC)
Grammar monster seems to agree that 'all of the' should not be used? (example 3 and 4 in the second set) Jamesmcmahon0 (talk) 11:04, 29 December 2013 (UTC)
@Jamesmcmahon0: - However, the NOT A RULE BUT A GUIDELINE section states "For many, all soldiers seems less clumsy than all of the soldiers, but both are acceptable." GoingBatty (talk) 16:41, 29 December 2013 (UTC)
Ah, sorry I completely overlooked that. Jamesmcmahon0 (talk) 16:56, 29 December 2013 (UTC)

Language tool suggestions

Moved from Wikipedia:AutoWikiBrowser/Tasks

I read in the signposts about the Language tool available from http://community.languagetool.org/wikiCheck/. While it is mostly false positive suggestions, the following simplification seem appropriate:

  • "with the exception of " => "except for "
  • "as to whether " => "whether "
  • "as to whether or not " => "whether " — Preceding unsigned comment added by Sun Creator (talkcontribs) 11:47, 29 December 2013 (UTC)
  • "all of the " => "all the "
  • "so as to" => "to "
  • "several different " => "different "
  • "many different " => "different "

As long as it's not in quotes I can't see a reason not to apply those replacements. Regards, Sun Creator(talk) 15:12, 28 December 2013 (UTC)

These would be better suggested at Wikipedia talk:AWB/T, they seem good to me though. Jamesmcmahon0 (talk) 18:25, 28 December 2013 (UTC)
I didn't consider grammar to be typos, but anyway, section moved. Regards, Sun Creator(talk) 11:41, 29 December 2013 (UTC)
  Added, plus another suggested by the tool: "continue to remain" → "remain". GoingBatty (talk) 16:54, 29 December 2013 (UTC)
LanguageTool looks like something interesting to play with, although you need to review your changes carefully before saving to ensure you're not changing text inside quotes. If you want to add it to your Tools menu, you can add this to your Custom JavaScript file:
// Add LanguageTool launcher in the toolbox on left
addOnloadHook(function () {
 addPortletLink(
  "p-tb",
  "http://community.languagetool.org/wikiCheck/index?url=" + wgPageName,
  "LanguageTool"
)});
GoingBatty (talk) 17:10, 29 December 2013 (UTC)
After further consideration, I'm going to change the rules to be "several different" → "several" and "many different" → "many". GoingBatty (talk) 17:28, 29 December 2013 (UTC)
That was NOT what was suggested above. Regards, Sun Creator(talk) 23:10, 29 December 2013 (UTC)
Sorry I didn't make it clear that I understood your request, implemented your request, and then changed it. It's all been disabled anyway, based on the conversation below. GoingBatty (talk) 02:32, 30 December 2013 (UTC)
Can I suggest removing "many different"/"several different", which aren't synonyms of "many" and "several". ("The warehouse contains many books" and "the warehouse contains many different books" have very different meanings.) I foresee a lot of bad feeling arising from the false positives this rule will generate. Mogism (talk) 17:44, 29 December 2013 (UTC)
  Done - thanks for the example! GoingBatty (talk) 18:05, 29 December 2013 (UTC)
I've disabled all of the new grammar rules per User talk:GoingBatty#Grammar checks - let's discuss further. GoingBatty (talk) 18:13, 29 December 2013 (UTC)

Per my comment on GoingBatty's talk, I would consider the other three rules ("with the exception of", "so as to", "as to whether") all to be correct usage in British English (and probably in derivative versions such as Indian, Australian etc, although I can't say for certain), and in at least a good proportion of cases the suggested alternatives appear inappropriately informal in British English use. As a test I've just dropped five random British-topic FA's (Great Fire of London, Royal Assent, Queen Victoria, Brill Tramway, William Shakespeare) into LanguageTool and in three it's found at least one of these "errors" - while FAs aren't perfect, they've all been through review processes by multiple editors who are normally very picky about spelling and grammar, none of whom have flagged "so as to" etc as an issue. If the grammar changes do go ahead, can there be a way to opt-out of them and just apply the typo-fix list rather than the full list - I'd estimate that with these rules in place my false-positive rate has gone from 5-10% to around 90%. (And all this is aside from the backlash that will ensue from people having their text flagged as a "typo".) Mogism (talk) 18:34, 29 December 2013 (UTC)

New Testamant --> New Testament

Something to add to the code? Changing "New Testamant" --> "New Testament"? I have seen this spelling a few times when reading pages, like this one. (tJosve05a (c) 20:20, 29 December 2013 (UTC)

  Not done I get no results when I search for "New Testamant" or "Old Testamant", and the two results for "Testamant" look like German titles. Also, this typo list won't fix typos within wikilinks. You can find these by adding {{R from misspelling}} to redirects such as New Testamant and then work on the items on Wikipedia:Database reports/Linked misspellings. There are no articles that currently link to New Testamant. GoingBatty (talk) 05:10, 30 December 2013 (UTC)

£45 vs 45£

Since the currency mark shoulbe be put at the beginning of the number, I suggest that this will be added to the RegEx. (Change 45£ to £45). (tJosve05a (c) 12:13, 29 December 2013 (UTC)

  Added with € and $ too. GoingBatty (talk) 17:16, 29 December 2013 (UTC)
"In French-speaking Canada, the dollar symbol usually appears after the number" (Dollar sign) -- John of Reading (talk) 09:04, 30 December 2013 (UTC)

Ursula Oppens

Please don't change Ursula Oppens to Ursula Opens. -(tJosve05a (c) 18:39, 2 January 2014 (UTC)

I've added wikilinks to the seven articles which mentioned Ursula Oppens without linking to her article. That will stop the typo-fixer damaging those articles. -- John of Reading (talk) 21:17, 2 January 2014 (UTC)

Dates

Instead of changing 13th of september 1931 to 13th of September 1931 please change it to 13 September 1931. (tJosve05a (c) 00:52, 30 December 2013 (UTC)

What exceptions would there be, such as 4th of July? Maybe Ohconfucius could help here, as the MOSNUM scripts exclude some of these. GoingBatty (talk) 02:36, 30 December 2013 (UTC)
@Josve05a: This may already be one of AWB's general fixes - see this edit. Could you please give an example where AWB doesn't make this change (and it's not text within a quotation)? GoingBatty (talk) 03:05, 30 December 2013 (UTC)
@GoingBatty: See this edit]. It changed september to Semptember, but did not remove "th of". -(tJosve05a (c) 12:57, 30 December 2013 (UTC)
@Josve05a: The general fixes says "Removes ordinals from full dates per WP:DATESNO; does not alter on the 3rd November 2008 (i.e. the plus ordinal) to avoid introducing bad grammar", which is why this didn't change. We don't want to introduce any bad grammar via typo rules either. GoingBatty (talk) 15:25, 30 December 2013 (UTC)
@GoingBatty: I can't think of an example of how applying WP:DATE standards to "on the 3rd November 2008" can cause bad grammar; I fix these all the time with AWB. "Wilbur hit the lottery on the 3rd November 2008 and quit his job" should be changed to "Wilbur hit the lottery on 3 November 2008 and quit his job", right? Where's the problem? Chris the speller yack 18:30, 30 December 2013 (UTC)
Rjwilmsi added the "bad grammar" comment 3 years ago in this edit - maybe Rjwilmsi can elaborate for us. GoingBatty (talk) 20:45, 30 December 2013 (UTC)
That correction could fail if there's another noun straight after the date. "Wilbur hit the mainmast on the 3rd November 2008 ferry and sank the boat". Since "the" goes with "ferry" here, it mustn't be removed. A contrived example, of course. -- John of Reading (talk) 21:25, 2 January 2014 (UTC)

AWB changes 5th of july (with lower case 'j') to 5th of July (with upper case 'J'). It then takes a second run to change 5th of July (with upper case 'J') to 5 July (with upper case 'J'). See this edit- (tJosve05a (c) 04:29, 4 January 2014 (UTC)

A set of rules such as the following can fix the format and the capitalization at the same time; this rule was built with the restriction that the date must be preceded by "on". This rule fixes 3 months; another rule could fix March and May, one could fix April and August, and one rule would be needed for each of February, September, October, November and December. This rule also removes a comma after the month. But it would damage the "ferry" example that John of Reading has provided. Such cases, if found, could be placed inside a "Not a typo" template, but I don't think this kind of rule is ready for general rollout. Many AWB users could manage to be careful enough, but there would be some slip-ups and subsequent complaints.
Find: \b([Oo])n (?:the )?([123]?\d)(?:st|nd|rd|th)? (?:of )?(?:[Jj])(anuary|une|uly)(?:,)?(?<!\b[Oo]n \d+ J[a-z]+) ([12]\d{3})\b
Replace: $1n $2 J$3 $4
Chris the speller yack 18:28, 4 January 2014 (UTC)

togo → Togo

It's far more likely they meant to go. 101.160.15.107 (talk) —Preceding undated comment added 15:13, 4 January 2014 (UTC)

  Fixed in this edit so it won't change "togo" - thanks! GoingBatty (talk) 23:33, 4 January 2014 (UTC)

Minna no Ie

Please don't chnage Minna no ie to Minna no i.e.. -(tJosve05a (c) 02:14, 4 January 2014 (UTC)

@Josve05a: Could you please give an example of a page where the typo fixer wants to make this change and it is not appropriate to add a wikilink to Minna no Ie? Thanks! GoingBatty (talk) 23:41, 4 January 2014 (UTC)

Problem with "Arose" rules

Hi,

WPCleaner uses the rules to detect typos and suggest replacements. The 2 "Arose" rules, rewritten in December, are taking very very long time to apply on articles for Java regular expressions engine (like 2 minutes each for List of tone rows and series on my laptop). Could they be rewritten to move the (?<! and (?<= parts later in the regular expression ? --NicoV (Talk on frwiki) 16:41, 16 January 2014 (UTC)

@NicoV:   Done. It's unusual to have two rules with the same name. OK by me if someone wants to change one of the names. Remember, "Arose" by any other name would smell as sweet. Chris the speller yack 23:22, 16 January 2014 (UTC)
Thanks a lot! It's now a lot faster. --NicoV (Talk on frwiki) 02:55, 17 January 2014 (UTC)

Inocente vs Innocente

Please don't change Javier Inocente Pérez Torres to Javier Innocente Pérez Torres. -(tJosve05a (c) 01:30, 18 January 2014 (UTC)

Assuming Javier Valcárcel is the page in question, no-one has ever made that change. Mogism (talk) 01:38, 18 January 2014 (UTC)
Mogism, I was 'about' to make that change, but didn't save. I post these kinds of false results here instead of using {{not a typo}} because I feel more secure getting a second opinion and I don't know how many pages has the same string of words. (tJosve05a (c) 02:04, 18 January 2014 (UTC)
@Josve05a: If you search for "Javier Inocente Pérez Torres" (with the quotation marks), you'll see that Javier Valcárcel is the only page with that string of words. However, there are over 400 pages with "Inocente", so maybe someone wants to update the "Inn-" rule to exclude it? GoingBatty (talk) 01:50, 19 January 2014 (UTC)

signed a contact

I've changed a few "signed a contact" to " signed a contract ", you can have the rest and any future ones for AWB. ϢereSpielChequers 07:10, 18 January 2014 (UTC)

@WereSpielChequers:   Added rule and fixed the remaining 15 articles with this typo. Thanks! GoingBatty (talk) 02:01, 19 January 2014 (UTC)
There are lots of text strings like "he signed his first professional contact with ...", and they can be fixed with:
Find: \bsigned ([a-z]+ )?([a-z]+ )?([a-z]+ )?([a-z]+ )?contact
Replace: signed $1$2$3$4contract
This allows up to 4 intervening words, and, though I have yet to see a false positive, I think it is better for a few brave (and attentive) souls to run this as their own F&R rule, but not as a Typo rule. Really brave souls may easily expand the rule to fix cases with 5 or 6 intervening words. I was inspired by the possibilities provided by the new CirrusSearch back end for the Special:search page:
en.wikipedia.org/w/index.php?search="signed+contact"~2&button=&title=Special%3ASearch&srbackend=CirrusSearch
The "~2" after the target specifies up to 2 intervening words, and this can be expanded as needed. AWB does not have his capability (yet?), so a text search for AWB has to be very wide, like the words "signed" and "contact" (without the quotation marks, just the two words). If you also put the Find rule in the Skip tab in the "Doesn't contain:" box, checking "Regex" and "Case sensitive" boxes, you will speed up the processing by very quickly skipping pages where the two words don't appear that close to each other. Chris the speller yack 02:42, 20 January 2014 (UTC)
@Chris the speller: You might want to tweak the rule to include cases with multiple spaces between words:
Find: \bsigned\s+([a-z]+\s+)?([a-z]+\s+)?([a-z]+\s+)?([a-z]+\s+)?contact
Hope this helps! GoingBatty (talk) 04:12, 20 January 2014 (UTC)
Good tip. I always forget this, and you always remember! Chris the speller yack 04:20, 20 January 2014 (UTC)

Responsible being changed to respondsible

Greetings, I was using AWB to fix some typos and formatting over at Wikia military and found a possible error. Just wanted to let you know. here is an example of the one I ran into. Reguyla (talk) 18:58, 22 January 2014 (UTC)

@Reguyla: Thanks for posting here, but I don't think we can do much about this one. There's a soft hyphen hiding inside the word, so the typo fixer thought it was working on two words "respon" and "sible". -- John of Reading (talk) 19:06, 22 January 2014 (UTC)
Oh ok, thanks. It wonder if it would be better to not have a typo check for that word then. Its a pretty common word so I could see this being a common problem. I wonder how many have already been changed. Reguyla (talk) 19:14, 22 January 2014 (UTC)
@Reguyla: Could you please elaborate on how often the hidden soft hyphens are used? I changed the three instances of "respon­sible" to "responsible" on the English Wikipedia. Thanks! GoingBatty (talk) 23:10, 22 January 2014 (UTC)
I honestly don't know. I've seen it a half dozen times at Military and a couple times in other wiki's at Wikia. I'm not sure exactly how many though and I honestly don't even know how to find out. Reguyla (talk) 23:17, 25 January 2014 (UTC)
We could replace all of the regexps' \bs at the ends of words with (?:\b|\u00AD). Or the AWB parser could do automagically (as a code change). -- JHunterJ (talk) 11:47, 26 January 2014 (UTC)
Sorry I don't know what that means. I understand basic Regex but some of the typo's code is above my level. Reguyla (talk) 00:36, 27 January 2014 (UTC)
@JHunterJ: That wouldn't be enough; it would have to be an AWB code change. Otherwise we'd have to adjust every letter of every regex, so that, for example, "Establishement" was still corrected to "Establishment" even though there were soft hyphens at arbitrary points within the word. Very messy. -- John of Reading (talk) 07:29, 27 January 2014 (UTC)
Do we know how many pages (on en-wiki) contain these soft hyphens? And how many pages legitimately need these soft hyphens (suppose a few pages on Unicode characters etc.)? We may be able to do a cleanup / make it a CHECKWIKI error. Rjwilmsi 08:13, 27 January 2014 (UTC)
About 4000 in the December database dump, based on a scan of the first 5%. -- John of Reading (talk) 08:23, 27 January 2014 (UTC)
IMO, any soft hyphen is legitimate, if correctly placed in a word. -- JHunterJ (talk) 11:01, 27 January 2014 (UTC)
@John of Reading: Well, it wouldn't be enough to make sure that we changed all of the possible misspellings, true. But we're missing possible misspellings now. It would be enough to make sure that we avoid "fixing" things that aren't the misspellings we have regexps for, such as the respon-sible example. And that's one of the precepts of AWB/T, that we don't break correct words, even if that means we aren't able to fix all incorrect words. -- JHunterJ (talk) 11:00, 27 January 2014 (UTC)
Oh, I see. If I've understood you correctly, for that you'd need something like (?!\u00AD)\b [not tested], to peek ahead at the word delimiter and make sure it wasn't a soft hyphen. "Change respon to respond unless there's a soft hyphen coming up" -- John of Reading (talk) 11:08, 27 January 2014 (UTC)
Ah, yep, my regexp was wrong way round. But yours is what I was intending, yes. -- JHunterJ (talk) 11:12, 27 January 2014 (UTC)
Or Java could fix (change) their regexp parser to recognize that a soft hyphen isn't a word boundary. Probably soft hyphen should be included in the \w word character set. :-) -- JHunterJ (talk) 11:15, 27 January 2014 (UTC)
@Rjwilmsi: A different idea - could "HideMore" be taught to hide any word containing an embedded soft hyphen? Then these words would be exempt from typo-fixing without us having to change any regexps. -- John of Reading (talk) 07:51, 28 January 2014 (UTC)
Yes, should be doable, much better than trying to change every regex. Though that's prevention rather than a cure isn't it: still seems to me that we should clarify at WP:MOS whether soft hyphen is allowed/encouraged/disallowed, as there may still be a need for some cleanup? Rjwilmsi 10:44, 28 January 2014 (UTC)
They are already covered at MOS:HYPHEN; they are allowed "in special cases". -- John of Reading (talk) 11:17, 28 January 2014 (UTC)

WT:FAC#Word usage

Just a pointer. This is a little different than what you guy do, but it overlaps. (Also, I've been an admirer of your work here for years, though I don't use AWB myself.) - Dank (push to talk) 19:36, 31 January 2014 (UTC)

Indian rubber

"india rubber" and "india-rubber" should be exceptions to the cap fix india → India. (Cap'n varies, but it's not a typo, and l.c. is found in the OED.) — kwami (talk) 20:17, 4 February 2014 (UTC)

Talk page discussion about it. Bgwhite (talk) 21:25, 4 February 2014 (UTC)
  Fixed - thanks! GoingBatty (talk) 01:37, 5 February 2014 (UTC)

Experimental typos

Thought I would post this here to hear your guys thoughts before going to feature requests...

Status   New
Description Add two options to AWB to enable new and experimental typos, this would involve creating a new section at AWB/T for the experimental ones. The new typos would include everything under 'New additions' and would be enabled by default. The change would be the addition of an section on that page for experimental typos and an option in AWB, disabled by default, to use them. This experimental section could include typo fixes that are a work in progress, or possibly ones that will never 'graduate' due to the high false positives. It would mean that these type of typos fixes could be more collaboratively worked on and used by anyone who knows what they're getting in for. Jamesmcmahon0 (talk) 11:16, 12 February 2014 (UTC)
Added in revision

Jamesmcmahon0 (talk) 11:16, 12 February 2014 (UTC)

I'm in favor, I'd like to suggest some fairly straightforward usage regex rules, but I'd prefer that typo fixers have the option of opting in or out. - Dank (push to talk) 12:35, 12 February 2014 (UTC)

Québécois rule

I have disabled the "Québécois" rule after seeing Oreo Priest revert an edit of mine here. Dictionary.com indicates that "Québecois" and "Quebecois" are acceptable spellings. Is there any need to have a more limited version of this rule? Thanks! GoingBatty (talk) 20:22, 16 February 2014 (UTC)

High-water mark

The fact that the WP article is mishyphenated does not determine our direction. The fact that Collins, education.yahoo.com, Merriam-Webster show it hyphenated is sufficient reason for having the rule. Chris the speller yack 21:11, 16 February 2014 (UTC)

@Chris the speller: Then should High water mark be moved to High-water mark? GoingBatty (talk) 21:41, 16 February 2014 (UTC)
@GoingBatty: Yes, the article should have a hyphen in its name. The exception (and source of much confusion) is that the US Government does not use a hyphen in "ordinary high water mark", which is a term that has legal uses. (You didn't really expect them to use correct punctuation, did you?) Chris the speller yack 04:53, 17 February 2014 (UTC)

Currency

A recent rule-change (not sure what, but I've only just started noticing this so I assume in the last few days) is "correcting" the appearance of the dollar sign in the middle of a string of numbers (so 25$00 becomes $2500 and so on). However, this is the correct format for the pre-Euro Portuguese escudo (and assorted other currencies in the former Portuguese empire), and literally every incidence I've found of this is a false-positive in an article on a Portuguese (or Brazilian, Macanese etc) topic. If it's not going to break something else, could consideration be given to disabling this one? Mogism (talk) 01:49, 18 January 2014 (UTC)

@Mogism: Could you please give an example of an article where this false correction could happen? Thanks! GoingBatty (talk) 01:52, 19 January 2014 (UTC)
Empire of Brazil, Cape Verdean escudo, Portuguese real, Frechas, Campo de Benfica, Lighthouse of Ponta das Contendas, History of Póvoa de Varzim, List of fish on stamps of Cape Verde, Candelária (Ponta Delgada), José Maurício Nunes Garcia, Church of the Jesuit College, Economy of the Empire of Brazil, Port of Póvoa de Varzim, Henrique Mitchell de Paiva Couceiro, or any other article on Portugal or its colonies prior to it joining the Euro in 1999 (or on Cape Verde to the present day, as they still use the escudo) that happens to mention the price of anything, as the way to format prices in escudos and reis is escudos$centavos (i.e. 20$00, not $20.00). We're not talking a couple of obscure pages, we're talking entire categories. Mogism (talk) 16:15, 19 January 2014 (UTC)
@Mogism:   Fixed! I apologize that I didn't make it clear above that I was asking for one example so I could fix the problem, not because I doubted the problem existed. Thanks for reporting it! GoingBatty (talk) 21:32, 19 January 2014 (UTC)
Thanks for that. I gave a range of articles to show that this was standard practice, as opposed to a couple of obscure articles using an archaic formatting style, or a few edge-cases. For instance, the new "first debuted" rule is wrong in some instances ("NBC were so confident in Friends that they commissioned a second series before the first debuted"), but I wouldn't argue for removing that rule as it's clearly a correct fix in most cases. Mogism (talk) 21:33, 19 January 2014 (UTC)

2014-02-21

It did not chnage 600.000 € in Amir Abrashi. (tJosve05a (c) 12:15, 21 February 2014 (UTC)

double 'minus'

Hi could a rule be added to fix things of the type "minus - 4°C" such as in this edit Jamesmcmahon0 (talk) 11:02, 19 February 2014 (UTC)

We could, but should it be changed to −4 °C (&minus;4&nbsp;°C) instead? (I just submitted a request to have AWB genfixes add the non-breaking space.) GoingBatty (talk) 02:51, 20 February 2014 (UTC)
@GoingBatty: Not sure what fix would be right WP:MOSNUM mentions non-breaking spaces between the number and symbol but doesn't mention convention for minus vs - vs − or anything about non-breaking spaces Jamesmcmahon0 (talk) 10:18, 21 February 2014 (UTC)
@Jamesmcmahon0: - see WP:MOSNUM#Minus sign and WP:MOSNUM#Specific units. GoingBatty (talk) 14:39, 21 February 2014 (UTC)
  • But wait, is that an accidental double or an intentional one. Perhaps it should change "minus - 4°C" to "+4 °C (or just 4 °C) instead since a double negative make a positive? — {{U|Technical 13}} (tec) 03:07, 20 February 2014 (UTC)
@Technical 13: In the case I saw it was definitely a mistake and not a weird use of double negatives... whilst that is a possibility I would think that it's also very unlikely that the editor meant +4 °C but typed minus -4 °C. GoingBatty (talk) 14:39, 21 February 2014 (UTC)
@Technical 13: Do you have an example where an intentional double negative would be used in Wikipedia? GoingBatty (talk) 16:15, 22 February 2014 (UTC)

Sainte-Adresse

Plese don't change Sainte-Adresse to Sainte-Addresse. Sainte-Adresse is a place in France. (tJosve05a (c) 19:24, 22 February 2014 (UTC)

Same question as above: Can this be solved by wikilinking the first instance in each article? GoingBatty (talk) 20:20, 22 February 2014 (UTC)
Hm...Maybe. But in this case it has a dash in it making it harder, since sometimes it might be written as -, , {{ndash}} or &ndash;. It could become hard to find every instance of it and wikilink it properly. (tJosve05a (c) 20:43, 22 February 2014 (UTC)
And if it is in a disambig it might say something like:
* [[NAME with exiting article]], a place in Sainte-Adresse, France
According to WP:MOSDAB you should not include 2 bluelinks on one row, making wikilinking every instance of it...wrong! (tJosve05a (c) 20:47, 22 February 2014 (UTC)
If there are different ways of spelling it, then there would need to be different solutions to fix the rule. For "Sainte-Adresse" and "Sainte—Adresse", there was one I could fix with a wikilink, and 2009–10 Coupe de France 1st round could use your attention for multiple fixes. Are there lots of other pages where incorrect fixes are being made? GoingBatty (talk) 21:30, 22 February 2014 (UTC)

Question about Etc. rule

The "Etc." rule was designed to avoid /etc, the Unix file directory. What's the best way to have it also avoid etc/ (e.g.Environment Modules (software))? Thanks! GoingBatty (talk) 18:43, 23 February 2014 (UTC)

Intrust Super Cup

Pleace don't change Intrust Super Cup to Entrust Super Cup. (tJosve05a (c) 12:04, 21 February 2014 (UTC)

Is this an issue that can be solved by changing the first instance of Intrust Super Cup to Intrust Super Cup in each article? GoingBatty (talk) 14:41, 21 February 2014 (UTC)
Same with Intrust Bank Arena? GoingBatty (talk) 14:42, 21 February 2014 (UTC)
Yes it could, but what if the article has two mentions of Intrust Super Cup? Should bothe be wikiliked, should a {{not a typo}} be placed? (tJosve05a (c) 14:44, 21 February 2014 (UTC)
Wikipedia:AutoWikiBrowser/Typos#AutoWikiBrowser (AWB) states "If a typo rule matches a wikilink target, this rule will be ignored on the whole page", so it's not necessary to overlink or add {{not a typo}}. Wikipedia:AutoWikiBrowser/Typos#WPCleaner states that typo rules are ignored within internal links, but doesn't specify how it handles another unlinked mention in the same article. GoingBatty (talk) 17:22, 21 February 2014 (UTC)
@Josve05a: @GoingBatty: Don't forget the option of protecting it with a "Proper name" template. Chris the speller yack 16:00, 24 February 2014 (UTC)
Well, duh! That's just a redirect to Not a typo. Chris the speller yack 16:00, 24 February 2014 (UTC)

Misspelt file names and URLs

I suggest that AWB should make sure the typo is not part of a link.

I had a couple of occasions recently where AWB did a typo fix which was perfectly valid but the correct action should have been ignoring the typo. Specifically, if the spelling mistake was within a URL or an image file name. Unfortunately, I have realised this is a bug hours after seeing the problem and the two actual examples are lost somewhere in my edits. Periglio (talk) 07:08, 16 March 2014 (UTC)

Were you using "Find & Replace" expressions that you'd set up yourself, or had you just ticked "Regex typo fixing" on the "Options" tab? Either way, the developers will need to see the diffs before they can comment with any precision. -- John of Reading (talk) 07:50, 16 March 2014 (UTC)
This was with the "Regex typo fixing" box ticked. I just happened to notice a couple during my edits which I did not save so there are no diffs. As I said, I did not think about it until long after the event, so I am unable to find the article where this took place. I have also been unable to recreate it on my user page so I apologise for assuming it was a general oversight! If I see it happen again, I will be back. Periglio (talk) 10:32, 16 March 2014 (UTC)
@Periglio: One of the items in Wikipedia:AutoWikiBrowser/Typos#To do shows one way we can update the typo rules to ignore URLs, but there are others. If you notice this behavior in the future, please post here (don't save a bad edit), so we can see how we can adjust the article and/or the typo rule. Thanks! GoingBatty (talk) 21:27, 16 March 2014 (UTC)

New rule "।"

I've just seen this new rule suggest a change at Kali, removing spaces before each "।" character - that's Unicode \u0964, not an ordinary pipe. Could Wikiuser13 (talk · contribs) or any other editor explain for me why this is a typo that needs fixing, as it's not obvious. -- John of Reading (talk) 10:33, 18 March 2014 (UTC)

Hello @John of Reading:, I knew that it is purnaviran, not pipe. In Devnagri, there is no space before purnaviram. If you know Hindi, you can see hi:विकिपीडिया:चौपाल#। (पूर्णविराम) का कौन-सा रूप सही है? at Hindi Wikipedia.--Wikiuser13 (talk | contribs) 10:39, 18 March 2014 (UTC)
Thank you, that's clearer. -- John of Reading (talk) 10:51, 18 March 2014 (UTC)

Bug with diacritics?

Sometimes when I am typo fixing, I will be presented with a page that has no changes automatically applied. However I have the skip if no typos are found setting checked so surely it should skip these pages automatically? This seems to happen on pages that use diacritics in the title (I haven't fully confirmed that). For an example try: Demographics of São Paulo you should also notice that the edit summary is given as fixing Sao Paulo to São Paulo. Jamesmcmahon0 (talk) 13:42, 28 March 2014 (UTC)

A second example:Diane de France has no changes but doesn't skip and has the edit summary; Typo fixing, typo(s) fixed: Angouleme → Angoulême Jamesmcmahon0 (talk) 13:49, 28 March 2014 (UTC)
This has come up before; see Wikipedia talk:AutoWikiBrowser/Bugs#TypoFix undoes general fixes, causing a confusing edit summary. -- John of Reading (talk) 15:08, 28 March 2014 (UTC)
Thanks, good to know the reason behind it, has been bugging me for ages. Jamesmcmahon0 (talk) 15:22, 28 March 2014 (UTC)

"an another"

I was doing a quick typo run-through and it changed "a another" to "an another". 74.248.63.109 (talk) 17:48, 28 March 2014 (UTC)

Correctly so. Regards, Sun Creator(talk) 23:33, 28 March 2014 (UTC)
That was very unhelpful. 74.248.62.60 (talk) — Preceding undated comment added 17:37, 29 March 2014 (UTC)
I can't easily create a sentence where "a another" would be valid. As we have often pointed out in other discussions, the Typo rules are meant to correct common minor mistakes in text that somewhat resembles good English. Typo rules can't fix every possible mistake, and "an another" makes no less sense than "a another", so it's not really doing any harm. When I see a Typo rule tripped up like that, I fix the sentence and continue on my merry way. Chris the speller yack 04:34, 31 March 2014 (UTC)
I'm not sure why but the BBC use "an another" a fair few times. https://www.google.co.uk/search?q=%22an+another%22+site:bbc.co.uk Regards, Sun Creator(talk) 22:38, 31 March 2014 (UTC)

debut

@Ohconfucius: My Concise Oxford lists "debut" as a valid spelling, without the accent. -- John of Reading (talk) 07:36, 5 April 2014 (UTC)

  • Many editors gnome and make that change so I thought I would put it in. But I'm fine with it not going in if it's not considered a typo. I'm always suspicious of Oxford, they allow all sorts of funny spelling ;-) -- Ohc ¡digame! 07:40, 5 April 2014 (UTC)
@Ohconfucius: On about 20 pages, when I used it, it changed 'debut' to 'Noel Coward'. OccultZone (Talk) 07:54, 5 April 2014 (UTC)
Ah, I see what it is. My bad, and I apologise. The first appearance doesn't do anything. -- Ohc ¡digame! 07:59, 5 April 2014 (UTC)

WARNING: Incorrect pattern syntax for..

Using WPCleaner I get this message in the Java-consol. I don't know what anything of it means, but I think it has something about this page...

Extended content
WARNING: Incorrect pattern syntax for [\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)]: Look-behind group does not have an obvious maximum length near index 97
\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)
                                                                                                 ^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b]: Look-behind group does not have an obvious maximum length near index 35
\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b
                                   ^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Aa])  ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)]: Look-behind group does not have an obvious maximum length near index 927
\b([Aa])  ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b]: Illegal repetition near index 5
\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b
     ^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)]: Look-behind group does not have an obvious maximum length near index 97
\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)
                                                                                                 ^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b]: Look-behind group does not have an obvious maximum length near index 35
\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b
                                   ^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Aa])  ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)]: Look-behind group does not have an obvious maximum length near index 927
\b([Aa])  ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b]: Illegal repetition near index 5
\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b
     ^

(tJosve05a (c) 16:53, 19 April 2014 (UTC)

Add typo rule for "at he end"

Could a rule be created for typos of the form "at he (end|beginning|corner|side|etc...)" to be changed to "at the ..." Jamesmcmahon0 (talk) 12:34, 16 May 2014 (UTC)

There are only a handful of these. A new Typo rule is usually only considered when it would catch about two dozen or more errors. If you had just fixed 50 of these this week, that would be a different story. But the general problem you have brought to our attention is interesting. When 'he' follows any preposition, it is frequently an error of some kind. I found cases where "with he" should have been "with him", and cases where "with" was just an extraneous word. I'm having fun with this angle, but a new Typo rule is probably not the best way to deal with it. Chris the speller yack 16:33, 16 May 2014 (UTC)
This is the sort of thing that's really useful to run against a database dump. If you have the RAM/diskspace/processing power to do it, get the most recent dump and scan for these typos. Then either fix them then, or make a subpage on your own userpage of them to work through slowly. This ensures a human looks at each edits, and once the bulk of the fixes are done, new errors will build up slowly. Shadowjams (talk) 06:53, 17 May 2014 (UTC)
If you need a couple of database scans, just give me a yell. I've always got the current dump file. If it is a common problem, can make a request to be added at WP:FIX. Bgwhite (talk) 07:20, 17 May 2014 (UTC)
@Shadowjams: IMHO, you don't need a database dump to do this. You could do this in AWB by making a list and using find/replace rules to do your testing. GoingBatty (talk) 00:29, 19 May 2014 (UTC)
Well, you gotta get the list. Shadowjams (talk) 06:58, 20 May 2014 (UTC)

Counter-attack

Copied from User talk:Mogism (with some parts of the comments removed):

I do not know what is standard advise in American specific dictionaries but the international OED says:

ˈcounter-, prefix
...
In those compounds which we have taken from French or Italian, the consolidation of the word is usually greater than in those formed in English, and they are regularly written as single words, as counterbalance , counterfeit , countermand , countermarch , though sometimes with the hyphen. The stress is normally, in verbs and their derivatives, on the root, in nouns and their derivatives, on the prefix: cf. to underˈgo , ˈundertone . But there are exceptions, esp. where the noun stress is taken by a verb of the same form, as in to ˈcounterfeit . In words formed in English the two elements are in looser union, both accentually and in writing. In verbs the rhetorical or antithetical stress on the prefix may be equal to, or even for the nonce stronger than, that normally on the root, as in to plan and ˈcounter-ˈplan (ˈcounter-ˌplan ), and the two parts are properly hyphened. In nouns, when the counter- word is contrasted explicitly or implicitly with the simple word (as in 2b – 2d), the predominant stress of the prefix is strongly marked, as in ˈcounter-cheer , ˈcounter-aˌnnouncement . These are properly written with the hyphen (now rarely as a single word, but occasionally in two separate words). When such a contrast is not distinctly present (as in 2e, 2h), the predominance of the prefix is less marked, and the root-element may receive an equal or greater stress; in such case there is a growing tendency to write the prefix as a separate qualifying word, and in fact to treat it as an adjective. Thus counter-side , counter-truth , become counter side , counter truth : see counter adj.
All permanent compounds in counter-, with some of the more important of the looser combinations, are given in their alphabetical order; of the casual combinations (many of them nonce-words) of obvious meaning, examples here follow.
...
[2]b.
(a) Done, directed, or acting against, in opposition to, as a rejoinder or reply to another thing of the same kind already made or in existence. (The stress is on the prefix; in long words there is a secondary stress on the accented syllable of the root-word.)
...
counter-attack n.

-- PBS (talk) 19:20, 17 May 2014 (UTC)

Wikipedia's policy on this is intentionally vague - "There is a clear trend to join both elements in all varieties of English (subsection, nonlinear), particularly in American English. British English tends to hyphenate when the letters brought into contact are the same (non-negotiable, sub-basement) or are vowels (pre-industrial), or where a word is uncommon (co-proposed, re-target) or may be misread (sub-era, not subera). American English reflects the same factors, but is more likely to close up without a hyphen. Consult a good dictionary, and see National varieties of English above.", but definitely leans towards the removal of hyphens.
This is in direct opposition to the OED as regards the "counter" prefix, but as with much of the OED, take their rules with an extreme pinch of salt - they have a famously loose relationship to standard English of the type actually written anywhere in the world, to the extent that Oxford English has its own language tag (en-GB-oed) to differentiate it from British English. British style guides don't agree with each other; the Guardian is firmly in the single non-hyphenated word camp, the Economist firmly in the "hyphenate" camp, and the Times is mute on the matter. All major US guides (as far as I can see) oppose the hyphen, including the CMOS hyphenation guide which the overwhelming majority of North American sources follow.
[It is] the source regex which drives the automated typo fixing. Mogism (talk) 20:06, 17 May 2014 (UTC)
(adding) Pinging @Colonies Chris:, who added the "counter-" rule back in 2012 and may know if it was discussed anywhere. Mogism (talk) 20:14, 17 May 2014 (UTC)
This is a tightly limited rule that removes the space or hyphen from only counter-attack, counter-part and counter-point. (I wouldn't think the last two are at all controversial.) "Counterattack" corresponds to the spelling used in the WP article of that name. I don't think there was any discussion on the subject - it was accepted as uncontroversial. But a key characteristic of these rules is that they should only be making changes that are considered uncontroversial so by all means raise the question at WT:AWB/T. Colonies Chris (talk) 22:55, 17 May 2014 (UTC)
"counter-part" and "counter-point" are both spelt "counterpart" and "counterpoint" in the OED (and are not shown as spelling mistakes with the British English spelling checker I am using). "counterattack" show as a spelling mistake and is spelt "counter-attack" in the OED (see above), therefore I recommend removing counter-attack from the list and also not automatically "fixing" counter attack unless the type of English used on a page can be ascertained automatically.-- PBS (talk) 09:53, 18 May 2014 (UTC)
I think "counterattack" and "counter-attack" should be left alone by AWB, but "counter attack" is not acceptable in American or British English. Changing the open spelling to the closed spelling produces "counterattack", which is always correct in American English and often correct in British English, and is at least more easily grasped than "counter attack". If, later on, an editor wants to insert a hyphen in an article that uses British English, fine. But it is not being kind to WP's readers to leave it open. Chris the speller yack 04:17, 19 May 2014 (UTC)
The statement by Colonies Chris about "tightly limited rule that removes the space or hyphen from only counter-attack, counter-part and counter-point" is not true of the current rule. A space will be removed from "counter attack", but a hyphen in "counter-attack" will be left alone. Chris the speller yack 04:32, 19 May 2014 (UTC)
I disagree that "counter attack" as two unhyphenated words is never acceptable. "Martello Towers were built along the British coast to counter attacks from the sea". 80.43.188.197 (talk) 11:39, 19 May 2014 (UTC)
A word boundary after "attack" deals with that. It's true that people sometimes write "to counter attack from the sea", and we don't want to change that to "counter-attack", but we do want to catch it and change it to something better (counter attacks, or better yet, defend against attacks). - Dank (push to talk) 11:59, 19 May 2014 (UTC)
@Chris the speller it seems that the OED disagrees with you they write about the words in their range of 2b to 2d (of with counter-attack is one) "... These are properly written with the hyphen (now rarely as a single word, but occasionally in two separate words), so why do you think that "counterattack" more appropriate than "counter attack" and why do you claim that "'counter attack' is not acceptable in ... British English"? -- PBS (talk) 19:53, 19 May 2014 (UTC)
I get the conflict here, PBS. It probably doesn't make sense to make a lot of automated edits to a new article that strike the primary writer as "nitpicky" ... it might give them the impression that we don't value their contributions, or that we're focused on the wrong things. OTOH, most writers, most of the time, actually appreciate good advice, and "counter attack" isn't good advice ... it isn't in SOED, Oxford Dictionaries, Cambridge Dictionaries, or any style guide I've seen. Where to draw the line at WP:RETF is not my call, but I'm working on copyediting software aimed at writers who have actually asked for writing advice, and for them, I can't recommend "counter attack" in any variety of English. - Dank (push to talk) 20:38, 19 May 2014 (UTC)
And Dank's example above, "to counter attack from the sea" (not optimal but valid), where "counter" is a verb and "attack" is a noun, illustrates why we need to close them up or hyphenate when "counterattack" or "counter-attack" is either a verb or a noun. Chris the speller yack 21:22, 19 May 2014 (UTC)
I agree with Dank above - nitpicky fixes, even when correct, can make Wikipedia appear a snooty and nitpicky place to people unfamiliar with it. As I've said before here, I really wish we could find an alternative default edit summary than "typos fixed", since not all the changes made at WP:AWB/T are typos. Regulars know that these summaries are just an artefact of the software, but for a new editor, it must look like an accusation or a questioning of their competence, when someone makes an ultra-trivial change like amending the apostrophe in "Guns N' Roses" from curly to straight, but labels it as a typo. (Oh, I know it's not good practice to make a change this trivial unless you're changing something more substantial as well, but it does happen.)
As regards PBS's original point, as I said on my talk I wouldn't argue if someone removed "counter-attack" from the list. While I personally think we should standardise on the non-hyphenated form, leaving the hyphens in place won't cause any harm - it's not as if they change the meaning of sentences, or render something confusing - and if there's a reasonable argument to be made for keeping them then it's not an uncontroversial change.
I do repeat what I said on my talk about treating the OED as canonical - they make some very weird calls which are out of keeping with the way English is actually used, most famously their refusal to accept the "-ise" suffix and insistence on the serial comma, and even hyper-conservative style guides like The Times now reject them. Mogism (talk) 16:08, 20 May 2014 (UTC)

Issue

I was just wondering, if you come across an article that finds multiple typos but one of the typos is the correct way of spelling it (e.g. a surname), how do you ignore it so it only changes the incorrect typo and doesn't change the other?--Mjs1991 (talk) 08:51, 26 May 2014 (UTC)

@Mjs1991: You could make the correct spelling a wikilink (e.g. [[Liev Schreiber]]) or use the {{not a typo}} template. GoingBatty (talk) 13:29, 26 May 2014 (UTC)

50€ → €50 needs looking at

On the article Harry (TV gameshow) AWB tries to 'correct' a number of instances of 1 500€ (etc.) to 1 €500 i.e. it misses the leading 1 which is obviously important in the context of the typo. Jamesmcmahon0 (talk) 16:34, 28 March 2014 (UTC)

I don't even know what "241 00€" is supposed to represent; how can you expect a Typo rule to know what's going on? If the article contained properly formatted numbers instead of garbage, I think the Typo rules would work fine. Chris the speller yack 04:23, 29 March 2014 (UTC)
I agree that the article is poorly written, I assume the 241 00€ is supposed to be 241 000€ but obviously it can't be expected to fix that. I couldn't find any guidelines in MOS:CURRENCY for how to layout large amount so I would think that £1 000 000 is equally acceptable as £1,000,000. Could the rule be changed to look for groups of three numbers separated by either a space or a comma and fix accordingly? Jamesmcmahon0 (talk) 18:38, 29 March 2014 (UTC)
I think WP:MOSNUM#Grouping of digits would apply, which indicates that commas should be used as the separators. GoingBatty (talk) 21:52, 29 March 2014 (UTC)
Yep, I would agree with that, Thanks. Since that has come up, is the any way to correct 123 456 etc to 123,456 without hideous amounts of false positives? Jamesmcmahon0 (talk) 22:53, 30 March 2014 (UTC)
I think there would be lots (by our standards) of false positives, such as "the plane dropped 3 500 pound bombs on the target", which is poorly formatted and unhyphenated, but understandable by a human reader. The correction you suggest would change its meaning. I'm not in favor of risking that kind of damage. Chris the speller yack 04:23, 31 March 2014 (UTC)

This is a common problem with text translated from other languages, particularly if it is done by a person from their mother into English. Other languages use other delimiters in large amounts (for example German uses points where English uses commas and commas where English uses points), so a German "100.000,99" is "100 000,99" in French and 100,000.99 in English. For those who are unsure (or French ;-) ) missing out the delimiters or not converting them is the simplest solution to translating numbers. Hence this will be more of a problem of incorrectly formatted numbers with certain currency symbols than others, and the Euro is going to be one of them because some(most?) European continental languages place the currency symbols after the number. Also according to this paper different dialects of English do or do not use spaces between the currency symbol and the numeric amount. -- PBS (talk) 10:26, 18 May 2014 (UTC)

  • I've been running into the same problem, and I think my suggestion would be to use the trailing currency symbol as the litmus for reformatting a number. So "5 500$" -> "$5,500", but "5 500" goes unchanged. I've also run into other separators, like a single quote, so maybe it just needs to match a space or any punctuation. I think the way to deal with a "241 00€" is to go with "€241.00" - ie matching a trailing group of not 3 digits as decimals. I think this could be the most robust, and requires significantly less editor intervention than punting on any non-comma separator. VanIsaacWScont 08:40, 13 June 2014 (UTC)

Dependant vs. dependent

Wondering if we could craft a rule to change "dependant" to "dependent" when necessary. My understanding is:

  • In British English:
    • "Dependent" means reliant on.
    • A "dependant" is a person (usually a child or a spouse).
  • In American English, you can use "dependent" for both. (copied from grammar-monster.com)

I'm thinking we could try either:

  1. change "dependant on" (but not "a dependant on") to "dependent on", OR
  2. change "is/are now/highly/very/mostly dependant on" to "is/are now/highly/very/mostly dependent on" (more adverbs as needed)

Thoughts? GoingBatty (talk) 20:04, 30 May 2014 (UTC)

1: seems to fail in BritEng for say "any dependant on some form of assistance"
2: I'm not seeing the difference in the before and after, GB. - Dank (push to talk) 10:59, 16 June 2014 (UTC)
@Dank: Oops! Fixed my typo in option 2. Thanks! GoingBatty (talk) 11:03, 16 June 2014 (UTC)
Thanks. I think #2 now fixes the false positive problem. - Dank (push to talk) 11:31, 16 June 2014 (UTC)

Casted -> Cast

Cast is an irregular verb i.e. the past tense is also cast, not casted.

I would add it to the list myself, but I don't seem to be able to edit the page.

<Typo word="Cast" find="\b([Cc])asted\b" replace="$1ast"/>

-- Niightblade (talk) 08:26, 16 June 2014 (UTC)

Works for me, regex looks good. - Dank (push to talk) 11:00, 16 June 2014 (UTC)
But Collins English Dictionary says "casted" is an adjective that means "belonging to a caste". I also found "protect the patient's casted foot". Maybe check for "casted" that is preceded by he, she, they, was, is, be, being, etc.? Chris the speller yack 14:01, 16 June 2014 (UTC)
Thanks for catching that. I don't have a preference. - Dank (push to talk) 16:27, 16 June 2014 (UTC)

Langage / Language

Do not replace langage (French) with language when it is preceded immediately by (le|les|du|des) example page: Maurice Merleau-Ponty 138.251.14.34 (talk) 12:46, 19 June 2014 (UTC)

The right way to protect foreign-language text from English spell checkers is to use the "lang" template. I have employed it in your example page. Chris the speller yack 13:42, 19 June 2014 (UTC)

Enmao -> Emmao

In this edit, Enmao, as in Wang Enmao, was changed to Emmao. I think the <Typo word="Emm-" line is the culprit. Could this be fixed? Bgwhite (talk) 04:49, 20 June 2014 (UTC)

  Done -- John of Reading (talk) 05:37, 20 June 2014 (UTC)

Including

I'm seeing AWB trying to 'correct' including to includeing, giving the rule includ --> include for a reason. Thank you, BethNaught (talk) 13:33, 24 June 2014 (UTC)

@BethNaught: On what page? I suspect there may be an invisible control character between "includ" and "ing", causing the software to see it as two separate words. -- John of Reading (talk) 15:24, 24 June 2014 (UTC)
Sorry I forgot to mention it earlier.. I believe, from digging in my contribs, that the page is IEC EN 61000-3-2. BethNaught (talk) 15:35, 24 June 2014 (UTC)
I retyped the words, and that has fixed it. -- John of Reading (talk) 16:59, 24 June 2014 (UTC)

european -> European in Infoboxes

In National Front (France), AWB changed the infobox parameter in {{Infobox political party}} from european to European. This caused the parameter not to show up in the article. Bgwhite (talk) 23:22, 24 June 2014 (UTC)

@Bgwhite: I can't reproduce the issue. Since AWB won't make typo changes inside templates, and your edit to the article was to remove some extra braces, my guess is that AWB was confused about where the infobox ended, and didn't realize that the |european= parameter was part of the template. Thanks for fixing the article and not saving a bad edit! GoingBatty (talk) 02:18, 25 June 2014 (UTC)
GoingBatty, ahhh, thank you for the explanation. Bgwhite (talk) 03:58, 25 June 2014 (UTC)

full time → full-time in soccer articles

I've been running across scattered instances of "full time" → "full-time" in soccer articles fairly regularly, where it is used to refer to the end of the game, eg "the score was tied 1-1 at full time". From my memory of context, I think this can be resolved by simply retaining "full time" when preceded by "at". If no one has a counter example, could someone better with regexp program this in? VanIsaacWScont 05:57, 3 July 2014 (UTC)

@Vanisaac:   Done - thanks! GoingBatty (talk) 21:51, 4 July 2014 (UTC)

i.e. rule doesn't skip URLs

Although the "i.e." rule was designed to skip URLs, AWB wants to change www.currahaparish.ie. to www.currahaparish.i.e. in Curraha. I tried fixing the rule, but my fix didn't work. Could someone else please take a look at this? Thanks! GoingBatty (talk) 14:42, 5 July 2014 (UTC)

Name caught in adverb spelling web

The last name "Dealy" keeps getting caught in the adverb suffix ly -> lly rule. It wouldn't get an erroneous skip of "idealy" if you match to the capitalized spelling only. VanIsaacWScont 08:17, 6 July 2014 (UTC)

@Vanisaac:   Done by fixing "-ally (2)". Thanks for the report! GoingBatty (talk) 13:47, 6 July 2014 (UTC)

goverment

Is it possible to add "goverment"? Poveglia (talk) 01:00, 12 July 2014 (UTC)

@Poveglia: The rule already exists - see this edit I just did. Do you see any articles where AWB isn't fixing the typo (when the typo isn't in a protected area, such as a URL or template?) Thanks! GoingBatty (talk) 03:29, 12 July 2014 (UTC)
I feel stupid now. I think I must've made a typo myself after pressing Ctrl-F. I fixed the "goverment" typos manually. Thanks, Poveglia (talk) 05:36, 13 July 2014 (UTC)

new jersey

I'm not sure how often this would happen, but AWB/T capitalised thinking it was a place name, but in fact the context was "a new jersey" of a hockey team. [31] --Breno talk 11:32, 6 August 2014 (UTC)

@Breno: I added the invisible template {{not a typo}} so it won't happen again. GoingBatty (talk) 01:13, 9 August 2014 (UTC)

Lousiville → Louisville

I recently had to correct around 40 of these. While these may be honest typos, I have seen the misspelling used in a pejorative sense, and therefore susceptible for use by vandals. Not a biggie, but it would be nice if I didn't have to have it set up as a find/replace when I do cleanups. Stevie is the man! TalkWork 11:13, 20 August 2014 (UTC)

@Stevietheman:   Added - thanks! GoingBatty (talk) 02:37, 23 August 2014 (UTC)
Thank you! Stevie is the man! TalkWork 13:16, 23 August 2014 (UTC)

"guerrilla" rule

@Colonies Chris: We've been down this road before. Sorry, but several major dictionaries allow one 'r', so I have removed the rule. I don't like the spelling with one 'r' any more than you do. Chris the speller yack 18:45, 16 September 2014 (UTC)

Armenian Wikipedia and AWB

Hello, please help us with one issue, with AWB in Armenian Wikipedia we cannot use Wikisearchtext and Wikisearchtiltle command. Thanks beforehand. --ERJANIK (talk) 13:54, 6 September 2014 (UTC)

now Wikisearchtext working, Wikisearchtiltle not. --ERJANIK (talk) 06:41, 16 September 2014 (UTC)
right now Wikisearchtext & Wikisearchtiltle again not working. --ERJANIK (talk) 07:16, 17 September 2014 (UTC)