User:TomTheHand/AWB regular expressions

In mid-June, 2006, Bobblewik posted instructions on using his unit and date formatter tools to WP:SHIPS. These tools add easy-to-use tabs to the top of your browser window which provide consistent unit formatting and remove unnecessary date links using regular expressions (regexes). I used them and found them to be very handy in my cleanup of ship history articles.

Recently, I began using AWB, which supports regular expressions. There are small differences between the Javascript regexes used in Bobblewik's tool and the (.NET?) regexes used by AWB, but I read a few tutorials and converted Bobblewik's regexes. Listed below are the regexes that I use in AWB. Bobblewik is the original author for the vast majority of them, though I've tweaked some and rearranged their order a little bit. I've also written a couple myself.

I'm writing this up in order to help AWB users as well as to get advice to improve the regexes I use. My regexes focus primarily on formatting units and adding nbsp's where appropriate. I don't have all of Bobblewik's unit formatter functionality, and I don't have ANY of his date formatter functionality, but I'm adding to it little by little. As I'm very much a beginner with regexes, in some cases I use a simplified version of one of Bobblewik's regexes. My code may therefore have bugs that Bobblewik's does not. I'll take care of these issues little by little. Still, I think what I have now will be useful to AWB users.

A few important notes:

  • Please post questions and comments on the talk page! I'm interested in hearing what people have to say.
  • Copy and paste directly from this page, rather than copying from the page source. I had to use a Wiki trick to get   to show up as text instead of a space, and if you copy from the page source you'll copy my trick instead of a regular  .
  • I run in case sensitive mode, so some of my regexes look weird because I want specific parts of them to be case insensitive.
  • DOUBLE CHECK the results of these regexes before saving the page! They are far from perfect.

Also, I'd like to say thanks very much to Bobblewik for writing most of these regexes.

Unit formatting

edit
Find Replace with
Fix common spelling errors celsius Celsius
[Cc]elcius Celsius
Fix common naming error (be careful with this one) [Cc]entigrade Celsius
Convert degree symbols into ° symbol, ensure preceding space ° °
º °
°\s?([CF]) °$1
°\s?(Celsius) °C
(\d)\s?(°[CF]) $1 $2
Convert &sup into superscript ² symbol ² ²
³ ³
Convert the word ohm(s) or the html entity into the actual Ω symbol (Omega, not the actual ohm symbol Ω) and make sure it is spaced (\d)\s?(Y|Z|E|P|T|G|M|k|K|h|da|d|c|m|µ|μ|µ|n|p|f|a|z|y)?\s?(Ω|ohm|Ohm)s?([\s,.;:\)\(\\/)]) $1 $2Ω$4
Convert various micro symbols into the actual micro symbol, make sure it's spaced (\d)\s?(μ|μ|µ)(g|s|m|A|K|mol|cd|rad|sr|Hz|N|J|W|Pa|lm|lx|C|V|Ω|F|Wb|T|H|S|Bq|Gy|Sv|kat|M)([\s,.;:\)\(\\/)]) $1 µ$3$4
Convert capital K to lowercase k in units (\d)[\s|\-]?K(g|s|m|A|K|mol|cd|rad|sr|Hz|N|J|W|Pa|lm|lx|C|V|Ω|F|Wb|T|H|S|Bq|Gy|Sv|kat|M)([\s,.;:\)\(\\/)]) $1 k$2$3
(\d) K(g|s|m|A|K|mol|cd|rad|sr|Hz|N|J|W|Pa|lm|lx|C|V|Ω|F|Wb|T|H|S|Bq|Gy|Sv|kat|M)([\s,.;:\)\(\\/)]) $1 k$2$3
Hertz (\d)[\s|\-]?(Y|Z|E|P|T|G|M|k|K|h|da|d|c|m|µ|μ|µ|n|p|f|a|z|y)?hz $1 $2Hz
Fix kilometers (\d)[\s|-]?kms?([\s,.;:\)\(\\/)]) $1 km$2
(\d) kms?([\s,.;:\)\(\\/)]) $1 km$2
Fix sq. km and sq. m (\d)\s?sq\.?\s?kms? $1 km²
sq\.?\s?kms? km²
(\d)\s?sq\.?\s?m([^i]) $1 m²$2
m²\.\) m²)
Standardize km/h km\/hr km/h
kph km/h
Standardize 'per second' (\d)\s?ft\/sec(ond)? $1 ft/s
(\d)\s?m\/sec(ond)? $1 m/s
(\d)\s?km\/sec(ond)? $1 km/s
Standardize horsepower (\d)\s?hp([^A-Za-z0-9]) $1 hp$2
(\d)\s?HP([^A-Za-z0-9]) $1 hp$2
(\d)\s?shp([^A-Za-z0-9]) $1 shp$2
(\d)\s?SHP([^A-Za-z0-9]) $1 shp$2
(\d)\s?bhp([^A-Za-z0-9]) $1 bhp$2
(\d)\s?BHP([^A-Za-z0-9]) $1 bhp$2
(\d)\s?ihp([^A-Za-z0-9]) $1 ihp$2
(\d)\s?IHP([^A-Za-z0-9]) $1 ihp$2
Miles per hour m\.p\.h\. mph
(\d)[\s|-]?mph $1 mph
Standardize symbol for pounds (\d\+?)[\s?|-]lbs? $1 lb
Standardize symbol for newton metres N•m N·m
Standardize symbol for foot pounds ft[\s-.·•\/]lbf?s? ft·lbf
(\d)\s?feet $1 feet
(\d)\s?ft $1 ft
(\d)\s?foot $1 foot
(\d)\s?miles $1 miles
(\d)\s?mi $1 mi
(\d)\s?(\[\[)?nautical\smile $1 $2nautical mile
(\d)\s?nm $1 nm
(\d)\s?nmi $1 nmi
(\d)\s?mph $1 mph
(\d)\s?kt $1 kt
(\d)\s?(\[\[)?knot $1 $2knot
(\d)\s?lb $1 lb
(\d)\s?lbs $1 lb
(\d)\s?tons $1 tons
(\d)\s?psi $1 psi
([^a-zA-Z])(\d)[\s|-]?(G|M|k|K|h|da|d|c|m|µ|n)?(g|m|Hz|N|W|Pa|V|Ω|F)(\.|;|\s|,|\)|/) $1$2 $3$4$5
([^0-9;])([0-9.]+)[\s|-]?(in[\s|\.|;|ch|/|,]) $1$2 $3

Typos

edit

These are basically just simple find-and-replaces for some common mistakes I find. Feel free to use them.

Find Replace with
([Mm])achinegun $1achine gun
([Cc])omission $1ommission
ammount amount
edit

One of my interests is naval history. I use the following expressions for formatting naval history articles. They won't be directly useful on other types of articles, but you can use them if you share a similar interest or modify them to serve other purposes.

Find Replace with
TF\s(\d) TF $1
TG\s(\d) TG $1
DesDiv\s(\d) DesDiv $1

Above are very simple regexes which add nbsp's between TF, TG, or DesDiv and that force/group/division's number. For example, TF 58, the famous late-WWII Fast Carrier Task Force, gets an nbsp inserted.

Find Replace with
class\sdestroyers[|]([A-z .]+)(\d+)\]\] class destroyers|$1(DD-$2)]]

Above is a specialized regex used to format the category for destroyers. Many of the ships had their category entries like this:

  • [[Category:(name) class destroyers|{ship name} {hull number}]]

I prefer this:

  • [[Category:(name) class destroyers|{ship name} (DD-{hull number})]]

This regex simply looks for ships formatted the first way, and reformats them in the second fashion. You may be able to adapt this to your purposes, but it won't be of any use as-is. I intend to format all destroyer articles this way and then there won't be any more reason to run it!