Wikipedia:Bots/Requests for approval/DumZiBoT 2

The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was

Approved.

DumZiBoT 2

Operator: NicDumZ ~

Automatic or Manually Assisted: Automatic, supervised

Programming Language(s): Python

Function Summary: You do know what DumZiBoT was doing, right ? :p Extend its behavior to deal with duplicate references.

Edit period(s) (e.g. Continuous, daily, one time run): Same as before, on each XML dump

Already has a bot flag (Y/N): Y

Function Details:

This is pywikipedia reflinks.py which I maintain and extend. As always, the latest stable verstion is always pushed to the repository
What has already been running without a formal new BRFA, is a small improvement to merge identical references. This means that
<ref>A</ref>

<ref name="duplicate">A</ref>

<ref>B</ref>

<ref name=booh>A</ref>
has been merged over the last days to

<ref name="duplicate>A</ref>

<ref name="duplicate" />

<ref>B</ref>

<ref name="duplicate" />
This is not a complex functionality, quite easily written and tested. When no ref name is available in the page "autogenerated#" is used.
The only case when this heuristic fails, however, (and you can find it reading my talk page) is when there are duplicate references, e.g. <ref name=dup>A</ref>...<ref name=dup>B</ref> . For some reasons the Cite extension does not complain about it, and the fact is, that in long articles, it happens often, causing confusion, because you have to know which reference contents is associated with the name.
The idea here is to detect those duplicate references, when crawling the page for identical references / bare references that need a title, but to leave them alone, and to add a message on a the attached talk page. I tried at first automatically resolving duplicate references, but that's actually a hard job.( consider <ref name=dup>A</ref>...<ref name=dup /> ... <ref name=dup>B</ref> ... <ref name=dup /> : was it caused by a copy paste or is this intended ? Does the fact that the first <ref name=dup /> uses content "A" is meant or is it a side effect from an unwanted operation ? ) I say, let the users deal with that, it's too damn complicated for a little bot's head :)
Example, on a test copy of Casualties of the Iraq War which has been pointed to me as a very reference-buggy article :
- Page edit (not supposed to be anything else than the previous task; we just have to check that no duplicate reference is ever being touched, to avoid conflicts)
- associated talk page message (Text is a draft, looking forward to any comments)

If you look at the older edits of DumZiBoT, you'll probably catch one of these minor bugs that have been reported on my talk page, and fixed since (and pushed to SVN). Do not base yourself on older edits, ask me to edit again, if needed ;)

Discussion

Lets see what could go wrong ^[1] ^[2] ^[2] ^[3] ^[4] ^[4] ^[5] ^[6] ^[6] ^[7] ^[8] ^[9] Cite error: The opening <ref> tag is malformed or has a bad name (see the help page).

^ Example 1: named:casingref
^ ^a ^b Example 2: name:CasingRef and Tag casing
^ Example 3: Quoting (note that mixed quote ' " don't work, also note how space translates to an underscore)
^ ^a ^b Cite error: The named reference quoting space was invoked but never defined (see the help page).
^ bug">Example 4: ref is named "quote>bug", but is null
^ ^a ^b Example 5: Characters that can make a ref name, excluding <, >, and "
^ Example 6: special characters translate to anchor encoded character, so ! and .21 are equivalent
^ Cite error: The named reference bang.21 was invoked but never defined (see the help page).
^ Example 7: Empty name are apparently valid name (BUG!)

I suggest take the end space out of <ref name=""/>, its only required in XHTML for backwards compatibility and we could save a few bytes on the encyclopedia
Instead of using autogenerate, use the domain name. Regex: \w+://+([^/]*?)(?P<domain>[^A-Za-z0-9\-]+)\.(?P<tld>[^A-Za-z\.]{2,6})(?P<port>:\d+)?/ which should grab domain from http://subdomain-org.division.domain.org.uk/
Is merge <ref>[URL title] added fact blah blah blah<ref> <ref>URL<ref> too complicated?

def escapeId(s):
	""" Anchor encode routine similar to URL encode"""
	return escapeUrl(s).replace('%', '_')
def escapeUrl(s):
	""" URL encode routine percent encodes non-safe characeters """
	if not safe_map:
		# generate when first used
		safe = '-.0123456789:ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz'
		for i in range(256):
			c = chr(i)
			safe_map[c] = (c in safe) and c or ('.%02X' % i)

	res = map(safe_map.__getitem__, s.replace(' ', '_'))
	return ''.join(res)

— Dispenser 17:03, 25 July 2008 (UTC)[reply]

Ah, Dispenser again. Thanks :)

Well, I see no big problem with the above examples. No, my bot does not support all of them, and no, I do not think I can do so :)

I did not know about reference names being converted the same way the anchor are converted, I just implemented its support (using the url2unicode from pywikipedia ;) ). after adding a few identical references... the bot removes the identical. adding an encoded identical works the same : [1] (And yes, if some processing is needed, the decoded version will always be preferred)
Empty named references are just being ignored by my regex (i.e. not taken into account, ever. No merges, no duplicates flagging, and so on)
I just modified my regex to ignore "quote>bug". Previously, it would match "quote", which was wrong, and since I don't think there's a simple way to handle those very specific references, I'll just ignore them
I personally really dont care about XHTML spec, nor about removing 1 space from an article. I'll just leave it as it is, I don't think that having one more space will strain the servers, knowing that most of the time, DumZiBoT is removing duplicates hence reducing the overall text size
For that second task, I'm not working on references only containing a link. I work on all references. It means that fetching a domain name won't work all the time, as some references are just plain text. Also, I'd prefer acting very dumb, saying I'm not able to guess a proper name, to have the editors look into the reference if they want to find a proper name. "autogenerated" is plain and simple : a bot has inserted it; a guessed name, when not relevant, can be confusing, right ?
While I apperiate your efforts in keeping the program content agnostic, most users do not rename references. A nice way to keep it content agnostic while still selecting parts to be rename is to have a configurable regex, like\|\s*last\s*=\s*(?P<refname>[^\w\s]+)|http&#3A;//+([^/]*?)(?P<refname>[^A-Za-z0-9\-]+)\.([^A-Za-z\.]{2,6})(:\d+)?/ which will (hopefully) capture the lastname of the author, failing that it will use the domain of the website. And don't take my earlier attempt of this, take the longest word and hope authors have long names, as anything that took more than 5 minutes to do. — Dispenser 04:01, 30 July 2008 (UTC)[reply]
As for reference merges when not perfectly identical... well... no, it's not this easy, I'm afraid :( I'd prefere to keep it rather simple, as the overall code is getting bigger and bigger...

NicDumZ ~ 18:31, 25 July 2008 (UTC)[reply]

I'd love to see this go through a couple thousand pages until it gets perfect, so that I can steal the tried-and-true code for my bot. . Seriously though, I advocate its formal approval here. – Quadell ^(talk) 23:21, 25 July 2008 (UTC)[reply]

I see cautions and ideas above, but no objections. NickDumZ has always been diligent with checking for errors and fixing them promptly. I'm confident that this task will be performed carefully responsibly. – Quadell ^(talk) 12:25, 31 July 2008 (UTC)[reply]

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. MBisanz ^talk 12:40, 31 July 2008 (UTC)[reply]

Okay, already spotted this kind of buggy edit. I'm adding nowiki tags around ref contents to solve the problem :)

NicDumZ ~ 13:16, 1 August 2008 (UTC)[reply]

Also fixed this kind of edits (the URL wasnt parsed correctly because of the spaces in the title) :) NicDumZ ~ 15:41, 1 August 2008 (UTC)[reply]

Here are 50-ish edits... :) NicDumZ ~ 12:11, 2 August 2008 (UTC)[reply]

Approved. Looks harmless, useful, and appropriate. (You have the honor of my first RfBA closing.) – Quadell ^(talk) 13:39, 6 August 2008 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.

[casingref-1] Example 1: named:casingref

[CasingRef-2] Example 2: name:CasingRef and Tag casing

[quoting_space-3] Example 3: Quoting (note that mixed quote ' " don't work, also note how space translates to an underscore)

[quoting_space-4] Cite error: The named reference quoting space was invoked but never defined (see the help page).

[quote-5] ug">Example 4: ref is named "quote>bug", but is null

[!#$%&'()*+,-./0123456789:;=?@[\]^_`{|}~-6] Example 5: Characters that can make a ref name, excluding <, >, and "

[bang!-7] Example 6: special characters translate to anchor encoded character, so ! and .21 are equivalent

[bang.21-8] Cite error: The named reference bang.21 was invoked but never defined (see the help page).

[0-9] Example 7: Empty name are apparently valid name (BUG!)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]