Wikipedia:Bots/Requests for approval/Hazard-Bot 11
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Hazard-SJ (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 01:55, Tuesday May 29, 2012 (UTC)
Automatic, Supervised, or Manual:
Programming language(s): Python
Source code available: Pywikipedia
Function overview: Fixing ISBNs
Links to relevant discussions (where appropriate): Wikipedia:BON#Rich_Farmbrough.27s_bots
Edit period(s): Occasional
Estimated number of pages affected: N/A
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): Yes
Function details: This bot will go through a list of pages which have possible ISBN errors, then attempts to fix them. This "list" will be made by my scanning a database dump. I have already attempted using the script to ensure it works, twice in the sandbox (here and here), and three in mainspace (here, here, and here). Yes, it may need a few tweaks to become more efficient, but as Helpful Pixie Bot can do this no more, I'd like my bot to try to help out. Hazard-SJ ✈ 01:55, 29 May 2012 (UTC)[reply]
Discussion
editThere were a very large number of complaints when Helpful Pixie Bot resumed doing this shortly before the Arb case caused it to be blocked. How will you avoid a similar situation? Also, what specifically are the errors the bot will fix? Anomie⚔ 02:45, 29 May 2012 (UTC)[reply]
- It is unclear what the bot would fix; my advice is to try to focus on errors that break the link to Special:BookSources that the ISBN number should automatically create. One example I ran across recently is the use of "IBSN-10" or "ISBN-13" to prefix the number; the software doesn't seem to recognize this. Robert Skyhawk (T C) 02:52, 29 May 2012 (UTC)[reply]
Do you intend to treat lack of dashes as an ISBN error, as WP:COSMETICBOT, or as a matter of editor style choice and hence ignore? Personally, I would prefer the bot only to action this if the article was already internally inconsistent. SpinningSpark 09:08, 29 May 2012 (UTC)[reply]
- The changes are:
- Removal of the colon (
(r'ISBN: (\d+)', r'ISBN \1')
- Removal of the word "number" or a variant thereof (
(r'ISBN( number| no\.?| No\.?|-Nummer|-Nr\.):? (\d+)', r'ISBN \2')
) - Replacement of en dash, em dash etc. with hyphen-minus:
(ur'ISBN (978|979) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d)(?!\d)', r'ISBN \1-\2-\3-\4-\5')
for ISBN-13(ur'ISBN (\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d|X|x)(?!\d)', r'ISBN \1-\2-\3-\4')
for ISBN-10
- Addition of a space between "ISBN" and the number if the space is absent, and replacement of with the proper space (
(r'ISBN(| | )((\d(-?)){12}\d|(\d(-?)){9}[\dXx])', r'ISBN \2')
)
- Removal of the colon (
- I hope this helps. Hazard-SJ ✈ 02:25, 30 May 2012 (UTC)[reply]
- With my limited understanding of regex, that seems fine to me. SpinningSpark 16:03, 30 May 2012 (UTC)[reply]
{{BAGAssistanceNeeded}}
Could I take a trial please? Hazard-SJ ✈ 21:03, 2 June 2012 (UTC)[reply]- All of this looks good to me except changing a non-breaking space to a space; is there consensus for this change? I can see why you would want to make this change but I can also see why, aesthetically, someone may wish to prevent a line break between "ISBN" and the number. — madman 22:29, 5 June 2012 (UTC)[reply]
- If you want the wikimagic to automatically link the ISBN, you have to use a normal space: compare ISBN 978-1-4000-4360-6 (normal) to ISBN 978-1-4000-4360-6 (
) or ISBN 978-1-4000-4360-6 (literal U+00A0 character). To prevent a linebreak, you'd have to use a piped link Special:BookSources explicitly (or use a template that does this, if there is one), or use something like {{nowrap}}. Speaking of which, what will this bot do if it sees the ISBN in the text portion of a wikilink, e.g.[[Special:BookSources/9781400043606|ISBN 978-1-4000-4360-6]]
? Anomie⚔ 10:49, 6 June 2012 (UTC)[reply]- I think madman's point was that a non-breaking space would probably have been inserted by an editor consciously making this compromise and it is not for a bot to revert a human editorial choice. It would only have been inserted if there was actually a problem or the editor believed a problem was possible. Piping does not work by the way, it results in a redlink, but the nowrap template does work. SpinningSpark 12:46, 6 June 2012 (UTC)[reply]
- Actually, I withdraw my objection to that change; thanks for the information, Anomie. I think allowing for the magic link is more important than the editorial choice of a breaking space versus a non-breaking space. Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — madman 13:24, 6 June 2012 (UTC)[reply]
- Piping seems to work fine to me: [[Special:BookSources/9781400043606|ISBN 978-1-4000-4360-6]]. Note that certain browsers might still decide to break the piped text at the dashes, though. Anomie⚔ 17:40, 6 June 2012 (UTC)[reply]
- Right, I thought you meant pipe from the wikimagiced ISBN. SpinningSpark 00:14, 7 June 2012 (UTC)[reply]
- I think madman's point was that a non-breaking space would probably have been inserted by an editor consciously making this compromise and it is not for a bot to revert a human editorial choice. It would only have been inserted if there was actually a problem or the editor believed a problem was possible. Piping does not work by the way, it results in a redlink, but the nowrap template does work. SpinningSpark 12:46, 6 June 2012 (UTC)[reply]
- If you want the wikimagic to automatically link the ISBN, you have to use a normal space: compare ISBN 978-1-4000-4360-6 (normal) to ISBN 978-1-4000-4360-6 (
- All of this looks good to me except changing a non-breaking space to a space; is there consensus for this change? I can see why you would want to make this change but I can also see why, aesthetically, someone may wish to prevent a line break between "ISBN" and the number. — madman 22:29, 5 June 2012 (UTC)[reply]
- With my limited understanding of regex, that seems fine to me. SpinningSpark 16:03, 30 May 2012 (UTC)[reply]
- Just to be clear: HPB would break ISBNs when it didn't recognize them, and there was no way to prevent it from continuously rebreaking the ISBN every time it ran (see the second paragraph down). As long as this bot doesn't do that, I'm fine. If it does do that then you have my strongest possible objection. This is also Sven Manguard 18:48, 8 June 2012 (UTC)[reply]
- Not to get offtopic, but it looks like that ISBN in your diff should be 9780521243315 rather than 9780521243319. At least according to Amazon. Anomie⚔ 19:45, 8 June 2012 (UTC)[reply]
- Actually, it's {{cite book}} that's broken. It should (and used to) accept {{Please check ISBN}} in the isbn= field. As Anomie notes above, that ISBN is wrong. Mr Stephen (talk) 21:22, 12 June 2012 (UTC)[reply]
- From the most recent (June 2) dump, using a (when compressed) 98.2M file, I got 22 edits. I'll either need a bigger dump or a SQL scan for more. Hazard-SJ ✈ 04:50, 9 June 2012 (UTC)[reply]
- Trial complete. I used the dump hosted on Toolserver to complete my trial (I actually did extras). Hazard-SJ ✈ 02:06, 12 June 2012 (UTC)[reply]
- From the most recent (June 2) dump, using a (when compressed) 98.2M file, I got 22 edits. I'll either need a bigger dump or a SQL scan for more. Hazard-SJ ✈ 04:50, 9 June 2012 (UTC)[reply]
The bot has edited WP space, which seems inadvisable. The value of editing WP space is less, and the risk of "correcting" an intentional error in an instructional page is greater. Jc3s5h (talk) 03:18, 12 June 2012 (UTC)[reply]
- As Jc3s5h notes, this task should be restricted to namespace 0. Also, it has much too high of an error rate. This edit absolutely should not have happened; it should be expected that ISBN 13 may appear in a citation instead of ISBN-13, and a simple sanity check would have found that the newly-formed ISBN was sixteen digits long. This edit broke an image. This edit and this edit changed ISBNs in external links, which may not be edits that break anything but that are certainly of dubious value. Finally, fourteen edits changed spaces to dashes, which is a purely cosmetic change, one that should have been discussed above in reply to SpinningSpark's question. My understanding is that the goal of this task is to allow wikimagic to occur where it was not before.
- I'd like to get a second opinion on this from another BAG member, but I think that given the relatively small number of articles needing edits (if the above is true) and the contextual problems involved, this should be denied as an automated task. Thanks, — madman 21:46, 12 June 2012 (UTC)[reply]
- Why not also correcting ISBNs in WP:AFC/WT:AFC space? I see no reason why excluding submissions... mabdul 21:51, 12 June 2012 (UTC)[reply]
- That's a fair point, and the reason for these discussions. I see no problem with including AFC submissions (if the task were to meet the requirement of the bot policy that it be harmless, and I don't think it does at this time). — madman 22:33, 12 June 2012 (UTC)[reply]
- Spaces work for the wikimagic? Test: ISBN 978 1 4000 4360 6 ← I guess so. And the problems with links and images would certainly need to be sorted out. Anomie⚔ 02:45, 13 June 2012 (UTC)[reply]
- (edit conflict)I can limit it to mainspace (tested). As for AFCspace, I can't use dumps directly for that. I'd have to generate a list of those pages separately. As for changes to external links, that shouldn't be a problem (see the edit in https://en.wikipedia.org/wiki/?diff=497333259 where I added
&thisdoesnotexistbuttest=ISBN 978-87-988154-5-7
to the URL, but it wasn't changed). I'll be addressing the other issues soon. Hazard-SJ ✈ 02:48, 13 June 2012 (UTC)[reply] - ref tags are now ignored. As for the space-to-dash issue, I changed:
- (edit conflict)I can limit it to mainspace (tested). As for AFCspace, I can't use dumps directly for that. I'd have to generate a list of those pages separately. As for changes to external links, that shouldn't be a problem (see the edit in https://en.wikipedia.org/wiki/?diff=497333259 where I added
- Why not also correcting ISBNs in WP:AFC/WT:AFC space? I see no reason why excluding submissions... mabdul 21:51, 12 June 2012 (UTC)[reply]
(ur'ISBN (978|979) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d)(?!\d)', r'ISBN \1-\2-\3-\4-\5'), # ISBN-13 (ur'ISBN (\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d+) *[\- −\.‐-―] *(\d|X|x)(?!\d)', r'ISBN \1-\2-\3-\4'), # ISBN-10
- to
(ur'ISBN (978|979)[\-−\.‐-―](\d+)[\-−\.‐-―](\d+)[\-−\.‐-―](\d+)[\-−\.‐-―](\d)(?!\d)', r'ISBN \1-\2-\3-\4-\5'), # ISBN-13 (ur'ISBN (\d+)[\-−\.‐-―](\d+)[\-−\.‐-―](\d+)[\-−\.‐-―](\d|X|x)(?!\d)', r'ISBN \1-\2-\3-\4'), # ISBN-10 (ur'ISBN (978|979) +[\- −\.‐-―] +(\d+) +[\- −\.‐-―] +(\d+) +[\- −\.‐-―] +(\d+) +[\- −\.‐-―] +(\d)(?!\d)', r'ISBN \1 \2 \3 \4 \5'), # ISBN-13 (ur'ISBN (\d+) +[\- −\.‐-―] +(\d+) +[\- −\.‐-―] +(\d+) +[\- −\.‐-―] +(\d|X|x)(?!\d)', r'ISBN \1 \2 \3 \4'), # ISBN-10
- so the issue should be resolved (it is according to tests I just made). Hazard-SJ ✈ 03:27, 13 June 2012 (UTC)[reply]
- The file issue has been fixed now (see this sandbox edit).
{{BAGAssistanceNeeded}}
Is there anything I left out? If not, can I go for another trial please? Hazard-SJ ✈ 03:38, 13 June 2012 (UTC)[reply]- What do you mean by ref tags now being ignored? Do you mean their attributes (e.g. name, which I was thinking may be a problem, but it's not one that showed up in the trial)? Or do you mean their contents as well? (In which case a lot of the good edits made in the last trial would not have happened.) — madman 13:19, 13 June 2012 (UTC)[reply]
- There was a problem with ISBN fixes in the last trial that "broke" a ref by inserting dashes between the ISBN and a date. However, I believe I can unignore ISNBs in <ref> tags since the dash issue is now resolved. Hazard-SJ ✈ 22:13, 13 June 2012 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — madman 23:36, 14 June 2012 (UTC)[reply]
- Trial complete. Hazard-SJ ✈ 23:44, 19 June 2012 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — madman 23:36, 14 June 2012 (UTC)[reply]
- There was a problem with ISBN fixes in the last trial that "broke" a ref by inserting dashes between the ISBN and a date. However, I believe I can unignore ISNBs in <ref> tags since the dash issue is now resolved. Hazard-SJ ✈ 22:13, 13 June 2012 (UTC)[reply]
- What do you mean by ref tags now being ignored? Do you mean their attributes (e.g. name, which I was thinking may be a problem, but it's not one that showed up in the trial)? Or do you mean their contents as well? (In which case a lot of the good edits made in the last trial would not have happened.) — madman 13:19, 13 June 2012 (UTC)[reply]
- Looks like 97 edits, actually, almost twice as many as were approved. Please change the task so it will be restricted to namespace 0 in the future and another member of the BAG or I will review the edits as soon as possible (convenience link: [1]). Thanks, — madman 01:36, 20 June 2012 (UTC)[reply]
- (More convenient link) Yes, I forgot to use
-namespace:0
, and noticed near the end of the trial, and since to bot took nearly forever to start (reading an entire enwiki dump first), I didn't stop it to fix that (not to say that it won't be done in the future!). Hazard-SJ ✈ 23:07, 20 June 2012 (UTC)[reply]- Approved. – The results of the trial were correct in all cases. Please do restrict this task to namespace 0. Thanks! — madman 19:32, 21 June 2012 (UTC)[reply]
- (More convenient link) Yes, I forgot to use
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.