Wikipedia:Bots/Requests for approval/Polbot 8
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Contents
Operator: – Quadell (talk) (random)
Automatic or Manually Assisted: Automatic, but with supervision
Programming Language(s): Perl, using Perlwikipedia
Function Summary: Fixes to references and external links
Edit period(s): In small batches (one category at a time)
Already has a bot flag: Yes
Function Details: The most prominent thing this bot will do is name external links, similar to the way that DumZiBoT does. There are a few differences: DumZiBoT runs on all pages using a database dump, while Polbot will run on one category's worth of articles at a time and will use live data. Also, DumZiBot changes "[http:www.example.com/subpage]" to "[http:www.example.com/subpage A Subpage]", whereas Polbot will instead change it to "[http:www.example.com/subpage A Subpage] at www.example.com" per feedback on my talk page. And DumZiBoT can handle PDFs, but I can't figure out how to do that, so Polbot will skip PDFs.
But wait, there's more! Polbot's 8th function will also make other improvements, to wit:
- Convert the <references /> tag to the {{reflinks}} template, as AWB does in its auto-fixes.
- If there are <ref> tags but no {{reflist}}, it'll add a {{reflist}}.
- Where? In Notes or References sections if they exist. If not, then just before the earliest of External links, Sources, Further reading, See also, or just before the categories, or at the end if all else fails.
- Fix the double bracketing like [[http://www.example.com]], also inspired by AWB auto-fixes.
- Change [http://en.wikipedia.org/wiki/Example] to [[Example]] and [http://pt.wikipedia.org/wiki/Example this] to [[:pt:Example|this]].
- When there are bare links to IMDB, such as [http://www.imdb.com/title/tt0099700/], change them to use the various IMDB templates, such as {{imdb title|0099700|Gremlins 2: The New Batch (1990)}}.
- Turn previously-unnamed bare external links (BELs) to references by putting them in <ref> tags. This is so that pages don't intersperse numbered reference links[1] with numbered BELs (like [1]), when both are used as citations and have contradictory numbering schemes. It will ignore BELs in HTML comments or in the {{PDFlink}} template. It will also ignore BELs that are the only thing on a line. This is because...
- If any two ref tags have the exact same content, it will merge them. E.g. "<ref>Example</ref>...<ref>Example</ref>" will become "<ref name=botgen1>Example</ref>...<ref name=botgen1 />"
- If there have been any changes needed due to the above, then Polbot will perform misc other cleanup tasks at no extra charge.
- Fix miscapitalized headers (e.g. "See Also")
- Fix mislinked dates and years (e.g. "[[20th Century]]")
- Change <i> to '' and <b> to '''
References for examples
editDiscussion
edit- Suggestions? Cautions? Encouragement? I'm all ears. – Quadell (talk) (random) 02:04, 4 July 2008 (UTC)[reply]
- As the Template:reflist documentation says, there is no consensus to use it in place of <references/> in all occasions. See [2].
- I have it only replace divs if they match up on either side of the references tag, with nothing else in there. So that bug shouldn't be a problem. – Quadell (talk) (random) 03:32, 4 July 2008 (UTC)[reply]
- If there is no "references," "notes," or "sources" section and it needs to add a reflist, it should create a "References" section.
- Great minds think alike. :-) It does this. – Quadell (talk) (random) 03:32, 4 July 2008 (UTC)[reply]
- How will it generate a reference name when it creates a named ref? Mr.Z-man 02:23, 4 July 2008 (UTC)[reply]
- It'll use "botgen1", "botgen2", etc. – Quadell (talk) (random) 03:32, 4 July 2008 (UTC)[reply]
- Actually, I'm having trouble getting this part to work correctly (and it is infrequently needed anyway). I may end up skipping this function. – Quadell (talk) (random) 01:52, 6 July 2008 (UTC)[reply]
- It'll use "botgen1", "botgen2", etc. – Quadell (talk) (random) 03:32, 4 July 2008 (UTC)[reply]
- I think it is only necessary to do number 4 if you are doing some other change, it is pretty trivial. Wouldn't there be uses of BEL's (see number 6) where this would break how the article would read. If an article needs cleanup it might say ' The companies website can be found at [3].' and task 6 would break this. Otherwise I think the idea of combining bot jobs is great! Want to check the order of sections and date the maintenance tags while you are at it :-P -- maelgwn - talk 02:49, 4 July 2008 (UTC)[reply]
- Your example about #6 breaking the article is an interesting one. What it would do is change:
- The company's website can be found at [4].
- ...to:
- The company's website can be found at [1].
- In my opinion, this would change a badly formatted article from using one poor choice to another. Both before and after, it's a number used awkwardly as a noun. Both before and after, the sentence would have to be totally reworded to make it standard. I do see your point, though. I could have have the bot check for " at [link]" or " in [link]", and skip those. By the way, I'm very willing to change section order and date maint tags while I'm at it -- is there a place I can see the standard for that, possibly including already-existing regular expressions? – Quadell (talk) (random) 03:32, 4 July 2008 (UTC)[reply]
- Dating maintentance tags is probably not really feasible because you won't be editing those articles within the month that the tag was placed, if you are using categories ... I only know of SmackBot, and it uses AWB. And I don't know about the standard it applies to several templates like {{cleanup}}. -- maelgwn - talk 07:43, 4 July 2008 (UTC)[reply]
- Your example about #6 breaking the article is an interesting one. What it would do is change:
- Comment: Here are some examples of the bot's changes to local copies of Paducah, Texas, Paducah, Kentucky, Adyashanti, Íngrid Betancourt, Internal combustion engine, and Peace movement. – Quadell (talk) (random) 22:08, 4 July 2008 (UTC)[reply]
- I have no idea about bots and code and won't pretend that I can assess potential downsides. To me, everything about it sounds great and I would love to see these function go live. user:Everyme 06:14, 8 July 2008 (UTC)[reply]
Hey, BAGgers! It's been close to a week with no objections, and I've answered all questions. I hate to be a nag, but I'm going on vacation soon and I'd like to run a trial before I go, if possible. – Quadell (talk) (random) 14:31, 10 July 2008 (UTC)[reply]
Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. BJTalk 19:20, 10 July 2008 (UTC)[reply]
Thanks! Okay, the trial is still ongoing, but I already found a few opportunities for improvement. First, there were way too many changes like this, which seems a tad piddlin', so I changed the code to only save changes if it has something more substantial to change. This edit incorrectly changed a book title, so I took that out; now it only changes complete links such as 20th Century. This was a flat-out bug. I fixed it. And this edit created a bot-generated title of "File Not Found", when the server neglected to return a status of 404; I have changed the bot to read "File Not Found" as a dead link. Developing... – Quadell (talk) (random) 22:31, 10 July 2008 (UTC)[reply]
- This edit? Definitely something I want to avoid. Fixed. Also, what do I do about links that want me to sign in, like in this diff? Do I mark it as a dead link, or ignore it, or what? – Quadell (talk) 22:56, 10 July 2008 (UTC)[reply]
- Done Okay, the testrun of 50 is done. I fixed the obvious errors I saw. See any subtler errors, or suggestions for improvement? – Quadell (talk) 23:15, 10 July 2008 (UTC)[reply]
Current status
editTrial is complete. Found errors have been fixed. There is still one open question: what to about links that want me to sign in or register? If no one has any better ideas, I'll simply ignore them. Anyway, I guess I'm just waiting for final approval, or any further suggestions on improvements. All the best, – Quadell (talk) 02:47, 14 July 2008 (UTC)[reply]
- One thing before I approve this, replacing <references /> to {{reflist}} was wrong last time I checked but replacing <div class="references-small"><references /></div> with {{reflist}} was fine. BJTalk 21:52, 17 July 2008 (UTC)[reply]
- What the code does is this: First it checks to see if the there is a <references /> surrounded by two <span>s (or <div>s, or one of each), with no other text inside those divs. If this is the case, the bot replaces the whole "<div (something)><div (something)><references /></div></div>" with {{reflist}} (or {{refilst|2}}, or whatever, depending on the div parameters.) If a doubly-divved <references /> doesn't exist, it looks for a singly-divved one. Does the same. And if that isn't found, it looks for a non-divved <references />, and replaces it with {{reflist}}. This way, formatting isn't removed from manually-entered references. For instance, this wikicode:
<div class=references-small> <references /> * A manual reference </div>
- ...is converted to:
<div class=references-small> {{reflist}} * A manual reference </div>
- ...so formatting is preserved. This is the same logic that AWB now uses, since the AWB folks addressed the complaints on the issue. I haven't seen Polbot introduce any mal-formatting so far with this, but I'll keep my eyes open.
- Having said this, this is a very minor part of the task, and I don't mind leaving that out if it would help the BAGgers sleep peacefully at night. – Quadell (talk) 15:09, 15 July 2008 (UTC)[reply]
Questions from Dispenser
editSome question from the DumZiBot BRfA
- How will the bot handle dead links? Will it append {{dead link}} to the link?
- How good is the bot's soft 404 Detection?
- Will the bot have blacklisted titles? Will the bot be able to correctly handle the DumZiBot's test cases?
- Why not just extend the code from DumZiBot?
Unfortunately, I'll be at the H.O.P.E. conference over the next few days and wont be able answer your responses.
- I'm on vacation too, just popping in. Thanks for those links and comments! In brief: it does append {{dead link}} to those links it happens to be looking up anyway to find the titles. The bot's soft-404 detection is rudimentary, and I expect to be adding to it as I review its changes. Its only blacklisted titles are the really obvious soft-404 detection, e.g. "file not found". That link to DumZiBoT's test cases is great! I'll get on that when I'm back in town. Also, the difference between this and DumZiBoT is that DumZiBoT only plows through database dumps, and it doesn't auto-cite books or magazines or journals, so it really has a different scope. There's some overlap in what they do, but really, Polbot#8 will effect different pages in different ways. We're collaborating though.
- Okay, I'm diving back into vaca-mode now. Enjoy H.O.P.E.! I have several friends who are going, and are really looking forward to it. All the best, – Quadell (talk) 17:22, 17 July 2008 (UTC)[reply]
Update: HOPE is over, my vacation is over, and I'm going through DumZiBot's test cases. I'll let you know when I'm ready here. – Quadell (talk) 15:02, 24 July 2008 (UTC)[reply]
- Awe, ready for the second volley. Number to correspond to the answer above.
- Does it also detect the wrappers that are sometimes around it. Like the two columns?
- Fine.
- Fine.
- Watch out to make sure you aren't doing
[http://en.wikipedia.org/wiki/Category:Firearms Linked category] -> [[Category:Firearms|Linked image]]
. And would the first example that you give change a number [1] to example? See also Wikipedia:AWB/FR#External to Interwiki. - Is there a wiki page where users could see and add to the list of templates? Mind you it doesn't need to pull it from the page.
- Could I see the source already, because the one I uses doesn't work when the in a ref on the same line.
- Same comment as on DumZiBoT page, would like to a better ref names as editors ultimately don't change these names
- This probably isn't useful but if you wanted more you could always have AWB to run the perl script.
- Did you rewrite all of DumZiBot's code in perl? If you did wouldn't it have been easier to call it the program? When will you releasing the source code? — Dispenser 21:15, 25 July 2008 (UTC)[reply]
I appreciate your questions, and I hope I'm understanding them all correctly.
- #1: Yes, it keeps font and column formatting.
- #4: It prepends colons for categories and images, so http://en.wikipedia.org/wiki/Image:William_Gibson_by_FredArmitage.jpg becomes [[:Image:William Gibson by FredArmitage.jpg]] and [http://en.wikipedia.org/wiki/Category:Internet_history linktext] becomes [[:Category:Internet history|linktext]].
- #5: the way the code deals with templates such as {{cite news}} and {{imdb}} is rather complicated -- but people have left requests on my talk pages, and whenever they have, I have implemented the requested templates.
- #6: The source code is at User:Polbot/source/Reffix.pl. But basically, this function turns a line like this:
These[http://www.example.com/1] are examples[http://www.example.com/2].
- ...into:
These<ref>[http://www.example.com/1 Title 1]</ref> are examples.<ref>[http://www.example.com/2 Title 2]</ref>
- Is there something the code's not doing that you think it should?
- #7: I've given up on this for now. It doesn't come up often, and NickDumZ is having a bugger of a time getting it to work. So I may go back to that challenge later. Or not.
- #8: Well, the version I'm working on now is over 1000 lines, including XML-parsing packages, HTML-following packages, and strange character-encoding packages, so I don't know that that sort of thing could (or should) be integrated into AWB. I did not use DumZiBot's code directly -- I basically took his idea, changed it considerably, and created a perl bot from scratch. Although the two bots do a few things similarly, the differences really are enormous. – Quadell (talk) 22:45, 25 July 2008 (UTC)[reply]
I think I've answered all the questions. I believe this bot is ready for Wikipedia, if Wikipedia is ready for this bot. – Quadell (talk) 13:12, 29 July 2008 (UTC)[reply]
Approved. BJTalk 13:17, 29 July 2008 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.