User talk:ST47/perlwikipedia

Latest comment: 16 years ago by Dagoth Ur, Mad God in topic Dagothbot

Unix

edit

I'm running ActivePerl on a Windows box, and perlwikipedia doesn't seem to work. Am I correct that pw assumes an Unix environment? For instance, perlwikipedia.pm says:

system("test -s \".perlwikipedia-$editor-cookies\"");

Is "test" a Unix command? – Quadell (talk) (random) 17:44, 24 May 2007 (UTC)Reply

I have a similar issue, and others have reported it as well. Shadow doesn't get too happy when we talk about windows around him, though, so I suppose it won't be fixed. I think it's an encoding error, and I've said several times that I was going to look at it, but I never got around to it. --ST47Talk 18:04, 25 May 2007 (UTC)Reply
Oops! Yeah, "test" is a Unix command to check if a file exists, I'll write a better handler for this code ASAP and commit it to SVN. Shadow1 (talk) 18:58, 25 May 2007 (UTC)Reply
Ok, the code should now work on Windows/ActiveState if you update your working copy. Thanks for reminding me that not everyone uses Linux! Shadow1 (talk) 19:08, 25 May 2007 (UTC)Reply

Thanks for the quick turnaround! Now it still errors out, but at a different location. When I try to log in, I get:

Error requesting Special%3AUserlogin: 403 Forbidden

When I turn on debug, just before it dies it tells me

Retrieving http://en.wikipedia.org/w/index.php?title=Special%3AUserlogin&action=edit

Of course I can enter this URL in my browser and not get a 403 error. Is this an incompatibility with Windows, or something else? Any ideas? – Quadell (talk) (random) 19:55, 25 May 2007 (UTC)Reply

That's because the user agent is blocked, you need to change it to something specific to your bot if you want to do anythyng. --ST47Talk 23:57, 25 May 2007 (UTC)Reply
I'm not sure I understand. I'm just running login.pl from here. I don't see anywhere to set the user agent. Is this related to the line "my $editor = Perlwikipedia->new('Bot/WP/EN/E/ExtranetBot');" in this code you wrote? – Quadell (talk) (random) 02:32, 26 May 2007 (UTC)Reply
I believe the command is $editor->{mech}->agent('w/e'); --ST47Talk 11:25, 27 May 2007 (UTC)Reply

List?

edit

Another question: Is there a list (or category) of bots using perlwikipedia? – Quadell (talk) (random) 17:54, 24 May 2007 (UTC)Reply

I just created Category:Perlwikipedia bots. Shadow1 (talk) 18:58, 25 May 2007 (UTC)Reply
Thanks! If my bot gets approved, I'll add it. – Quadell (talk) (random) 19:56, 25 May 2007 (UTC)Reply

More tech support

edit

Hi. Perlwikipedia looks like a great tool, and I'd love to use it, but I can't get it to work. The supposed test script, login.pl, does not seem to work as-is. (I get "Error requesting Special%3AUserlogin: 403 Forbidden".) ST47, above, suggested I add the line "$editor->{mech}->agent('w/e');" to specify the user agent. When I do that, I get this error: "There is no form named "userlogin" at C:/Perl/lib/Perlwikipedia.pm line 102. Died at C:/Perl/lib/WWW/Mechanize.pm line 1684."

If I can't get this to work, I'll have to find some other way to interface with Wikipedia. Any help anyone could provide would be greatly appreciated. (I'm using ActivePerl on a Windows box, by the way.) Thanks, – Quadell (talk) (random) 14:21, 28 May 2007 (UTC)Reply

First, 'w/e' means 'whatever', so replace that with something descriptive. I usually use Bot/WP/EN/ST47/BotName. I don't know what that error means, but make sure you have the latest version and such. --ST47Talk 14:31, 28 May 2007 (UTC)Reply
Unless you are using the passwordless login method that I described on the Google Code wiki, there is no reason you should need to use Login.pl. It's a script that is designed to fetch the login data for your bot's account and place it into a file so that your bot can log into Wikipedia without using a password in cleartext. From what I've seen, the source code you're using should work perfectly fine if you insert the bot's password into the right place in the login() call. Shadow1 19:06, 30 May 2007 (UTC)Reply
It may be that my modules (LWP, Mechanize, etc.) were not installed correctly. I'm investigating. – Quadell (talk) (random) 12:53, 31 May 2007 (UTC)Reply
That was it. With LWP and Mechanize reinstalled, it works fine. Huzzah! – Quadell (talk) (random) 14:42, 31 May 2007 (UTC)Reply

New problem. It logs in fine, but when attempting to get_text, on a Windows system, it puts itself in an endless loop. (It works fine on a *nix system.) My code looks like this:

use Perlwikipedia;
use strict;
my $pw=Perlwikipedia->new();
$pw->{debug} = 1;
$pw->{mech}->agent('Bot/WP/EN/Quadell/polbot');
my $login_status=$pw->login('Polbot','(my password)');
die "I can't log in." unless ($login_status eq 'Success');
my $html = $pw->get_text('User:Polbot');

The output on a Windows system (with debug on) is as follows:

Retrieving http://en.wikipedia.org/w/index.php?title=Special%3AUserlogin&action=edit
Login as "Polbot" succeeded.
Retrieving http://en.wikipedia.org/w/index.php?title=User%3APolbot&action=edit&oldid=&section=
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
. . .

It continues trying to load a page with no title specified until I cancel the program. This seems to be because m/var wgAction = "edit"/ doesn't match, so the until condition is never met. Debugging, I tried to print $res->content from within the get_text definition, and it seems to be complete gobledegook. Is there an encoding problem, maybe? – Quadell (talk) (random) 15:58, 31 May 2007 (UTC)Reply

Install the module Compress::Zlib. For some reason, the servers like to return gzip-compressed content, so installing this module should fix the last of your problems. Shadow1 (talk) 16:22, 31 May 2007 (UTC)Reply
I installed Compress::Zlib, but it does the same thing. – Quadell (talk) (random) 17:35, 31 May 2007 (UTC)Reply
The only other problem I can think of is that there's something wrong with your installation of ActiveState/WWW::Mechanize that's causing it to not properly decode the content. In the actual Perlwikipedia.pm file, change

use WWW::Mechanize;

to

use WWW::Mechanize::Gzip;

and

WWW::Mechanize->new( cookie_jar => {}, onerror => \&Carp::carp );

to

WWW::Mechanize::Gzip->new( cookie_jar => {}, onerror => \&Carp::carp ); .

Other than that, I really can't help you much more. Shadow1 (talk) 19:23, 1 June 2007 (UTC)Reply

Actually, no, never mind that. The author of WWW::Mechanize recently removed support for decoding Gzipped content via content(), so make sure you're using the latest version of the module. It should be version 1.30. Update the module and you should be fine. Shadow1 (talk) 13:14, 2 June 2007 (UTC)Reply
I have the latest WWW::Mechanize, v1.30. It's not a problem with Mechanize. The following code works as expected:
my $agent = WWW::Mechanize->new('polbot');
$agent->get("http://en.wikipedia.org/w/index.php?title=Main_page&action=view");
print ($agent->{content});
But this code hangs forever:
my $pw = Perlwikipedia->new();
$pw->{mech}->agent('Bot/WP/EN/Quadell/polbot');
print ($pw->get_text('Main page'));
I've repeated this error on a different Windows box with a fresh ActivePerl and Mechanize install. As of now, it looks to me like PerlWikipedia does not work on ActivePerl on Windows. – Quadell (talk) (random) 02:12, 6 June 2007 (UTC)Reply
Ok, change all instances of "->content" in get_text to "->decoded_content" and see if that works. If it does, then it's some sort of odd problem with ActiveState's Mechanize, although I just tested Perlwikipedia on my Windows machine and it worked fine. Shadow1 (talk) 19:40, 6 June 2007 (UTC)Reply
That worked! I'm befuddled as to why I have this problem and you don't, but I'm certainly glad to have a fix. Thanks for all your help! – Quadell (talk) (random) 19:53, 6 June 2007 (UTC)Reply
I'm guessing that there are code differences between ActiveState's Mechanize and CPAN's, but that's water under the bridge. Thanks for helping to resolve this issue; I'll change the code accordingly and commit it to SVN. Shadow1 (talk) 20:33, 6 June 2007 (UTC)Reply

New sub I created

edit

Hey. I created a new sub that I use in my Perlwikipedia.pm. You might want to consider adding it to the official release. You pass in an image name, it returns an array of all articles that include the image (from the "File links" list).

=item get_file_links($pagename)

Returns array containing the pages that link to an image or other media.

=cut

sub get_file_links {
	my $self 	 = shift;
	my $pagename = shift;
	my $res = $self->_get( $pagename, 'view');
    unless ($res) { return; }
	unless ($res->decoded_content =~ m/\(pages on other projects are not listed\):<\/div><\/p>\n<ul>(.*?)\n<\/ul>/s) {return;}
	my $linklist = $1;
	my @articles = split(/\n/, $linklist);
	my @return;
	foreach my $article (@articles) {
		if ($article =~ m/<li><a href=\"[^"]*\" title=\"([^"]*)\">/) {
			push(@return, $1);
		}
	}
	return @return;
}
edit

The behavior of what_links_here seems problematic to me. It is currently returning not only pages that link to the specified page, but also pages that link to redirects to the specified page. However, it doesn't return the first page that links to a redirect.

For example, look at Jill Gascoine & Jill Gascoigne. Jill Gascoigne is a redirect to Jill Gascoine. If I compare a what_links_here here on both pages, the results of Jill Gascoine include all of those of Jill Gascoigne except for Morecambe and Wise which is missing.

It seems to me that what_links_here should only return pages that actually link to the requested page. Returning links to redirects doesn't seem that useful as I would rather specifically request what_links_here on the redirect if that's what I want, but perhaps I'm overlooking something.

So, I recommend either that:

  1. what_links_here be fixed to return the first page linking to a redirect; or
  2. what_links_here's Special::Whatlinkshere screen-scrap be replaced with a call to api.php which only returns direct links.

The benefit of the second is that api.php also supports filtering by namespace which would be convenient in some applications.

If there is interest in the api.php approach, I am willing to write the patch for that. -- JLaTondre 19:41, 4 August 2007 (UTC)Reply

It should work now. Shadow1 (talk) 13:44, 25 August 2007 (UTC)Reply
Thanks. -- JLaTondre 23:56, 26 August 2007 (UTC)Reply

CPAN

edit

I started using this module and it looks fine. Of all the Perl bot frameworks i tried this is the first that i could install and made it do the right thing pretty quickly. Thanks for your work!

A question: Is there a reason you don't host this module on CPAN? CPAN is the natural place to look for Perl code, but anyone who searches CPAN for "MediaWiki" today finds the module of that name, which has impressive documentation, but appears to be unmaintained. Finding your framework on Wikipedia wasn't so trivial. --Amir E. Aharoni (talk) 15:47, 2 June 2008 (UTC)Reply

Unicode

edit

I'm running Ubuntu with Perl 5.8 in an all-UTF-8 environment. Perlwikipedia (today's SVN) seems to assume the terminal runs Latin-1. --LA2 (talk) 22:12, 11 July 2008 (UTC)Reply

Categories

edit

The function get_pages_in_category() seems to retrieve a web page and follow the "next 200" link. This of course has a different name in other languages of Wikipedia. The bot should be able to use an API call instead, to retrieve the full list of category members. See http://en.wikipedia.org/w/api.php for documentation. --LA2 (talk) 22:15, 11 July 2008 (UTC)Reply

German Umlaute

edit

If I use german Umlaute in a mediawiki article I get a pagelinks like Zweidimensionale_H%C3%A4ufigkeitsverteilung_-_Zweidimensionale_H%C3%A4ufigkeitstabellen. If I put this in get_text then I get an empty contents. In a browser this works; any idea what I can do? I have extracted the link with perl from the HTML of Special:Allpages. -- sigbert 14:50, 21 Aug 2008

I found a solution by modifying Perlwikipedia.pm. I added under sub new a $self->{getesc}=0; and replaced under sub _get the line my $no_escape = shift || 0; by my $no_escape = shift || $self->{getesc};. Then I can force from outside if a uri_escape_utf8 is done or not. --Sigbert (talk) 12:30, 17 September 2008 (UTC)Reply

Dagothbot

edit

At one of my wikis, I would like to develop a bot called DagothBot. I have XAMPP installed. Can Perlwikipedia work with the "perl.exe" file in XAMPP or do I need to download Perl from perl.org? I use x10Hosting to host the wiki. Dagoth Ur, Mad God 09:48, 18 September 2008 (UTC)Reply