User:Topbanana/Reports/Scripts/Create Link Analysis Database

Overview

edit

The downloadable database dumps of Wikipedia only contain details about current articles and possibly their history. The script below analyses the text of articles available in these database dumps to create a number of tables describing the Wikipedia:wikilinks they contain.

The links tables can then be used to carry out further analyses such as identifying Wikipedia:Orphaned articles.

Preparing a local Wikipedia database

edit

Firstly, download and install a copy of the free, open-source MySQL database. Version 4.1 contains all of the features required to run the script below and can be downloaded from here.

Now download a dump of the Wikipedia database from here - the 'cur' downloads contain all the information needed by the script below. Unzip the downloaded file - you should end up with a single file with a name ending in ".sql".

To load the contents of this file into MySQL, you will first need to create and use a database:

mysql> create database wikipedia;
Query OK, 1 row affected (0.06 sec)
mysql> use wikipedia;
Database changed

Now the downloaded and unzipped database dump can be entered into your database:

mysql> source c:\20040613_cur_table.sql;

The file will take several hours to load. While waiting, log on to wikipedia and improve some articles.

Create required subscripts

edit

The script below makes use of four other scripts to perform repetitive tasks. You will need to create and save these - instructions on how to do this can be found on each of these pages:

extract_links.sql
extract_nowiki.sql
extract_math.sql
extract_HTML_comment.sql

The script below expects to find these in the directory "C:\" - if you save them to a different location, you will need to update your copy of the script accordingly. The lines you will need to modify all begin with the word "SOURCE".

What the script does

edit

Once you have completed these steps script below can be cut and paste into your mysql client window. It will take about 12 hours to run on a modern desktop PC and require about 2 GB of disk space.

The script first copies relevant articles from the 'cur' table (created by the database dump file you downloaded) into a new table 'art'. This cuts down the amount of work the script needs to do later, making it faster. It also ensures that only one article exists with any given title. Articles are marked as being 'stubs', 'redirects', 'disambiguation pages' or, if they are none of these, just plain 'articles'.

Next, the text of articles in the 'art' table is analysed. Sections of articles marked as being 'special' - for example being within nowiki tags - are identified and details stored in the 'special_text' table. Once done, the text of the articles is analysed once more and all wikilinks identified. These are stored in the 'raw_links' table.

The raw links are matched back to article titles where possible. A number of different transformations of the links is carried out - for example removing repeated spaces and some special character codes - to ensure as many links as possible are matched. Links for which a target is found are moved into the 'good_links' table. Unmatched links end up in 'bad_links' and any links which could not be processed - for example because they were badly formed or were interlanguage links - end up in 'unhandled_links'.

Once all this is done, summaries of the number and types of links are calculated. Summaries of good links are stored on the 'art' table. Unmatched links (those in the 'bad_links' table) are summarised to 'bad_link_summary'.


The script

edit


------------------------------------------------------------------------------
--            Script to create WikiPedia reporting database                 --
------------------------------------------------------------------------------
--  This script recreates a set of 'links' tables by analysing the text of  --
--  articles stored in the 'cur' table.  It only deals with articles in the --
--  main 'en' namespace (namespace 0), and so works best with a database    --
--  contqaining only the 'en cur' wikipedia database dump                   --
------------------------------------------------------------------------------
--  No warranty express or implied.  Queries, complaints, and praise to     --
--  User_talk:Topbanana please                                              --
------------------------------------------------------------------------------

-- Create a table containing just the articles we're interested in.  This makes
-- this slightly more efficient at the expense of disk space and allows us to
-- precalculate a number of interesting properties of articles.

CREATE TABLE art (
    -- Basic article details
    art_id mediumint(7) unsigned NOT NULL,
    art_title varchar(255) binary NOT NULL,
    art_text mediumtext NOT NULL,
    
    -- Precalculated details about this article
    art_len mediumint(6) unsigned NOT NULL,
    art_is_stub tinyint(1) unsigned NOT NULL,
    art_is_redirect tinyint(1) unsigned NOT NULL,
    art_is_disambig tinyint(1) unsigned NOT NULL,
	art_is_copyvio tinyint(1) unsigned NOT NULL,
	art_is_list tinyint(1) unsigned NOT NULL,
    
    -- Summary counts of links
    links_from mediumint(6) unsigned NOT NULL default 0, 
    good_links_from mediumint(6) unsigned NOT NULL default 0, 
    bad_links_from mediumint(6) unsigned NOT NULL default 0, 
    unhandled_links_from mediumint(6) unsigned NOT NULL default 0, 
    links_to mediumint(6) unsigned NOT NULL default 0, 
	
	PRIMARY KEY( art_id ),
	UNIQUE INDEX( art_title ),
    INDEX( art_len ),
    INDEX( art_is_stub ),
    INDEX( art_is_redirect ),
    INDEX( art_is_disambig )
) ENGINE=MyISAM;


-- Locking tables before large inserts is much faster
LOCK TABLES art WRITE, cur READ;

-- It has been the case that the 'cur' table contains a number of articles with
-- different id numbers but identical titles.  This would appear to be incorrect.
-- We need to strip out the bogus records here to avoid getting spurious entries on
-- reports later on.  The 'REPLACE INTO' syntax works just like 'INSERT INTO' only
-- it will overwrite existing records where there is a unique key violation.
--
-- This statement should take ~10 mins to run on a desktop PC.

REPLACE INTO art ( art_id, art_title, art_text, art_len, art_is_stub, 
                   art_is_redirect, art_is_disambig, art_is_copyvio, art_is_list )
SELECT cur_id, cur_title, cur_text, Length( cur_text ),
       if( locate( 'stub}}', cur_text ) > 0, 1, 0),
       cur_is_redirect, 
       if( locate( '{{msg:disambig}}', cur_text ) > 0, 1, 0),
       if( locate( 'copyvio', cur_text ) > 0, 1, 0),
       if( cur_title LIKE 'List_of_%' OR cur_title LIKE 'Table_of_%', 1, 0 )
FROM cur
WHERE cur_namespace = 0;

-- Alternatives for the stub and disambig messages

UPDATE art
SET art_is_disambig = 1
WHERE locate( '{{disambig}}', art_text ) > 0;

UPDATE art
SET art_is_disambig = 1
WHERE art_title LIKE '% (disambiguation)';

UPDATE art
SET art_is_disambig = 1
WHERE art_title LIKE '% (disambig)';


UNLOCK TABLES;

-- Okay, on to links.  This is again a multi-stage process.
--
-- Links tables are:
--   raw_links         - links we are processing
--   unhandled_links   - links we are unable to process automatically
--   good_links        - links we have matched to a target article
--   bad_links         - links we can't find a target article for


CREATE TABLE raw_links (
    pos mediumint(5) unsigned NOT NULL,
    from_id mediumint(7) unsigned NOT NULL,
    to_id mediumint(7) unsigned NOT NULL default 0,
    lstart mediumint(6) unsigned NOT NULL,
    lend mediumint(6) unsigned NOT NULL default 0,
    link varchar(255) binary NOT NULL default '',
    PRIMARY KEY ( pos, from_id )
) ENGINE=MyISAM;

CREATE TABLE unhandled_links (
    pos mediumint(5) unsigned NOT NULL,
    from_id mediumint(7) unsigned NOT NULL,
    to_id mediumint(7) unsigned NOT NULL,
    lstart mediumint(6) unsigned NOT NULL,
    lend mediumint(6) unsigned NOT NULL,
    link varchar(255) binary NOT NULL,
    reason varchar(255) binary NOT NULL,
    PRIMARY KEY ( pos, from_id )
) ENGINE=MyISAM;

CREATE TABLE good_links (
    from_id mediumint(7) unsigned NOT NULL,
    to_id mediumint(7) unsigned NOT NULL,
    count smallint(5) NOT NULL,
    PRIMARY KEY ( from_id, to_id )
) ENGINE=MyISAM;

CREATE TABLE bad_links (
    from_id mediumint(7) unsigned NOT NULL,
    link varchar(255) binary NOT NULL,
    count smallint(5) NOT NULL
) ENGINE=MyISAM;


-- First the position of each link start (indicated by [[ in the article text)
-- is noted.  This is done using repeated queries and a 'position' column to
-- note the number of the link within each page.  To the best of my knowledge
-- MySQL has no loop constructs available in 4.1 which allow this to be done
-- more neatly, but I'm happy to be proven wrong if anyone knows better.

-- This statement and the extract_links script should take about 30 minutes
-- to run on a desktop PC.

LOCK TABLES raw_links WRITE, art READ, raw_links as rl2 READ;

INSERT INTO raw_links (pos, from_id, lstart)
SELECT 1, art_id, locate( '[[', art_text, 1 ) + 2
FROM art
WHERE locate( '[[', art_text, 1 ) > 0;

-- Search for the second and subsequent links.  This script handles up to 10,000 links
-- per article.

SOURCE c:\\extract_links.sql

UNLOCK TABLES;
ANALYZE TABLE raw_links;

-- Okay, now find the end of each link ( the next ]] after each [[ ).  This'll take
-- about 20 minutes on a desktop PC.

LOCK TABLES raw_links WRITE, art WRITE;

UPDATE raw_links, art
SET raw_links.lend = locate( ']]', art.art_text, raw_links.lstart )
WHERE raw_links.from_id = art.art_id;

UNLOCK TABLES;


-- Now carry out further analysis of text to identify blocks enclosed by important tags

CREATE TABLE special_classes (
   special_id tinyint(3) unsigned NOT NULL,
   link varchar(255) binary NOT NULL,
   PRIMARY KEY ( special_id )
) ENGINE=MyISAM; 
   

INSERT INTO special_classes VALUES ( 0, 'nowiki' );
INSERT INTO special_classes VALUES ( 1, 'math' );
INSERT INTO special_classes VALUES ( 2, 'HTML comment' );
INSERT INTO special_classes VALUES ( 3, 'external link' );
INSERT INTO special_classes VALUES ( 4, 'HTML references' );


CREATE TABLE special_text (
    special_id tinyint(3) unsigned NOT NULL,
    art_id mediumint(7) unsigned NOT NULL,
    pos smallint(5) unsigned NOT NULL,	
    start mediumint(6) unsigned NOT NULL,
    end mediumint(6) unsigned NOT NULL default 0,
    PRIMARY KEY ( special_id, art_id, pos )
) ENGINE=MyISAM;


-- Text inside <nowiki> </nowiki> tag pairs

INSERT INTO special_text ( special_id, art_id, pos, start )
SELECT 0, art_id, 1, locate( '<nowiki>', art_text, 1 ) + 8
FROM art
WHERE locate( '<nowiki>', art_text, 1 ) > 0;

SOURCE c:\\extract_nowiki.sql;

UPDATE special_text, art
SET special_text.end = locate( '</nowiki>', art.art_text, special_text.start )
WHERE special_text.art_id = art.art_id
AND   special_id = 0;


-- Text inside <math> </math> tag pairs

INSERT INTO special_text ( special_id, art_id, pos, start )
SELECT 1, art_id, 1, locate( '<math>', art_text, 1 ) + 6
FROM art
WHERE locate( '<math>', art_text, 1 ) > 0;

SOURCE c:\\extract_math.sql;

UPDATE special_text, art
SET special_text.end = locate( '</math>', art.art_text, special_text.start )
WHERE special_text.art_id = art.art_id
AND   special_id = 1;

-- Text inside HTML comments <!--  -->

INSERT INTO special_text ( special_id, art_id, pos, start )
SELECT 2, art_id, 1, locate( '<!--', art_text, 1 ) + 4
FROM art
WHERE locate( '<!--', art_text, 1 ) > 0;

SOURCE c:\\extract_HTML_comment.sql;

UPDATE special_text, art
SET special_text.end = locate( '-->', art.art_text, special_text.start )
WHERE special_text.art_id = art.art_id
AND   special_id = 2;


-- Unclosed tags run to the end of the article

select art_title, special_text.*
from art, special_text
where art.art_id = special_text.art_id
and start > end;

update special_text, art
set end = art_len
where art.art_id = special_text.art_id
and start > end;

ALTER TABLE special_text ADD INDEX( art_id );

-- Okay, back to the link analysis

-- Detect obviously anomalous links and move them into the unhandled links table before
-- proceeding

LOCK TABLES raw_links WRITE, unhandled_links WRITE, special_text READ;

INSERT INTO unhandled_links
SELECT *, 'end <= start'
FROM raw_links
WHERE lend <= lstart;

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'Length > 255'
FROM raw_links
WHERE ( lend - lstart ) > 255;

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'All in HTML comment'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 2
  AND from_id = art_id
  AND lstart > start
  AND lend <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'Starts in HTML comment'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 2
  AND from_id = art_id
  AND lstart > start
  AND lstart <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'Ends in HTML comment'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 2
  AND from_id = art_id
  AND lend > start
  AND lend <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'All in nowiki'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 0
  AND from_id = art_id
  AND lstart > start
  AND lend <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'Starts in nowiki'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 0
  AND from_id = art_id
  AND lstart > start
  AND lstart <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'Ends in nowiki'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 0
  AND from_id = art_id
  AND lend > start
  AND lend <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'All in math'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 1
  AND from_id = art_id
  AND lstart > start
  AND lend <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'Starts in math'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 1
  AND from_id = art_id
  AND lstart > start
  AND lstart <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );

INSERT INTO unhandled_links
SELECT *, 'Ends in math'
FROM raw_links
WHERE EXISTS (
  SELECT 1 
  FROM special_text 
  WHERE special_id = 1
  AND from_id = art_id
  AND lend > start
  AND lend <= end );

DELETE FROM raw_links WHERE EXISTS ( SELECT 1 FROM unhandled_links WHERE unhandled_links.pos = raw_links.pos AND unhandled_links.from_id = raw_links.from_id );


UNLOCK TABLES;

-- Okay now extract the text for the target of each link.  A number of manipulations
-- need to be carried out during this process, including:
--
--   Removing any descriptive text ( 'Greeting|Hello' -> 'Greeting' )
--   Trimming leading and trailing spaces ( ' Green ' -> 'Green' );
--   Compounding pairs of spaces ( 'hello  there' -> 'hello there' )
--   Converting spaces to underscores
--
-- A selection of other manipulations are carried out later on.
-- This'll take about 20 minutes on a desktop PC

LOCK TABLES raw_links WRITE, art WRITE;

UPDATE raw_links, art
SET LINK = replace( replace( trim( substring_index( mid( art_text, lstart, lend-lstart), '|', 1 )), ' ', '_' ), '__', '_' )
WHERE from_id = art_id;

UNLOCK TABLES;

-- Tidy up section links (links to locations within a certain page).  Where a link is
-- to a specific section within another article, we'll pretend it's a plain link, ie
-- we'll transform 'human_being#anatomy' into just 'human_being'.  Links starting with
-- a hash (#) are to another section within the same article, we'll mark those as
-- unhandleable for now.

INSERT INTO unhandled_links
SELECT *, 'Local section link'
from raw_links
WHERE link LIKE '#%';

DELETE FROM raw_links
WHERE link LIKE '#%';

-- Articles cannot have square brackets in their titles

INSERT INTO unhandled_links
SELECT *, 'Contains a square bracket'
from raw_links
WHERE link LIKE '%[%';

DELETE FROM raw_links
WHERE link LIKE '%[%';

INSERT INTO unhandled_links
SELECT *, 'Contains a square bracket'
from raw_links
WHERE link LIKE '%]%';

DELETE FROM raw_links
WHERE link LIKE '%]%';

-- Wikimedia seems to treat multiple spaces in links as one.  We need to carry out
-- extra processing to handle the few articles that contain more than two spaces here.

UPDATE raw_links SET link = replace( link, '__', '_' );
UPDATE raw_links SET link = replace( link, '__', '_' );
UPDATE raw_links SET link = replace( link, '__', '_' );

-- Capitalize the first letter of each link

UPDATE raw_links
SET link = concat(upper(left( link, 1 )),mid( link, 2 ));

-- The above operations ccould have caused the raw_links table to become quite fragmented.
--  Optimising the table here is worth the cost.

OPTIMIZE TABLE raw_links;

-- Okay, now look for article titles matching the text of each link.  This is just a
-- first pass at the process, we'll carry out further manipulations of the text of those
-- links we can't find a match for later.
-- This takes quite a while to run - about 11 hours - on my own database server.  No idea
-- why, there's no data moving involved and a well-ordered full index on art.art_title.

ANALYZE TABLE raw_links;

LOCK TABLES raw_links WRITE, art READ;

UPDATE raw_links
SET to_id = ( select art_id from art where art_title = link );

UNLOCK TABLES;

-- As the above should have handled the vast majority of links, we'll create indexes
-- now and analyze to ensure stats are up to date

ALTER TABLE raw_links 
    ADD INDEX( link(8) ),
	ADD INDEX( to_id ),
	ADD INDEX( from_id );

ANALYZE TABLE raw_links;

-- The next thing to do is convert character codes in links we couldn't match to thier
-- ASCII equivalents.  This isn't an exhaustive list by any means, but catches a few
-- hundred links we would otherwise miss.

UPDATE raw_links SET link = replace( link, '%27', '\'') WHERE to_id = 0 AND locate( '%27', link ) > 0;
UPDATE raw_links SET link = replace( link, '%2C', ',')  WHERE to_id = 0 AND locate( '%2C', link ) > 0;
UPDATE raw_links SET link = replace( link, '%3F', '?')  WHERE to_id = 0 AND locate( '%3F', link ) > 0;
UPDATE raw_links SET link = replace( link, '%92', '\’') WHERE to_id = 0 AND locate( '%92', link ) > 0;
UPDATE raw_links SET link = replace( link, '%28', '(')  WHERE to_id = 0 AND locate( '%28', link ) > 0;
UPDATE raw_links SET link = replace( link, '%29', ')')  WHERE to_id = 0 AND locate( '%29', link ) > 0;

UPDATE raw_links, art
SET to_id = art_id
WHERE link = art_title
AND to_id = 0;

-- Now remove interwiki links

INSERT INTO unhandled_links
SELECT *, 'Interwiki'
FROM raw_links
WHERE link LIKE '__:%'
AND to_id = 0;

DELETE FROM raw_links
WHERE link LIKE '__:%'
AND to_id = 0;

-- HTML character codes

UPDATE raw_links SET link = replace( link, '&Aacute;', 'Á') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&aacute;', 'á') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Acirc;', 'Â') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&acirc;', 'â') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&AElig;', 'Æ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&aelig;', 'æ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Agrave;', 'À') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&agrave;', 'à') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Aring;', 'Å') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&aring;', 'å') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Atilde;', 'Ã') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&atilde;', 'ã') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Auml;', 'Ä') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&auml;', 'ä') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ccedil;', 'Ç') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ccedil;', 'ç') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&divide;', '÷') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Eacute;', 'É') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&eacute;', 'é') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ecirc;', 'Ê') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ecirc;', 'ê') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Egrave;', 'È') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&egrave;', 'è') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ETH;', 'Ð') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&eth;', 'ð') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Euml;', 'Ë') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&euml;', 'ë') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&frac12;', '½') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&frac14;', '¼') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&frac34;', '¾') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Iacute;', 'Í') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&iacute;', 'í') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Icirc;', 'Î') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&icirc;', 'î') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Igrave;', 'Ì') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&igrave;', 'ì') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&iquest;', '¿') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Iuml;', 'Ï') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&iuml;', 'ï') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ntilde;', 'Ñ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ntilde;', 'ñ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Oacute;', 'Ó') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&oacute;', 'ó') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ocirc;', 'Ô') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ocirc;', 'ô') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ograve;', 'Ò') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ograve;', 'ò') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Oslash;', 'Ø') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&oslash;', 'ø') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Otilde;', 'Õ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&otilde;', 'õ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ouml;', 'Ö') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ouml;', 'ö') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&szlig;', 'ß') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&THORN;', 'Þ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&thorn;', 'þ') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&times;', '×') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Uacute;', 'Ú') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&uacute;', 'ú') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ucirc;', 'Û') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ucirc;', 'û') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Ugrave;', 'Ù') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&ugrave;', 'ù') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Uuml;', 'Ü') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&uuml;', 'ü') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&Yacute;', 'Ý') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&yacute;', 'ý') WHERE to_id = 0;
UPDATE raw_links SET link = replace( link, '&yuml;', 'ÿ') WHERE to_id = 0;

UPDATE raw_links, art
SET to_id = art_id
WHERE link = art_title
AND to_id = 0;

-- Show remaining HTML-escaped codes in links
-- SELECT distinct mid( link, locate( '&', link ) + 1, locate( ';', link, locate( '&', link ) ) - locate( '&', link ) - 1 )
-- FROM raw_links
-- WHERE to_id = 0
-- AND link LIKE '%&%;%'
-- ORDER BY 1;

-- Show remaining HTML-escaped codes in article titles
-- SELECT distinct mid( art_title, locate( '&', art_title ) + 1, locate( ';', art_title, locate( '&', art_title ) ) - locate( '&', art_title ) - 1 )
-- FROM art
-- WHERE art_title LIKE '%&%;%'
-- ORDER BY 1;

-- Mark any links still containing HTML-escaped characters as unhandleable

INSERT INTO unhandled_links
SELECT *, 'Link contains an HTML character code'
FROM raw_links
WHERE link like '%&%;%'
AND to_id = 0;

DELETE FROM raw_links
WHERE link like '%&%;%'
AND to_id = 0;

-- Remaining links containing a colon (:) are probably to another namespace, for example
-- 'Wikipedia:What links here'.  Mark these as unhandleable for now.

INSERT INTO unhandled_links
SELECT *, 'Link contains a :'
FROM raw_links
WHERE link like '%:%'
AND to_id = 0;

DELETE FROM raw_links
WHERE link like '%:%'
AND to_id = 0;

-- Remaining links containing a hash (#) are probably section links = flatten them

UPDATE raw_links
SET link = substring_index( link, '#', 1 )
WHERE link LIKE '%#%'
and to_id = 0;

UPDATE raw_links, art
SET to_id = art_id
WHERE link = art_title
AND to_id = 0;

-- Okay, we've matched all we can match.  Pick out good and bad links into nice tidy
-- tables to make reporting easy.

LOCK TABLES good_links WRITE, raw_links READ;

INSERT INTO good_links
SELECT from_id, to_id, count(*)
FROM raw_links
WHERE to_id > 0
GROUP BY from_id, to_id;

UNLOCK TABLES;

ALTER TABLE good_links add INDEX( to_id );
ANALYZE TABLE good_links;

LOCK TABLES bad_links WRITE, raw_links READ;

INSERT INTO bad_links
SELECT from_id, link, count(*)
FROM raw_links
WHERE to_id = 0
GROUP BY from_id, link;

UNLOCK TABLES;


ALTER TABLE bad_links
              ADD INDEX( from_id ),
              ADD INDEX( link(8) );
ANALYZE TABLE bad_links;

-- Now fill in a few summary statistics to aid in the generation of reports

ALTER TABLE unhandled_links ADD INDEX( from_id );
ANALYZE TABLE unhandled_links;


LOCK TABLES good_links READ, art WRITE;

UPDATE art
SET art.links_to = ( SELECT count(*) FROM good_links WHERE to_id = art_id );

UPDATE art
SET art.good_links_from = ( SELECT count(*) FROM good_links WHERE from_id = art_id );

UNLOCK TABLES;

LOCK TABLES bad_links READ, art WRITE;

UPDATE art
SET art.bad_links_from = ( SELECT count(*) FROM bad_links WHERE from_id = art_id );

UNLOCK TABLES;

LOCK TABLES unhandled_links READ, art WRITE;

UPDATE art
SET art.unhandled_links_from = ( SELECT count(*) FROM unhandled_links WHERE from_id = art_id );

UNLOCK TABLES;

LOCK TABLES art WRITE;

UPDATE art
SET links_from = ( good_links_from + bad_links_from + unhandled_links_from );

UNLOCK TABLES;

ALTER TABLE art
  ADD INDEX( links_from ),
  ADD INDEX( good_links_from ),
  ADD INDEX( bad_links_from ),
  ADD INDEX( unhandled_links_from ),
  ADD INDEX( links_to );

-- Summary table of bad links

CREATE TABLE bad_link_summary (
    link varchar(255) binary NOT NULL,
    count smallint(5) NOT NULL,
    PRIMARY KEY ( link ),
	INDEX( count )
) ENGINE=MyISAM;
 

LOCK TABLES bad_link_summary WRITE, bad_links READ;

INSERT INTO bad_link_summary
SELECT link, count(*)
FROM bad_links
GROUP BY link;

UNLOCK TABLES;

OPTIMIZE TABLE unhandled_links;


Category:Wikipedia resources for researchers