User:MZMcBride/climax

climax is the code name for a project that gathers and analyzes a set of attributes of biographies of living people in an attempt to programmatically find problematic biographies.

Attributes

A number of attributes of the pages are collected using Python scripts and are inserted into an SQLite database. The database will be released to the public (with the exception of one attribute—the number of page watchers). Below is a table of the raw attributes collected. Other attributes will be derived from this data.

Attribute	Source	Description
Page ID	Database dump	Numeric page ID
Page title	Database dump	Page title
Page length	Database dump	Length of the page in bytes
Bad words count	Database dump	Total number of instances of bad words
Bad words within 50 bytes of an inline references tag	Database dump	Total number of bad words that are within 50 bytes of an inline <ref> tag
Bad words within 50 bytes of an inline "citation needed" tag	Database dump	Total number of bad words that are within 50 bytes of an inline {{citation needed}} tag
Inline references tag count	Database dump	Total number of instances of "<ref"
Inline "citation needed" tag count	Database dump	Total number of instances of "{{citation needed}}" (and its redirects)
Banner "citation needed" count	Database dump	Total number of instances of "{{BLP unsourced}}" (and its redirects)
External hyperlink count	Database dump	Total number of instances of "http"
Page views	Compiled text file	Total number of page views (in November 2009)
Days since last edit	Replicated database	Number of days since the page was edited most recently
Days since first edit	Replicated database	Number of days since page creation
Revisions count	Replicated database	Total number of revisions for the page
Page watchers count	Replicated database	Number of users with the page on their watchlist

Analysis

The value in this data comes from the analysis of it. climax will focus on a scoring system. Other users may be interested in performing their own analysis to examine certain trends or problem areas.

Technical details

Going to split this into a few separate scripts. Dump scanner goes first. Then need to retrieve various props from a text file and from the database....

Script statuses — largely deprecated

climax-dump-props.py
- Implemented
  - Total page length
  - Total number of bad words
  - Total number of "<ref"s
  - Total number of "[http"
  - Presence of reference banners
- Not implemented
  - Number of bad words within X bytes of {{cn, etc.
  - Number of bad words within X bytes <ref / [http
climax-database-props.py
- Implemented
  - none
- Not implemented
  - Date of first edit to page
  - Date of last edit to page
  - Total number of revisions
  - Number of page watchers
climax-views-props.py
- Implemented
  - none
- Not implemented
  - Number of page views from bh.txt (rename this file...)
climax-scorer.py
- Implemented
  - none
- Not implemented
  - Need to devise a proper scoring chart
  - over 9000 :PP

Test cases

Bad words

Urgently need to add case sensitivity support here.

Words definitely need case sensitivity:

dick
evil
traitor
arrested
psycho

Words to possibly remove from the bad word list:

steals (lolbaseball)
investigations

Full list of bad words

\babusing\b
\babuse\b
\babused\b
\babducted\b
\babduction\b
\baccuse\b
\baccused\b
\baccusation\b
\ballege\b
\balleged\b
\banus\b
\barrest\b
\barrested\b
\barse\b
\bass\b
assault\b
assaulted\b
asshole
bastard
bitch
bloody
bollocks
\bbribe\b
\bbribes\b
\bbribed\b
bugger
\bcharges\b
child molester
child molestor
child predator
child predater
\bcocks\b
convict\b
convicted\b
\bcorrupt\b
cunt\b
\bdick\b
dumbass
espionage
\bevil\b
fag\b
faggot\b
faggots\b
fags\b
\bfired\b
\bfled\b
\bflee\b
fraud\b
\bfuck\b
\bfucks\b
\bfucked\b
is gay
\bghey\b
guilty
had an affair
\bhates
idiot
\bimpeach
insane
insanity
investigation
jackass
\bkilled\b
\bliar\b
\bliars\b
\blie\b
\blied\b
lol\b
\blying
malpractice
molest\b
molested
molestation
molesting
murder\b
murdered\b
murdering\b
mutant
neglect
neglected
negligent
\bnigger
paedophile
parole
pedophile
psychiatric
\bpedo\b
\bpsycho\b
\bpussy\b
\bracist
\brape\b
\braped\b
\braping
\bscandal\b
sexual assault
sexually assault
\bshit
\bslut\b
\bsluts\b
\bslutty\b
\bsteal
\bstole\b
\bstupid\b
\bretarded\b
\bretard\b
\btheft\b
\btits\b
\btwat\b
\bwanker
your mom
\bcharged\b
\bsentenced\b
in jail\b
\btraitor

Database schema

As of 02:46, 25 January 2010 (UTC)

sqlite> .schema
CREATE TABLE dump (
    dump_id INTEGER NOT NULL UNIQUE,
    dump_title TEXT NOT NULL,
    dump_length INTEGER,
    dump_bad_words INTEGER,
    dump_bad_words_near_ref INTEGER,
    dump_bad_words_near_citation_needed INTEGER,
    dump_individual_ref_tags INTEGER,
    dump_external_links INTEGER,
    dump_inline_citation_templates INTEGER,
    dump_banner_citation_templates INTEGER
);

CREATE TABLE database (
    db_id INTEGER NOT NULL UNIQUE,
    db_title TEXT,
    db_first_edit INTEGER,
    db_last_edit INTEGER,
    db_revision_count INTEGER,
    db_creator TEXT,
    db_watchers INTEGER
);

CREATE TABLE views (
    views_title TEXT NOT NULL,
    views_value INTEGER NOT NULL,
    views_month TEXT NOT NULL
);

CREATE INDEX vtindex ON views(views_title);
CREATE INDEX vvindex ON views(views_value);

To-do

case sensitive bad words (e.g., "dick")
add column in the database table for page creator text
add all code to code repo
prefix database names properly
version views columns (e.g., nov_09_views)
test.db includes non-articles
need to add "reference_headers" count
differentiate bad words vs. very bad words
track Google hits?
track incoming page links from other articles?