User:Umeboshi/Tools

I've been writing some small scripts in my spare time. All tools here are developed and tested on machines running debian.

XML data dump tools

Most of the tools dealing with wikipedia database dumps written so far have been in perl. Personally, I think perl is a little ugly and sometimes hard to read. Read the minimalist section here to better understand. This example may also help:

##why i don't use perl
perl -e'$_=q#: 13_2: 12/o{>: 8_4) (_4: 6/2^-2; 3;-2^\2: 5/7\_/\7: 12m m::#;y#:#\n#;s#(\D)(\d+)#$1x$2#ge;print'

I keep this example around to remind me.

User:Umeboshi/Tools/enwiki-xml-splitter

This is a python script to split up the large xml database dump by page nodes, and rearchive them on the fly, to keep from using gig's of disk space. This script makes system calls and recommends the p7zip-full package, although the -z option can be used to change archivers. This script has been tested on this file. It reads input from stdin, so the bz2 file should also work. The script makes multiple (though not multi-volume) archives to save on extensive disk i/o.