User:GreenC/software/agrep

Plain-language methodology for approximate (“fuzzy”) matching as done in the noisbn project, for the purpose of record matching of books at Internet Archive with Wikipedia

By User:GreenC

Given a book from a source such as Wikipedia or BWB it includes metadata such as title, author, year of publication.

Extract the author(s) last names only, and the title without subtitle. Make an advanced search on IA like this:

https://archive.org/advancedsearch.php?q=%28<title>%29+AND+creator%3A<last name(s)>+-collection%3Aopensource+mediatype%3Atexts&rows=10&output=xml

This will narrow the universe of possible choices to a max of 10 or whatever number you choose in &rows=. Elasticsearch is powerful and this step is a big gain in finding a match as you only need to check a small number of close matches.

For each IA ID found, for each field (title, publisher etc) create a function that does a combination of approximate matching, and rules.

For approximate matching I settled on the tre-agrep tool (available via apt etc..) because it internally uses multiple algorithms, is very fast, has some key options, and it is better than the Levenshtein distances libraries included in most language standard libraries. More info at the Wikipedia article titled “agrep”.

The interface to tre-agrep is a bit awkward so I wrote an awk wrapper script that can be called from any program (bash, nim, python etc) with standard command-line options. The script dynamically adjusts tre-agrep options based on input string lengths, and is optimized for the short strings of book metadata. The script is available here: https://github.com/greencardamom/WaybackMedic/tree/master/Agrep

The noisbn matching logic is simple, traverse the fields sequentially:

If title match then

If publisher match then

If author match then

If date match then

Found a match

When a match fails, the further into the loop it goes the more interesting because there is a good chance that match might be accurate with some adjustments to the code. I do lots of logging.

Matching rules can be invented to help it be more accurate. For wikipedia purposes it looks something like this, contained in the date matching function:

If wp has no |year, has a |publisher, and no |pages or id, then automatically match the date
If wp has no |year, and no |publisher, and no isbn, then automatically accept
If there is a wp |isbn, check the year associated with it via OpenLibrary (isbn_meta program), then see if it matches with the ia year
Does it have an identifier? Pages? No ':' in title (possible periodical)? Year must be exact match.

These rules are optimized for Wikipedia purposes, we can match strictly or be more liberal based on the availability of metadata. For example if the only metadata available is author and title, then edition doesn’t matter, we can safely match any year and/or publisher. But if there is a page number in the metadata, then we must be more strict about edition so it lands on the correct page.

There is string normalizations for publisher. For example:

sub("(?i)D(orling)?[.]?[ ]?K(indersley)?[.]? Publish(ers|ing)?", "Dorling Kindersley")

This is a regex that converts variants of the publisher (DK, D.K. Pub., etc.) into a normalized string so approximate matching is more accurate. I have about 50-100 of these for the most common publishers, the work is endless discovering publisher strings that need normalizing but getting the major ones goes a long way.

For dates, I allow a 1-year window, so for year 2010 it would match on 2009-2011. Except in cases where it looks like a periodical (no subtitle, existence of volume or number).

For books with multiple volumes check title and metadata for ‘Volume” or “Vol.” however these can be difficult as volume meta info doesn’t always exist, and it can end up matching the wrong volume.

To recap on general methods:

Elasticsearch to narrow field of possible matches at IA.
Normalize strings in publisher
Extract metadata and compare title, publisher, author and date in sequence using agrep
Create more specific rules based on your requirements that can help refine a match