Wikipedia:GLAM/Museum of New Zealand Te Papa Tongarewa/What We've Done/Myosotis Pilot

Te Papa's Myosotis pilot project wanted to find out how we could effectively and sustainably contribute to Wiki projects using our collection images, metadata, and curatorial knowledge. We used OpenRefine to load 355 images of Myosotis specimens native to Aotearoa New Zealand, creating a reusable process that involves adding well-described content, improving and creating articles, and connecting with structured metadata.

The project:

Loaded 355 images of Myosotis specimens
Added the images to articles created by one of our Botany Curators, Stitchbird2
Added and updated Wikidata items for many species and people related to the set
Created a new Commons template, Template:TePapaColl
Created processes to support the selection, export, transformation, and upload for our images and data

If you've got any questions, suggestions, or just want to talk about the project, get in touch with Avocadobabygirl.

This page describes the project goals, how to publish a set like this to Wikimedia Commons, and specifics of how we made it happen.

What’s Wikimedia Commons? How do we use it?

Te Papa wants the collections and knowledge we hold to be accessible and impactful for anyone who wants them. We’re building up an ongoing programme of digital outreach work that works out the best (most effective, sustainable, enriching…) platforms we can push out onto and makes it happen.

By loading images and metadata to Wikimedia Commons, as well as including the pictures in Wikipedia articles and connecting into Wikidata, we put valuable and up to date material right where people go looking for it. Contributing to a scientifically-sound article on native forget me nots makes Wikipedia itself more complete, but also helps people who see that information when it goes elsewhere on the internet, like iNaturalist or Google search results.

We load not just high-quality and high-resolution images, but also detailed metadata and a link back to the record on our Collections Online site. This helps the image travel with its context: detailed and useful information that makes the images easier to find, use, and interpret in all sorts of ways.

On the front end, we display descriptive metadata that makes it really clear what you’re looking at, as well as extra info that’s useful to wikipedians and researchers. Behind the scenes we also hook in several structured data statements using Wikidata properties and items, making it easier to computationally interpret the image.

Loading all this material to Wikimedia Commons is done with OpenRefine. We use it to prepare our data, hook into Wikidata, and upload the images in bulk.

Read on to see how we select material, process images and data, and load it to the platform.

Selection criteria

Setting selection criteria and making your actual selections helps keep the size of the following work down.

Establish your criteria, using the following as a basis.

Criteria	Reasoning
Set size is between 300 and 1000 images	Safely within OpenRefine’s bulk upload capacity Manageable amount to do some manual processing
Set prioritises New Zealand/Pacific material	Supports Te Papa’s strategic goals More likely to fill gaps on Wikimedia Plays to our strengths
Images are new to Wikimedia Commons	Avoid duplication of effort
Images are public domain or have a CC BY licence	Open licenses are required for Wikimedia Commons/Wikipedia
Images and data are high quality	Small/unclear images and incorrect/inconstant data don’t support positive audience impact or reflect well on Te Papa

Preferably, images will also have a use case ready to go, like inclusion on specific Wikipedia articles.

It will also be easier to prepare and upload material is the records are all the same type (eg Specimen vs Object), but this isn’t required.

What we did

Our criteria needed to ensure we had a set of records and images that was not too complex and supported Te Papa’s overall goals. Important considerations also included:

The number of images OpenRefine’s experimental bulk upload functionality can safely handle
Subject matter expert interest and availability through the length of the project
Avoiding a wide variety of types so we could focus on supporting a single data model

As a result, our set was:

355 images of specimens
Made up of Aotearoa’s native species
Focused on providing a few high quality images representative of each species, covering field and herbarium views, as well as pollen if available. Some small or unclear images were removed.
To the best of our knowledge, not represented on Wiki Commons already
All licensed CC BY. Photos that weren’t taken by Te Papa staff were removed to avoid needing to spend lots of time chasing up rights

Image selection

Because we wanted to restrict our set to a small number of relevant and high quality images, we did a review of all images attached to the records we’d chosen.

What we did

We had previously drilled down in our original report spreadsheet, and flagged the specimens we wanted to include.

We now exported images for those specimens into a set of folders arranged by species and then specimen.

Myosotis images --> 
-- Myosotis capitata --> 
----- SP106542_A --> 
-------- SP106542_A_434684.jpg 
-------- SP106542_A_434691.jpg 
-------- SP106542_A_434695.jpg 
----- SP106542_B --> 
-------- SP106542_B_434706.jpg 
-------- SP106542_B_451340.jpg 
-- Myosotis antarctica subsp antarctica --> 
----- SP102775 --> 
-------- SP102775_389843.jpg 
... and so on

Our Botany Curator reviewed these images using XNView, making selections by dragging each image to another folder. Because each image included both the specimen identifier and the image identifier, we could then run a script over the folder contents that gave us our list of selections.

This took us from 1826 specimens and around 8900 images to 157 specimens and 355 images.

Preparing the data for OpenRefine

Create a general list of the kinds of images you want to include. It’s good to do this as a spreadsheet including columns like:

record numbers
titles
species
locations.

Make sure that there is one row in your spreadsheet for each image.

You can now open it in OpenRefine as a new project.

What we did

We created a report in EMu, our collection management system, which exports a ton of data that can be helpful when narrowing down a set of uploads. It includes a row for each image attached to each specimen.

By doing a general search on Myosotis specimens that had openly-licensed images we produced a spreadsheet that could then be processed in OpenRefine.

Making the actual selection was fairly manual, and was possible thanks to the subject matter expertise of the project’s Botany Curator.

First, we exported catalogue records for all Myosotis specimens in a spreadsheet, which included the species. In OpenRefine we faceted the data by species and removed those that were not native to New Zealand. That let us:

See how many species we needed to cover
See which of these had specimens with images, and how many of each
Determine how many specimens we should include for each species, and how many images for each specimen
Initially review the state and structure of the data

Filtering and faceting in OpenRefine

Use OpenRefine’s faceting and filtering tools to remove records (each relating to a single image) you don’t want to include. Each record should relate to a single image. Some useful methods are:

Facet by species
Facet by specimen or catalogue record. Only keep those with multiple images.
Facet on empty fields
Facet on image metadata. For example: minimum longest edge, file type (tif, jpg), file size, creation date, filename (the filename may point to the type of image it is – specimen sheet, field image etc)
Facet on image creator

When you have filtered the records you don’t want to include, you can flag them using the All dropdown menu on the first column, then Edit rows, then Flag rows. When you’re done, you can then remove all flagged records from your project by selecting Remove all matching rows – it’s better to do this at the end, in case you change your mind.

Review your data

After narrowing down to a subset of records, it’s a good time to review your data.

Look out for things like:

Values showing in the correct fields
Consistency – dates, spelling of names, formatting
Missing or additional data that should be added, for example Wikidata QIDs for associated people and taxa
Sensitive information – cultural, personal, location and financial data that shouldn’t be published

Ensure that data supporting image use is correct. For example:

Individual rights statements are consistently applied and meet the requirements of the external platform. For example, Wikimedia Commons requires images to be freely licensed or in the public domain.
Images are already (or queued to be) published on your own platform. This ensures users can verify that an image has in fact been officially published and is reusable.
Images are published at their highest resolution

What we did

We were regularly checking our data throughout many iterations of the export process - some changes happened literally at the last minute.

Changes were made in EMu, and we either re-exported the data or for small stuff we made the same change in our spreadsheet (and then updated the EMu record).

Some of the important fixes we did include:

Spellings in qualified names
Inverted coordinates (eg 42 instead of -42)
Associating the correct person as the specimen identifier

We also reviewed and fixed as needed issues relating to images, including:

Ensuring all selected images had access copies (what you get when you download them from Collections Online) at the highest available resolution
Ensuring the rights statements associated with images were correct
All selected images were properly published and associated with their specimen’s records

Wikidata prep

OpenRefine lets you reconcile columns of values against Wikidata items, thereby connecting each upload to structured data in all sorts of useful ways.

Reconciliation using OpenRefine

Linking up things like creators, species, what’s depicted in the image, and significant locations covers most of the things people want to know. You might also consider:

type status (both whether the specimen is a type, and what kind of type)
collection/institution it's held in
people involved in collecting or identifying it.

The easiest way to get a definite match is to include Wikidata identifiers – QIDs – in your source data.

Wikidata:Identifiers

What we did

Knowing we wanted to load these images with robust structured metadata, we had to ensure the relevant items existed and were ready to use on Wikidata. For us, the essential items were people involved in collecting or identifying the specimens, and the species themselves.

We created lists of the people and species included in our data spreadsheet, and manually searched Wikidata for each one.

Once those items had been reviewed, improved with extra statements if needed, or added where missing, we added their QIDs into EMu. Including QIDs in the source data meant we could then export and work with them much easier to perform OpenRefine reconciliation tasks later.

We also created new Wikimedia Commons categories to ensure our uploads would be findable. We created categories for each species, and associated it with the relevant Wikidata item. We also added a category for material uploaded by Te Papa staff, to aid our own tracking.

For locations, we found there wasn’t a clean way of matching up regions in our source data (the State/Province field) to Wikidata items, as our values in that field were working at a lot of different levels, using different schemas.

For locations, we’re generalising the State/Province values in our data to the 16 administrative regions/districts of New Zealand. It isn’t going to be perfectly accurate, but the verbatim value will still show on the front end of images we upload, and the source data is of course safe in EMu.

Northland region (Q59596)
Auckland region (Q726917)
Waikato region (Q139918)
Bay of Plenty region (Q2192924)
Gisborne district (Q140246)
Hawke’s Bay region (Q251825)
Taranaki region (Q140207)
Manawatū-Whanganui region (Q139907)
Wellington region (Q856010)
Tasman district (Q666142)
Nelson region (Q1995856)
Marlborough district (Q140083)
West Coast region (Q541468)
Canterbury region (Q657004)
Otago region (Q692912)
Southland region (Q864971)
New Zealand (Q664)

Finding a QID on Wikidata

A lot of things are already on Wikidata, so there’s a good chance of finding a QID for the entity you’re working with. Sometimes, the difficult part is finding the right one.

Wikidata items are supposed to be one-to-one with a specific thing, so finding something that’s close isn’t going to be helpful. Alexander von Humboldt (Prussian naturalist) is not Alexander von Humboldt (boat), and a specimen of Myosotis antarctica subsp. traillii isn’t a specimen of Myosotis antarctica subsp. antarctica.

Start by searching from the box in the top right of Wikidata’s homepage. If the item you want doesn’t show up in the dropdown, hit enter to get a full search results page.

When looking for the right item, think about how you would be sure you’re looking at the right one:

Is the name at the right level of specificity?
Do birth/death dates, locations, associated institutions line up?
Has the name of the entity changed over time, with different ones being used in your data and on Wikidata?

You may find you need to do more research. If available information is scant and you can’t make a confirmed match, it may be safest to leave it out, and just use the entity’s name string instead.

Adding a new item to Wikidata

If there isn’t an item you can match, you can add your own one.

Help:Items tells you how to do that.

Create statements for the item to help make it clear what it is.

For example, a person’s record should include:

Instance of: human
Given name
Family name
Occupation
- If you don’t have more definite information, add a contextually appropriate role here, like ‘botanical collector’
If it’s available the identifier from your system. For us, this is Te Papa agent ID

See Heidi Meudt’s Wikidata page for a more filled-in example.

What we did

As far as we’re concerned, if something has a parties, taxon, or place record in EMu, it’s significant enough to add to Wikidata. Some of these items were scant, to the point that the only Occupation we could reliably list was ‘botanical collector’.

Wikimedia Commons prep

Categories in Wikimedia Commons (and Wikipedia) group content together and help make it findable.

When applied to uploads, it’s best to use the most specific applicable category. For example, this specimen upload is a Myosotis, but only has the Myosotis pansa category.

Commons:How to create new categories or subcategories

What we did

As a standard, we apply the following categories:

Uploaded by Te Papa staff
The applicable “… in Te Papa” category under Category:Collections of Te Papa
If relevant, Te Papa type specimens
As applicable, Herbarium specimens or Zoological specimens

The category Uploaded with OpenRefine is automatically applied during upload.

We ended up needing to create categories for most species.

We created the Uploaded by Te Papa staff category to help our own tracking, and we changed several existing categories that referred to “Te Papa Tongarewa” instead of “Te Papa” (the proper short form of our name).

Data mapping and transformation

The data actually required to load images to Wikimedia Commons is very simple – a filename and a license statement. But it’s possible to provide a lot more data.

If including more complex data, you’ll want to use a template. Templates for some object types are much more mature than others.

Naturalis have created a more comprehensive specimen template called Biohist.

What we did

To display our data on Wikimedia Commons (and connect it to structured data on Wikidata), we map our fields to structures that make sense to those platforms.

API

{"identification.qualifiedName": "Myosotis pansa subsp. pansa (L.B.Moore) Meudt et al."}

Wikimedia Commons display view

Qualified name: Myosotis pansa subsp. pansa (L.B.Moore) Meudt et al.

Wikidata statement

This image depicts [P180] the species Myosotis pansa subsp. Pansa [Q76375609]

For the display view, we created a variation of the Biohist template that was more suited to an Aotearoa (and Te Papa) context, and which we’ll be able to build on over time.

Template:TePapaColl

For the Wikidata statements, see the section on OpenRefine processing and schema below.

Harvesting data

With your selections and data mapping in place, you can now re-export your data in a format that’s easy to process and upload in OpenRefine.

What we did

For our pilot we exported data directly from EMu, but from now on (because it’s easier and we know all the data is published), we’re exporting from the API.

We harvest from our collections API using a Python script that:

Takes a list of catalogue and image identifiers (irns)
Queries the API for each catalogue record
Harvests, relabels, and transforms the record data as required
Creates a CSV file, writing a new row for each image, populated with processed data
Downloads and renames each image

We will share this script once ready.

Processing in OpenRefine

Load the fresh export of data to OpenRefine as a new project, and do a final review of your data.

Ensure the filenames and filepath are correct
Remember that some things may appear to be doubled up, as they’re covering both descriptive and structured metadata

What we did

Most of our data transformation is handled in the harvest script, but we still needed to handle a few more things:

Adding columns for specific categories (such as joining the Genus and Species column to create the string Myosotis_pansa, which would then be used in the Wikitext)
Splitting columns of lists, such as collectors. This could range from 1 to 8 names, and was easier to do in OpenRefine than a flat CSV
Generating Wikitext to collate data into a single value that calls the TePapaColl template

Wikitext

Generate Wikitext for each item by transforming the Wikitext column with the following value (adjust as needed, of course):

"== {{int:filedesc}} ==\n" +
"{{TePapaColl\n" +
if(isBlank(cells.BasisOfRecord.value), "", "|BasisOfRecord=" + cells.BasisOfRecord.value + "\n") +
if(isBlank(cells.QualifiedName.value), "", "|QualifiedName=" + cells.QualifiedName.value + "\n") +
if(isBlank(cells.CommonName.value), "", "|MāoriCommonName=" + cells.CommonName.value + "\n") +
if(isBlank(cells.GenusCommonName.value), "", "|GenusCommonName=" + cells.GenusCommonName.value + "\n") +
if(isBlank(cells.MātaurangaMāori.value), "", "|MātaurangaMāori=" + cells.MātaurangaMāori.value + "\n") +
if(isBlank(cells.Family.value), "", "|Family=" + cells.Family.value + "\n") +
if(isBlank(cells.RegistrationNumber.value), "", "|RegistrationNumber=" + cells.RegistrationNumber.value + "\n") +
if(isBlank(cells.InstitutionCode.value), "", "|HerbariumCode=" + cells.InstitutionCode.value + "\n") +
if(isBlank(cells.TypeStatus.value), "", "|TypeStatus=" + cells.TypeStatus.value + "\n") +
if(isBlank(cells.TypeOf.value), "", "|TypeOf=" + cells.TypeOf.value + "\n") +
if(isBlank(cells.Institution.value), "", "|Institution=" + cells.Institution.value + "\n") +
if(isBlank(cells.DateCollected.value), "", "|CollectionDate=" + cells.DateCollected.value + "\n") +
if(isBlank(cells.CollectedBy.value), "", "|CollectedBy=" + cells.CollectedBy.value + "\n") +
if(isBlank(cells.IdentifiedBy.value), "", "|IdentifiedBy=" + cells.IdentifiedBy.value + "\n") +
if(isBlank(cells.Country.value), "", "|Country=" + cells.Country.value + "\n") +
if(isBlank(cells.StateProvince.value), "", "|StateProvince=" + cells.StateProvince.value + "\n") +
if(isBlank(cells.CatalogueRestrictions.value), if(isBlank(cells.PreciseLocality.value), "", "|PreciseLocality=" + cells.PreciseLocality.value + "\n"), "") +
if(isBlank(cells.ElevationMetresFromTo.value), "", "|Elevation=" + cells.ElevationMetresFromTo.value + "\n") +
if(isBlank(cells.DepthMetresFromTo.value), "", "|Depth=" + cells.DepthMetresFromTo.value + "\n") +
if(isBlank(cells.SourceUrl.value), "", "|SourceURL=" + cells.SourceUrl.value + "\n") +
if(isBlank(cells.CreditLine.value), "", "|CreditLine=" + cells.CreditLine.value + "\n") +
"}}\n" +
"=={{int:license-header}}==\n" +
"{{cc-by-4.0}}\n" +
"[[Category:Botany in Te Papa Tongarewa]]\n" +
"[[Category:Uploaded by Te Papa staff]]\n" +
"[[Category:Herbarium specimens]]\n" +
if(isBlank(cells.CategoryScientificName.value), "", "[[Category:" + cells.CategoryScientificName.value + "]]\n") +
if(isBlank(cells.TypeStatus.value), "", "[[Category:Museum of New Zealand Te Papa Tongarewa type specimens]]\n")

Schema

Property	Example item	Qualifier property	Example qualifier item
depicts	Myosotis glabrescens
main subject	Myosotis glabrescens
source of file	file available on the internet	described at URL	https://collections.tepapa.govt.nz/object/470141
		retrieved	10 October 2022
significant event	plant collection	point in time	February 1890
significant person	Donald Petrie	subject has role	botanical collector
country of origin	New Zealand
location	Otago Region
taxon name	Myosotis glabrescens L.B.Moore	taxon author	Lucy Beatrice Moore
		taxon author citation	L.B.Moore
	Boraginaceae
instance of	type specimen
subject has role	holotype	of	Myosotis glabrescens
collection	Museum of New Zealand Te Papa Tongarewa Herbarium
	Museum of New Zealand Te Papa Tongarewa
copyright status	copyrighted
copyright license	Creative Commons Attribution 4.0 International

Reporting and analytics

There are several tools that help gather analytics data about use of Wikipedia articles, Commons images, and more. They tend to provide a qualitative overview, so it’s good to supplement that with qualitative measures as well.

Using Wikimedia’s API to get pageviews

Wikimedia REST API documentation

This API gives you access to pretty much whatever you want to pull from Wikimedia, but what’s useful here is the pageviews data endpoint. This lets you send queries about how much use a given page is getting, customised with several parameters.

We run the following python script monthly, creating a simple report from a couple of text files that have lists of urls for the images and articles we want to keep track of.

from requests import get
import json
import html
import csv

headers = {"Accept": "application/json", "User-Agent": "[PUT YOUR LOGIN EMAIL HERE]"}

# Queries the API for each url, called by Report.get_views()
class WikiAPI():
	def __init__(self):
		self.pageviews_base_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article"

	def pageviews(self, project, access, agent, article, granularity, start, end):
		article = html.escape(article)
		slugs = [self.pageviews_base_url, project, access, agent, article, granularity, start, end]
		query_url = "/".join(slugs)

		response = json.loads(get(query_url, headers=headers).text)

		return response

# Takes a list of urls and query parameters, creates API queries, and writes the results to a csv
class Report():
	def __init__(self, mode=None, articles=None, start=None, end=None, granularity=None, project=None, access=None, agent=None):
		self.mode = mode
		self.articles = articles
		self.start = start
		self.end = end
		self.granularity = granularity
		self.project = project
		self.access = access
		self.agent = agent

		self.API = WikiAPI()

		if self.mode == "articles":
			self.report_file = "{start} - {end} wikipedia article views.csv".format(start=self.start, end=self.end)
		elif self.mode == "images":
			self.report_file = "{start} - {end} wikimedia image views.csv".format(start=self.start, end=self.end)

		self.open_file = open(self.report_file, "w", newline="", encoding="utf-8")

		self.write_report()

	def write_report(self):
		self.reportwriter = csv.writer(self.open_file, delimiter=",")
		self.reportwriter.writerow(["wikiUrl", "pageViews"])

		with open(self.articles, 'r', encoding="utf-8") as f:
			lines = f.readlines()
			for line in lines:
				wiki_url = line.split("/")[-1].strip()
				view_count = self.get_views(wiki_url)
				self.reportwriter.writerow([wiki_url, view_count])

		self.open_file.close()

	def get_views(self, article):
		view_count = 0
		response = self.API.pageviews(project=self.project, access=self.access, agent=self.agent, article=article, granularity=self.granularity, start=self.start, end=self.end)

		if "items" in response:
			for day in response["items"]:
				view_count += day["views"]

		return view_count

# Use to set parameters for the report
def run_report(mode=None):
	# Can be daily or monthly
	granularity = "daily"
	# YYYYMMDD or YYYYMMDDHH
	start = "20221001"
	# YYYYMMDD or YYYYMMDDHH
	end = "20221031"

	# Can be all-access, desktop, mobile-app, or mobile-web
	access = "all-access"
	# Can be all-agents, user, automated, or spider
	agent = "user"

	if mode == "articles":
		project = "en.wikipedia.org"
		articles = "tracked_articles.txt"

	elif mode == "images":
		project = "commons.wikimedia.org"
		articles = "tracked_uploads.txt"

	Report(mode=mode, articles=articles, start=start, end=end, granularity=granularity, access=access, agent=agent, project=project)

# mode can be "articles" or "images"
run_report(mode="images")

Use of images on Wiki project pages

Other tools let you see how categories of Commons images are used across the Wiki ecosystem, giving you a broad scale of how a set of images are being used and also letting you drill down.

We use Glamorous to check the usage of all images under Category:Collections of Te Papa.

Filtering to a date span shows a chart of views by project (such as English-language Wikipedia, Spanish-language Wikipedia, Wikidata) on the Daily views tab.

Usage is also charted on the Global file usage tab.

And the File usage details tab provides a complete breakdown of every image in the category, showing for each one:

Number of uses
Page views across projects
Which pages it’s linked on

Tracking contributions

It can be useful to see how interest by contributors is building, based on how active they are after significant releases or other work.

The Programs and Events Dashboard provides a combined view of multiple users' contributions. Users can be added to the overall campaign or individual events.

We’re using ours to see how staff interest is (hopefully) building as we release more material and publicise the work internally. Staff who are interested in contributing as part of their work are added to the board, and we then look at our collective impact.

Another tool we may use is Herding Sheep - the idea is to ask participants at public edit-a-thons we hold to share their usernames, so we can get an idea of what kind of session or topic inspires the most ongoing activity as an editor.

Qualitative data

Although the available tools mainly focus on raw numbers, the wider Wiki ecosystem does provide good ways to collate qualitative data, which may tell you things like:

What questions people are trying to answer when they go to Wikipedia
What sort of problems you’ve helped them solve
What they think is still missing

We’re keeping an eye on our user Talk pages, as well as those for articles we’ve edited and images we’ve uploaded.

Other existing channels, including our website pop-up survey and high-resolution image download questionnaire, are also being watched for relevant comments. We are currently receiving feedback through emails to individual staff, and may set up a digital outreach address to publicise as an easy point of contact.

The main trick is to actually record these comments as they’re received. Even adding them to our simple monthly reporting spreadsheet is enough to get that information aggregated, analysed, and shared with the right people.

In the future, we’re considering running observational user testing to get qualitative feedback on the specifics of how we’re using these platforms, particularly regarding user experience and content decisions.