Wikipedia talk:Authority control/VIAF
Liaison with VIAF
edit- Moved from my talk page to facilitate wider discussion. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:00, 19 August 2014 (UTC)
Hi Andy!
I'm part of the VIAF team. We'd be pleased to work with someone on feeding the Wikipedia suggestions for changes into our database. We've also gotten a message from Sergey Vladimirov who suggested he'd like to know when we've made corrections to our pages. I sent an email to him a few weeks ago but haven't heard anything since.
I'd like to:
- understand the sources of the correction data so we can figure out how to weight it in our processing
- reach some agreements on the format and location of the suggestions
- reach some agreements on the format and location of any feedback you'd like from us.
Sergey pointed us at this page in wikidata: [1]
I'm looking forward to a successful feedback loop here!
Ralph Ralphlevan (talk) 14:11, 19 August 2014 (UTC)
- Hi, Ralph, and thanks for reaching out. Have you seen our page at Wikipedia:VIAF/errors? We find errors in two ways.
Firstly, as on that page, by observation while researching an article. Most of the errors there are self evident (two or more numbers for the same person), or are explained by comments from those reporting them. Of course, those are only for articles on the English Wikipedia. Is there a format for reporting which we could use, on this Wikipedia, that would help you? Secondly, the automatically-generated report at Wikidata covers all languages, and again the majority are self explanatory as duplicate values for one person (under the '"Single value" violations' heading). Those under the other headings are more likely errors on our part, requiring manual checking and resolution (several have already been fixed).
As to how you could notify us of fixes, you could update the error page on this wiki with a note saying what you have done; optionally, you could update the Wikipedia articles too. You could also update Wikidata entries yourselves. Let me know if you need guidance on that. If that's too onerous, you could publish a list of amendments on your website, preferably with an accompanying RSS feed and/ or XML, which e could run a script against, at our end. Most changes should be routine fixes (merging duplicates), but there may be ssme instances that need annotations, such as when one VIAF covers two names, because one name is a pseudonym for another; or when an identity is split in two, and we need to know which is which; or when we have made fas reports. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 14:46, 19 August 2014 (UTC)
There's lots to consider here and I'm afraid I'd spend a week trying to make a comprehensive reply. Instead, I'll pick a couple low-hanging fruit.
On the Wikipedia:VIAF/errors page:
- Is there any chance of a template being used for those reports? They look like free text with good links in them. But, I'd have to infer the action to be taken. It would be lovely if the request was more explicit/consistent.
- Lots of "not VIAF" notes there. Is this our problem or yours? Picking the first one: Michael Peterson (murder suspect) - not VIAF 111595726, it looks like we saw that VIAF ID in your page (we download your pages monthly) and merged it into our cluster. We need for the ID to leave your page for us to correct our cluster. I don't think we're responsible for that link, though perhaps Max's bot put it there. As far as I'm concerned, you should just remove the offending link. I think your editors should just feel free to delete them whenever they like rather than reporting them as errors. Is there something we can provide to help?
- 2 or more IDs for the same article. Are you suggesting that the VIAF cluster records listed should be merged? If a trustworthy person made that suggestion, then we'd probably just take it and merge them. We merge on weaker evidence than that. Again, some indication of unambiguous intent would help us detect those suggestions. The first example: VIAF 38044491 (more detailed) & * VIAF 305133334 = Leslie Baily is clearly correct.
- "VIAF merges different identities" is more of the same. Agreement on clear intent and probably the form of the name or better yet the source of the incorrect form of the name would allow us to split those clusters.
- Parallel VIAF Clusters: How is this not the same as "2 or more IDs for the same article"? Your first example for the Duchess Henrietta is clearly a case of two clusters needing to be brought together.
- Articles about multiple people with a single VIAF ID is clearly your problem. Why not just make the change rather than report it. (I guess this is me trying to understand your process as much as anything.)
- Other errors looks like something we might glance at occasionally.
As we accept these suggestions, how would you like to hear back from us? Can we get a bot authorized to edit that page? Is there some other automated way to provide that feedback? I like marking things as done, but I'd hope that they age away rather than slowly filling the page. Otherwise, we could just make the suggestions go away as we implement them.
All the same questions apply to the wikidata error page. Do you have a suggestion on how we should progress that conversation?
What's the mechanism for including other people into this conversation? I've got three co-workers that would probably like to be involved and I'd like to get Sergey Vladimirov included too. For now, I'm just going to copy all this and mail it to them.
Thanks!
Ralph Ralphlevan (talk) 19:34, 19 August 2014 (UTC)
- @Ralphlevan: I've moved this from my talk page (and numbered your questions) to facilitate a wider discussion. Quick answers; more later:
- Could do, though anything too complex might put people off making reports. What should it include?
- I suspect some editors are reluctant to remove VIAFs because they don't know how to, or they're not sure where the error lies.
- Yes, I read these as requests to merge; I would advise you to sanity-check.
- We may not know incorrect sources.
- I've often wondered the same thing!
- Having researched this further, over the last few hour, I'm not sure it's always our problem (perhaps about 30% are, on a small sample). For example, at the foot of the Wikidata report is VIAF 96139163, which apparently conflates two people, de:Chen Kuen Lee (an architect) and Chen-Yuan Lee (a pharmacologist).
- OK.
I don't think a bot could edit the current page, as it's too unstructured. Your colleagues would be welcome to sign up for accounts and join in here. I agree that resolved issues should be removed - at first to separate section, for review, them removed completely after, say, a month. Paging User:Maximilianklein and User:Andrew Gray for their views. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:19, 19 August 2014 (UTC)
-- Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:19, 19 August 2014 (UTC)
- Quick note to say I've seen this (and excited!) but currently busy - will reply shortly! Andrew Gray (talk) 20:00, 20 August 2014 (UTC)
Way forward
edit- @Ralphlevan and Pigsonthewing: - some thoughts on ways forward, based on the errors page lists
- 1 - WP not the same as originally linked VIAF identity.
- These are mainly logged to ensure that the data is successfully removed from VIAF (though we probably need to make sure they're not hanging around in WP/Wikidata as well). From what you're saying above, it sounds like some of these links may no longer be in the database - the example given definitely used to be! - but I'm not sure of quite how this mechanism works. If we were to render it down to a simple list of VIAF cluster IDs and named Wikipedia pages that they are known not to match (and confirm that they've all been removed here), would this be suitable for you to remove any lingering matches at your end?
- We could do this in one fell swoop (basically, Andy and I do the checking here, we send you a list of codes, we clear the entire section and archive it somewhere)
- 2 - Two VIAF identities for the same article (ie, same person)
- 4 - Parallel VIAF clusters for one identity
- As far as I can see these are the same thing (but confused now...) - two numbers refer to the same author.
- These need a merge at your end. Most are fairly unambiguous and won't need supporting evidence, I think. Perhaps you could run through these and mark them when done? (eg with strikeouts) and we'll then archive it after checking WP has the updated number.
- 3 - VIAF merging different identities
- From what you say above these splitting is a bit more complicated than merging. Perhaps deal with this after the two above? I'll have a look at it over the next few days and figure out how best to handle it.
- 5 - Wikipedia link inside VIAF is out of date
- All gone now :-). And from what you say above this should be happening automatically on a rolling basis, so we can probably remove the section entirely.
- 6 - Articles about multiple people assigned the VIAF identity for one of them
- 7 - Other errors
- Complicated and may be worth parking for now. (we probably do want to track the individual IDs for single authors who WP deals with as a duo, but not immediately clear how best we can do this - especially since we often have a duo page on some wikis and individual pages on others) — Preceding unsigned comment added by Andrew Gray (talk • contribs) 18:55, 22 August 2014
All looks good to me. Regarding Wikipedia links inside VIAF, it might be worth VIAF starting to think about switching to using Wikidata. Once we've cleared this backlog, we can look at improving the model (templates, etc) for future reporting. I've also done some work to resolve errors from the Wikidata report; but some are ambiguous (e.g. VIAF 145018975; used on both Q752431, Q1979542 - which appear to be the same entity, but have different articles in German.).Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:30, 22 August 2014 (UTC)
- This section is difficult to read, for poor use of left-marginal whitespace and poor identification of participants (employees where, Wikipedians in residence, WikiData movers and shakers).
- Why not extend coverage to our Redirects from individual people? Complete the template {{Authority control}} on those pages. Include those pages in VIAF bundles. Are we waiting for decisions about Wikidata design? For Wikimedia software upgrade? (--presumably one that links from pages such as joint biographies to incoming redirects such as those from people)
- What does VIAF.org update frequently by reference to Wikipedia pages? VIAF 104887034 unusually fits our joint biography Brothers Hildebrandt, whose footer I linked to that bundle in March, using the template. But VIAF.org still includes that page in its Tim Hildebrandt bundle VIAF 49344720.
- Hi P64. This conversation looks a bit odd because it span out of a conversation Andy and I had with someone from OCLC at Wikimania last week, so not all the backstory is on the page :-)
- Regarding matching VIAFs to redirects, this seems reasonable - though including the template might be a bit odd if we're not expecting readers to actually look at those pages. I know that Wikidata often has three entries (A, B, A+B) so putting the data there might seem the most sensible approach; however, this assumes that A/B exist on another wiki. So perhaps starting with redirect pages is the way to go (and we can always migrate later).
- People represented here by redirects, continued.
- Hi Andrew Gray. I don't believe I have visited any A+B page at Wikidata. I'll check some of my examples.
- I suppose that Wikidata should have a page for everyone with a biography or a personal redirect at some Wikipedia, among other people. Does anyone know where at Wikidata to find discussion of whether to go ahead --to create and expand WD pages for those people, regardless whether this template is to be displayed at the WP redirects. (Once I found warnings against going and pleas not to go ahead, but I didn't note the location, merely decided not to go ahead.)
- Yes, the template is less valuable at our personal redirects, if we expect no one to visit those pages. The same is true, and moreso, concerning interlanguage links. Yet moreso concerning personal categories such as Illustrators and Writers on the personal redirects for a husband-and-wife illustrator-writer team. If not a software upgrade we "need" See also in the Categories list, in effect (but only a software upgrade would give us See also in the Languages sidebar menu, I presume).
- --P64 (talk) 16:57, 23 August 2014 (UTC)
- Ralphlevan, I've started aggregating the easiest class of entries - VIAF-Wikipedia links originally assigned by the bot which we definitely know to be incorrect - and listing these at Wikipedia:VIAF/errors#Sent_to_OCLC. The format there is basic tab-separated, one column for number and one for the page it doesn't match. Could you run these through your system and check they're all removed from within VIAF, then either delete the lines when you've done or mark them as complete? (I don't know whether it'll be useful to have an archive to consult in future) Andrew Gray (talk) 21:50, 22 August 2014 (UTC)
- If it's a purpose of VIAF.org to improve data integrity at the authorities such as national libraries LC and DNB (US and Germany), some archive may be useful.
- When one of the authority records mixes data concerning two people, or two records divide data concerning one person, it may be no true VIAF "error" that its bundles do not match people, but I feel sure that many such problems have generated listings on our VIAF errors page (as well "bad" assignments of parameter values in our biography footers that have not been listed there). So "bad" bundles at VIAF may --presumably by hardworking humans-- be mined for errors by the authorities, such as mistaken assignment of books, birthyears, or middlenames to writers who share firstname lastname.
- --P64 (talk) 16:58, 23 August 2014 (UTC)
- Ralphlevan, I've started aggregating the easiest class of entries - VIAF-Wikipedia links originally assigned by the bot which we definitely know to be incorrect - and listing these at Wikipedia:VIAF/errors#Sent_to_OCLC. The format there is basic tab-separated, one column for number and one for the page it doesn't match. Could you run these through your system and check they're all removed from within VIAF, then either delete the lines when you've done or mark them as complete? (I don't know whether it'll be useful to have an archive to consult in future) Andrew Gray (talk) 21:50, 22 August 2014 (UTC)
@Ralphlevan, Andrew Gray, and P64: I have moved this to a sub-page to facilitate further discussion; please add to your watch list and stand by! Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:00, 24 March 2015 (UTC)
Rebooting the conversation for the non-Wikipedian
editI'm sorry I've let this languish.
We've started harvesting from Wikidata and using their links back to Wikipedia. That data is not in production yet, but can be seen in our test environment. Here's a pointer to Harry Harrison's VIAF page: http://rdap02pxdu.dev.oclc.org:8080/viaf/80169724/ Scroll down to the "About" section and you'll see all the Wikipedia links. The Wikidata link is up at the top in the "Preferred Forms" section. The Wikidata icon is hot and links you to the Wikidata page.
How does the conversation here relate to anything that gets done in Wikidata? Do you expect to make changes here and have them migrated to Wikidata or should the reverse happen?
I saw a reference to a Wikidata VIAF error report, but no pointer. (Sorry, non-wikipedian. It's probably obvious to you where that report would be.) Might that ultimately be the place for all VIAF issues?
Whether it is or not, how do you imagine we might let you know when we correct something? If we split a VIAF record based on your feedback, how do you want to learn that? Would you consider letting us correct your pages once we've untangled our data? If the report and response are form based, then we're probably going to want to automate as much of that process as we can.
We recognize that not all of your contributors (like me) are going to know how to do things the "right" way and we're looking into getting staffing in our database quality department to help with stuff that can't be automated.
I'm guessing you guys have exactly the same problems with Wikidata that you do with us: conflated names, multiple names, uncontrolled names, ... What is the process for correcting those situations? Might they apply here?
Ralphlevan (talk) 17:55, 25 March 2015 (UTC)
- Hi Ralph, Thanks for this. The error page is Wikipedia:VIAF/errors. That should keep you busy ;-) Do bear in mind that Wikipedia editors are not infallible, so there may be occasional false error reports. More comments to follow. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 21:52, 25 March 2015 (UTC)
- Actually, I was referring to a Wikidata error report. I found it here https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P214 Ralphlevan (talk) 14:42, 26 March 2015 (UTC)
- Hi again, Ralph. True, WD generates that list of "errors" in the sense that single-value is violated (eg, 2 VIAF ID for one item) or single-value (eg, 2 items share 1 VIAF ID). Visit the property discussion, for instance VIAF (Property talk:P214), and read explanation such as this:
- Single value: this property generally contains a single value / Exceptions are possible as rare values may exist. Known exceptions: Jean-Louis Foncine (Q3166765)
- It appears to me that genuine exceptions are likely to be common rather than rare; eg, likely to include every writer whose work some library catalogs under multiple names. This list must be useful for VIAF to work from, if it would merge all bundles such as those three that WD lists for Jean-Louis Foncine.
- The same is true of some Uniqueness violations, including the last entry in the list (9856879) where the gist is that VIAF 9856879 bundles two people, needs dis-integration --not that the two WD items need integration. --P64 (talk) 16:59, 26 March 2015 (UTC)
- Hi again, Ralph. True, WD generates that list of "errors" in the sense that single-value is violated (eg, 2 VIAF ID for one item) or single-value (eg, 2 items share 1 VIAF ID). Visit the property discussion, for instance VIAF (Property talk:P214), and read explanation such as this:
- Actually, I was referring to a Wikidata error report. I found it here https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P214 Ralphlevan (talk) 14:42, 26 March 2015 (UTC)
- Hi, Ralph. At Wikidata:
- Regarding multiple names it may be useful to visit Dorothy Rice Sims (Q19547970). Below the heading there is a drop box new-to-me today, "In more languages" (perhaps only because I am now logged in as English-language user). There I added the four English "Also known as". The labels may differ across languages and often do. See also Mark Twain.
- There is a page to request deletion(s) after merging data from two or more items into one of them. d:WD:RFD; d:Special:Permalink/171768343
- There is Help:Merge d:WD:M with a gadget for simple merge of two items.
- There is somewhere to leave notes, regarding perhaps merges that you are not ready or willing to do yourself, but I don't remember where (not the item Discussion page, as I recall). --P64 (talk) 22:54, 25 March 2015 (UTC)
Feedback Loop
editI see a couple of loops here:
- I like the Wikidata error report https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P214. It trivial for me to process automatically. Looking at the first item on that page, it correctly identifies that we've not merged the Japanese record for Northern Ireland into our bigger cluster. We'll get that one fixed. But, how does that get back into a Wikipedia page which then gets harvested by Wikidata which, hopefully, means that the errant VIAF ID goes away? (If that's actually the way Wikidata errors get corrected!)
- Similarly for the Wikipedia VIAF Errors page: how do we report back on the results of our work
Feedback would be along the lines of:
- We merged the identified records
- We split the identified records
- We don't see anything for us to do here. (Maybe you guys need to split or join pages instead?)
Where would this feedback go? Do we edit the pages ourselves or leave it to the reporter (in the case of the Wikipedia page)? Ralphlevan (talk) 15:16, 26 March 2015 (UTC)
- Edit Wikidata! There are tools for bulk edits, but it might be easier to make changes like this on an ad hoc basis, as you resolve issues at your end.
Wikipedia pages are increasingly pulling VIAF (and other) data from Wikidata.
- Clearly my internal model of what data is flowing where is wrong. I'd have sworn that the flow was from Wikipedia to Wikidata.Ralphlevan (talk) 19:33, 26 March 2015 (UTC)
- Wikipedia pages display some Wikidata data (pardon the expression). But EN.wiki pages are not systematically revised by incorporation of WD data (as far as I know).
- On the other hand, and with some delay or great delay, WD pages are revised by both automatic and systematic manual incorporation of data from wikipedia pages. For instance, here is one automatic upload of a short description about 30 hours ago [2]. The source for zillions of WD data is "upload from: English Wikipedia". --P64 (talk) 02:48, 27 March 2015 (UTC)
Whether or not you do, striking out or replying to messges on the error reporting page here, as you resolve them, would also be useful. I also think a Wikipedia template for reporting - and noting the status of - issues (reported, doing, done, will not fix) would be useful. I'll get onto that later, if no-one beats me to it.
- A template to record the status of those issues would be great! How are you going to feel if we automate the creation/editing of those status reports? I'm a little fuzzy on where you all stand on automated editing.Ralphlevan (talk) 19:33, 26 March 2015 (UTC)
Note that the Wikidata report is automatically generated, and applies to all languages, whereas this Wikipedia is just in English. There may be some duplication (erorrs reported both automatically and manually). Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:46, 26 March 2015 (UTC)
OCLC Mechanisms
editFor the Wikidata report (https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P214) I can see us generating a tool for our staff that presents them with an easy way to view the records in question and then authorize a merge. That's easy for a human to validate and we have a simple process to do that mechanically.
Splitting records is a lot tougher. If the problem goes back to a conflation in one of our sources, then the untangling requires their cooperation. If it's just a matter of us including a source into the cluster that shouldn't have been there, then the mechanics for excluding it are simple. The interface for our staff for doing that will not be as simple as the merging problem. Ralphlevan (talk) 15:33, 26 March 2015 (UTC)
- The reports depend on incremental dumps and are lagging behind for some hours. Also duplicates ("single value violations") may show after unsuitable merges of wikidata items and for other reasons (i.e. wrong data imported from somewhere). For instance, the corresponding reports for GND records are actively used for "authority control" on wikidata: if an item shows up in the reports chances are about 50:50 wether the GND has a duplicate or if there is a problem in wikidata to be fixed. Or to be fixed in some wikisource or wikipedia article a wrong number is stemming from... The task for automation on your side would be to present to human decision-makers only those numbers from the automated reports which did persist for some days.
- Of course also these "simple" duplicates reports are sometimes indicative for complex problems, e.g. two fairly large clusters conflating two distinct persons due to smaller VIAF partners not differentiating the persons...
- Assuming the actual outcome of human intervention in most cases is one formerly independend VIAF cluster being redirected to another one, there are Bots on wikidata (namely d:User:KrBot who is also responsible for the constraint reports) which automatically track redirects (probably based on the monthly dumps, and/or online rechecking on the occassion a wikidata item is "touched": I don't care, tracking redirects is not an urgent matter). Individual Wikipedia instances may have choosen to completely load off the VIAF numbers to Wikidata, or they have the technical means to record differences between the VIAF numbers stored locally and those in wikidata in maintenance categories. Here on en.wikipedia this is tracked Category:VIAF different on Wikidata and (possibly, if one imagines that VIAF Ids get actually removed on wikidata) Category:VIAF not on Wikidata.
- "unique value violations" very often mean that twho wikidata items can be merged (easy but I'm not sure wether someone is actually performing this sort of authority control in wikidata with respect to VIAF numbers), what remains indeed are the hard cases where VIAF clusters must be split up, probably this often can only be achieved in a persistent way by persuading the partners to fix their problems - "in a timely manner" (i.e. ASAP). Thus again, items and VIAF numbers appearing in the reports need some time (days, weeks?) to settle before it is worth the effort to look at the situation from VIAF's end.
- I perceive a huge symmetry between VIAFs operation and wikidata's when considered as "hubs". When restricted to VIAF numbers stored in wikidata (or Wikidata Q-Numbers recorded in VIAF) the situation becomes slightly asymetric or rather complimentary: In one case merging at VIAF or disambiguating at wikidata are the options (one is right, the other wrong, this has to be decided based on additional data), in other cases it is the other way round. In either case "Escalating" the problems should only be done when "local" inspection has been performed without success. One might think about other approaches: Since merging is usually cheaper (technically and intellectually) than the splitting up of clusters one could consider that the "merge end" should have the first try. OTH having to untangle the results of erroneous merges is the worst case when seen from satellite sytems acutally using the data, and therefore the technical simplicity of merging should not interfere with the intellectual diligence which is crucial in these situations... -- Gymel (talk) 08:24, 27 March 2015 (UTC)
- As I observed yesterday in section 2, the report of VIAF uniqueness violations, too, must be useful for VIAF to work from. Yesterday visiting the first and last reported violations only, I found one of two: the last entry in the list (9856879). The gist is that VIAF 9856879 bundles two people and needs dis-integration --not that the two WD items need integration. This one is not derived from error by one of the authorities; evidently from a mistaken merge at VIAF, as VIAF 23797217 redirects to the other. WD page history [3] shows that d:User:KrBot literally created the violation last week --presumably by tracking the VIAF redirect as Gymel describes (thanks). --P64 (talk) 16:27, 27 March 2015 (UTC)
- Certainly: The bot's edit exchanged the redirected, old cluster number by the current, valid one. Accidentaly the erroneous VIAF operation from 2015-03-11 was to transfer the correct (musician's) LCCN into the target cluster, consequently kicking out that of the painter and transferring it to a fresh cluster. Seeing how wildly the constituent records mix birth and death dates it is understandable why the VIAF algorithms failed... But since all Wikidata clusters are created intellectually in a sense, this is regular "authority control": Some operation triggers a violation report which subsequently can be dealt with. And the first step in dealing with that is to identify the realm(s) where a fix is necessary: Unfortunately the options are not good: After having established as in this case that wikidata has no means to "repair" the VIAF operation in question, one can either deliberately keep the (mostly) wrong VIAF number in the wikidata item, or delete it. In the latter case a) documentation is lost that there is an issue with that VIAF number and b) the wrong number is likely to reappear soon since e.g. bots will try to establish the VIAF number based on the (correct) LCCN in the item. -- Gymel (talk) 18:26, 27 March 2015 (UTC)
- As I observed yesterday in section 2, the report of VIAF uniqueness violations, too, must be useful for VIAF to work from. Yesterday visiting the first and last reported violations only, I found one of two: the last entry in the list (9856879). The gist is that VIAF 9856879 bundles two people and needs dis-integration --not that the two WD items need integration. This one is not derived from error by one of the authorities; evidently from a mistaken merge at VIAF, as VIAF 23797217 redirects to the other. WD page history [3] shows that d:User:KrBot literally created the violation last week --presumably by tracking the VIAF redirect as Gymel describes (thanks). --P64 (talk) 16:27, 27 March 2015 (UTC)
- @Ralphlevan: During the last weeks I systematically sifted through the "Unique value" violations report on wikidata. While I was doing that a bot started uploading huge amounts of VIAF id's from Template:Authority control previously not seen on wikidata, yielding hundreds of additional entries in the report...
- Sorry, I can't find that reportRalphlevan (talk) 20:28, 16 April 2015 (UTC)
- @Ralphlevan: The relevant reports are updated on a daily schedule at d:Wikidata:Database reports/Constraint violations/P214 which comprises a section on its own for any constraint declared at the properties Talk Page, namely
- the quite huge "Single value" violations (one WD item carries multiple VIAF ids: possible solutions splitting wikidata items, or deleting nonsensical/inappropriate/leftover-after-moving-sitelinks numbers on wikidata, or merging in VIAF)
- and the fortunately several orders in magnitude smaller section on "Unique value" violations (several WD items list the same VIAF id with possible solutions include merging in wikidata, or shuffling sitelinks between wikidata entries, or - also here - deleting nonsensical/inappropriate/leftover-after-moving-sitelinks numbers on wikidata or splitting up the VIAF cluster in question). -- Gymel (talk) 23:25, 16 April 2015 (UTC)
- @Ralphlevan: The relevant reports are updated on a daily schedule at d:Wikidata:Database reports/Constraint violations/P214 which comprises a section on its own for any constraint declared at the properties Talk Page, namely
- Sorry, I can't find that reportRalphlevan (talk) 20:28, 16 April 2015 (UTC)
- The remaining issues are now well below 100, and include quite a number of different cases.
- A few of them will be dealt with in time (merging or deletion is requested in some constituent wikipedia),
- for some there probably never will exist a satisfying solution (people not unequivocally "said to be the same" like VIAF 301733717 (arguagbly one could try to persuade french wikisource to merge them), or VIAF 4911634 (correctly) standing for d:Q452681 and d:Q16746837 (historic encyclopedic sources well may have distinct entries for them), VIAF 54284156 (where newer sources distinguish between d:Q657941 and d:Q936686)
- an important subclass of the former are the "pseudepigraphic" identities, like VIAF 106211622 or slightly more general everything where later authors assume the name of a more or less legendary person for their pseudonyme (most prominent VIAF 78822416 where many wikipedias make the disctinction between John the Apostle and John the Evangelist)
- another case are musical projects - VIAF 78814362 has no choice but to mix "strictly personal" entries and those which take the project name as main entry for a biographical article - perception as a pseudonym or a "proper" band name may have changed over time.
- other cases include wrong identifications already performed by the constituent authority files, e.g. VIAF 44312946 where GND 133981738 already confuses a philosopher and a chemist
- cases like VIAF 100176359 where two GND records are already erroneously clustered together, corresponding to d:Q716458 and d:Q1698776. Or VIAF 21158660 identifying different persons from different authority files (at least the murderer d:Q6166455 and the musician d:Q6167207)
- some cases which are a complex mixture, like VIAF 22156797 where the birth-date concides between the manager and one-work-person d:Q2150873 and the prolific scientist d:Q2150870 also described by VIAF 281857969...
- an extremly unsatisfying but currently still rare case is problematic for all authority files and the wikipedias alike: early manuscripts conceived as works or the real-world artefacts recording them like VIAF 222517998 with d:Q93304, d:Q12397781. Collections of texts are also a general problem, VIAF 185083399 "is" d:Q19786 and d:Q83367 alike, and it is known that "O.T." means varying selections depending on the individual denomination...
- Geographic entities will become a massive problem, VIAF 234616394 is one geological entity but usually two (separable) terms d:Q206587 and d:Q695368, arguably a part-whole-relationship might be involved, or VIAF 243096161 which may stand for a village or the commune named after the village, or all those islands which as political entity include some more islands in the vicinity...
- We've got some internal inconsistencies with Geographics as well. They slop over very easily into corporates (Paris, for example is both a corporate body and a place.) I'm afraid the conclusion will be to just live with them combined.Ralphlevan (talk) 20:28, 16 April 2015 (UTC)
- There are cases (I can't recollect them unfortunately) where just one entry is assigned the wrong one of two otherwise clearly shaped VIAF clusters and I just deleted that number in the wikidata entry. Others of that kind may show in the "Single value" report which increased in size during the last weeks (it includes probably some amount of false positives, unfortunately I do not have the resources to intellectually patrol > 8000 VIAF entries to spot - I presume only a few - wikidata errors). -- Gymel (talk) 08:24, 13 April 2015 (UTC)
We use the Wikidata records as a source when creating clusters. We are taking the existence of multiple VIAF IDs in a record to be a weak hint that the records should be considered for clustering. When they do get clustered, the redundant ID's will redirect to the correct cluster. That sort of merging will appear tomorrow in production. An example would be the Q76 entry for Barack Obama. Who is going to delete the redundant VIAF ID from the Q76 page? The test would be if the VIAF link does a redirect to one of the other links on the page, then it should go away. Is this something you guys would do?Ralphlevan (talk) 20:28, 16 April 2015 (UTC)
- Hi Ralphlevan, d:User:KrBot is deleting false numbers and redirects (VIAF & GND). --Kolja21 (talk) 21:17, 16 April 2015 (UTC)
- @Kolja21: That's great news! How often does it visit a particular page? I notice the d:Q76 page still appears on the constraint violations page. For that matter, how often is that report generated? Ralphlevan (talk) 19:38, 17 April 2015 (UTC)
- I left a note @ d:User talk:Ivan A. Krestinin. --Kolja21 (talk) 00:09, 18 April 2015 (UTC)
- @Ralphlevan: The reports are created around 06:00 UTC every day and usually reflect all changes on wikidata until around 11:00 the previous day (it is based on the "incremental dumps" which deliberately are lagging behind to give a chance to revert vandalism first). Both VIAF numbers had been inaccessible for most of the day but this is resolved now. Accidentally Q76 has been touched today (in other aspects) thus KrBot might also recheck the VIAF numbers of this particular item on his next run or the one after that (depending on when this change becomes visible to it). For obvious reasons it does not check all 550.000+ VIAF numbers noted in wikidata every day nor does it have a means to continuously harvest all VIAF changes, my suspicion is that it processes your monthly "justlinks" dump to identify VIAF ids which aren't valid any more and subsequently performs a live access for these to determine redirection vs. deletion. -- Gymel (talk) 23:58, 17 April 2015 (UTC)
- Thank you Gymel. KrBot checks all VIAF ids every day. VIAF and Wikidata dumps are used for bot`s TODO list generation. Items of TODO list are checked using VIAF and Wikidata API. Invalid cluster ids are deleted by bot. Redirect ids are resolved. Wikidata incremental dumps allow have ~1-2 day delay for Wikidata information. VIAF dumps appear once per month, this creates some notable delay. — Ivan A. Krestinin (talk) 04:35, 18 April 2015 (UTC)
- d:Q76 was updated less than one hour ago: [4] (turning the target number into a duplicate which was susequently removed from the item). The VIAF dataset 20150416 had became visible at http://viaf.org/viaf/data less than 36 hours ago. -- Gymel (talk) 04:53, 24 April 2015 (UTC)
- Chronology (all times in UTC and slightly rounded)
- 2015-03-30 01:45 wikidata dump production starts. Not only VIAF but probably many more applications "external" to wikimedia projects base their knowledge of "wikidata content" on these dumps.
- 2015-04-02 09:00 generation of wikidatawiki-20150330-pages-articles.xml.bz2 is completed and shortly after available at the wikimedia dumps page at http://dumps.wikimedia.org/
- 2015-04-16 is the name of the files comprising the current VIAF dump. I presume that the monthly reclustering in VIAF was performed at that date and Wikidata information used for the operation based on the previously mentioned dump of 2015-03-30.
- 2015-04-21 22:30 their prevalent timestamp at http://viaf.org/viaf/data/
- before 2015-04-22 20:30 they have been available for download
- 2015-04-23 01:00 wikidata dump production starts for "20150423"
- 2015-04-24 04:00 KrBot at Wikidata starts performing about 9.700 VIAF-related edits in a row of ascending Q-numbers: Change of the property value often followed by deletion of that value (since identical to an already recorded value for that item). Thus for >4.800 VIAF numbers the number at recorded in wikidata item changes or the constraints violation report will be reduced by one (I did not count the deletions but they can't have been more than half of the bot's edits).
- 2015-04-24 09:20 last of this strain of VIAF related edits of KrBot. During the operation WDQ and related tools like Autolist and BEACON file generation had already been taking on the changes within their usual delay of about 15 minutes.
- 2015-04-24 12:27 Wikidata:Database reports/Constraint violations/P214 is produced based on data from 2015-04-23 11:11
- (estimated) 2015-04-25 09:00 w:Wikidata:Database reports/Constraint violations/P214 is produced based on data from 2015-04-24 11:11 and thus the first report on the effects of incorporating the 2015-04-16 VIAF merges into wikidata
- (estimated) 2015-04-26 05:00 generation of wikidatawiki-20150423-pages-articles.xml.bz2 is completed and shortly after available at the wikimedia dumps page
- (estimated) 2015-05-15 wikidata dump production starts for the first time with data of the VIAF shuffle 2015-04-16
- (estimated) 2015-05-15 VIAF reclustering is performed
- (estimated) 2015-05-18 generation of wikidatawiki-201505xx-pages-articles.xml.bz2 is completed and available. This is the first wikidata dump which makes available data reflecting the VIAF changes of 2015-04-16 as processed in wikidata at 2015-04-24
- Currently the four-weekly VIAF update interval and the about three-to-four-weekly wikidata dump production interval are in somewhat unfortunate constellation. But as a general rule VIAF cannot expect that the most recent wikidata dump at the time of a VIAF reclustering already is reflecting the results of the preceding VIAF reclustering: wikidata dumps reflecting VIAF changes are available between two to at least up to five weeks after the operation at VIAF. -- Gymel (talk) 14:11, 24 April 2015 (UTC)
Wikidata items in VIAF bundles
editVIAF bundles now include Wikidata items, apparently in place of English Wikipedia pages. Perhaps by mass replacement relying on links from those items to EN.wiki pages? By the way, is Wikidata now considered one of the authorities?
Hours ago I discovered bundle VIAF 90666739 that errs only in linking/bundling the wrong Wikidata item, and thus displaying a wrong Wikidata description. I reported at the bottom of section 1.7 Wikipedia:VIAF/errors#Other errors, where it may be educational for those of us who read that section. But more may read here?
As I depart, neither one of the WD items includes that VIAF identifier. Would that help or hurt in such a case? --P64 (talk) 23:30, 10 May 2015 (UTC)
- That is a general problem of VIAF: Name + year of birth (and no other information) = same person. Mike Phillips (illustrator) was a redirect without a WD item. To lead VIAF in the right direction I've created d:Q19890764 for the illustrator. --Kolja21 (talk) 03:20, 11 May 2015 (UTC)
- Sorry, I should have copied both WD identifiers here. The illustrator was at Wikidata, associated with his stub biography pl:Mike Phillips. I merged the new item down there (so it is now a redirect) and added some more statements.
- VIAF 90666739 now points to, and displays the expanded description from, the illustrator at Wikidata--as it should. Corrected because I reported manually or because new VIAF identifier statement(s) at Wikidata triggered the work process described above? --P64 (talk) 18:03, 18 May 2015 (UTC)
- Hi P64, VIAF works like a bot. The only way to correct errors is through Wikidata or de:Wikipedia:GND/F. BTW: We have right now on Wikidata an experiment with GND = "no value" (in the case VIAF merges two persons but only one of them has a GND). --Kolja21 (talk) 19:53, 18 May 2015 (UTC)
- @P64: It's hard to tell. This month VIAF reclustering took place at May 12, i.e. earlier than estimated above but still after your enhancement of d:Q11780639. Thus VIAF having corrected the situation so shortly after your edits is rather a coincidence. Also differing from the tentative timeline above the wikidata dump from mid-April starved and another one didn't yet start. The "wikidata" description of VIAF 90666739 from May 12 contains something you just added on May 11 thus we know for certain now that VIF gets its data (also) by different means than monthly dumps - maybe they follow the incremental backups or give those items which have a VIAF number (P214) in Wikidata a special treatment, i.e. download them separately by API access or whatever. Like Kolja I would guess that adding a second forename and birth dates at wikidata (LCNAF, GND and BNF know the exact-to-the-day dates also and at least the BNF provides it in a machine-friendly way) should have been sufficient to provoke re-assignment of the item to the correct cluster. -- Gymel (talk) 06:07, 19 May 2015 (UTC)
Exhaustive VIAF import to Wikidata?
editAt d:User talk:Ivan A. Krestinin#Outdated VIAF data? I'm trying to resume the different roles I perceive for the mutual VIAF / Wikidata matchings. There also is a discussion about an imminent one-time import of "excess" mappings from VIAF (VIAF records connections to about 50% more Wikidata items than the number of VIAF ids recorded in Wikidata). Anyone please feel free to comment, especially @Ralphlevan and ThomasBHickey: if this would have a negative impact on VIAF's workflow or if someone sees issues with the licences of the data. -- Gymel (talk) 07:09, 10 June 2015 (UTC)