User:Caeciliusinhorto/Don't rely on Earwig

Earwig's Copyvio Detector[a] is a useful tool to help identify potential copyright/plagiarism issues. However, it is not foolproof: it frequently returns both false negatives (it misses some copyright issues) and false positives (it flags potential issues which are not actually problems). If you are reviewing an article for copyright and plagiarism, simply checking Earwig is not sufficient.

Here are some of the key issues, along with tips on how to account for them:

Earwig only checks web sources

edit

While it seems obvious, it is important to understand that if a source is not available on the web, Earwig cannot check it. This means that text copied from a print source is not going to be detected by Earwig. Less obviously, if the text is available online but not indexed by Google – e.g. scans of newspapers in online databases or of books on the Internet Archive Library – Earwig will not detect it.

A corollary of this is that some articles are more effectively covered by Earwig than others. Articles on modern popular culture and sport are likely to be largely based on online sources which Earwig can check; more academic topics as well as ones which predate widespread access to the web are less likely to be effectively checked by Earwig. For example, Anyte is an article about a relatively obscure ancient Greek poet. Earwig can probably check only two of the sources cited in this revision – the Brooklyn Museum and USGS webpages, both of which are used to support only a few words. By contrast, this revision of the article on actress Anya Taylor-Joy is mostly supported by sources that Earwig can probably check.

When checking for copyright issues, therefore, it is important to manually check some of the sources that Earwig will not look at – especially if the article relies heavily on such sources. Check books which are available on Internet Archive; use The Wikipedia Library and the Resource Exchange to find copies of articles. If you have access to a good library where you live, use that to find print sources. If you are doing some sort of formal review, ask the author of the article to supply you with scans of some of their sources. Check the obscure sources, because they are the ones where issues are likely to be missed!

Earwig can miss close paraphrasing

edit

Even when Earwig does check a source, it can miss problems. Close paraphrasing is the superficial modification of material from another source; it can violate Wikipedia's policies on both plagiarism and copyright violation.

An article can rephrase distinctive text in such a way that Earwig will not flag it, but it still violates Wikipedia's rules on copyright and/or plagiarism. For instance, if an author takes (to borrow the example from WP:CLOP) Hilaire Belloc's description of a llama from More Beasts (for Worse Children) as a source:

The Llama is a woolly sort of fleecy hairy goat, with an indolent expression and an undulating throat; like an unsuccessful literary man.

And rewrites it for Wikipedia as:

Llamas are a kind of wooly goat. They have an expression of indolence and a throat which undulates like a failed writer.

Then as well as being guilty of unencyclopedic writing, they are also guilty of close paraphrasing – and yet Earwig is likely to miss this.

Again, the solution is to spotcheck sources. Earwig is likely to find direct borrowings, but you should be on the lookout for excessive structural similarities between source and text. This applies especially to distinctive rhetorical devices. The most obvious wording of basic facts like "John Smith was born in 1967" is acceptable even if that's the same phrasing used in the sources, but when it comes to communicating more complex ideas, or using figurative language, structural similarities are more of a concern.

Be especially wary of flowery language with similes, especially where the prose looks awkward. This is often a clue that a source has been ineptly and too-closely paraphrased.

Earwig can't translate

edit

If the source is in a different language to the wikipedia article, Earwig is unable to detect plagiarism or copyright violations. It doesn't know what the text it is looking at means in the same way that a human does. If it compares the beginning of Julius Caesar's Gallic Wars in Latin:

Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.

Against an article which reads:

Gaul is all divided into three parts, of which one the Belgae inhabit, another the Aquitani, the third those who in their own language are called Celts, in ours Gauls.

Then Earwig sees that the two texts look completely different on a word-by-word level, and assumes everything is fine. It does not know that the article is plagiarising Caesar. Careful manual reviewers, however, are likely to notice the unidiomatic sentence structure and check whether the article is guilty of over-literal translation.

If you read multiple languages, foreign language sources are a good candidate for spot-checking when reviewing, not just because Earwig cannot deal with them but also because most of the article's readers will not understand them; any errors are less likely to be found by other reviewers.

Earwig and false positives

edit

If the same text appears in a wikipedia article and an online source, Earwig will identify it. It will treat it exactly the same whether it is attributed or unattributed; marked with quotation marks or in a blockquote or included in running text without any indication. Similarly, if the Wikipedia article and the source both mention the same proper names, Earwig will flag those. Just because Earwig says something has a high chance of being a copyright violation, it does not mean that there is necessarily anything wrong.

According to Earwig, this revision of the article Mira Bellwether has a 46.2% chance of being a copyvio of this article published by Them. The three longest pieces of text it flags as problematic are all properly quoted and attributed. The fourth-longest is the title of the source itself, included in our article's bibliography! I have seen reviewers complain about high Earwig scores in GA reviews before, with no apparent understanding of what that score means or whether it is actually problematic. Simply having a high text similarity, as calculated by Earwig, is not necessarily an issue. (And if there is an issue, simply fiddling around with the text until Earwig no longer complains does not necessarily solve the problem!)

Earwig also does not know when a source copies from Wikipedia. If it flags something as copyvio, and you determine that unattributed copying has in fact occurred, it is important to check in which direction the copying happened. Checking archived copies of the page on archive.org can help with this; if there are no such copies, it can help to check the revision history of the text at issue in the Wikipedia article. If you can see that it evolved gradually over time, especially with many separate editors contributing, it is likely to have been original to Wikipedia; if it was all added in a single chunk it is more likely to have been a copyvio. (Though remember that even text which was added to a Wikipedia article all in one chunk may have been originally written for Wikipedia and then later copied by an external site!)

Dealing with false positives

edit

Ideally, you shouldn't need to do anything. By definition a false positive is not a problem. However, it is possible that someone reviewing an article (probably as part of the Good Article process) will raise the score given by Earwig as a concern. If this happens to you, simply explain to the reviewer why Earwig's score is not indicative of an actual copyright issue in this case. If they remain unconvinced, open a broader discussion (e.g. at WT:GAN for issues related to Good Article reviews).

Notes

edit
  1. ^ This essay refers specifically to Earwig's tool because it's widely used and I am familiar with it, but these issues are applicable to automatic copyvio detection software more generally.