Wikipedia:Turnitin/Statement by Ocaasi

Hello Wikipedians!

I read about Turnitin in a news article about a year ago and contacted them independently to see if they would be interested in donating some services to us. They liked the idea of helping Wikipedia, and it seemed, despite our different backgrounds, that we shared an interest in supporting "original authorship"--in their case catching plagiarism and in our case copyright violations. They were so enthusiastic that they offered to assess millions of articles on Wikipedia, a possibility that I was really excited to bring back to community. Just for the record, I'm not paid by Turnitin or otherwise affiliated with them in any way. I think this project might benefit us, but I have tried to consider all objections seriously, and whatever the community decides is best is ultimately what will happen. --Ocaasi^{t | c}

Introduction

With nearly 100 employees, eight global offices, and a headquarters in Oakland, California, Turnitin is a leading provider of plagiarism detection services.

As of 2012, Turnitin services approximately 10,000 institutions from 126 countries in 13 languages, including: 2,500 higher-education organizations and almost 5,000 middle and high schools. Turnitin's system evaluates approximately 40 million academic papers per year for text-matches to uncited sources; during final exam periods the site processes 400 new submissions per second.

Turnitin has offered to review every English Wikipedia article for copyright violations, a service worth hundreds of thousands of dollars. This review could revolutionize the way we manage our articles, reveal previously undiscovered infractions, and shore up a core part of our mission -- to remain truly free for others to use, modify, or even sell. Furthermore, if implemented effectively, Turnitin's services could lead to significant reforms or even a complete overhaul of how copyrighted content on Wikipedia is handled.

Turnitin only seeks attribution from Wikipedia. In particular, the off-Wikipedia reports (linked to from article talk pages) would contain minimal branding at the bottom corner such as "Powered By iThenticate", Turnitin's parent company. It would also like to be able to issue a joint press release with the Wikimedia Foundation and mention their collaborating with Wikipedia on their website or other promotional materials. Such attribution, as well as other elements of a collaboration with Turnitin, raise procedural and policy issues, which this page hopefully begins to address.

Attribution/Advertising

Wikipedia is non-corporate and non-commercial, and any threats to this status warrant serious consideration. Should the community see any corporatizing or commercializing impact arising from a collaboration with Turnitin, that impact would have to be weighed against potential losses from turning down such a collaboration. I don't see a particular problem in this area, but I want to address it up front.

Attribution not advertisement. Mentions of Turnitin on Wikipedia would be extremely limited. Aside from the main project page, or typical promotion at relevant community forums, attribution would be posted only on off-Wikipedia report pages, which would be linked to on Wikipedia article talk pages . Talk pages are not outward facing. They serve mainly editors. Many readers have never seen talk pages. Therefore, any ultimate attribution to Turnitin would primarily be communicating to other editors about the source of the report. The current proposal suggests providing attribution only to articles with suspected violations (those with a high degree of text-matching), not to all English articles. In other words, there would not be 3.8 million attributions, but only, perhaps, five to ten percent of that amount, and again, all of those would be off-Wikipedia. It's worth considering that Wikipedia already gives attribution to private companies in limited circumstances. There are links to tens of thousands of for-profit companies in references on actual article pages. There is the Template:Find sources template, which links to Google's various search sites as well as JSTOR. There is also precedent in the Wikipedia Zero program which gives direct attribution to the mobile company providing the free access. ("It is important to communicate to users when a page is free so that they are assured that no data charges are being incurred and are also encouraged to continue browsing without fear. At the same time, it is important to the operator to communicate the "freeness" so they receive the marketing benefit from it.") In short, there is precedent for mutually beneficial partnerships.

No unfavorable precedent. There is potential concern that collaborating with Turnitin would set precedent for companies to use Wikipedia to "advertise" in the future. Such a concern might overlook the unique services Turnitin would be providing, a service that facilitates a core site operation of Wikipedia, namely copyright enforcement. References to Turnitin would be solely in connection with that service and limited in the ways discussed above. Should Wikipedia go forward with a collaboration with Turnitin, that collaboration need not provide precedent for advertising on Wikipedia writ large. Indeed, perhaps the broadest precedent to be drawn from such a collaboration would be that if a company can facilitate Wikipedia's mission of remaining a free encyclopedia by providing services integral to its core operations for free, then that entity may be entitled to limited attribution recognizing the donation of those services. Should this limited precedent still be perceived as too broad, the community can always say "no" to further instances of attribution -- we can draw the line wherever we want, and we can do what works for us here without repeating the same at a later time.

Attribution does benefit Turnitin. While attribution is not in itself advertisement, this doesn't mean Turnitin won't benefit from collaborating with Wikipedia. There are indeed many potential benefits to Turnitin from this proposal: (i) advancement of its core values involving education and protection of original authorship; (ii) application of their algorithm and database to a novel use; (iii) affirmation from Wikipedia that it believes Turnitin's services complement and further Wikipedia's core mission; (iv) increased visibility of its reports and services in the Wikipedia community amongst both editors and some readers; and (v) opportunity to disclose to its customers a partnership with a reputable academic encyclopedia known the world over. In short, this collaboration accepts mutual benefit as a feature rather than a bug.

False positives

Even if the attribution/advertising issue is resolved, Turnitin still has a major obstacle to overcome in actually designing a system that works. That, however, is something they are willing to test, develop, and execute completely on their end, with their employees and technical staff, their troubleshooting efforts, and their money. They're willing to invest in making this project work.

Turnitin is well aware of the problem of mirrors and content reuse and has committed to adapting and designing a separate algorithm that worked just for Wikipedia. They also realize that no collaboration could be pursued on a wide scale without a well-designed pilot program that rigorously tested the effectiveness of Turnitin's algorithms and reports, and a period of feedback, analysis, and refinement to optimize the functionality of their approach.

In order for Turnitin's reports to be used on Wikipedia, they would have to meet a certain bar of effectiveness and accuracy. False positives would waste editor resources and could lead to alarmism or unnecessarily removed content.

And even then, Turnitin reports would have to be viewed as starting points for investigations rather than conclusions. Editors would have to use their discretion in exploring Turnitin reports and acknowledge that a clear Turnitin report is only proof that Turnitin didn't find any matches. Further investigation would be necessary. If Turnitin does find matches, investigation and confirmation would still be necessary.

Turnitin's reports would do several things to facilitate identifying copyright violations. It would create reports that presented matching text from an extensive variety of sources. It would identify a precise citation for those matches. It would compute a percentage-matched statistic which could be used on a Wikipedia project page which ranked articles by that statistic. Those articles would be the most likely candidates for investigation, and having the list would allow for suspected copyright issues, copyright problems, and copyright cleanup investigations to prioritize their work. For questions about a specific article, editors could use the Turnitin report to speed up and enhance their process in such forums as DYK, GA, FA, and regular content and verification work.

Usage of Turnitin on Wikipedia might provide the additional benefit of operating as a deterrent to those who would otherwise be careless with or intentionally violate our copyright policies. The presence of Turnitin reports may encourage writers to make a more vigorous effort to avoid copying from copyrighted sources or to avoid close paraphrasing. Though not necessarily benevolent in motivation, this arrangement may ultimately result in the contribution of text that is meaningfully original enough that it is no longer an outright copyright violation or even close paraphrasing.

The question is whether having Turnitin's reports gives us another beneficial tool and improves upon our current copyright checking regime. We can test to see if it does, and if it does, then I think there's good reason to use it.

Comparing Turnitin to alternatives

Turnitin is not the only way to approach plagiarism-detection.

There are other companies that charge for their services. In setting up the project it did not appear that any other company had the history, reputation, knowledge, partnerships, database, code, scope, or scale to address Wikipedia's issues.

There are also free alternatives. On inspection, it appears that they only provide one aspect of the plagiarism detection process, which is a webcrawler. This approach lacks one of Turnitin's core strengths, which is its database of millions of books and articles which it has developed through proprietary partnerships with various content providers. Turnitin's webcrawler may also be superior for the purposes of finding plagiarism. It uses a pattern-matching algorithm that has been developed over two decades and which is different from standard keyword-matching algorithms used by search indexes such as Google, Bing and Yahoo. Turnitin's web index is very large, up to 20 billion articles. Turnitin has devoted thousands of hours and hundreds of employees to developing their system, expanding it, and refining it--a process that free alternatives simply can't invest in. Last, free alternatives are unlikely to scale in a systematic and massive way, such as using them to check every single Wikipedia article on some regular basis.

Last, we have our own copyright tools. CorenSearchBot and/or MadmanBot currently check new articles for copyright violations. There are limitations to those bots: they do not check existing Wikipedia content, and they only check articles against webpages not a content database. It's possible that our bots are not as developed as the proprietary code by Turnitin and the webcrawler not as sophisticated. Our bots do not generate an itemized report which allows editors to actually see and compare plagiarized sections or identify the various sources which result in the match (for recent Coren's bot reports, see User:CorenSearchBot/manual). And, our bots do not have access to a content database like Turnitin's which contains millions of articles and journals. In addition to the webcrawler, Turnitin has access 100 million articles from content publishers including library databases, textbooks, digital reference collections, subscription-based publications, homework helper sites, books, dissertations, and theses. Some of these are sources that we've never had the ability to check before (even manually), and even using existing tools such as MadmanBot. The question is whether our tools are optimal, sufficient, and whether they would be complemented (or replaced) by Turnitin's system.

To determine which path forward is best, Turnitin needs to explain and demonstrate how they would approach analyzing Wikipedia content. Also, Coren, Madman, or others in the community would have to suggest or propose on-Wiki methods which were comparable. I think it's highly unlikely that an on-Wiki tool could design a system specifically for Wikipedia or have the resources and server capacity to check all of Wikipedia on a regular basis, but it's not impossible and I wouldn't put it past this community to develop such a system if they chose to do that.

Backlog

It may be the case that Turnitin reveals more copyright issues than we currently have capacity to fix. That may be a problem, but I think it is ultimately better than not knowing about those issues at all.

There is an extensive existing copyright detection and cleanup effort at Copyright cleanup investigations, Copyright problems, and Suspected copyright violations. There are already a variety of copyright tags and templates for articles which may have or do have problems. There are various user warnings for copyright violation contributors' talk pages.

Turnitin could easily integrate with this existing system, through one of the existing copyright bots such as MadmanBot, or through a new bot that posted article talk page notices with links to Turnitin reports. If Turnitin's algorithm was rigorously tested, we could possibly design bots which automatically tagged pages with a high level of text-matching. Last, Turnitin page-matching scores could be placed on a Wikipedia page in order of degree. This would allow copyright investigators to prioritize their efforts.

Having reports linked on article talk pages would allow non-copyright experts to evaluate articles, effectively outsourcing the investigation of violations from a small core of users to the whole community. Turnitin reports would do much of the tedious or even impossible work that cleanup investigations currently face. It would make the process more productive, more efficient, better targeted, and more effective.

Proprietary software

As a community which shares many of the open source movement's goals, it may be ideal for Wikipedia to use only open source products. However, it may simply be pragmatic and beneficial for us, at least in the short-to-medium term, to collaborate with those who have the extensive time, capital, resources, and motivation that are frequently (but not exclusively) found in successful private companies.

There are also instances of proprietary software that we use and link to already. Our Template:Find sources links to Google, Google News, Google Books, Google Scholar, and JSTOR. Those are proprietary, but we link to them because they offer a better service than anyone else. Our references regularly link to proprietary content, some of which is behind paywalls. There is also the Wikipedia Zero mobile partnership with acknowledges the donation of mobile service providers in a banner atop the mobile version of Wikipedia in those areas. These examples are tolerated, if not actively encouraged, because they further our core mission. I don't identify a meaningful difference in principle or degree with the Turnitin's situation.

As for Turnitin's code, it's not necessary for us to see it to know if it works and works better than our current systems. We can take into account Turnitin's reputation and history. We can also evaluate the results of a trial program ourselves and check to see what Turnitin catches and what it does not, and with what frequency and reliability. In the end, we can make use of what Turnitin is offering until there are options that can work comparably and can be implemented now. At the moment I don't know of any such options.

Foundation resources

There would be some upfront and ongoing investment from the WMF.

The process of crafting appropriate disclaimers on the talk page notices would be relatively trivial.

Crafting a written agreement, should the Foundation desire one, would involve some legal resources, but that would be a one-time cost and hopefully a collaborative rather than contentious process.

The ongoing use of server resources to tag articles does not seem to me to be prohibitive, particularly in light of all of the bots which are already running. If there are 3.8 million articles, and 10% of them have a high level of text matching, that would be 390,000 edits to talk pages once every few months. I don't think that is beyond our capacity, although it is still considerable. It may also be technically feasible to have Turnitin host their reports at a static link, eliminating the need for subsequent Talk page updates except on new articles. That would cut down the number of edits dramatically.

In the end, we have to decide if that outlay of resources is worth having access to those intelligent text-matching reports from a company that specializes in doing so. I personally think the benefits would be worth the costs, but I'll leave that determination to the technical folks.

Legal and media concerns

There are some legal issues I'll largely skip over here, since the Foundation has already reviewed this proposal and crafted terms that they are comfortable agreeing to.

The media risk--of one day waking up to 100 mainstream news articles about how X percent of Wikipedia is plagiarized--is a real concern. Fortunately we can mitigate that concern by rolling out the collaboration very slowly, starting with a trial of only 150 articles. Drawing sweeping conclusions from such a trial would be inappropriate, and not methodologically sound. Moreover, the trial will not be promoted by either Turnitin or the Foundation, until its results have been analyzed and further collaboration is deemed desirable. If necessary, the results of the trial could even be kept private off-Wikipedia, although that would obviously sacrifice transparency.

The last thought on this I've made elsewhere above, which is that it's ultimately better for us to know about copyright/plagiarism issues than to continue on in ignorance. If we learn that we have a large problem, we will have to adapt to that, and it will be newsworthy that we intend to do so.

Conclusion

This is the best case I can make for why a collaboration is worth pursuing further. I'm sympathetic to concerns here and if we need to tweak our approach then I am willing to try and facilitate that as well. I think sometimes we can achieve more with the help of others--even a private company--than by acting alone, and this may be one of those cases. Ocaasi^{t | c} 16:48, 22 July 2012 (UTC)[reply]