Wikipedia:Turnitin/Technical management

Machine-assisted approaches have proven successful for a wide variety of problems on Wikipedia most notably including vandalism and spam. Wikipedia already uses an rule-based edit filter, a neural network bot (Cluebot), and a variety of semi-automated programs (Stiki, Huggle, Igloo, Lupin) to determine which edits are least likely to be constructive. The worst of these are reverted automatically without any human oversight while those that fall into a gray area are queued up for manual review, prioritized by suspiciousness based on a number of factors that have correlated in the past with problematic edits.

This page is designed to explore how we can leverage the same approach to the issue of copyright. Copyright violations are a major concern and problem on Wikipedia, as the encyclopedia aims to be free for anyone to use, modify, or sell. In order to maintain copyright-compatibility, content on Wikipedia that does not share those permissions is severely limited (as in the case of non-free content/fair use), and that content is always explicitly tagged or labeled as such. Editors who contribute text or images to the encyclopedia under the auspices that it is 'free' content--when it is actually held under someone else's copyright--introduce practical and legal problems for the community. If we cannot trust that the content on Wikipedia is free, then neither can our content re-users. In addition to exposing Wikipedia to liability, we do the same for those who take from our site and assume the content has no strings attached (except for mere attribution, and share-alike provisions).

Steps for implementation

edit
  • Analyzing the operation and evaluating the effectiveness of our current copyvio detection tools (MadmanBot/CorenSearchBot)
  • Analyzing the operation and evaluating the effectiveness of prospective copyvio detection tools (Turnitin)
  • Developing a corpus of gold-standard copyvios against which to test future tools
  • Integrating meta-data based approaches with our current or prospective tools to improve their accuracy
  • Building and receiving approval for on-Wikipedia bots which automatically revert, flag, tag, or list edits that are most likely to contribute copyvios

CorenSearchBot

edit

How it works

edit
  • feed of new articles are fed into Yahoo Boss API
  • searches for the subject of the article
    • Update: I've found Coren's more recent code (>2010) searches for the subject of the article and for the subject of the article + random snippets of the text. — madman 02:40, 3 September 2012 (UTC)[reply]
  • pulls the top 3 results
  • converts page to text
  • compares Wikipedia article and search results with the Wagner–Fischer algorithm and computes a difference score
    • After trying forever to reproduce CSB's scores, I finally realized that the Wagner-Fischer algorithm is used to compare words as entities, not letters, if that makes sense (distance score is calculated using matching words, inserted words, deleted words). — madman 17:26, 5 September 2012 (UTC)[reply]
  • if high enough, means it's a likely copyvio
  • the page is flagged, the creator is notified, and SCV is updated

Weaknesses

edit
  • Only inspects new articles (as opposed to, say, all significant text additions)
  • Searches for topic instead of verbatim content matching. This presumably works well for new articles because they are relatively obscure topics and sources are likely to appear in top 3 search results. This would not scale well into finding section-wise copyvios on well developed articles.

Questions

edit
  • What is CSB's false positive rate? What Wagner-Fischer score is CSB's cutoff?
    • A preliminary review of the corpus says it's about 30%, but I need to do manual coding of about three years of dispositions in order to get a statistically sound answer on this one. About half of those false positives are because the content was licensed appropriately, but the bot had no way of knowing that. — madman 03:49, 27 August 2012 (UTC)[reply]
      • Any characterization of the other half? West.andrew.g (talk)
        • Straight-up false positives. — madman 13:36, 27 August 2012 (UTC)[reply]
          • I note that a lot of these false positives are for content that doesn't meet a threshold of originality necessary for copyright protection, e.g. track listings, lists of actors, etc. The similarity to external content could be 100% but if it were a positive it'd still be a false positive. I hope to ask on Tuesday how iThenticate might or might not address such content. I'd be very surprised if anything other than rough heuristics could be used to cut down on those positives. — madman 06:23, 1 September 2012 (UTC)[reply]
    • Regarding the Wagner-Fischer score, CorenSearchBot's scoring is... complicated. I need to re-read the code and see how the Wagner-Fischer scores map to CorenSearchBot's scores, because CorenSearchBot has a minimum threshold whereas obviously the higher a Wagner-Fischer score, the more two documents differ. — madman 03:49, 27 August 2012 (UTC)[reply]
      • Am I correct to assume the threshold was arbitrarily derived, or was some corpus the basis for empirical derivation? West.andrew.g (talk)
  • I read the CSB source code and did some investigation into the BOSS API. Is there type of special agreement with them to get free queries? What are the rate limits we respect? About how many queries are made daily? West.andrew.g (talk)
    • The Wikimedia Foundation has a developer account to which Coren, I, and others have access. As far as I know, we don't have any special agreement with Yahoo! and are paying for the queries. About 2,000 queries were made daily when MadmanBot was running. — madman 03:49, 27 August 2012 (UTC)[reply]
  • Can CSB search paywalled sites like NYTimes.com? (Turnitin cannot)
  • Does CSB exclude mirrors?
  • How does CSB decide which url to include in its report at SCV?
    • It's the first one it finds that exceeds the minimum threshold of infringement. — madman 03:49, 27 August 2012 (UTC)[reply]
      • Wouldn't it be more rigorous to include any/all of the 3 if they exceed the threshold? Ocaasi t | c 19:27, 30 August 2012 (UTC)[reply]
        • Actually, if I'm reading the code correctly (Perl is a WORN language – write once, read never), the last match that met the threshold is what is returned. I agree that the URL with the highest score should be returned; I don't know about returning multiple URLs. I don't think it's necessary. — madman 06:20, 1 September 2012 (UTC)[reply]

Copyvio detection approaches

edit

Direct: text comparison

  • Web index (either keyword-based or pattern-based)
  • Content database

Indirect: attribute comparison

  • Metadata
  • Other features
  • Hypothetical features
    • This is not something which is explicitly stored, but is certainly feasible to compute if the devs wanted to put some code in: How much time expires between the 'edit' button is pressed and the edit is committed (previews may complicate the process). Someone writes 5 paragraphs in 30 seconds? That is suspicious! It is not fool-proof (i.e., someone copying content developed in a sandbox), but would be an interesting data point.

Combined method

  • Use Turnitin or CorenSearchBot, whichever is more accurate and comprehensive to compute a text-match score
  • Feed that score as one input into an machine learning algorithm which has been trained on CorenSearchBot/SuspectedCopyrightViolation's corpus

Approaches for developing a corpus

edit
  • create the copyvios yourself by copying copyrighted content directly
  • use one tool's results to train another tool
  • scan data dumps based on edit summary, addition or removal of a copyvio tag/template
  • use the archive of manually cleared copyright cleanup investigations
  • scan history of all CSD G12
  • check SCV moves to CP after page blanking
  • get all versions tagged by SCB(usually first version)
  • check SCV archives (no/false/cleaned/deleted/relist (deprecated)/history purge/redirect/list (moves to CP)
  • find out what are the actual {{templates}}. Note that CSB was A/B tested by the foundation and there are multiple templates
  • first edit in RevDelete incidents citing RD1 (access issues for non admins)

Questions from Turnitin

edit

Question's from Chris Harrick

edit

Editors have a decision to make in terms of how much they assist Turnitin with their efforts. Some may prefer to instead optimize Wikipedia's own tools, and feel less positive about helping a private company, even one with whom we may be partnering. Do whatever feels right for you.

  1. How hard it is to identify and maintain a list of mirrors?
  2. How confident are you that the mirror list is up-to-date?
  3. Do you have a resident expert on mirror sites who can tell us:
  4. How many mirrors are there?
  5. Are there new mirrors daily, weekly, monthly?
  6. Are there ways to potentially automate the tracking of mirrors?
  7. What constitutes a mirror versus a legitimate copy? (Just the attribution and link back we assume but we will most likely need to filter for copies as well)
  8. How many brand new articles are being created per day.

New mirrors of Wikipedia content are constantly popping up. Some of them reprint sections of articles or entire articles; others mirror the entire encyclopedia. Wikipedia content is completely free to use, reuse, modify, repurpose, or even sell provided attribution is given and downstream re-users share the content under the same terms. The classic sign of a 'legitimate' mirror is that it acknowledges Wikipedia as the original 'author' and is tagged with one of the compatible licenses: Creative Commons Attribution/Share-Alike License (CC-BY-SA), GNU Free Documentation License (GNU FDL or simply GFDL), or Public Domain, etc. The presence of attribution and one of those copyrights on the page is de facto compliance with our license and therefore legitimate. Unfortunately, the absence of either attribution or one of those licenses does not mean that the site didn't copy content from Wikipedia; it's simply harder to tell whether Wikipedia or the other website was first. A potential way to check which was first is to compare the dates of the content addition to Wikipedia with the content present on the website at a given time. Though computationally intensive, that is one approach.

The most comprehensive list we have of known mirrors is at Wikipedia:Mirrors_and_forks/All (approximately 1000). There is also a list of mirrors here (approximately 30). A record of the license compliance is maintained here and here. That list is should overlap with the mirrors list, but there may be discrepancies between them. There is a Wikipedia category for 'websites that use Wikipedia' here (7 sites). There is a category of 'Wikipedia-derived encyclopedias' here (7 encyclopedias). There is a meta list of mirrors here (approximate 230). There is a list of 'live mirrors' here (approximately 170). In addition to mirrors there are also 'republishers'. These sites package and often sell Wikipedia articles as collections. A small list of known republishers is available here 6 republishers).

Also note that Google maintains a cache of slightly outdated Wikipedia articles: details here.

To get a sense of how well-maintained our main mirror list is, consider that mirrors beginning with D-E-F were updated 20 times between July 2010 and July 2012. A rough estimation then is that we've updated the complete mirror list 20*9=180 times in the last 2 years. That number is likely lower than the actual number of new mirrors in that time period and almost certainly much lower than mere isolated instances of copying/reprinting/excerpting individual articles.

Statistics for the number of new articles each day are here. A quick review of that table shows approximately 1000 new articles daily. There are approximately 4 million existing articles.

Automation of mirror detection could be pursued to enhance but not entirely cover the problem. One option is to look for the terms: Creative Commons, CC-BY-SA, GNU, GNU FDL, or GFDL. Another option is to look for terms such as: from Wikipedia, by Wikipedia, via Wikipedia, etc. We can examine mirrors manually to see what other clues there are to content reuse. One possibility for semi-automating mirror detection is to add a feature to Turnitin reports so that a Wikipedia editor could 'flag' a matched-text source site as a mirror. Those sites could be added to a list for review to determine if they are mirrors or not. This would require an investment to the interface and infrastructure of Turnitin's reports.

Questions from Turnitin's CTO

edit

The answers to these questions differ considerably depending on whether we are analyzing new Wikipedia articles or existing (old) Wikipedia articles. When looking at brand new articles, mirrors are irrelevant, because assuming we run a report soon enough after the article is posted, there is simply no time for mirror sites to copy the content. Thus, with a new article analysis, any text matches from an external website are likely to be a copyright violation (Wikipedia inappropriately/illegally copying from them). One exception to that instance is if the content was legally licensed for reuse. That is often indicated by a Creative Commons of GNUFDL (GFDL) copyright tag. We can possibly identify those and screen them out as well. The issue of mirrors only arises when enough time has passed between the addition of content on Wikipedia and the copyright analysis for pages/sites to have copied/mirrored our content in between. That is a much more difficult problem to solve, and certainly not the low-hanging fruit.

A mirror, for our copyright compliance purposes, is any page, collection of pages, subdomain, or entire site which copies content from Wikipedia. (Obviously we get the most leverage from identifying known entire sites that copy from Wikipedia en masse). Copying from Wikipedia is fully permitted by our license, provided attribution is given to Wikipedia and downstream reusers honor the same terms. Thus, if a site has copied verbatim from us and followed the terms, it is de facto not a copyright violation. If we copy verbatim from them, however, it almost definitely is (with the exception of direct quotations, which are permitted within reason).

If an external page contains some content that was copied from Wikipedia (CFW) and some content that was not copied from Wikipedia (NCFW), it's still possible that a copyright violation occured--if the NCFW content turns up on Wikipedia. My suspicion is that there is frankly no way to know which parts were copied and which were not, without intensive manual inspection. So, I think we have little choice but to ignore these mixed instances. That said, this is an extreme edge case. The likelihood of a site copying from us also being a site that we copied from seems very slim. Furthermore, just as a point of comparison, no existing tool that we have access to will be able to parse the difference either.

There are thousands if not millions of cases where an author or page has plagiarized Wikipedia. People copy from us all of the time, sometimes with proper attribution ("From Wikipedia", "Wikipedia says:", etc.) and sometimes in such a way that it would get them brought into the Dean's Office or fired. But again, this needs to be seen in the context of Wikipedia's CC-BY-SA license (creative commons, with attribution, share-alike). It's ok to copy verbatim from Wikipedia, and even when they do so without attribution it's more a problem for them than for us; meanwhile it's not ok for Wikipedia to copy verbatim from others, unless we strictly give attribution, and then still don't copy so much that it exceeds what would be appropriate under Fair Use (we can't "quote" an entire article, for example, only a minimal excerpt).

To determine if content within a Wikipedia page is problematic and worth investigating, we use a number of approaches. The first is automated detection. For example, CorenSearchBot pulls from the feed of new articles and feeds them into the Yahoo Boss API, searching for the title of the Wikipedia article. It then pulls the top 3 results, converting those results to plain text. It then compares that text to the Wikipedia article and computes the Wagner-Fischer algorithm score (i.e., edit distance). If the score is high enough it indicates a likely copyright violation, and the Wikipedia article is flagged while our Suspected Copyright Violations board is notified. From there editors manually inspect the highest matching site. Other approaches utilize the 'smell test'. If an article suddenly appears with perfect grammar, densely researched content, no Wikipedia formatting (markup language), especially from a new editor or an editor known to have issues with plagiarism, then editors will explore further. Sometimes a comparison of the posting date on a website with the version date of the Wikipedia article is a dead giveaway of which one came first. Other times searching around the site reveals that a majority of the content is copied from Wikipedia, allowing a manual determination that it is a mirror. That determination is more easily made if the matching site is authoritative, such as a known newspaper, book, or blog. It's not impossible that a book copied from Wikipedia, but it appears to be more common that the opposite happens.

One of the strengths of Andrew and Sean (Madman)'s involvement is that they are going to be collaborating on datamining a corpus of known positive and negative findings from our copyright archives. That should number in the several thousands of instances. Among those instances are numerous identified mirrors. In addition to the mirrors we find in the corpus, we have a list of approximately 3000 identified/suspected mirrors already to go in a spreadsheet. That will give us a good head start on analyzing existing (old) Wikipedia articles for copyright violation. Andrew also intends to use data-mining techniques to determine if mirror detection can be automated. Thus, it would be desirable if we were able to append to the Turnitin mirror list using API functionality.

In the end, it's only necessary that we reduce the number of false positives to a level that editors might be able to manually evaluate (1-3 might be tolerable but 20-30 would render reports meaningless). Andrew and Sean's work will also allow us to develop a profile of the typical editor who violates copyright, by determining a variety of metadata such as whether or not they are registered, how many edits they have made, how many blocks they have on their record, and about 40 others. That profile could be used in conjunction with Turnitin scores to develop a composite metric of your best evaluation combined with ours.

Questions from Turnitin's Product Manager

edit

I am trying to gain an understanding of how wikipedia would like to use the iThenticate API. I have been relayed information about the desire to remove mirrored sites from the reporting results and that there are likely thousands of mirrors in existence. Since this is the case I imagine the folks at Wikipedia would like a quick way to identify the mirrors and remove them from the iThenticate report results. Currently I need to get a better idea of how the API would be used and the best way to identify mirrors in order to bulk upload them to our url filter...

The GUI interface for the "filter" already exists (as you know) and the add/remove/list functionality that provides is sufficient. We just want API hooks into those actions so we can interact with the list programmatically (and not have to resort to screen-scraping). To be pedantic, here is what I imagined the API calls might look like (sweeping generalizations here; I have not interacted with your API much, though I intend to code a Java library for it this weekend):

EDIT: After delving into the API documentation over the weekend, my PHP-esque examples below are obviously not the XML-RPC that iThenticate uses. Nonetheless, the general input/output/method sentiment remains the same, so I am not going to modify at length below. Thanks, West.andrew.g (talk) 20:46, 20 September 2012 (UTC)[reply]

OUTPUT FILTER LIST

edit

http://www.ithenticate/....?action=filterlist

Response
FID URL
1 http://www.someurl1.com
2 http://www.someurl2.com
3 http://www.someurl3.com
4 http://www.someurl4.com

DELETE FROM LIST

edit

http://www.ithenticate/....?action=filterdelete&fid=1

Response
"deleted FID 1"

ADD TO LIST

edit

http://www.ithenticate/....?action=filteradd&url=http://www.someurl.com

Response
"added as FID [x]"

These IDs could just be the auto-increment on some table? They are not even strictly necessary but might make management a bit easier than having to deal with string matching subtleties (i.e., in the "delete" case). Obviously, we'd also need to encode the "url" parameter in order to pass the special chars. via HTTP.

Client-side

edit

All the complicated processing regarding mirrors will be done in our client-side application. You don't have to concern yourselves with this, but regardless, I'll describe it in brief:

  1. We provide Turnitin a document
  2. Per our telecon, the API will be able to provide us the URL matches and their match percentages: i.e. "99% www.someurl1.com; 97% www.someurl2.com".
  3. For matches above a certain threshold, we will then go fetch those URLs and put them through a machine-learning derived classifier to determines if: (a) the site is a Wikipedia mirror, or (b) the site is freely licensed. If either of these is true, we will immediately append that URL/domain to the filter list.
  4. Once done for needed URLs in the match list, we will re-submit the content and have the report re-run under the new filter settings. This output will be the one published to users and upon which any actions are based.
  • Mirrors and free content could also be suggested by humans (or via a manually compiled list we already have). It's no issue for us to write a script that batches these in with individual API calls. Such batches will be infrequent; I see no reason to have special functionality for this. More often, it will just be single URL/domain additions.
  • I'd expect "delete" and "list" actions to be relatively rare. I expect we'll also maintain a copy of the list in our databases and we'd only need the "list" action as an occasional sanity-check.

This summarizes my opinion on what is needed. West.andrew.g (talk) 19:45, 14 September 2012 (UTC)[reply]

Notes from Aug. 14, 2012 Turnitin Meeting

edit

Things noted by West.andrew.g (talk):

  • http://www.plagiarism.org/ as a not-super-technical resource
  • Will Lowe is the guy to talk to in ops
  • Report generation takes "45 seconds to a few minutes"
  • Up to 500 matching URLs will be output
  • Throughput of 400-500k daily reports is not out of question
  • Utilization loads are not just circadian; but heavily seasonal
  • One thought: "match scores" are computed at the document level. Imagine something commits a blatant copyvio but does it only for one section, say, 33% of the document. This might not raise eyebrows as it should. I noticed in the demo that the "match reports" could be exported in XML format. We might be able to do some feature extraction over that, i.e, "does this 33% all occur in one giant block?, or to what extent is it distributed across the entire document?"

Noted by Ocaasi

  • Turnitin does not use keyword matching but rather 'digital fingerprinting'.
  • Turnitin can detect close paraphrasing! by analyzing text for mere word substitutions or added sentences
  • Turnitin can exclude quotations and bibliography sections
  • Turnitin views their system not as a copyright/plagiarism detection tool but as an 'editorial supplement'
  • Technical questions should be direct to api-info [at] ithenticate.com
  • We have the option of setting the default landing view, (e.g. Content Tracking report, minimum 400 word or 40% match)
  • Turnitin can write a script to batch-update the mirror list
    • AGW proposes we ask them for API functionality for mirror list maintenance given it could easily be in the 10,000+ entry range. Worst case-scenario, we use a programmatic browser to handle the process on our end.
  • We might be able to do pre/post processing to automate mirror detection
  • Turnitin is providing a trial account for multiple editors to use (it will be single-user login)
    • How we give the community report access is a little less clear. The reports aren't "public" in the sense that anyone with the URL can access them. Authentication tokens are needed. However, there was the suggestion these tokens could be obtained simultaneously with wiki login (i.e., single sign-on). Do we give this to all registered editors? Make an explicit user right? We could print metadata about what Turnitin found "i.e., X% match from URL Y" on a talk page, but actual report access is a little more work.
  • Turnitin is going to provide information on their API, there's also an API guide on the website
  • Turnitin typically has excess server time during late evenings and non-pre-finals periods
  • We're planning a conference call with their chief Technology officer in about two weeks, and AGWest, Madman, and Coren are welcome to participate.

Turnitin Trial design

edit
  • A double-blind study comparing Turnitin and CSB on new articles only
  • Alter/enhance the reported pages at SCV to include
    SCV positives with matching URL
    Turnitin positives with [highest matching] URL
    SCV negatives (below threshold) with lowest matching URL
    Turnitin negatives (below threshold) with lowest matching URL
    Turnitin positives that are also SCV positives
    Turnitin negatives that are also SCV negatives
    Turnitin positives that are also SCV negatives
    Turnitin negatives that are also SCV positives
    None of the above
  • Approach would increase the number of reports to SCV by 2-10 times the normal amount
  • Would need approval from Coren and/or BAG
    • If we just did this in the bot's user space instead of SCV, this would not need approval from BAG (and it doesn't need approval from Coren in any case). In fact, we could start running a limited trial immediately to determine viability (though this would probably be better done using the SCV corpus). — madman 23:38, 30 August 2012 (UTC)[reply]
  • Would require having set up integration with Turnitin's API

Example report of 25 or so articles with CSB results and iThenticate results: [1]. Let me know what you think. — madman 16:16, 4 September 2012 (UTC)[reply]

iThenticate trial observations

edit
  • make sure you set it to [exclude quotes] and [exclude bibliography]
  • use the [Content Tracking] mode (top left) not the [Similarity Report] mode
  • are mirror sites shared between users?
  • We had a couple of words regarding the fact of recrawl latency. If you look at a "Content Tracking" report it seems to tell you the last date that resource was crawled? (or maybe the last time the resource changed?)
  • Note that when block quotes are used there are no quote marks and iThenticate will include those sections in the match reports. See Benjamin Franklin. Clearly this is something we should accommodate in our content pre-processing.
    • This is an excellent point and something I worried about.
  • Ocaasi: How were you getting content from Wikipedia to iThenticate? Just copying and pasting into a plain text document?
    • On the right hand side there is an option to upload via copy and paste. Select the Wikipedia text and you just paste it into the text field. No transfer to an external document is required.
  • Madman: Important: The iThenticate API includes parts information, percentage matches, but does not include the URLs from the similarity report! This seems kind of like a show stopper to me. How can we alert human volunteers to the fact that a new article may be a copyright violation when we can't tell them what the source of the infringing content might be? Per some of the other notes on this page, I glean that we can't give public access to the similarity reports. — madman 06:40, 1 September 2012 (UTC)[reply]
  • Madman: Also, I hope on Tuesday to ask for higher request limits; we simply don't have enough at the moment to build any sort of useful statistical universe, though I'll try. (I wasted a couple when I was working out base 64 encoding in XML-RPC and uploaded some mangled documents.) — madman 06:40, 1 September 2012 (UTC)[reply]

In project

edit

SCV Corpus

edit
  • Fix a bug that prevented me from pulling revisions of articles with UTF-8 characters in their title (urgent)
  • See if there's a way to view revdeleted content through the API and if not, file a bug
    • I have confirmed there's no way and have offered to help with a patch, but even if merged it's unlikely to deploy anytime soon so we're just going to have to deal with this limitation for now. — madman 18:53, 29 August 2012 (UTC)[reply]
  • Import corpus into a bare-bones database database and Web interface
  • Run examples from the corpus against the Turnitin API   Done
  • Run examples from the corpus against the iThenticate API

API integration

edit
  • I have coded a Java API with minimal functionality (login, submit-doc, check status, get report URL). West.andrew.g (talk)
    • An API or an API client? If the latter, I wrote a client about a month ago, which isn't really a problem, but we should perhaps determine what programming language we both wish to use so we're not stepping on each other's toes. My preferred language for bot programming is PHP as I have a very versatile bot framework in that language, but I also exclusively use Java in my day job so I wouldn't oppose settling on that. — madman 21:01, 23 September 2012 (UTC)[reply]
      • Client. All of my machine-learning and wiki integration stuff is in Java -- so I preference that. However, I also understand why PHP is a good choice (and CSB already exists in that language); although it isn't one I write too fluently. Most of my analysis is offline at this point. This question will become more pressing when it needs to come online; though its not unimaginable some PHP<->Java communication could just occur via a DB or IRC. West.andrew.g (talk) 21:23, 23 September 2012 (UTC)[reply]
        • Note: CSB is written in Perl. (Believe you me, the bot I'm writing to replace it is not. I have to use Perl occasionally at work and want to tear my hair out whenever I do.) — madman 21:54, 23 September 2012 (UTC)[reply]

Mirror detection

edit
  • Occasi provided a manually assembled list of ~2600 mirror domains. I piped these into a SQL table. Also cleaned up some manual formatting errors. Some ~600 of these entries did not have protocol or TLD components. Manual inspection suggested these were section titles and not URLs/domains (but the associated domain was often in an adjacent row). These were not piped in. West.andrew.g (talk) 21:08, 20 September 2012 (UTC)[reply]
  • I have code to fetch HTML/source code for the URLs/domains collected and write to DB. Just waiting on those negative mirror examples. Should be interesting to see how many of these bounce back with HTTP 404 or other nastiness. West.andrew.g (talk) 21:08, 20 September 2012 (UTC)[reply]
  • Given that we only have ~100 examples of mirrored content, I am looking into crawling the Internet in search of more potential ones. It would seem something like the phrase "from Wikipedia" would be a good indicator. However, the goal here is not to complete the mirror list, but to yield more positive examples for training purposes. I am still thinking through the types of bias this will introduce and how to circumvent those. West.andrew.g (talk) 22:13, 11 October 2012 (UTC)[reply]


Low level observations

edit
  • My above parse of the "examples" from Wikipedia:Mirrors_and_forks/all yielded 641 positive examples (well-formed external URLs).
  • Madman sent me a copy of the corpus. Pulling out the human-reviewed cases labelled in some form as "copyvio" with well-formed external URLs, we began with 3604 negative examples.
    • If you find external URLs that aren't well-formed, give me some examples and I'll try to track them down. I have the feeling that a lot of my contribution to this project is going to be the data collection and the bot implementation, not the analysis, as statistics isn't really my forte. But anything you ask in those realms shall be done.  madman 16:03, 27 September 2012 (UTC)[reply]
  • Ran a script to go out and fetch score the HTML source of these examples. Only interested in cases where HTTP 200 was returned
    • Positive examples 312 (49% of original)
    • Negative examples 2829 (78% of original)
  • Its been a significant period since many of these reports/observations were made, so its no surprise that many have since disappeared. This also prompted me to manually inspect a minority of those pages that did return HTTP 200 and source content (positive examples, in particular). Results were not inspiring. I'd estimate 1/2 of these URLs that were once "mirror" sites are no longer. Why? (1) domain campers/parking, where the site is not a HTTP404 case, but is now just SEO crud, (2) The site exists in some form, but no longer appears to be a mirror. Were these image-only cases? Did a Wikipedia user send a warning that was heeded? Is it a new site altogether? Has the site just grown away from its content scraping format? Who knows.
  • Regardless the reasoning, all these corpus instances will have to be manually inspected (ugh). There is entirely too much noise for statistical feature derivation to be of any use at this point. Even with this, we could be at as few as 150 positive examples by the end of inspection (need to determine what other sources Jake used in his list, and whether or not they also had "example" URL fields). West.andrew.g (talk) 22:50, 24 September 2012 (UTC)[reply]
    • I don't recall there being any other lists with examples. In any case, all of the sites I pulled from are linked here. Ocaasi t | c 00:11, 25 September 2012 (UTC)[reply]
    • I'm not sure how realistic manually checking all instances will be, though I'll be happy to try to code up something in the Web interface to make doing so as efficient as possible. Your results explain why when I ran CSB against the entire corpus I got no match on some revisions that had clearly been copyright violations. What I'm analyzing right now is instances where CSB says no match but iThenticate does have a match, or vice versa, and what's characteristic of those instances. — madman 16:03, 27 September 2012 (UTC)[reply]
      • Well, its time consuming, that's for sure. I already have a little Java script to help me, essentially opening a URL in Firefox, waiting until that process is killed, popping a feedback dialogue, and then proceeding to the next URL in the DB table. I've gotten through most of the positive examples already (~300). We'll see how far along I am when my patience runs dry. West.andrew.g (talk) 19:09, 27 September 2012 (UTC)[reply]
        • Inspected all 300 positive examples. Only 111 of these came back as live sites still containing mirrored data. Remember, this was operating at the "page/article" level. Domain-wise analysis might yield better results; but the point remains that these mirror lists are way out of date. West.andrew.g (talk)

Plans for June 15, 2013 progress meeting

edit
Ocaasi
  • Distribute ithenticate federated accounts to active CCI and SCV members
  • Review Turnitin project pages, cleanup and streamline for trial
  • Ping Chris and Mark Ricksen at Turnitin with an update
  • Check in with Zad about beta testing
  • Talk with Education project folks at WMF about a trial rollout/pilot
  • Send out Gcal invites for Google Hangout/Skype around June 15
Andrew
Madman
  • Advise of progress on unit tests
  • Have HTML normalization unit tests done along which should be all necessary for a BRFA excepting some outlying functionality
  • Have BRFA ready to submit if not submitted already
Zach