Talk:Internet Archive/Archive 2

This is an archive of past discussions about Internet Archive. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Delayed updates

Latest comment: 17 years ago2 comments2 people in discussion

User:124.184.107.76 recently added some information about a "temporary" halt to updates to the Wayback Machine. I cleaned this up and added a link back to the original post in the discussion forums. Apparently, no archives later than April 1, 2005, are currently available. The quoted post indicates that they are making changes and will resume updates soon, but that post was made in mid-August. I have tagged this section as a current event, as I imagine there will be more information forthcoming (either that updates have resumed, or have been discontinued, or whatever the final outcome may be). -- GeoGreg 18:10, 12 September 2006 (UTC)

Judging from this short post and this thread, they apparently have moved their servers and are now working on several updates. I tried www.bbc.com and www.cnn.com and there are 2006 copies of them alright. - Face 09:43, 23 January 2007 (UTC)

European mirror

Latest comment: 17 years ago2 comments2 people in discussion

Does the European mirror mentioned in the article still exist? I used to get redirected to it at least 50% of the time when accessing the collections from a UK IP address, but not any more. http://www.eu.archive.org/ says "Petabox cluster for www.US.Archive.org". I couldn't find any official announcements regarding either temporary failure or permanent withdrawal. 81.77.72.180 07:50, 6 December 2006 (UTC)

What software?

Anyone have any additional information on what software they use to crawl the web? Home grown, commercial?

It's right there in the article: The Wayback machine "is maintained with content from Alexa Internet." Perhaps that could be expanded on somewhat, but I'm not sure how many of the details are public. Xtifr _tälk 23:10, 23 February 2007 (UTC)

Editorial rant by User:82.234.109.34 removed from article

Latest comment: 17 years ago1 comment1 person in discussion

I reverted the addition shown by this diff which was added by User:82.234.109.34. This is an editorial rant and, though parts of it are true, it doesn't read as an encyclopedic addition. Clean it up and add it as proper content if you wish. --User:Ceyockey (talk to me) 22:47, 7 April 2007 (UTC)

Article about scanning books

Latest comment: 17 years ago1 comment1 person in discussion

Hey there. I work for the archive in Toronto and our scanning centre was recently in the Toronto Star. I don't feel confident editing the website myself, so I'll post the link here: http://www.thestar.com/News/article/203437

Thanks! —The preceding unsigned comment was added by 128.100.243.144 (talk) 01:23, 5 May 2007 (UTC).

Useful addition?

Latest comment: 17 years ago6 comments4 people in discussion

I am an employee at the United States Military Academy, an academic institution supported by tax dollars which trains college aged students to become officers in the US Army and also grants bachelor degrees. The website www.archive.org has been blocked by the filtering software for at least 2 months. There has been little progress in getting it unblocked through official channels. Is it possible to include this information in the article here? Blocking archive.org seems to be an egregious occurrence of censorship at an academic institution. 1) Is it appropriate for a (this) wiki article? 2) What would be needed to corroborate this info to make it wiki verifiable? (newb apologizing in advance for any breach of etiquette) 24.44.82.176 03:11, 26 April 2007 (UTC)

If the filtering is written about in a reliable source, we can include it. See WP:RS and WP:V for more. If it has not been published in a reliable source and you'd like to see it included, see if a local journalist will cover the matter, then cite that artcile here. Jokestress 03:19, 26 April 2007 (UTC)

And actually, it may not be that big a deall. If you want to stop people from visiting http://www.example.org, it rather defeats the purpose if they can just go to http://www.archive.org/web/*/http://www.example.org and effectively bypass the block. (I wonder if google cache is blocked as well?) The media collections could be an issue too, since, like Wikipedia/Wikimedia, they're not censored; as long as material is public domain or CC-licensed, they'll accept it. But as Jokestress says, the important thing here is reliable sources. Xtifr _tälk 06:30, 28 April 2007 (UTC)

Thanks for the feedback. I'm not sure about what Xtifr means about "it may not be a big deal"; perhaps I am over-reacting. In any event, google cache seems to be censored too. It's hard to know exactly what I'm being "shielded from" since I can't even see the cached pages while trying to use the internet for legit educational purposes. (I feel like I've gone to a library to look at a book but am told I can't see it unless I explain exactly what's in it before the library will show it to me. How can I know, from just the title [URL] what I need from the book [website] without seeing inside it?) I'm not sure what I will/can do about my experience with censorship at USMA. If I do talk to a journalist and it gets covered, I'll report back. Thanks for giving me advice.24.44.82.176 23:25, 30 April 2007 (UTC)

Sorry, I guess that didn't sound quite right. The censorship itself may well be a big deal. But the detail of including the IA in that censorship is not (or should not be) a big deal, and thus, not particularly relevant to this article. Its more like: they (the school) want to ensure that students don't have access to certain books, so they not only monitor student purchases, but also forbid students to visit libraries that carry those books. The Archive/library has nothing to do with your school's decision to restrict your access to this library. The library is perfectly willing to show you the book. Your school just won't let you in the door. This is, basically, an issue about the school, not about the IA (except very tangentially, as one example of what gets censored, and Google cache is probably a more notable example). Xtifr _tälk 00:52, 1 May 2007 (UTC)

I'm not an American and this is OT but in fact your tax dollars only go to people who filter sometimes. See Children's Internet Protection Act for example. You should write to your local Senator and/or Congressional Representative if your not happy with this I guess, or get involved in organisations like the ACLU Nil Einne 16:48, 20 May 2007 (UTC)

IA in Suits other than Patents

Latest comment: 16 years ago1 comment1 person in discussion

I note the main article sites 102B patent law timestamps. Is it acceptable here to open another subsection parallel to patents to reference other litigation (e.g., business disputes, product fraud, dissolution, etc.) where IA is in a court document? Can a court document be entered as a reference on the main article, or does it have to be picked up by a journal or media and set in context before it can be referenced on WP? I am new to WP. Nukeh (talk) 16:50, 22 February 2008 (UTC)

Reliability of Archive.org links in Wikipedia articles?

Latest comment: 16 years ago4 comments2 people in discussion

It occurs to me that referencing archive.org for references might create problems in relation to information that was later retracted. Or am I overthinking this? - Richfife (talk) 22:57, 14 March 2008 (UTC)

I think in that case you would end up with a dead link, just as the original url that had been supplanted by the archived one went dead at one time. In other words, no more problem than other dead links that need treatment. --User:Ceyockey (talk to me) 00:06, 15 March 2008 (UTC)

Actually, no. The Archive link would remain active after the original link that it was a duplicate of went dead. Some people (myself included) will sometimes replace dead links with links directly to archive copies of the link to bring them partially back to life. - Richfife (talk) 00:39, 15 March 2008 (UTC)

Only a subset of links are replaceable by Archive links ... in the case of non-replaceable ones (the majority, actually), the death of an Archive link leads to the common irreplaceable link situation. --User:Ceyockey (talk to me) 13:50, 15 March 2008 (UTC)

Use for primary sources

Latest comment: 16 years ago1 comment1 person in discussion

Right now wikipedia primarily uses project gutenburg as the main link for primary sources. But archive.org now has many classic works that PG does not. So we might want to double check for some works. --Gary123 (talk) 23:20, 29 March 2008 (UTC)

This is for discussion on improvements to the article. Changes in wikipedia policy or reccomendation should be sugjested in the Wikipedia_talk: namespace.

Unfounded allegations in 'controversies' section

Latest comment: 16 years ago8 comments4 people in discussion

The section 'Controversies' -> 'Missing news from 2001' contains unfounded allegations. On their substance, they are based on a misinterpretation of IA's general consistency of crawling and content coverage. With regard to Wikipedia policies, the allegations appear to be original resource with no alternate sources (besides the Flickr screenshots presumably created by the person making the allegation).

Additionally, Brewster Kahle is not quoted accurately, and indeed the quote has been subtly twisted to change its meaning in a crude/sensational direction. (Just check the article quote against the referenced source.) I think this is a serious enough misrepresentation to warrant a reprimand of the anonymous editor.

I will refrain from editing in this matter as I am an Internet Archive employee, but wanted to open a talk record to encourage some corrective action by others. Gojomo (talk) 03:40, 1 February 2008 (UTC)

I agree with Gojomo. I've tried removing the ridiculous material as have others, but two anonymous users are determined to have their say. Maybe the article should be locked from edit by anonymous users for a while. Fmccown (talk) 20:09, 1 February 2008 (UTC)

That there are some days some pages were not archived is true but trivial. It is the implication that they are "missing" and that this is controversial that is problematic. The initially twisted misquote was problematic and indicative of a sloppy and improper agenda. The current allegation that "a gap of this size, of these 15 sites, occurring simultaneously, is special because it happened only once" is also deceptive, as you can find many similar gaps in sites and groups of related sites in the Archive -- the collection process has neither aimed for nor achieved the level of continuous coverage that is being assumed. This section is engaged in paranoid extrapolation from a misunderstanding of what the Archive does manage to provide, plus original research to boot. It should be published on your own website, not as part of Wikipedia. Gojomo (talk) 01:33, 3 February 2008 (UTC)

This guy has been emailing me as well. It is an unsubstantiated crackpot conspiracy theory that doesn't pass the basic sniff test. Even the links he sends to show he is right, thoroughly refute his claims. (As said above, the gaps, while regrettable, are normal for that era of the archive.)--Jimbo Wales (talk) 17:39, 6 February 2008 (UTC)

The links cited below show a gap of this size, of these 15 sites, occurring simultaneously, happened only once.

Jimmy Wales claimed the links "thoroughly refute" that fact, and even claimed the gaps "are normal for that era."

But the links cited below show a gap of this size, of these 15 sites, occurring simultaneously, happened only once.

Brewster Kahle (Internet Archive founder), Gordon Mohr (Internet Archive employee), and Jimmy Wales

admit the gaps exist, their opinion is: this simultaneous failure to cache 15 major news sites is normal.

Opinions can not erase a verified fact which has a reliable source citation.

Each statement written about the Internet Archive's Missing News of 2001

cites the Internet Archive as the reliable source, without giving opinions.

Opinions can be erased from Wikipedia.

A verified fact can NOT be erased from Wikipedia.

Verified Fact: the Internet Archive is missing news in 2001.

—The preceding unsigned comment was added by User:61.204.240.194 06:45, 7 February 2008 (UTC) in this diff. -84user (talk) 10:44, 20 May 2008 (UTC)

Regarding: "a gap of this size, of these 15 sites, occurring simultaneously, happened only once" -- This is false; there are other longer simultaneous gaps even in 2001, nevermind other years. (See my comment at [1].) Gojomo (talk) 02:17, 8 February 2008 (UTC)

Gordon does not understand the word simultaneous. Simultaneous means "at the same time." A gap of this size, of these 15 sites, occurring simultaneously, happened only once.

Gordon posted a little summary about one site (CNN), but still Gordon is unable to post a gap of this size, of these 15 sites, occurring simultaneously. Why can't Gordon find a gap happening across all of these 15 sites at the same time? Because it only happened once. Anybody can verify this fact by clicking the citation links (the Internet Archive search results are easily verifiable.)

Also, the information Gordon posted about CNN is false (See his comment within [2].) Gordon claimed the Internet Archive DOES have archives of "deeper pages" published by CNN from August 24th, 2001, to September 10th, 2001, and Gordon even posted a link to these "deeper pages" as proof, but that first link Gordon posted showed no archives of pages published by CNN from August 24th, 2001, to September 10th, 2001.

Did the Internet Archive employee Gordon Mohr unfortunately post that link to unrelated 1996 archives by accident, or on purpose?

Did the Internet Archive webcrawler Alexa unfortunately miss these 15 sites all during the same period by accident, or on purpose?

Speculation about whether they dropped the ball on accident, or on purpose, is irrelevant.

The incontestable fact is: the Internet Archive dropped the ball by missing news from these 15 sites all at the same time, and they can not erase this fact.

The first positive outcome from this fact being posted is: from now on the Internet Archive will try not to miss archiving news from major sites.

The second positive outcome from this fact being posted is: from now on the Internet Archive will prioritize webcrawls of major sites, because failing to archive major news sites (while on the same exact days successfully archiving insignificant sites) is absurd. The Library of Alexandria does not make such errors of priority when choosing which sources to archive for posterity.

Another positive outcome from this fact being posted is: from now on people and organizations will begin archiving news from major sites themselves, as back-up, since the world now realizes the Internet Archive regrettably missed news from 15 major sites during the weeks preceding September 11th, 2001.

—The preceding unsigned comment was added by User:61.204.240.194 06:56, 8 February 2008 (UTC) in this diff. -84user (talk) 10:44, 20 May 2008 (UTC)

You are straining to ignore or even reverse the plain meaning of what I've explained. Interested people should read for themselves at the referenced thread. Gojomo (talk) 08:19, 8 February 2008 (UTC)

Suzanne WikiProjectBanners

Latest comment: 15 years ago2 comments2 people in discussion

Any info on why the Suzanne WikiProjectBanners case even arose? I presume the Internet Archive had an existing way for site owners to have their content removed (& of course she could have just used the robots.txt). Any info on why she didn't take advantage with the service and/or wasn't satisfied? Was she just blowing her own trumpet and/or hoping for some easy cash? Nil Einne 16:43, 20 May 2007 (UTC)

I think she just didn't knew IA's existence until she saw it. The question is why a person would put something public in the internet and regret after it, and if it is right to remove this register :PSSPecter ^☎|✉ ♠ _{11:06, 29 September 2007 (UTC)}.

I have edited the "Healthcare Advocates, Inc." section to reflect a noteworthy detail of the way the wayback machine reacts to robots.txt: The Archive crawls and archives all websites, regardless of the presence of a robots.txt file. Content is never removed from the wayback machine. Instead, when users attempt to browse archived content, the presence of a robots.txt on the live (not archived) site is tested. If there is a robots.txt on the live site, then it is used to control users' access to the archived content. If at some later point the live site takes down its robots.txt (or if the site itself disappears from the web), all of the archived content becomes accessible again. I have no knowledge of the Suzanne WikiProjectBanners case, but I speculate that she might have been agitating for the actual removal of the archived content from the archive's servers. TTK (talk) 21:30, 11 April 2009 (UTC)

Variability of availability

Latest comment: 15 years ago2 comments2 people in discussion

I am a frequent researcher on the internet archive and I've noticed that research that I conduct a few month ago indicated some of the web sites had been lost by hackers. I pain painstakingly visited every link to try and find an old wiki that was destroyed by hackers. Due to work checking links today from a former administrator of the wiki, I found the information still in tact, except for 3 pages the hacker got to before the entire site was destroyed. Now I may be wrong. Its something very hard to prove. If a researcher is interested in the Internet archive they might be interested enough to find out how they manage to keep all that information online at the same time, or are bits of the achieve offline at times? I would be very grateful and I'm sure others would like to know the mechanics of process.--Joewski 00:25, 16 December 2006 (UTC)

Please feel free to contact me at ttk@ciar.org for information on this. I worked at the internet archive as a software engineer for four years, and can describe some of the processes. To briefly answer your question, archived content arrives in the datacenter as bundles of ARC files, which are indexed once every few months. When the live site switches to the new index (which takes the form of several CDX files, which are stored in triplicate on several machines in the cluster), all of the indexed information becomes available through the wayback machine. None of the information is deliberately taken offline in the manner you are suggesting, but server outages and errors in the indexing process can render some information unavailable for days, weeks, or months. Such information comes back online as servers are brought back up, or as a re-index indexes previously missed content. TTK (talk) 21:41, 11 April 2009 (UTC)

Funding

Latest comment: 15 years ago2 comments2 people in discussion

It seems to me the Internet Archive would have quite a lot of expenses, yet neither here nor on Archive.org can I find any information on how it's funded. Does anyone know? dimo414 (talk) 05:20, 23 February 2008 (UTC)

The Archive draws from a number of sources for its funding. It receives several grants from various private and government institutions to carry out specific projects; it receives "crawl" contracts, usually from national governments, to download some subset of the web (for instance, the french government might pay it $X million to crawl all of .fr, or the American government might pay it to crawl all .gov just before and after a national election, etc); it receives payments from Archiveit users; it accepts contributions from individuals and companies (some sheerly charitable, and some as part of a collaboration, like the joint IA/Microsoft book-scanning effort); and, finally, there is what we jokingly called "Brewster's Millions". The guy struck it rich in the dot-com boom, and got out at the right time, and contributes millions of his own dollars each year to help make ends meet. Also, in theory The Archive does everything on the cheap -- relies on volunteers as much as possible, underpays its employees, uses free open-source software, uses very inexpensive hardware, and leverages Brewster's own considerable charisma and evangelical zeal to procure very cheap (sometimes even free) network connectivity, facilities, and electricity. Unfortunately in practice The Archive tends to be "penny wise, pound foolish", and wastes much of its means through misguided high-level policies and a lack of good business practices. Despite this, The Archive still exists and operates, and is accomplishing amazing things on a shoestring budget. As long as Brewster keeps pumping it with money and attracting vital and intelligent people to pursue his cause, it should continue to do so. TTK (talk) 04:43, 19 April 2009 (UTC)

Split Wayback Machine

Latest comment: 14 years ago3 comments3 people in discussion

The Wayback Machine is sufficiently notable to warrant an article itself, so I'm splitting it off. Before I started editing, this article spent a considerable amount on just the Wayback Machine while being passed off as general Internet Archive text. I've now arranged it more reasonably, so the former top-level sections that exclusively talked about the Wayback Machine are now subordinate to the Wayback Machine section itself. However, we run into deeply nested headings, the table of contents looks really top-heavy, and we already get warnings about article length. This deserves a split. -- C. A. Russell (talk) 18:15, 31 October 2009 (UTC)

I thought this article was about the Wayback Machine - is it not? Ottawahitech (talk) 01:16, 10 December 2009 (UTC)

The Internet Archive contains a lot more material than just the Wayback Machine. -- Quiddity (talk) 03:01, 10 December 2009 (UTC)

Organizational Information?

Latest comment: 14 years ago1 comment1 person in discussion

This article seems to be missing a great deal of basic information. It does not state, for example, who presently runs this archive, what its administrative structure is, or what its sources of funding are. I'm surprised to see these sorts of critical omissions in an article of such considerable importance. —Preceding unsigned comment added by 76.199.11.220 (talk) 01:47, 11 January 2010 (UTC)

Wayback machine missing websites

Latest comment: 14 years ago8 comments3 people in discussion

I understand that the Waybackmachine archives every single website in the universe, except those that have set their robots.txt file set to exclude crawling. I am wondering how one can deterermine if this is the case? I know of sites that contain public legal records which do not show up on the Waybackmachine, and cannot find an explanation about this in the article. Ottawahitech (talk) 16:28, 13 March 2010 (UTC)

This page is just for discussing the article about the Internet Archive (archive.org) as a whole.

You'll get better replies to questions of a technical nature, about the Wayback Machine, at either Wikipedia talk:Linkrot or Wikipedia talk:Using the Wayback Machine (though that one looks unvisited recently). If those don't bear fruit, try Wikipedia:Reference desk/Computing or Wikipedia:Village pump (technical). HTH. -- Quiddity (talk) 19:39, 13 March 2010 (UTC)

Thanks for the speedy response, Quiddity, and thanks for the links to lots of interesting stuff. I still wonder, however, if the article should not include a section helping readers figure out if a robot.txt was placed on a site (if this is technically possible?). Off to read the links... Ottawahitech (talk) 16:37, 14 March 2010 (UTC)

robots.txt is always in the root directory. So just add that text to the base url to check any site. eg http://www.boingboing.net/robots.txt or http://slashdot.org/robots.txt or http://google.com/robots.txt -- Quiddity (talk) 21:43, 14 March 2010 (UTC)

Thanks again Quiddity, for taking the time to educate us. I posted a followup question at: Talk:Robots_exclusion_standard#.22Sorry.2C_this_page_no_longer_valid.22 Ottawahitech (talk) 17:32, 18 March 2010 (UTC)

Clarification -- The wayback machine actually archives every web page that it can, regardless of the robots.txt, but before it makes pages available to the user it checks the original site for a robots.txt which would have excluded the page from being crawled. Pages thus excluded are silently omitted from the user interface. This means that should the site owner change their robot.txt, the archived pages available through the wayback machine changes to match. And of course if the original site disappears, there is no robots.txt anymore, and all of the archived pages become available to the public (which was the original point of the wayback machine, to preserve sites which disappear from the web). TTK (talk) 16:13, 3 June 2010 (UTC)

I wonder if that's really all true; in the Evangelion articles, we were using some Mainichi Times articles, but the site went down and even though I couldn't find any robots.txt at all for the old domain, we couldn't get at any archived copies. --Gwern (contribs) 20:25 6 June 2010 (GMT)

He worked for The Internet Archive from 2003-2008, according to his resume. So I'd believe him until proven otherwise ;) -- Quiddity (talk) 03:31, 7 June 2010 (UTC)

He could've left out something. For example, maybe the MT sent a specific request to the IA. Or perhaps the IA follows redirects and there was a robots.txt on a domain the MT URLs redirected to. Or something like that. --Gwern (contribs) 13:36 7 June 2010 (GMT)

You're right, Gwern -- what I said above is the "normal" case. There are reasons a site might not be archived. Take-down notices from copyright holders are respected, for instance, and Mainichi Times might have been using technology other than robots.txt to prevent crawling. If you're really curious, you can try contacting Brad Tofel and asking him. He's an engineer at The Archive's web archive department, and gets a kick out of hearing how people use the wayback machine. He also knows more than anyone else there about the Wayback Machine and its metadata, so might be able to make a quick query for you (if he isn't too slammed -- one thing about The Archive is that it's chronically short on manpower, so there's always more than enough work to go around). TTK (talk) 13:01, 18 October 2010 (UTC)

Citation guidance needed

Latest comment: 12 years ago1 comment1 person in discussion

Is there is a template for citing a book as retrieved from IA, as opposed to one giving the appearance of citing a hard copy? I think citation guidance should be added to this article, not only for Wikipedia, but also for other scholarly work as well. What I've found to date:

How do I cite Wayback Machine urls in MLA format?[3]: This question is a newer one. We asked MLA to help us with how to cite an archived URL in correct format. They did say that there is no established format for resources like the Wayback Machine, but it's best to err on the side of more information. You should cite the webpage as you would normally, and then give the Wayback Machine information. They provided the following example: McDonald, R. C. "Basic Canary Care." _Robirda Online_. 12 Sept. 2004. 18 Dec. 2006 . _Internet Archive_. < http://web.archive.org/web/20041009202820/http://www.robirda.com/cancare.html>. They added that if the date that the information was updated is missing, one can use the closest date in the Wayback Machine. Then comes the date when the page is retrieved and the original URL. Neither URL should be underlined in the bibliography itself. Thanks MLA!
Wikipedia Citing sources, electronic equivalents: As service providers begin making books available online it will become increasingly useful to cite them in the encyclopedia. Eventually we can begin linking all book citations to their electronic equivalents. Here is an example citation for Google Book Search:

Ballard, Joe N. (1998). Horatio Wright. In The History of the U.S. Army Corps of Engineers. DIANE Publishing. p. 143. ISBN 0-7881-7666-8. Google Book Search. Retrieved on December 23, 2007.

Amazon's search inside the book feature provides less data to non-registered users but is still quite useful. Consider:

Nash, Jay Robert (1994). Encyclopedia of Western Lawmen & Outlaws. Da Capo Press. p. 250. ISBN 030680591X. Amazon Books. Retrieved on April 13, 2005.

--Pawyilee (talk) 03:51, 8 December 2011 (UTC)

Admission to hosting pirated content

Latest comment: 12 years ago1 comment1 person in discussion

I noticed in reading the megaupload wiki page that the internet archive removed archives of megaupload so that it wouldn't be held for hosting piracy. This would appear to me as if they have been involved in the same thing megaupload has been up until removing it so I wonder what makes it okay for them but not for megaupload. Woods01 (talk) 02:12, 4 March 2012 (UTC)

Wayback Machine shut down?

Latest comment: 12 years ago1 comment1 person in discussion

Please see [Talk: All of Wayback Machine's archived links are shut down!] As of June 30 it's still down. ERROR: "The New Wayback Machine is having problems. Please try again later." I find no recent forums activity. What do others see?
--68.127.94.194 (talk) 21:17, 30 June 2012 (UTC)Doug Bashford

Torrent support

Latest comment: 12 years ago1 comment1 person in discussion

Although torrent support been brewing since before June (forum post), on August 7, Brewster Kahle blogged about IA's wide bittorrent support, as did Eric Klinkert at BitTorrent blog. Technical information about it is scarce:

when did work/testing actually start?
how does it fit in within IA's infrastructure?
to what extent (if any) was Bittorrent (the company) actually involved?

Just curious, --Lexein (talk) 16:39, 8 August 2012 (UTC)

Missing News from 2001 Verified

Latest comment: 12 years ago15 comments4 people in discussion

The Internet Archive's record of missing news from 2001 is verifiable at the Internet Archive (cited, published by the Internet Archive.)

Brewster Kahle's response about missing news from 2001 is verifiable at the Internet Archive (cited, published by the Internet Archive.)

The record published by the Internet Archive clearly shows that a gap of this size, of these 15 sites, occurring simultaneously, is special because it happened only once in the history of the Internet Archive.

^ Archive missing 55 consecutive days of Newsweek in 2001.
^ Archive missing 51 consecutive days of Reuters in 2001.
^ Archive missing 44 consecutive days of Alternet in 2001.
^ Archive missing 43 consecutive days of ABC in 2001.
^ Archive missing 39 consecutive days of Time in 2001.
^ Archive missing 33 consecutive days of MSNBC in 2001.
^ Archive missing 29 consecutive days of ABC News in 2001.
^ Archive missing 20 consecutive days of Nasdaq in 2001.
^ Archive missing 19 consecutive days of Bloomberg in 2001.
^ Archive missing 19 consecutive days of LA Times in 2001.
^ Archive missing 19 consecutive days of Times of India in 2001.
^ Archive missing 18 consecutive days of CNN in 2001.
^ Archive missing 18 consecutive days of UAL in 2001.
^ Archive missing 17 consecutive days of CBS News in 2001.
^ Archive missing 10 consecutive days of NY Times in 2001.

The supporters of an organization are entitled to their point of view, but they can not (and should not) attempt to hide an embarrassing fact about the organization, especially when that fact has such a reliable source of verification.

Wikipedia readers who are researching the Internet Archive deserve to know the Internet Archive is a member of the American Library Association, and it is officially recognized by the State of California as a library, and because of its goal of preserving human knowledge and artifacts it has been likened to the Library of Alexandria, and it is missing news from 2001.

—The preceding unsigned comment was added by User:61.204.240.194 04:51, 7 February 2008 (UTC) in this diff. -84user (talk) 10:28, 20 May 2008 (UTC)

Actually, the real issue is whether these facts are notable. Just because it's true (which it obviously is, as you point out), doesn't mean that it's notable. In my opinion, this is trivia to most people, but obviously we could disagree on this, and it's not up to me.

Wikipedia's official policy on notability is on whether professional journalists have written on the topic, as their jobs and livelihoods depend on providing information that readers consider useful. As the section in question is uncited, I've removed it from the article. Unless you can cite more than one source for this issue, the section should not be restored. See WP:NOTE for more on the notability policy, and see every other section in the "Controveries" section for examples of cited controversies. Pro crast in a tor (talk) 07:21, 8 February 2008 (UTC)

Wikipedia's official policy states:

A fact within an article is NOT required to be notable. WP:NNC

A fact verified by a reliable source can NOT be erased. WP:NNC

Hi 211.7.100.61(talk) (who also seems to be using 61.204.240.194), please sign your comments, as it makes it much easier to track this conversation.

Don't you see that your pet controversy section is unlike all the others in that no professional journalist has written on the topic? That there's not a single article to support your contention that this is, in fact, a topic of interest to the public? Deleted again. Pro crast in a tor (talk) 18:23, 8 February 2008 (UTC)

Wikipedia's official policy states:

A fact within an article is NOT required to be notable. WP:NNC

A fact verified by a reliable source can NOT be erased. WP:NNC

Hi 211.7.100.61(talk), it's a fact that Gojomo works for the Internet Archives, yet this information is not included on the page. Why might that be? Perhaps because it's a non-notable fact? There might be some sort of filter to figure out which facts go onto the web page. perhaps... oh, I know, a newspaper article, or some other reliable source! If it appears in one of them, then perhaps we should include it in the article. Until then, deleted again. Oh, and for those just tuning in, we had a good quote earlier:

"It is an unsubstantiated crackpot conspiracy theory that doesn't pass the basic sniff test. " - Jimbo Wales on this issue [4]

Pro crast in a tor (talk) 00:41, 10 February 2008 (UTC)

Wikipedia's official policy states:

A fact within an article is NOT required to be notable. WP:NNC

A fact verified by a reliable source can NOT be erased. WP:NNC

Proponents of the Internet Archive have gone against Wikipedia policy, by attempting to erase a fact published by the Internet Archive itself.

Any further attempts to delete this verifiable published fact (published by the Internet Archive itself) would be unethical and against Wikipedia policy.

Blah blah blah unsourced, blah blah blah not notable. Obviously we're not agreeing, nor are you attempting to productively resolve this disagreement. Normally I'd say that we should get another editor in here to help us come to an agreement, but there are already two other folks (Gordon and Jimbo) that have agreed with my position that this is utter horsecrap. But... it sure would be nice if someone else chimed in at this point and said that. Pro crast in a tor (talk) 09:30, 10 February 2008 (UTC)

"A fact verified by a reliable source can NOT be erased. WP:NNC" The policy does not say that, or anything like it. Verified facts most certainly can be erased if they are found to be insufficiently notable. "A fact within an article is NOT required to be notable." It isn't required to be notable enough to be the subject of its own article, but it does still have to meet a the standard that reliable independent sources have written about it. That doesn't appear to be the case here, which makes this section about the "controversy" a piece of original research. But you know what? It's interesting. And it is verifiable. So I think it would be worthwhile to note in the body of the article itself that IA's historical sampling rates have left gaps in its archive, and cite the pre-9/11/2001 gaps as an example. - JasonAQuest (talk) 14:22, 10 February 2008 (UTC)

WP:NNC = Notability guidelines do not directly limit article content =

Shortcuts

"Notability guidelines give guidance on whether a topic is notable enough to be included in Wikipedia as a 'separate article', but do NOT specifically regulate the 'content of articles' (with the exception of lists of people). The particular topics and facts WITHIN an article are NOT each required to meet the standards of the notability guidelines; instead, article content is governed by other policies and guidelines, such as the policy requiring Verifiability and the guidelines covering the use of reliable sources and of trivia sections."

= A fact within an article is NOT required to be notable. WP:NNC

= A fact verified by a reliable source can NOT be erased. WP:NNC

Thanks for quoting the policy; it makes it easier to explain how you're misreading it. You're correct that WP:NOTE doesn't regulate every item within each article. But if you read the second half of the second sentence (start with the semicolon), you'll see that WP:V and WP:RS do regulate every item. WP:NOR applies as well. This information you want to insert appears to fail WP:RS and WP:NOR (specifically WP:SYN). As for the second statement you keep repeating... there is nothing in any policy that says that. If a reliably verified fact violates any WP policy, it can and should be deleted. (By the way, your attempts to delete material from this discussion are futile; our archive has no gaps.) - JasonAQuest (talk) 20:43, 10 February 2008 (UTC)

Hi JasonAQuest, thanks for the feedback. I agree with you that mentioning gaps in the service would be a useful thing to note in the body of the article. Perhaps the second paragraph of the Wayback machine section could read" Snapshots become available 6 to 12 months after they are archived. The frequency of snapshots is variable, depending upon the website, and there have occasionally been multi-week outages in the page." However, without a citation, this smacks of OR. We could include a link to 07/23/2001 to 09/15/2001 Internet Archive is missing 55 consecutive days of Newsweek.com, however, this seems clumsy. Thoughts? Pro crast in a tor (talk) 22:55, 10 February 2008 (UTC)

It's borderline OR, but at least all it does it dig up a single fact, rather than synthesizing a collection of facts, making an inference from them, and inventing a controversy about them. - JasonAQuest (talk) 00:27, 11 February 2008 (UTC)

WP:NNC "The particular topics and facts WITHIN an article are NOT each required to meet the standards of the notability guidelines;"

This means: the particular topics and facts WITHIN an article are NOT each required to meet the standards of the notability guidelines.

This means notability IS required for SEPARATE ARTICLES, but notability is NOT required for the particular facts WITHIN an article.

This means people trying to erase a fact WITHIN an article using the 'not-notable' excuse, do not yet realize notability is only required for SEPARATE ARTICLES.

WP:NNC "instead, ARTICLE CONTENT is governed by other policies and guidelines, such as the policy requiring VERIFIABILITY and the guidelines covering the use of RELIABLE SOURCES and of trivia sections."

This means since this fact WITHIN an article can be VERIFIED at THE INTERNET ARCHIVE (THE MOST RELIABLE SOURCE OF VERIFICATION) you can NOT erase it using the "not-notable" excuse.

You could try claiming "it's trivia", but even then, the trivia guidelines state: "Such sections should not be categorically removed. This guideline does not suggest removing trivia sections, or moving them to the talk page."

So even if it's "not-notable", a VERIFIED fact with a RELIABLE SOURCE cited WITHIN AN ARTICLE can NOT be erased.

And even if it's "trivia", a VERIFIED fact with a RELIABLE SOURCE cited WITHIN AN ARTICLE can NOT be erased.

"Reliable sources" means independent sources. Like The Washington Post. Or Mother Jones. Or National Review. Or maybe even Fortean Times if you're desperate. The idea is that we base our articles on what other people report, not what Archive.org reports about themselves, and not on what you or I report. That's what "No Original Research" is all about. Please read it. If you think that this is a story that someone should be covering (and I get the impression that you do), convince a journalist to look at it. Stop wasting your time trying to shove it directly into an encyclopedia, because this encyclopedia's editorial policies don't allow that.

Also, I think you've misunderstood what "should not be categorically removed" means. It doesn't mean "should not be removed". That word "categorically" means that sometimes the information should be kept. But only when it meets all the other criteria for inclusion in the Wikipedia. No matter how many times you declare that "a VERIFIED fact with a RELIABLE SOURCE cited WITHIN AN ARTICLE can NOT be erased", that's simply incorrect. Wishing doesn't make it so. - JasonAQuest (talk) 05:52, 11 February 2008 (UTC)

The Internet Archive supporters who claimed, "notability (journalistic coverage) is required for article content" should be ashamed of themselves, because they are either lying about (or are confused about) the simple difference between 2 specific Wikipedia Policies:

1.) The Wikipedia Policy for SEPARATE ARTICLES states notability (journalistic coverage) is required for SEPARATE ARTICLES.
2.) The Wikipedia Policy for ARTICLE CONTENT states notability (journalistic coverage) is NOT required for ARTICLE CONTENT.

Notability is required for SEPARATE ARTICLES.
Journalistic coverage is required for SEPARATE ARTICLES.

Notability is NOT required for ARTICLE CONTENT.
Journalistic coverage is NOT required for ARTICLE CONTENT.

Facts within ARTICLE CONTENT need to be published
by only one Verifiable Reliable Source to be protected:
the Internet Archive is the Verifiable Reliable Source.

The Internet Archive is the Verifiable Reliable Source which has published this fact:
The Internet Archive is missing the headlines news of 15 major sites during the weeks preceding September 11th, 2001.

The Internet Archive is the Verifiable Reliable Source which has published this confirmation:
Brewster Kahle admitted that regrettably the Internet Archive is missing the headline news of 15 major sites during the weeks preceding 9-11.

To summarize, if you discover Exxon published on it's own homepage the admittance that it regrettably has not yet payed the total compensation demanded by Alaskan courts, here is what Wikipedia Policies state about adding that fact:

1.) You can NOT create a SEPARATE ARTICLE about that fact if it lacks journalistic coverage.

2.) You CAN add that fact to the existing Exxon ARTICLE CONTENT, even if it lacks journalistic coverage, as long as you cite Exxon as the Verifiable Reliable Source.

WP:NNC
"The particular topics and facts WITHIN an article are NOT each required to meet the standards of the notability guidelines."

Finally, to all the people above who tried to use lack of notability (journalistic coverage) as an excuse to erase this verified fact published by a Verifiable Reliable Source:

If you truly want Wikipedia to begin requiring notability (journalistic coverage) for ARTICLE CONTENT, you will have to erase the Wikipedia Policy which specifically protects ARTICLE CONTENT published by a Verifiable Reliable Source. WP:NNC

Hi 211.7.100.61, you say "Notability is NOT required for ARTICLE CONTENT". That's incorrect, otherwise, I could add the word "purple" to the page, and argue that "Notability is NOT required for ARTICLE CONTENT". Which, as far as I'm concerned, is what you're doing when you mention that their crawler missed a few web sites in 2001. So? Not only is it not notable, but to label it a "controvery" is WP:OR. Pro crast in a tor (talk) 06:09, 15 February 2008 (UTC)

WP:NNC If you truly want Wikipedia to begin requiring notability (journalistic coverage) for ARTICLE CONTENT, you will have to erase the Wikipedia Policy which specifically protects ARTICLE CONTENT published by a Verifiable Reliable Source. WP:NNC

So, are you saying I can add "Barack Obama is running for president" to every article in Wikipedia? It's a true statement by a verifiable reliable source. Pro crast in a tor (talk) 05:49, 17 February 2008 (UTC)

You can add article content with a verifiable reliable source about Obama: to the Obama page.

You can add article content with a verifiable reliable source about purple: to the purple page.

Article content with a verifiable reliable source is protected by Wikipedia Policy: WP:NNC.

Why couldn't I add it to this page? Pro crast in a tor (talk) 00:16, 18 February 2008 (UTC)

First, Pro crast in a tor erased article content with a verifiable reliable source, repeatedly.

Then, Pro crast in a tor claimed journalistic coverage is required for article content, repeatedly.

Subsequently, Pro crast in a tor learned journalistic coverage is NOT required for article content. WP:NNC

Currently, Pro crast in a tor is facetiously asking why he can't add facts unrelated to the Internet Archive to the Internet Archive page.

Finally, Pro crast in a tor will realize when he encounters a verified fact which doesn't support his rosy POV about the Internet Archive, he can NOT delete ARTICLE CONTENT with a verifiable reliable source. WP:NNC

You can use the "lack of journalism" excuse to delete SEPARATE ARTICLES only.

You can NOT delete ARTICLE CONTENT when it has a verifiable reliable source. WP:NNC

61.204.240.194: Ok, you win, I'm done with this "debate". Of course, I'll still delete it if you re-add a "controversy" section without any journalistic coverage indicating that there is, in fact, a controversy, or some reason that it is, in some way, notable that their spider missed some pages on some days in 2001. Pro crast in a tor (talk) 08:01, 18 February 2008 (UTC)

The Internet Archive is not a public utility. Brewster Kahle founded and funds it out of the goodness of his heart. If he wanted to delete some or all of the collection, he could. It is HIS Internet Archive (I'm oversimplifying to make a point). Instead of complaining about omissions and conspiracies, the unhappy individual above should be grateful that the Internet Archive even exists at all! --FeralOink (talk) 12:43, 23 August 2012 (UTC)

Proposal for Machinima section

Latest comment: 11 years ago2 comments1 person in discussion

Hi there, i'm currently creating a section for the Machinima archive within the Internet Archive and would like to add it to a section within this article as its notability won't let it be a stand alone article. This is for a uni piece so the article will be well constructed and cited where it needs to be, with links to the the relevant pages on the Internet Archive. All progress on the section can be found on

My sandbox

with project due to be finished between Friday 15th and Sunday 17th of Feb 2013.

Let me know your thoughts on my talk page please,

Thanks,

Tomswainson (talk) 12:20, 11 February 2013 (UTC)

Just added the small section regarding the Machinima article. --Tomswainson (talk) 19:50, 17 February 2013 (UTC)

Poor coverage of national web archives

Latest comment: 11 years ago2 comments2 people in discussion

The present article suffers from very poor coverage of existing website archiving and is poorly focussed.

It makes little or no attempt to record, assess or even reference the extent of national website archiving initiatives around the world. Some examples:

Australia All sites* - PANDORA http://pandora.nla.gov.au/

Austria All sites* http://www.ifs.tuwien.ac.at/~aola/

United Kingdom: Private sites* - Web Archiving Consortium http://www.webarchive.org.uk/ Govt sites -The National Archives at http://www.nationalarchives.gov.uk/p...edwebsites.htm

Commonly, there is some selectivity of private (non-official) sites proposed for inclusion by nomination.

For more countries web-archive programmes, see the National Library of Australia PADI page http://www.nla.gov.au/padi/topics/92.html

A |Digital Library| in practice and content is a rather different beast, already accorded a separate wiki article. Unwise to confuse or duplicate the two beyond a brief cross-reference. —Preceding unsigned comment added by 112.213.142.221 (talk) 22:24, 20 September 2010 (UTC)

This article is about Internet Archive, other initiatives are here List of Web archiving initiatives. emijrp (talk) 16:56, 8 June 2013 (UTC)

Removal of Archive web pages

Latest comment: 11 years ago1 comment1 person in discussion

I would like some mention of how removal of pages is handled. And why pages are removed. — Preceding unsigned comment added by 24.31.246.41 (talk) 21:52, 5 August 2013 (UTC)

How can I detect all the archived links of a website?

Latest comment: 11 years ago3 comments2 people in discussion

In Romania there was a newspaper named Ziua (www.ziua.ro), quite a good one. My question is how can I detect all the archived links from this website? Here are some links archived from ziua.ro:

etc. — Ark25 (talk) 20:11, 28 August 2013 (UTC)

Late response, but did you try http://web.archive.org/web/*/http://www.ziua.ro/ and all descendants? I wonder if wget could be coerced by some judicious scripting into slavishly recovering all the WA-held ziua pages. That would be a good thing to build and publish, for reconstructing whole entities from archive. --Lexein (talk) 06:44, 17 September 2013 (UTC)

In this case I have to wait for someone good enough at wget to do the task. Trying http://web.archive.org/web/*/http://www.ziua.ro/ doesn't help me since I don't know how to automate things. I think I wll try to send an email to the support team of Archive.org, to ask them to facilitate such a tool. — Ark25 (talk) 01:07, 21 September 2013 (UTC)

Merger proposal - Wayback Machine

Latest comment: 11 years ago18 comments3 people in discussion

I think, we should merge Wayback Machine with this article. That article is in fact about the same thing and Use in legal evidence section IMO belongs to Internet Archive (see the History section). I think, it should be merged into this. --Rezonansowy (talk • contribs) 06:02, 12 September 2013 (UTC)

Oppose. Wayback Machine was (correctly) split off in 2009 due to size and independent notability. As stated at the bold split, the Internet Archive is indeed about much more than the Wayback Machine. Coverage of Wayback Machine should not grow much here, and new details should continue to be added there. --Lexein (talk) 11:16, 12 September 2013 (UTC)

I agree with Lexein's comments. Size and independent notability, in particular. –Quiddity (talk) 20:00, 12 September 2013 (UTC)

I'll try to reopen this dispute. I understand your arguments, but I think that Controversies and legal disputes and Use in legal evidence sections are in fact about the same thing. Merging them appears to be logical for me. --Rezonansowy (talk • contribs) 19:46, 2 November 2013 (UTC)

(Well, nobody's really disputing, just opining). But I think we did miss your point about the legal issues sections. I'd suggest,

Instead of merging, that the section contents which are misplaced should be moved. For example, the following relate directly to the operation and use of the Wayback Machine and should be moved to Wayback Machine: Internet Archive#Healthcare Advocates, Inc. and Internet Archive#Suzanne Shell. I don't see anything in the Wayback Machine legal section which needs to move to Internet Archive.
Alternatively, move all the legal actions to a third article: Internet Archive legal issues, with a redirect Wayback Machine legal issues and make it clear at the top of the article that it includes Wayback Machine legal issues, too. If this is done, it would be important to cite WP:RS to support that "Internet Archive legal issues" is a topic notable enough for an article.

Discuss? Rezonansowy? Quiddity? --Lexein (talk) 01:19, 3 November 2013 (UTC)

I agree with option #1. (with the addition of the Internet Archive#Scientology section, to those 2 you specified).

(side-note: Mention Notifications currently require a 4tilde signature in the same "save" in order to work - I'll reping Rezonansowy here :)) –Quiddity (talk) 07:02, 4 November 2013 (UTC)

Missed the SCN one. (Oh, did not know that MN detail. thx.) --Lexein (talk) 07:38, 4 November 2013 (UTC)

Hmm, I'm not sure which option is best. I wanted only to those sections were in one place. Both #1 and #2 option makes sense to me. I hope that I understood correctly. In #1 you propose the inverse of my solution by moving Internet Archive some Controversies and legal disputes sections to Wayback Machine article. If yes, #1 should be enough, however, the placement of all the sections in one place would be much more accessible. --Rezonansowy (talk • contribs) 10:00, 4 November 2013 (UTC)

I'm actually easy with either. Quiddity - I know you expressed a preference for #1, but do you have an objection to #2? I mean, in a way, legal issues with both entities, the parent (Internet Archive) and the project (Wayback Machine) are likely dealt with by the same legal team, and there are certainly enough for a standalone article. If we do go with #2, as I said, I'd really like a reliable source or two with discussion(s) of IA and/or WBM various legal issues. That would avoid any question of WP:synth. One example I've found so far is from the defendant in a lawsuit:

"It's a rather strange one [the lawsuit], too, because Wayback is used every day in trademark law. It's a common tool." "Keeper of Expired Web Pages Is Sued Because Archive Was Used in Another Suit - New York Times". NYTimes.com. July 13, 2005.

--Lexein (talk) 11:24, 4 November 2013 (UTC)

(A) I'm generally a m:mergist, because that makes it easier for readers to find info, and easier for editors to collaborate and watchlist.

(B) When asking myself "can this be a standalone wp:summary style split-off article", I think of the featured article criteria and ask whether the spin-off will be able to meet those standards, eventually. (Note that "length" is not a factor - we have some very short FAs, eg. Tropical Depression Ten (2005)).

Therefore, I lean towards merging, but am open to splitting if you believe that it would be immensely helpful (to both readers and editors). HTH. –Quiddity (talk) 17:58, 4 November 2013 (UTC)

I'm confused. Do you mean remerging both articles? Or do you mean gathering the legal sections into a new, third article (which is what I meant by #2)? --Lexein (talk) 19:34, 4 November 2013 (UTC)

Sorry, it all sounded clearer in my head!

Option #1 is to merge the WBM-specific legal issues from IA into WBM. I lean towards that option, because keeping the quantity of pages down has many benefits.

Option #2 is to split the legal issues from both articles, off into a 3rd article (summary style). I'd hesitantly support that option, if you're both sure that it's better in the short/long run.

Let me know if that's clearer. I can probably rephrase it another way! ;)

(Oh, I added the {split section} tags because there isn't a {split or merge} tag.) –Quiddity (talk) 20:34, 4 November 2013 (UTC)

That's better. How about Option #3: First, do #1 as above, just to resolve confusion of the two entities. Then, prior to doing #2, open up the creation of a new article to wider discussion.

This gets us to a position of two less confusing articles, then, possibly, three strongly internally consistent articles. Hm?--Lexein (talk) 02:44, 5 November 2013 (UTC)

I approve this message! :) –Quiddity (talk) 19:53, 5 November 2013 (UTC)

Rezonansowy ? --Lexein (talk) 22:35, 5 November 2013 (UTC)

Well, I have a dilemma. But why not? That's a Wikipedia and everything can always be discussed again, if necessary. --Rezonansowy (talk • contribs) 13:58, 7 November 2013 (UTC)

Excellent. Step #1 done. --Lexein (talk) 18:42, 7 November 2013 (UTC)

Thinking about it while actually doing the edit, I got the strong sense that separating the demands/legal issues concerning manual uploading(IA), and automatic/manual website archiving(WBM) into their associated articles is helpful to readers. Of course some might want them all together on one page (option #2), but if that's done, I'd still try to strongly distinguish the two. People think "the internet" and "the web" are the same thing; people think IA and WBM are the same thing, but one is subsidiary to the other, and on different servers, too. --Lexein (talk) 20:19, 7 November 2013 (UTC)

Scanning Center Fire

Latest comment: 11 years ago2 comments2 people in discussion

[5] [6] --emijrp (talk) 18:35, 7 November 2013 (UTC)

In time order:

SFGate
RichmondSFblog.com with video of firefighting and images of damage.
Archive.org blog
The Verge

Not mentioned so far is the delicate issue of insurance. --Lexein (talk) 20:32, 7 November 2013 (UTC)

Too many images

Latest comment: 11 years ago10 comments3 people in discussion

Is it me, or is this article cluttered with images? It's so bad that they break section headers, and overlap multicolumn lists. And {{clear}} is no help, it just leaves huge ragged gaps. Do we really need the following?

Some disconnected "Old storage devices at the Internet Archive" on a shelf?
The office plants? "Internet Archive offices"
"Another scribe book scanner"?

Since they're freely licensed, can't some of them go in a gallery? Discuss? --Lexein (talk) 03:50, 8 November 2013 (UTC)

A gallery would be good, so that they can be re-merged into the prose if/when the prose is expanded enough for there to be sufficient space. –Quiddity (talk) 20:21, 9 November 2013 (UTC)

Should we keep 1,2,3 in the article at all? They don't contribute to an understanding of IA, IMHO. --Lexein (talk) 20:31, 9 November 2013 (UTC)

Agreed, those 3 could all be removed, the other scanner photo further above is far superior. –Quiddity (talk) 21:35, 9 November 2013 (UTC)

I'll boldly take that action. --Lexein (talk) 00:45, 10 November 2013 (UTC)

Thanks for making the filenames more informative. I've improved your changes a bit, making the images more informative for the article. Some were more relevant in other (old or new) sections and for instance the old photo of a IA book scanner doesn't show that well the anatomy of their more recent Scribe workstations.

Also, it's not clear to me why you removed the image from the "Open Library" section: I use a small screen but there is plenty of space there for me, so an image of a scanner is IMHO a useful way to show what feeds the website.

As for galleries, we already have one, it's the Commons category. --Nemo 07:53, 10 November 2013 (UTC)

The renaming is per Commons guideline, and I truly wish they had been named sensibly on Flickr. I (and Quiddity?) disagree with you about the scanner image. 1/3 blocked, un-whitebalanced, harshly downlit, farther away, with less detail (especially the glass), and with a big old chair in the foreground, does not a better scanner photo make. My other changes were about trying to reduce large, ugly gaps, and reducing the need for {{clear}} (more gaps). You've added {{clear}}, added back two inferior images against the opinion of three editors (including apparently Morn?), and added back gaps to the article. You've taken the ceramics collection image from a logical empty space next to the collections table, and jammed it back into the already overcrowded prose. The objective is not to jam in a photo for every section of the article. The image of an Apple IIGS with drives on a shelf, not in use, is generic, mostly blurred, and doesn't show their use. What you're doing won't really help get this article to GA status. --Lexein (talk) 12:50, 10 November 2013 (UTC)

So. It took me a few searches through the page history to figure out what exactly was being debated. I think this covers it:

Book scanner.
- Locations to put the image(s) include:
  - Internet Archive#Open Library
  - and Internet Archive#Text collection
  (I'm not sure why the video is currently in the "#Archive-It" section)
  
  (Is the scanning actually an aspect of "Open Library", or a sideproject that just happens to benefit it? If it is, perhaps the "#Text collection" section should be moved up, to be a subsection of "#Open Library"?)
- Image choices include:
  - File:Scribe Book Scanner.jpg
    (Decent long-range shot, but no closeup aspects)
  - File:Internet Archive book scanner 1.jpg
    (Good closeup, but 2008)
  - File:Scribe Machine Acquisition 3.jpg
    (My preferred image)
  - File:Playboy Scanning Prep.jpg and File:A Real Page-Turner.jpg
    (2 decent action-shots)
  - and the excellent video File:Internet-archive-brewster-kahle-2013-0329.webm
    (Best choice overall)
Whether or not to include File:Videocassettes and VCRs at Internet Archive.jpg.
(I agree that it is not a very helpful image)
Whether or not to include File:Apple IIGS and drives at Internet Archive.jpg.
(I agree that it is not a very helpful image)
Ceramic archivists image.
(I agree with moving it out of the "Collections" section, and back into its relevant section - It overlapped with the table, if my browser-window was smaller than fullscreen.)
Whether {{clear}} is needed anywhere.
(This will become clear (!) once we decide on which images to keep, and some basic testing of different browser-window widths. Whitespace is preferable to problems - it might even encourage prose additions!)
Whether a <gallery> is needed anywhere in the article, to hold great leftover images that don't fit currently.
(Again, we'll see later on. Whilst Commons is great as a repository for all images, not many readers click through to the multiple subcategories, and manually sift through the good and bad images collated there. An in-article gallery is often useful, to hold "great images that don't currently fit amongst the article text".)

HTH. Now back to coffee... –Quiddity (talk) 19:52, 10 November 2013 (UTC)

I prefer File:Scribe Machine Acquisition 3.jpg as well, and don't want File:Apple IIGS and drives at Internet Archive.jpg or File:Videocassettes and VCRs at Internet Archive.jpg, because they aren't showing equipment in use, or a mass of reference equipment, and they have depth-of-field and color problems. I think the article needs far fewer images in the body, because there's absolutely not enough prose to carry this many photos: it looks silly from any layout perspective, and is a perfect use case for a bottom gallery. I prefer only one (if any) image per topic. The video of Kahle is fine. I don't want any {{clear}} usage. Odd result: I experienced less overlap issues with the image placed next to the Collections table. --Lexein (talk) 00:19, 11 November 2013 (UTC)

I'm not a fan of images and discussing so much about photos makes my head hurt. A number of images (increased or decreased) is never a sensible goal in my opinion; what I'd like this article to have is a set of illustrations clearly showing, together with the text, the very diverse set of activities conducted by the IA. Rather than spending too much time on this goal though, if it gets too difficult and controversial, I'd focus on improving the text. --Nemo 09:36, 16 November 2013 (UTC)

[1] Archive missing 55 consecutive days of Newsweek in 2001.

[2] Archive missing 51 consecutive days of Reuters in 2001.

[3] Archive missing 44 consecutive days of Alternet in 2001.

[4] Archive missing 43 consecutive days of ABC in 2001.

[5] Archive missing 39 consecutive days of Time in 2001.

[6] Archive missing 33 consecutive days of MSNBC in 2001.

[7] Archive missing 29 consecutive days of ABC News in 2001.

[8] Archive missing 20 consecutive days of Nasdaq in 2001.

[9] Archive missing 19 consecutive days of Bloomberg in 2001.

[10] Archive missing 19 consecutive days of LA Times in 2001.

[11] Archive missing 19 consecutive days of Times of India in 2001.

[12] Archive missing 18 consecutive days of CNN in 2001.

[13] Archive missing 18 consecutive days of UAL in 2001.

[14] Archive missing 17 consecutive days of CBS News in 2001.

[15] Archive missing 10 consecutive days of NY Times in 2001.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]