Talk:Wayback Machine

Latest comment: 5 days ago by Samuel Wiki in topic Pages archived count

Wiki Education Foundation-supported course assignment

edit

  This article was the subject of a Wiki Education Foundation-supported course assignment, between 28 August 2018 and 11 December 2018. Further details are available on the course page. Student editor(s): Chr09.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 12:46, 17 January 2022 (UTC)Reply

Old phrasing of disamb template

edit
See Wikipedia:Using the Wayback Machine for information on using the Wayback Machine with Wikipedia.
Title changed by me from "Untitled" ---Luhanopi (talk) 09:42, 12 October 2024 (UTC)Reply

Banned in Russia

edit

Not sure if this is newsworthy enough for main paragraph or needs a new subject or nothing. Maybe sites banned and unbanned all the time.

http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-islamic-state-video/510074.html — Preceding unsigned comment added by RonPaul573e (talkcontribs) 08:50, 27 October 2014 (UTC)Reply

Still reading?

edit

Is it still reading pages? Seems not. 82.163.24.100 (talk) 14:40, 8 June 2010 (

Did you see the part of the article that reads: "Snapshots become available 6 to 18 months after they are archived." ? -- Quiddity (talk) 18:11, 8 June 2010 (UTC)Reply

Where in Europe?

edit

Later in the article it talks about how copyright law in 'Europe' could cause certain effects but it doesn't mention where in Europe! The Continent? If so, where on the continent? Is it the UK? There is no single copyright law within the region... Just curios!

Presumably this refers to the European Union (not all the countries of the European Peninsula/so-called continent), which has a very important governing role. --Eleanor1944 (talk) 02:55, 11 February 2013 (UTC)Reply

Wayback Machine is Amazingly Slow

edit

What surprises me time and time again is how incredibly slow the WayBackMachine is. Check Google for 'waybackmachine slow' and you'll see other people agree; even called "notoriously slow" by some folks.[1] I wonder if there's a reliable source somewhere so we could mention the service's speed in the article. --82.171.70.54 (talk) 06:07, 19 June 2010 (UTC)Reply


I believe Wayback Machine compresses everything because there is too much information for just their servers. (we are talking about the entire, or most of. The World Wide Web!) so it takes a long time to de-compress all of the related files. - 45.36.173.204 "wellsilver"
Although your comment dates back to 2010, it may still be wortwhile to read IA's Jason Scott's explanation on the performance of the Wayback Machine: [2] MichielN (talk) 13:47, 19 May 2023 (UTC)Reply

Still collecting pages?

edit

I was able to see the www.defenselink.mil page from October 22, 2009 http://replay.waybackmachine.org/20091022164418/http://www.defenselink.mil/

171.64.66.13 (talk) 15:31, 9 September 2010 (UTC)Reply

edit

Why aren't all those archived links in the Wayback Machine working anymore?! Can't someone please fix the Wayback Machine?! --Angeldeb82 (talk) 20:30, 26 January 2012 (UTC)Reply

Would you kindly explain to those of us who are not familiar with the term, what are "archived links"? Thanks in advance Ottawahitech (talk) 15:55, 3 March 2012 (UTC)Reply
I took "archived links" to mean links to its old, archived pages, it's main function.
      As of June 30 it's still down. ERR: "The New Wayback Machine is having problems. Please try again later." Seeking help in forums etc, I could find no activity in recent months. I hope this historical treasure of history comes back, as I see evidence that Winston Smith's memory hole is gaining power —and coincidentally the historical treasure of Google's Usenet archive no longer seems cut in stone.
--68.127.94.194 (talk) 17:53, 30 June 2012 (UTC)Doug BashfordReply
UPDATE my above: I've since used it, it's seemingly working fine.
--68.127.90.135 (talk) 16:15, 27 July 2012 (UTC) Doug BashfordReply

Netbula v. Chordiant Software ? ...Jargon?

edit

That section makes no sense. The first paragraph, I assume accurate, is meaningless. Probable jargon and/or insider-know presumptions. Suggest repair or deletion.
--68.127.94.194 (talk) 16:56, 30 June 2012 (UTC)Doug BashfordReply

Not reliable anymore

edit

A matter of location of the IP? — Preceding unsigned comment added by 201.10.57.86 (talk) 02:05, 6 September 2012 (UTC)Reply

Reliability in retrieving archived material

edit

It would probably be miraculous if the WM could archive everything on the internet, but as an experienced user I know only too well that pages and images are often unavailable not because of robots.txt or legal reasons, but simply because WM failed to retrieve them properly. There is absolutely no mention of this in the article and there should be. Lee M (talk) 02:42, 1 July 2013 (UTC)Reply

I agree it's only archive 10%~40% of whole pages specially if the site are above 500 pages , no need to mention sites had million of pages/link they almost store 10% max .--Salem F (talk) 01:12, 7 December 2015 (UTC)Reply

Not well

edit

Section Search engine links:

... began to provide links to other versions of pages archived on the Wayback Machine.

What does that even mean? That they use the Wayback Machine as a caching service? That it is possible to see not only the latest version of a page, but olders versions as well? Whatever it is, it ought to be described.

--Mortense (talk) 14:18, 15 February 2014 (UTC)Reply

December 2014

edit

This week it rained in San Francisco and the power immediately blew out. Your tech utopia • The Register

Internet Archive: The big storm in SF has knocked out power to our main data center, so the site will be down for a while. We'll keep you posted here! 7:59 AM - 11 Dec 2014

unintelligible sentence

edit

Under the heading "Origin, growth and storage", this rather odd sentence appears: "This became a threat of abuse the service for hosting malicious binaries." Can anyone make sense of this? It would seem to be missing a few words. Bricology (talk) 06:40, 23 March 2015 (UTC)Reply

  Done. I checked all three references the paragraph cites. I changed the sentence to, "This became a threat of abuse by the service for hosting malicious binaries." The sources support the assertion that potentially malicious executables and PDFs are currently archived at the site.  —Aladdin Sane (talk) 19:06, 25 March 2015 (UTC)Reply

Storage capacity

edit

At present this section is mainly a list of historical capacities. Can anyone add anything about the growth rate and future ability to store information? It would also be good to include information in the section on resilience i.e security of the data stored. LookingGlass (talk) 10:14, 12 September 2015 (UTC)Reply

edit

Hello fellow Wikipedians,

I have just added archive links to one external link on Wayback Machine. Please take a moment to review my edit. You may add {{cbignore}} after the link to keep me from modifying it, if I keep adding bad data, but formatting bugs should be reported instead. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether, but should be used as a last resort. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 13:36, 31 March 2016 (UTC)Reply

Stanford version of Wayback Machine

edit

I was just wondering if the Stanford version of the Wayback Machine is in any way related to the Internet Archive's Wayback Machine. And the Stanford Wayback Machine has a few pages, some dating to late 1991! So if anyone knows, make sure to reply.

Source(s):
https://swap.stanford.edu/ — Preceding unsigned comment added by 173.73.242.76 (talk) 01:34, 25 April 2016 (UTC)Reply

An error on Storage capabilities

edit

At the start, we claim that in 2009 the site grew by 100 TB per month.

At the bottom, we claim that in 2014 the site grew by 20 TB per week, which is 80 TB per month - less than in 2009.

Is it possible? רן כהן (talk) 13:53, 11 May 2016 (UTC)Reply

"Mass deletion of content"

edit
"Beginning in 2015, mass deletions of previously archived content caused a number of critics to question the sincerity of this goal."

The cited sources don't support that assertion. The first source is confused and inaccurate. The second source contains an update to the effect the problem was specific to that user and fixed. Both are essentially self-published blogs. -- GreenC 21:45, 23 September 2016 (UTC)Reply

@Green Cardamom: Then please go ahead and remove it (and put this info into the edit summary). --Fixuture (talk) 17:02, 29 September 2016 (UTC)Reply
I was going to alert people to a seldom covered fact: the archive's own archive of itself claims to have 502 billion pages saved, not the current 278. However, I later saw that it's just a change in their counting definition. I hope the "bug" in the two sources you mention served as a wake-up call for certain people to get their act together. A site this important should be coded in such a way that bugs are likely to make it display more pages than desired. Connor Behan (talk) 02:53, 13 February 2017 (UTC)Reply

Major problem with robots.txt

edit

Hello, I just notice that since wayback machine won't archive pages AND also deletes the all previous archives of the webpage prior to the use of robots.txt, there is a flaw in it:

  • If a website went defunct, another site opens with the same URL later, and the second URL have robots.txt, can delete the previous defunct website. Even if the latest web owner does not technically own the dead website version of the URL.
  • If a site got hacked and robots.txt was applied, the same thing happened, all history is gone.

Check out a citation of an archive of SpySheriff, before, wayback machine does host the website, now since it now have robots.txt, the past versions archived are now deleted. I've assume hackers adjusted the website under that URL to include that file.

This is another threat to both wikipedia and wayback machine, as wayback machine does not have a "protection" to its archive. With things that can accidentally vanish by website replacement with robots.txt and hacked sites, it makes archiving virtually pointless in the very future.Joeleoj123 (talk) 05:12, 15 April 2017 (UTC)Reply

@Joeleoj123: Thank you for bringing this up. Do you have any relevant references? However, from what I can see, there are good reasons to exclude malware-distributing websites which seems to be the case for "SpySheriff". Also it seems that as of last month they are exploring ignoring robots.txt more broadly (see: Wayback Machine#Website exclusion policy). --Fixuture (talk) 14:26, 18 May 2017 (UTC)Reply
This is very easily solved, by using whois service to check if the owner changed. 152.62.109.203 (talk) 12:13, 19 December 2022 (UTC)Reply

This is still regularly occurring. As an example someone unrelated to the original site owners has taken over the expired domain name www.xyzzynews.com and redirected it to a casino site so that years of archived material that I need to access is no longer available. What this means is that anybody can delete anything they want from the wayback machine as long as the domain name is available for purchase. There needs to be some mention on this page that the archived material of sites that don't exist anymore is not safe and can disappear at any time. 116.250.163.80 (talk) 01:28, 20 July 2018 (UTC)Reply

Try https://archive.is/www.xyzzynews.com --GreenC 01:40, 20 July 2018 (UTC)Reply
Also, when blocked by robots.txt, the original HTML can still be accessed by using a non-JavaScript enbabled browser, or simply doing a wget or curl request to retrieve the HTML and view the html file locally. The robot blockage mechanism requires JavaScript to work. -- GreenC 13:12, 20 July 2018 (UTC)Reply

Hi, regarding the 'citation needed' on the 2017 policy change mentioned in the main page, I looked into it and found that there indeed was an automated mechanism via robots.txt documented here but the page got removed in 2015. The docs on wbm exclusion since late 2018 just say to write an e-mail. Might have happened even earlier, I did not have time to hunt down the earliest mention across site layout changes. Theultramage (talk) 09:12, 25 July 2020 (UTC)Reply

Problem with only first page of pdf files

edit

I know this is off topic but I don't know a better way to reach Wayback users.

I have an ongoing problem with only the first page of pdf's being supplied:

https://web.archive.org/web/20060912144906/http://www.dbts.edu/journals/1996_1/ACDIXON.PDF https://web.archive.org/web/20160313082813/http://users.ipfw.edu/jehle/deisenbe/cervantes/bowle.pdf

and many others. I am using Safari on iOS, latest versions. Any remedy? Thanks. deisenbe (talk) 11:37, 5 April 2019 (UTC)Reply

@Deisenbe: I don't have this problem they both download complete multipage. Try a different browser or system. -- GreenC 14:01, 5 April 2019 (UTC)Reply
The same problem in Chrome and Dolphin. I was hoping some reader had dealt with this. deisenbe (talk)
Maybe clear cache? Download the file and open with a different PDF viewer not attached to the browser? -- GreenC 14:10, 5 April 2019 (UTC)Reply

Self-censorship BY (not of) the Wayback Machine

edit

I can't into Wikipedia, but I believe this case to be notable enough to be included. In August 2016, the Wayback Machine removed an archived page out of their own volition and pro-homosexual anti-Nazi bias. Link.--Adûnâi (talk) 11:47, 6 April 2019 (UTC)Reply

Observation: User agent passthrough.

edit

Hello. I have noticed that when using web.archive.org/save/example.org (initially web.archive.org/record/example.org in October 2013, see http://www.digitaljournal.com/article/360776 ), the Wayback Machine forwards the browser's user agent to the archived page.

This explains why archiving a website from a mobile web browser brings up the mobile version of the webpage.

Whether the Wayback Machine keeps a record of that user agent, is unknown. --Handroid7 (talk) 14:51, 26 August 2019 (UTC)Reply

edit

It would appear that copyright claims against an archive service would be spurious given that there exists an explicit limitation against copyright in the United States which allows for archival of content. See Title 17, United States Code. Sec. 108. https://www.law.cornell.edu/uscode/text/17/108 . Thus, it would appear that any claim for copyright infringement against an archive service such as wayback would be obviously meritless, making the assertion that cases were filed on such grounds highly suspect at best. I would therefore suggest removal of such references to such matters unless it can be shown that case was filed in PACER. http://www.pacer.gov 66.90.153.184 (talk) 23:10, 3 November 2019 (UTC)Reply

I am an Israeli citizen living in Israel. Israeli copyright law says archival for public access is permitted only by specific law, e.g. the law by which the national library of Israel operates, and requires publishers to submit two copies of every book.
So when the Internet Archive scraped my website and made copies of it available to the public, it didn't rely on U.S. law. It relied on an Israeli citizen having a really hard time taking a foreign company with no local offices to court. I call it anarchism.
As long as they respected my request not to make archive of my site public, and the robots.txt to not scrape my site, I was quite. Then they've decided to scrape my site regardless. In four days they've consumed as much bandwidth as everyone else does in three months, including people browsing the web site, web engines' crawlers, hackers searching for vulnerabilities, and the library of congress and French national library coming every two weeks to archive every image hotlinked by U.S. & French sites. Why? Because they want to archive not just the content under sitemap.xml or linked from the root index.html, but also everything under cPanel & co (including graphics and fonts) for all posterity. Who knows? Maybe I tailored my version of cPanel, and in a hundred years some historian would find it interesting. 152.62.109.203 (talk) 12:12, 19 December 2022 (UTC)Reply

Censorship and other threats

edit

Someone who understands this sentence should rewrite it for clarity: There are known rare cases where online access to content which "for nothing" has put people in danger was disabled by the website.

Perhaps a longer quotation would help.71.14.76.58 (talk) 22:34, 25 March 2020 (UTC)Reply

Wayback Machine is blocked in India ?

edit

I found two news about Internet Archive was blocked in India in 2017 (they are all in Chinese), but I don't know if the blockade has been lifted nowadays, should it be put into the article?

印度政府突然全國封鎖「Wayback Machine」!事前未發出通知 [The Indian government suddenly blocked the"Wayback Machine" across the country! No prior notice] (in Chinese (Hong Kong)). Unwire.hk [zh]. 2017-08-10. Archived from the original on 2017-08-10.

【印度闪电政策再一发】打击盗版还是包庇贪官?印度关闭网路「网页时光机」 [[India's Lightning Policy Repeats Again] Combating Piracy or Covering Corrupt Officials? India shuts down the Internet "Wayback Machine"] (in Chinese). TechOrange. August 11, 2017.

--BlackShadowGtalk03:56, 29 June 2020 (UTC)Reply

AFAIK that's old news. The Internet Archive was at times blocked by various authoritarian governments but it usually comes back. Nemo 13:23, 29 June 2020 (UTC)Reply

Site cannot archive pages

edit

Last days, Wayback is not able to archive web pages. --5.43.102.127 (talk) 15:42, 25 July 2020 (UTC)Reply

Move discussion in progress

edit

There is a move discussion in progress on Talk:WABAC machine which affects this page. Please participate on that page and not in this talk page section. Thank you. —RMCD bot 13:48, 9 October 2020 (UTC)Reply

Wayback Machine blocked in India

edit

The Wayback machine has been blocked in India, possibly due to copyright issues.[1] There will be a message that says "Your requested URL has been blocked as per the directions received from Department of Telecommunications, Government of India. Please contact administrator for more information."

References

  1. ^ "Wayback Machine has been blocked in India". The Verge. Retrieved 15 February 2021.

Oldest cached pages

edit

Whilst the oldest cached pages are reported to have been from the 12th of May 1996, I have found a page that predates it (https://web.archive.org/web/19960511013802/http://www.geocities.com/homestead/) on May 11th, 1996. I don't think it's time zones or anything like that in effect. Should it be added that they started archiving on the 11th, or at least the earliest (known?) page is from that date? Markymark101 (talk) 17:01, 3 December 2021 (UTC)Reply

Awesome find. Geocities no less. Sure go ahead and change it, there is no official source for the date, just links to captures people have found. You could re-frame it as "the oldest known archive date". -- GreenC 19:22, 3 December 2021 (UTC)Reply
It's not just you who think it's not time zones. When one archives a page at e.g. 12:00:00 on 25 February, 2025 (UTC), no matter where one is, it gets the "20250225120000" timestamp. Alfa-ketosav (talk) 18:12, 22 February 2024 (UTC)Reply

Blacklisting of adservers

edit

It looks like they have recently blacklisted advertisement servers such as tpc.googlesyndication.com and 2mdn.

> This URL is in our block list and cannot be captured. Please email us at "[…]" if you would like to discuss this more. — Preceding unsigned comment added by Okoso (talkcontribs) 00:38, 12 January 2022 (UTC)Reply

Usage tip for Internet Archive Digital Library

edit

As important as Internet Archive is in terms of providing working links, it seems like there should be a page for usage tips.

This tip is specific to the Internet Archive Digital Library:

  • When searching for a book title using the default "search metadata" option, you should put the title in quotes to specify an exact match.
  • However, if the title includes a colon, you need to delete the colon or you will get no match.

Fabrickator (talk) 07:30, 23 February 2023 (UTC)Reply

Crawler?

edit

Which crawler software and user agent name does the Wayback Machine use, anyone know? I'm looking for reliable and recent sources. By the way, I know Heritrix exists, and that it is a project from the Internet Archive, but that doesn't mean they currently use it for their Wayback Machine. Thanks. The reason I'm asking is I'd like to include this information in this article, and possibly other places (e.g. User-Agent header, maybe Heritrix, etc). --2001:1C06:19CA:D600:2BD8:5934:EB69:C9 (talk) 10:33, 12 September 2023 (UTC)Reply

Define "crawl"

edit
I think the article needs to provide a clear definition of the word "crawl" and some of its varied uses. The inexperienced, technically limited reader, like myself, has a glimpse of what it means but a concise definition would be helpful. The source article The Internet Archive Turns 20 contains 84 varied uses of the word. Buster Seven Talk (UTC) 13:12, 15 June 2024 (UTC)Reply

Website number drop?

edit

While in January 3, 2024, the Wayback Machine has been reported to have over 866 billion archived websites, as of 08:22, 22 February 2024 (UTC), the Internet Archive's main page (archive.org), web.archive.org and archive.org say 365 billion. Why did these decreases happen? Alfa-ketosav (talk) 20:11, 21 February 2024 (UTC)Reply

Also, as of 08:22, 22 February 2024 (UTC), the dropdown menu appearing on the "Web" part of the menu still says 866 billion archived websites. Alfa-ketosav (talk) 20:20, 21 February 2024 (UTC)Reply

Blocked in Russia?

edit

Is Wayback Machine still blocked in Russia, this source claims that it was blocked 2015-2016? Bottle for Bread (talk) 10:27, 13 August 2024 (UTC)Reply

Formerly, there was India added as well, where appears to be still blocked but not entirely enforced, so I would suggest removing Russia and adding India with a note saying it is not fully enforced and that it depends on the region. Bottle for Bread (talk) 11:46, 13 August 2024 (UTC)Reply

Data Breach

edit

No info on breach? 2603:6080:D841:50F4:8859:19D8:C939:6150 (talk) 13:50, 10 October 2024 (UTC)Reply

Correcting contradictory, apparently erroneous statement in #History section

edit

The history section begins with the following statements:

The Wayback Machine began archiving cached web pages in 1996. One of the earliest known pages was archived on May 10, 1996, at (UTC).[1]

Internet Archive founders Brewster Kahle and Bruce Gilliat launched the Wayback Machine in San Francisco, California,[2] in October 2001,[3][4] primarily to address the problem of web content vanishing whenever it gets changed or when a website is shut down.[5]

How could the Wayback Machine begin archiving pages in 1996 if it was not launched until 2001?

It appears it is supposed to say the Internet Archive began archiving pages in 1996 and then in October 2021 the public-facing Wayback Machine was launched. I am basing that on this statement from reference [2]:

"The original idea for the Internet Archive Wayback Machine began in 1996, when the Internet Archive first began archiving the web. Now, five years later, with over 100 terabytes and a dozen web crawls completed, the Internet Archive has made the Internet Archive Wayback Machine available to the public. The Internet Archive has relied on donations of web crawls, technology, and expertise from Alexa Internet and others. The Internet Archive Wayback Machine is owned and operated by the Internet Archive."

I am going to correct what appears to be an error here and wanted to catalogue my reasoning in case someone more familiar with this topic has another interpretation.

References

  1. ^ PepsiCo, Inc. (May 10, 1996). "PepsiCo Home Page". Internet Archive/Wayback Machine. Archived from the original on May 10, 1996. Retrieved October 8, 2022.
  2. ^ "Wayback Machine General Information". Internet Archive. Archived from the original on December 5, 2019. Retrieved March 2, 2021.
  3. ^ "WayBackMachine.org WHOIS, DNS, & Domain Info – DomainTools". WHOIS. Archived from the original on May 14, 2020. Retrieved March 13, 2016.
  4. ^ "InternetArchive.org WHOIS, DNS, & Domain Info – DomainTools". WHOIS. Archived from the original on May 12, 2020. Retrieved March 13, 2016.
  5. ^ Notess, Greg R. (March–April 2002). "The Wayback Machine: The Web's Archive". Online. 26: 59–61. INIST 13517724.

--MYCETEAE 🍄‍🟫—talk 06:54, 30 October 2024 (UTC)Reply

Making it available to the public does not mean the Wayback Machine didn't exist in 1996 as an internal application. Obviously they created software to do the archiving, and that software had a name. It's a good question though, when was the software coined "Wayback Machine". -- GreenC 14:50, 30 October 2024 (UTC)Reply
Yeah it was certainly confusing and contradictory the way it was written. I made the change (permalink) to say The Internet Archive began archiving cached web pages in 1996 to align with the source. I actually came to this article because I tend to use "Internet Archive" and "Wayback Machine" interchangeably but realized there was a distinction and wanted to clear it up. My takeaway from the sources and articles is that Internet Archive runs many other projects, such as Open Library. My reading of the source is: The Wayback Machine is the public-facing version of the web page archive. The archive dates back to 1996 and the public-facing Wayback Machine was launched in 2001. --MYCETEAE 🍄‍🟫—talk 16:45, 30 October 2024 (UTC)Reply

Broken bar chart

edit

The chart is clearly broken, as shown by the parenthetical 'color' designations on the bars, while the bars remain entirely light blue. Can someone who knows how to format this correctly fix it? Also, it has a note on the citations of "Update me at end of 2021", which clearly hasn't been followed... cheers. anastrophe, an editor he is. 17:56, 4 November 2024 (UTC)Reply

Pages archived count

edit

I removed the table of Archived pages counts from the Internet Archive article because it was too detail for a summary. There are also discrepancies between the figures in that table and the bar chart in this article so I could use someone's help figuring which numbers are correct. The table is also seems to be better sourced, refs can be added to the bar chart after the year. Samuel Wiki (talk) 11:24, 6 November 2024 (UTC)Reply