View Post [edit]
Poster: | Nemo_bis | Date: | Jul 7, 2014 10:37am |
Forum: | faqs | Subject: | Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
The matter is well known:
https://archive.org/about/faqs.php#2
https://archive.org/about/exclude.php
http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html
The Internet Archive doesn't run for free, it has huge costs. Surprisingly low for the level of service it provides, but still huge. When you ask more access, have you first asked yourself if *you* would pay for additional legal costs it may happen to cause?
Shouldn't we instead be happy that resources have been invested on removing the 6 months embargo and on allowing on-demand archival of URLs, so that now we can immediately enjoy crawls *and* ask our own?
Until the Oakland Archive Policy is superseded, the Internet Archive is not going to change their policies. Is there an alternative standard that one could adopt? If not, who's going to make one? Probably netpreserve.org and IFLA would need to be involved at least.
If you don't like the current policy, work to create one that will do a better service for the public while being a legal defense strong enough to safeguard the Internet Archive...
Some more links for additional instruction:
https://archive.org/post/407088/honoring-present-instead-of-past-robotstxt-is-illogical
https://archive.org/post/1009682/archived-pages-should-be-unaffected-by-robotstxt-changes
https://archive.org/post/1001794/retroactive-and-permanent
https://archive.org/post/433848/domain-resellers-blocking-waybackmachine
https://archive.org/post/225623/retroactive-robotstxt
https://archive.org/post/188806/retroactive-robotstxt-and-domain-squatters
https://archive.org/post/184024/robotstxt-policy-is-a-failure
https://archive.org/post/62230/retroactive-robotstxt-exclusion-different-domain-owner
https://archive.org/post/8920/cybersquatters-copyright-ownership
https://archive.org/post/602721/remove-archived-webpages-when-domain-was-in-hands-of-previous-owner
https://archive.org/post/557165/will-past-crawls-stay-removed-after-removing-robotstxt
https://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains
https://archive.org/post/401162/parked-domains-robotstxt-disallows-viewing-of-past-content
https://archive.org/post/406315/archived-sites-being-made-no-longer-available-due-to-current-robotstxt
https://archive.org/post/280486/domain-name-re-sold-robots-problem
Reply [edit]
Poster: | metaeducation | Date: | Mar 24, 2016 11:23am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
> yourself if *you* would pay for additional legal costs
> it may happen to cause?
There are various entities I'd hope would be willing to get in the fight if someone were to sue (the EFF, to name one).
Either way, it would seem there should be a way to irrevocably greenlight the Internet Archive on content. A license on the content can already do this.
For instance a Creative Commons license: if my blog is entirely CC-BY-SA content, then shouldn't the archive be able to keep it up regardless of some hypothetical later state of robots.txt? There could be something more selective, a "Internet Archive License", so even otherwise copyrighted sites could greenlight the archive having a copy.
If it has to be an opt-in process, then that's unfortunate. But I'd certainly prefer to be able to "opt-in to future domain squatters not being able to erase my existence" over having no choice at all...
Reply [edit]
Poster: | Hasan jonas | Date: | Jun 27, 2022 10:05pm |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Reply [edit]
Poster: | Hasan jonas | Date: | Jun 27, 2022 10:06pm |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
This post was modified by Hasan jonas on 2022-06-28 05:06:46
Reply [edit]
Poster: | Rasel Jonas | Date: | Jul 11, 2022 8:35am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Reply [edit]
Poster: | Rasel Jonas | Date: | Jul 11, 2022 8:32am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
http://pixelseoservices.info/ SEO and link building service company agency PIXELSEO
http://pixelseoservices.info/seo-service/ SEO and link building service company agency PIXELSEO
Reply [edit]
Poster: | asad johns | Date: | Sep 7, 2022 12:47am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Reply [edit]
Poster: | asad johns | Date: | Sep 7, 2022 12:48am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Reply [edit]
Poster: | Hjulle | Date: | Mar 4, 2015 12:50am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
This will also become a growing problem, as more and more webmasters die (or otherwise become unable to pay for their domain). If the domain switches owner, the new owner should not have any power over the old owners content.
Reply [edit]
Poster: | Hjulle | Date: | Mar 4, 2015 12:54am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
A reasonable compromise would be to make "User-agent: *" only affect the current version, and make "User-agent: ia_archiver" retroactive. That way, you wouldn't remove history by mistake, but you could still remove it just as easily and you wouldn't have to change any of the policy documents.
Also note that "The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. " - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Universal_.22.2A.22_match
Reply [edit]
Poster: | Nemo_bis | Date: | Mar 4, 2015 1:33am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Just think of all the emails or support requests which might com from webmasters confused by the (non) interpretation of "*": increasing the workload like that would defeat the purpose. I can understand why IA prefers a conservative (customary?) interpretation for now and I trust them to switch to a less defensive interpretation whenever that's more sustainable than the opposite.
Reply [edit]
Poster: | Somona Khumar | Date: | Jul 7, 2022 2:33am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
Reply [edit]
Poster: | Hjulle | Date: | Mar 4, 2015 1:46am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
But according to https://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains
they already do that. Only the "User-agent: ia_archiver" should remove anything, so my point was irrelevant.
I drew my first conclusion from this site https://web.archive.org/web/*/http://www.testblogpleaseignore.com/2012/06/22/the-trouble-with-frp-and-laziness/ not having any archive, while the (new) robots.txt only says "User-agent: *".
Reply [edit]
Poster: | dolalin | Date: | Jan 8, 2020 1:05am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
robots.txt should be respected but only on a per-crawl basis. If people want things removed from the IA they should be obliged to at least do the bare minimum action of sending an email to request it.
Reply [edit]
Poster: | Menelmacar | Date: | Apr 2, 2015 4:02pm |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
That's the thing: There's nothing customary about it. The robots.txt standard was invented to affect the *current* behavior of crawlers. Stopping/limiting current crawling was all it ever was ever drafted to do. As far as I've seen, it was never proposed that compliant robots would be expected to perform actions elsewhere, such as modifying existing databases.
See:
http://www.robotstxt.org/orig.html
http://www.robotstxt.org/norobots-rfc.txt
http://en.wikipedia.org/wiki/Robots.txt
The "Oakland Archive Policy" that IA defers to ( http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html ) tries to use robots.txt for a purpose it was never designed for. It's a Band-Aid for the fact that there never was (and likely never will be, given the legal tangles involved) a dedicated mechanism for sites to declare whether it's ok for archiving sites to retain permanent copies.
For it's part, robots.txt was never even approved by a major standards body as a standard. It's only a de facto one, which one would think (note: IANAL) might make its use in a legal context even more problematic.
It's unfortunate that there hasn't (to my knowledge) been enshrined into a law protection similar to what exists for temporary caching ( http://en.wikipedia.org/wiki/Online_Copyright_Infringement_Liability_Limitation_Act#Other_safe_harbor_provisions ) , for cases where Internet archiving is provided to the public in an essentially unmodified form for no profit. Given the immense value of a resource like IA to society, ideally something would be worked out to put a site like IA on safer footing.
I think the long and the short of the problem is that IA doesn't have the legal staff, legislated liability protection, or access to standardized authorization protocols that would put them on safer legal ground, nor enough staff to handle enormous volumes of takedown requests, so they feel like they have to go to enormous lengths to be cautious.
I do wish they could at least correlate it against whois records though. My heart sinks any time this happens. It'll definitely become a worse and worse problem as time goes on.
*Sigh* One more reason to loathe %*&^$*ing domain squatting. (Sorry, "domain parking". Ugh.)
Reply [edit]
Poster: | Nemo_bis | Date: | Apr 2, 2015 11:32pm |
Forum: | faqs | Subject: | Re: Customary syntax and liability |
As for legal protection, you're very right. I wonder if https://www.manilaprinciples.org/ would help.
Reply [edit]
Poster: | CogDogBlog | Date: | Jun 28, 2016 11:55am |
Forum: | faqs | Subject: | Re: Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |
So if robots.txt is not found at all, the IA wipes it out? Hardly archival to my simple mind. The full story http://cogdogblog.com/2016/06/dont-archive/