Apr 08 2023 TIL: API for saving webpages in the wayback machine

The Internet Archive's Wayback machine is a way to save a copy of a public webpage at a point in time. There is an API to submit URLs for them to crawl, but the documentation I found is fragmented and some is out of date. Here is what I have got working:

First, you need an account with archive.org. Some of the documentation you can find via Google contradicts this, but I've tried with and without credentials, and unauthenticated doesn't work for me. Signing up is free.

Once you have an account, log in. You need to generate S3 credentials (as in AWS? It's unclear to me). You do that at https://archive.org/account/s3.php. There's two parts to the credentials, an access key and secret key. Both are a medium-ish length random string. These will be used for authentication, via a request header later.

Now that you have credentials, you can submit save requests for webpages on the wild internet. You do so by submitting an HTTP request to https://web.archive.org/save/. The full request I used looks like this in cURL:

curl -X POST -H "Accept: application/json" -H "Authorization: LOW <access key>:<secret key>" -d"url=https://example.org/url-to-save&capture_all=1&delay_wb_availability=1&skip_first_archive=1" "https://web.archive.org/save"

Explanation

curl -X POST [...] "https://web.archive.org/save": we are submitting an HTTP POST request to archive.org via the command line, using cURL.

-H "Accept: application/json" -H "Authorization: LOW <access_key>:<secret_key>": Headers to put in the request. The Authorization header is required, as far as I can tell, despite some documentation you can google to the contrary. Replace <access_key> and <secret_key> with your credentials. Note the colon between them. Don't include the literal < or > characters in your request. Specifying a JSON response is not required.

-d"url=https://example.org/url-to-save&capture_all=1&delay_wb_availability=1&skip_first_archive=1": the payload for the request. url is well, the URL you want to save. capture_all=1 tells the crawler you want to capture all HTTP responses, even 4XX or 5XX errors. Without this, only 200 OK responses are saved. delay_wb_availability tells Wayback that you don't need the page saved immediately. If the crawler is overloaded, it will go onto the end of the queue for indexing when load is lower. skip_first_archive tells Wayback to skip checking if this is the first time the URL has been archived. They say it increases archiving speed, so I went with it.

How I got there

This stack exchange answer indicates that the Save Page Now API is how you can programmatically submit URLs for saving. I kind of had to google around for a while, and eventually stumbled on the info that "Save Page Now API" is the magic words. There are other IA API docs, but don't mention anything about Save Page Now that I can find.

That stackexchange answer links to this documentation in Google Docs (curious to me that it's not on an archive.org URL, or the existing IA API docs). That page links to more Google Docs with the actual API specification.

Further research

The page to manually submit a save request to Wayback has an option to "Save also in my web archive". Your user account has a (public!) list of pages you have personally saved.

It seems the form on that page POSTs to the same web.archive.org URL. As best I can tell, after you submit the form, the webpage waits for the indexing request to begin, receives some job ID from Wayback, and then submits another request using that ID and your account info. I have been unable to get this working with the Save Page Now API via any combination of the name field corresponding to that checkbox.

If anyone knows if saving to My Web Archives is possible via API, please get in touch!