Download fails #267

ingvarr777 · 2023-11-12T18:58:22Z

Can't download anything lately.
Here's an example:

wayback_machine_downloader example.com
Downloading example.com to websites/example.com/ from Wayback Machine archives.

Getting snapshot pages................... found 25580 snaphots to consider.

5 files to download:
https://www.example.com/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/index.html was empty and was removed.
https://www.example.com/ -> websites/example.com/index.html (1/5)
http://www.example.com/? # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/?/index.html was empty and was removed.
http://www.example.com/? -> websites/example.com/?/index.html (2/5)
http://example.com/%2F/ # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com//index.html was empty and was removed.
http://example.com/%2F/ -> websites/example.com//index.html (3/5)
http://example.com/#main # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#main/index.html was empty and was removed.
http://example.com/#main -> websites/example.com/#main/index.html (4/5)
http://example.com/#/login # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for 207.241.237.3:443)
websites/example.com/#/login/index.html was empty and was removed.
http://example.com/#/login -> websites/example.com/#/login/index.html (5/5)

What I get as a result is a bunch of empty folders. Does anyone have a solution?

The text was updated successfully, but these errors were encountered:

jomo06 · 2023-11-14T15:37:56Z

same here - guessing that wayback is breaking the connection after a small handful of requests...mine worked for the first 19 pages, then it began to fail...

ingvarr777 · 2023-11-19T03:24:08Z

This fix does work. It's a bit slow now of course, but the files get downloaded.

sww1235 · 2023-11-20T03:46:23Z

archive.org has implemented rate limiting, which is why the delay fixes things. It is unfortunate, and probably breaks multithreaded downloading as well, but it is a free resource after all. https://archive.org/details/toomanyrequests_20191110

technomaz · 2023-12-14T22:10:17Z

can we get this fix approved and a new release created?

ee3e · 2023-12-22T21:47:45Z

As far as I can tell archive.org is limiting the number of connections you can make in a short period of time.

As mentioned in #264, browsers and wget (which uses persistent connection) is not affected by this issue.

It should be fixed by using a single persistent connection for all downloads instead of creating a new connection for each download.

diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
     @processed_file_count = 0
     @threads_count = 1 unless @threads_count != 0
     @threads_count.times do
+      http = Net::HTTP.new("web.archive.org", 443)
+      http.use_ssl = true
+      http.start()
       threads << Thread.new do
         until file_queue.empty?
           file_remote_info = file_queue.pop(true) rescue nil
-          download_file(file_remote_info) if file_remote_info
+          download_file(file_remote_info, http) if file_remote_info
         end
+        http.finish()
       end
     end

@@ -243,7 +247,7 @@ class WaybackMachineDownloader
     end
   end

-  def download_file file_remote_info
+  def download_file (file_remote_info, http)
     current_encoding = "".encoding
     file_url = file_remote_info[:file_url].encode(current_encoding)
     file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
         structure_dir_path dir_path
         open(file_path, "wb") do |file|
           begin
-            URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
-              file.write(uri.read)
+            http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+              file.write(body)
             end
           rescue OpenURI::HTTPError => e
             puts "#{file_url} # #{e}"

JXGA · 2024-01-09T20:44:04Z

diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
     @processed_file_count = 0
     @threads_count = 1 unless @threads_count != 0
     @threads_count.times do
+      http = Net::HTTP.new("web.archive.org", 443)
+      http.use_ssl = true
+      http.start()
       threads << Thread.new do
         until file_queue.empty?
           file_remote_info = file_queue.pop(true) rescue nil
-          download_file(file_remote_info) if file_remote_info
+          download_file(file_remote_info, http) if file_remote_info
         end
+        http.finish()
       end
     end

@@ -243,7 +247,7 @@ class WaybackMachineDownloader
     end
   end

-  def download_file file_remote_info
+  def download_file (file_remote_info, http)
     current_encoding = "".encoding
     file_url = file_remote_info[:file_url].encode(current_encoding)
     file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
         structure_dir_path dir_path
         open(file_path, "wb") do |file|
           begin
-            URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
-              file.write(uri.read)
+            http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+              file.write(body)
             end
           rescue OpenURI::HTTPError => e
             puts "#{file_url} # #{e}"

This is an elegant (and working) solution. Nice one!

Connections are limited to 15/minute. More will lead to a "Connection refused" error. Take ee3e's advice and just use a persistent connection: hartator#267 (comment)

ShiftaDeband · 2024-02-08T05:35:42Z

Thank you @ee3e!

Similarly, this should be implemented for get_all_snapshots_to_consider:

In wayback_machine_downloader.rb:

  def get_all_snapshots_to_consider
    # Note: Passing a page index parameter allow us to get more snapshots,
    # but from a less fresh index
    http = Net::HTTP.new("web.archive.org", 443)
    http.use_ssl = true
    http.start()
    print "Getting snapshot pages"
    snapshot_list_to_consider = []
    snapshot_list_to_consider += get_raw_list_from_api(@base_url, nil, http)
    print "."
    unless @exact_url
      @maximum_pages.times do |page_index|
        snapshot_list = get_raw_list_from_api(@base_url + '/*', page_index, http)
        break if snapshot_list.empty?
        snapshot_list_to_consider += snapshot_list
        print "."
      end
    end
    http.finish()
    puts " found #{snapshot_list_to_consider.length} snaphots to consider."
    puts
    snapshot_list_to_consider
  end

...and in archive_api.rb:

  def get_raw_list_from_api url, page_index, http
    request_url = URI("https://web.archive.org/cdx/search/xd")
    params = [["output", "json"], ["url", url]]
    params += parameters_for_api page_index
    request_url.query = URI.encode_www_form(params)

    begin
      json = JSON.parse(http.get(URI(request_url)).body)
      if (json[0] <=> ["timestamp","original"]) == 0
        json.shift
      end
      json
    rescue JSON::ParserError
      []
    end
  end

(Please check my code, but it worked for me to download a very large archive that I've been struggling with for a bit.)

sww1235 linked a pull request Nov 16, 2023 that will close this issue

Added Configurable Delay option #268

Open

Forage mentioned this issue Jan 10, 2024

Decompress gzip content #262

Open

ShiftaDeband linked a pull request Feb 8, 2024 that will close this issue

Implement Net::HTTP to resolve rate limiting #280

Open

tudoujunha mentioned this issue Apr 12, 2024

Some downloaded files are gzip stream #259

Open

fgpzen mentioned this issue Apr 12, 2024

Implement Net::HTTP to resolve download fails and rate limiting fgpzen/wayback-machine-downloader#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download fails #267

Download fails #267

ingvarr777 commented Nov 12, 2023

jomo06 commented Nov 14, 2023

ingvarr777 commented Nov 19, 2023

sww1235 commented Nov 20, 2023

technomaz commented Dec 14, 2023

ee3e commented Dec 22, 2023

JXGA commented Jan 9, 2024

ShiftaDeband commented Feb 8, 2024

Download fails #267

Download fails #267

Comments

ingvarr777 commented Nov 12, 2023

jomo06 commented Nov 14, 2023

ingvarr777 commented Nov 19, 2023

sww1235 commented Nov 20, 2023

technomaz commented Dec 14, 2023

ee3e commented Dec 22, 2023

JXGA commented Jan 9, 2024

ShiftaDeband commented Feb 8, 2024