-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download fails #267
Comments
same here - guessing that wayback is breaking the connection after a small handful of requests...mine worked for the first 19 pages, then it began to fail... |
This fix does work. It's a bit slow now of course, but the files get downloaded. |
archive.org has implemented rate limiting, which is why the delay fixes things. It is unfortunate, and probably breaks multithreaded downloading as well, but it is a free resource after all. https://archive.org/details/toomanyrequests_20191110 |
can we get this fix approved and a new release created? |
As far as I can tell archive.org is limiting the number of connections you can make in a short period of time. As mentioned in #264, browsers and wget (which uses persistent connection) is not affected by this issue. It should be fixed by using a single persistent connection for all downloads instead of creating a new connection for each download. diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
@processed_file_count = 0
@threads_count = 1 unless @threads_count != 0
@threads_count.times do
+ http = Net::HTTP.new("web.archive.org", 443)
+ http.use_ssl = true
+ http.start()
threads << Thread.new do
until file_queue.empty?
file_remote_info = file_queue.pop(true) rescue nil
- download_file(file_remote_info) if file_remote_info
+ download_file(file_remote_info, http) if file_remote_info
end
+ http.finish()
end
end
@@ -243,7 +247,7 @@ class WaybackMachineDownloader
end
end
- def download_file file_remote_info
+ def download_file (file_remote_info, http)
current_encoding = "".encoding
file_url = file_remote_info[:file_url].encode(current_encoding)
file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
- URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
- file.write(uri.read)
+ http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+ file.write(body)
end
rescue OpenURI::HTTPError => e
puts "#{file_url} # #{e}" |
This is an elegant (and working) solution. Nice one! |
Connections are limited to 15/minute. More will lead to a "Connection refused" error. Take ee3e's advice and just use a persistent connection: hartator#267 (comment)
Thank you @ee3e! Similarly, this should be implemented for In
...and in
(Please check my code, but it worked for me to download a very large archive that I've been struggling with for a bit.) |
Can't download anything lately.
Here's an example:
What I get as a result is a bunch of empty folders. Does anyone have a solution?
The text was updated successfully, but these errors were encountered: