You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The scraper scrapes too fast, and gets IP banned for 5 minutes by Wayback Machine.
As a result, all the remaining URLs in the pipeline fail repeatedly, Scrapy gives up on all of them and says "we're done!"
...
2023-11-09 22:09:57 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://web.archive.org/cdx/search/cdx?url=www.example.com/blog/stuff&output=json&fl=timestamp,original,statuscode,digest> (failed 3 times): 429 Unknown Status
2023-11-09 22:09:57 [scrapy.core.engine] INFO: Closing spider (finished)
I see two issues here:
Add a global rate limit (I don't think the concurrency flag covers this?)
1.b. If we get a 429, increase the delay? (Ideally should not occur, as the limit appears to be constant? Although this page https://web.archive.org/429.html suggests that the error can occur randomly if Wayback is getting a lot of traffic from other people.)
Also, if we get a 429, that seems to mean the IP has been banned for 5 minutes, so we should just pause the scraper for that time? (Making any requests during this time may possibly extend the block?)
(Unnecessary if previous points handled?) Increase retry limit from 3 to something much higher? Again, if we approach scraping with a "backoff"
TODO:
Find out exactly what the rate limit is: May be 5 per minute, or may be 15 per minute? (12 or 4s delay respectively.)
They seem to have changed it several times. Not sure if there are official numbers. https://archive.org/details/toomanyrequests_20191110
This page says it's 15. It only mentions submitting URLs, but it appears to cover retrievals too.
Find out if this project already does rate limiting. Edit: Sorta, but not entirely sufficient for this use case? (e.g. no 5-minute backoff on 429, autothrottle does not guarantee <X/minute, etc.)
Seems to be using Scrapy's autothrottle, so the fix may be as simple as updating the start delay and default concurrency: __main__.py
'AUTOTHROTTLE_START_DELAY': 4, # aiming for 15 per minute
This doesn't seem to be sufficient to limit to 15/minute though, as I am getting mostly >15/min with these settings (and as high as 29 sometimes). But Wayback did not complain, so it seems the limit is higher than that.
More work needed. May report back later.
Edit: AutoThrottle docs say AUTOTHROTTLE_TARGET_CONCURRENCY represents the average, not the maximum. Which means if Wayback has a hard limit of X req/sec, setting X as the target would lead by definition to exceeding that limit 50% of the time.
The text was updated successfully, but these errors were encountered:
Update: added download_delay = 4 to mirror_spider.py. This seems to make auto_throttle unnecessary (?), so I disabled it.
But now I get very low pages/minute: avg 6/min and sometimes only 2-3 per minute! (Edit: just got 1/min...)
I thought this is due to Wayback being very slow sometimes? I haven't measured. Still, 30s/req seems very fishy.
Edit: Total request count is about double of saved pages, which brings avg to >12/sec, which is close to the ideal of 15.
Not sure where all those extra requests are coming from, will need to run this in debug mode.
Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine.
( See discussion on similar project here buren/wayback_archiver#32 )
The scraper scrapes too fast, and gets IP banned for 5 minutes by Wayback Machine.
As a result, all the remaining URLs in the pipeline fail repeatedly, Scrapy gives up on all of them and says "we're done!"
I see two issues here:
1.b. If we get a 429, increase the delay? (Ideally should not occur, as the limit appears to be constant? Although this page https://web.archive.org/429.html suggests that the error can occur randomly if Wayback is getting a lot of traffic from other people.)
Also, if we get a 429, that seems to mean the IP has been banned for 5 minutes, so we should just pause the scraper for that time? (Making any requests during this time may possibly extend the block?)
TODO:
They seem to have changed it several times. Not sure if there are official numbers.
https://archive.org/details/toomanyrequests_20191110
This page says it's 15. It only mentions submitting URLs, but it appears to cover retrievals too.
Seems to be using Scrapy's autothrottle, so the fix may be as simple as updating the start delay and default concurrency:
__main__.py
and
This doesn't seem to be sufficient to limit to 15/minute though, as I am getting mostly >15/min with these settings (and as high as 29 sometimes). But Wayback did not complain, so it seems the limit is higher than that.
More work needed. May report back later.
Edit: AutoThrottle docs say
AUTOTHROTTLE_TARGET_CONCURRENCY
represents the average, not the maximum. Which means if Wayback has a hard limit of X req/sec, setting X as the target would lead by definition to exceeding that limit 50% of the time.The text was updated successfully, but these errors were encountered: