Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 429 + Scraper gives up #19

Open
avelican opened this issue Nov 10, 2023 · 2 comments
Open

Error 429 + Scraper gives up #19

avelican opened this issue Nov 10, 2023 · 2 comments

Comments

@avelican
Copy link

avelican commented Nov 10, 2023

Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine.

( See discussion on similar project here buren/wayback_archiver#32 )

The scraper scrapes too fast, and gets IP banned for 5 minutes by Wayback Machine.

As a result, all the remaining URLs in the pipeline fail repeatedly, Scrapy gives up on all of them and says "we're done!"

...
2023-11-09 22:09:57 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://web.archive.org/cdx/search/cdx?url=www.example.com/blog/stuff&output=json&fl=timestamp,original,statuscode,digest> (failed 3 times): 429 Unknown Status
2023-11-09 22:09:57 [scrapy.core.engine] INFO: Closing spider (finished)

I see two issues here:

  1. Add a global rate limit (I don't think the concurrency flag covers this?)
    1.b. If we get a 429, increase the delay? (Ideally should not occur, as the limit appears to be constant? Although this page https://web.archive.org/429.html suggests that the error can occur randomly if Wayback is getting a lot of traffic from other people.)
    Also, if we get a 429, that seems to mean the IP has been banned for 5 minutes, so we should just pause the scraper for that time? (Making any requests during this time may possibly extend the block?)
  2. (Unnecessary if previous points handled?) Increase retry limit from 3 to something much higher? Again, if we approach scraping with a "backoff"

TODO:

  1. Find out exactly what the rate limit is: May be 5 per minute, or may be 15 per minute? (12 or 4s delay respectively.)
    They seem to have changed it several times. Not sure if there are official numbers.
    https://archive.org/details/toomanyrequests_20191110
    This page says it's 15. It only mentions submitting URLs, but it appears to cover retrievals too.
  2. Find out if this project already does rate limiting. Edit: Sorta, but not entirely sufficient for this use case? (e.g. no 5-minute backoff on 429, autothrottle does not guarantee <X/minute, etc.)

Seems to be using Scrapy's autothrottle, so the fix may be as simple as updating the start delay and default concurrency:
__main__.py

'AUTOTHROTTLE_START_DELAY': 4, # aiming for 15 per minute

and

parser.add_argument('-c', '--concurrency', default=1.0, help=(

This doesn't seem to be sufficient to limit to 15/minute though, as I am getting mostly >15/min with these settings (and as high as 29 sometimes). But Wayback did not complain, so it seems the limit is higher than that.

More work needed. May report back later.

Edit: AutoThrottle docs say AUTOTHROTTLE_TARGET_CONCURRENCY represents the average, not the maximum. Which means if Wayback has a hard limit of X req/sec, setting X as the target would lead by definition to exceeding that limit 50% of the time.

@avelican
Copy link
Author

avelican commented Nov 10, 2023

Update: added download_delay = 4 to mirror_spider.py. This seems to make auto_throttle unnecessary (?), so I disabled it.
But now I get very low pages/minute: avg 6/min and sometimes only 2-3 per minute! (Edit: just got 1/min...)
I thought this is due to Wayback being very slow sometimes? I haven't measured. Still, 30s/req seems very fishy.

Edit: Total request count is about double of saved pages, which brings avg to >12/sec, which is close to the ideal of 15.
Not sure where all those extra requests are coming from, will need to run this in debug mode.

@JamesEBall
Copy link

I am getting the same issue too. I need to scrape 150k pages to rebuild a website, but am constantly hitting the rate limit on archive's servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants