Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some downloaded files are gzip stream #259

Open
FoxAhead opened this issue Sep 17, 2023 · 4 comments · May be fixed by #262
Open

Some downloaded files are gzip stream #259

FoxAhead opened this issue Sep 17, 2023 · 4 comments · May be fixed by #262

Comments

@FoxAhead
Copy link

FoxAhead commented Sep 17, 2023

I used this great tool to download the site http://web.archive.org/web/20230713110210/http://users.tpg.com.au/jpwbeest/. At first glance everything went well, but then I found out that some downloaded files, regardless of extension, were saved as GZIP stream. Some were fine. The result was consistently repeated on repeated downloads. It was about 30 "corrupted" files out of total 245.

Examples of gzipped files (The first two bytes 1F 8B are gzip magic number, and the third 08 is deflate compression)

image
image

I would like to know what causes this to happen. Is it a bug or peculiarities of this site or the whole Wayback Machine? Is it possible to fix it?

So far I've solved this problem with a simple python script that scans the files in the directory, and if the file has signs of a gzip stream, decompresses it, or otherwise just copies it to the output folder.
Thanks!

@lihaohong6
Copy link

Waybackmachine might have changed how their api works. I'm downloading webpages archived in the past two months, and all of them end up being gzip files. I suspect that the change happened sometime during the last two months.

@grandpa1946
Copy link

i am also having this issue

@Forage Forage linked a pull request Oct 11, 2023 that will close this issue
@Na-x4
Copy link

Na-x4 commented Nov 2, 2023

i think reverting #34 will solve this problem.

@tudoujunha
Copy link

i think reverting #34 will solve this problem.

You are right. This will solve the problem: #267 (comment)
I recommend this: #280 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants