american_literature

crawl, processing code
https://americanliterature.com/short-746story-library,
Details of the preprocessing and stats on Supplementary.pdf

quickstart

to be written

crawling

python 3.x, request, beautifulsoup4
asked permission for use (via the twitter, one and only mean of contact), did not responded.
crawling code waits 1 seconds after scraping one story to avoid unintended server breakdown.

post processing

sentencizing

We used spaCy to do this.
takes ~1 day, 1.148 million sentences. (train/val/test = 8:1:1)
We connected the consecutive sentence one by one to make the sentence to sentence prediction form of the dataset
avoided connecting end of a story with start of another story.

filtering out

filtered out too short (<= 3 tokens), or too long (>= 70 tokens) sentences
resulted in 11.98% of outliers to be excluded (train/val/test = 11.99%, 11.99%, 11.90%)

specs

train = 808.34 k sentences
val = 10.11 k sentences
test = 10.11 k sentences

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scrape-n-process		scrape-n-process
1_hotfix.py		1_hotfix.py
2_length_measure.py		2_length_measure.py
3_filter_outliers.py		3_filter_outliers.py
ALSS_web.png		ALSS_web.png
README.md		README.md
Supplementary.pdf		Supplementary.pdf
pre_continuation_length_stat.py		pre_continuation_length_stat.py
scraper.py		scraper.py
split_randomly.py		split_randomly.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

american_literature

quickstart

crawling

post processing

sentencizing

filtering out

specs

About

Releases

Packages

Languages

sonsus/american_literature

Folders and files

Latest commit

History

Repository files navigation

american_literature

quickstart

crawling

post processing

sentencizing

filtering out

specs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages