AUSCrawl is a web scraper and crawler that scrapes AUS Banner for data on every single course, instructor, level, and attribute for every semester in AUS since 2005 and saves it in an SQLite database to be queried.
Note: There is a WIP Python re-write of this project.
I created this project as a way to practice using a headless browser to scrape mass data while also learning asynchronous code, using the Sequelize ORM and optimizing my code in general. Additionally, I think the dataset this project produces can allow many others to practice data science or build applications that make use of this data.
To run this project, you will need NodeJS. I recommend using any version after v14.
- Download the repository:
git clone https://github.com/DeadPackets/AUSCrawl
- Enter the project and download required libraries:
cd AUSCrawl && npm install
- Now, simply run the project:
node crawl.js
- Additionally, if you want verbose output, run the following:
VERBOSE=true node crawl.js
- Additionally, if you want verbose output, run the following:
- Chalk is used for coloring the console output
- Sequelize is the database ORM used to save the crawled data into SQLite
- Puppeteer is the headless browser library used to browse and crawl the data from banner.
I am planning on writing a blog post soon.
Sure! Simply fork the project, add your feature/fix and make a pull request. I will review them ASAP.