Skip to content

Latest commit

 

History

History
79 lines (62 loc) · 2.76 KB

README.md

File metadata and controls

79 lines (62 loc) · 2.76 KB

LSFEventScraper

LSFEventScraper is a multithreaded Python scraper, that extracts all events of the current semster from HTW-Berlin.de

How it works

Currently, HTW-Berlin.de has an semester-overview page, from where the crawler can reach a page for every day of the current semster. These day-overview pages are the source, from where every event of the semester can be extracted. What the module does is: • fetching the semester-overview • extracting all day-overview URLs • fetching all day-overviews • extracting every event • save the events to a database

Database Configuration

LSFEventScraper can store the events to either MYSQL or PostgreSQL. If you want to use it with the corresponding HTWRoomFinder, you need to use PostgreSQL. The first thing you need to do is to create the appropriate tables. This is how you do it for PostgreSQL:

psql -h <your host> <db-name> <user> < RoomDBInit_PSQL.db

And here for mysql:

mysql -p -h <your host> -u <user> -p <db-name> < RoomDBInit_MYSQL.db

The LSFEventScraper needs to connect to your database, so you also need to provide your credentials. Just add your credentials to db_credentials_PSQL.json, if you want to use PostgreSQL, or to db_credentials_MYSQL.json, if you want to use MYSQL.

Requirements

The module requires psycopg2, beautifulsoup4, mysql-python. You need pip to install them. for example like this: pip install requirements.txt

Usage

To reduce dependencies, the whole project is built up with the facade pattern. Thus the only class, you need to use is LSFEventScraper, which is an interface for the whole functionality.

There are 2 scenarios for how you can use the LSFEventScraper in a reasonable manner:

  1. Scenario: ============ Scraping all events and store them to a database
# - Fetches all events from HTW-Berlin.de and stores them to memory.
scraper.scrape_events()

# - Sends a TRUNCATE command to the database, to delete all current rows.
scraper.db_access.reset()

# - Sends saves all events to the database.
scraper.save_events_to_db()
  1. Scenario =========== Fetching all day-overviews and store them as html-files to disk. Scrape the locally stored events and store them to a database later.
# - Fetches all day-overviews and stores them as html files to ./data_events/
scraper.crawl_day_pages_and_save_to_disk()

# - ...Later... After you've fetched the pages, you can scrape and store the events later.

# - Scrapes all local sites and stores them to memory
scraper.scrape_local_sites()
# - Sends a TRUNCATE command to the database, to delete all current rows.
scraper.db_access.reset()
# - Sends saves all events to the database.
scraper.save_events_to_db()

test

If all requirements are installed, you can test the scraper with

python main.py