Skip to content

the repo for the DE zoomcamp 2023 which was submitted for evaluation

Notifications You must be signed in to change notification settings

MekongDelta-mind/de_zoomcamp_2023_capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DISCLAILMER: I have submitted the project in the First attempt itself. Because someone had given a wrong commit id for a project that I had to review, I didn't review one of the project that was assigned to me. That is the reason I have made some improvements( added prefect and modified the readme) and submitted again. Please do take time and review my project and do give any suggestions to improve my understnading. Also note, as I have added this readme after filling the form for the Project attempt-2, so the commit ID in the excel you saw would be diffrenet from the current commit id.

Excel sheet commit ID: f812320

Current Commid ID: whatever commit ID you can see


de_zoomcamp_2023_capstone

For the Final Capstone project we are going to use the Citi Bike Sharing data toget some insights and explain them using the vizualization . All the tools explained in the course has been used to help with the process.

Problem Description

An Intro about the Dataset:


Citi Bike is the nation's largest bike share program, with 27,000 bikes and over 1,700 stations across Manhattan, Brooklyn, Queens, the Bronx, Jersey City and Hoboken. This dataset is about the Bike Sharing service Citi Bank in NY.

The official site is here

The Dataset site is here

Some important info about the dataset from the site:

  • This data has been processed to remove
    1. trips that are taken by staff as they service and inspect the system.
    2. trips that are taken to/from any of our “test” stations (which we were using more in June and July 2013).
    3. any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it's secure).

The Question we are trying to answer are :

  1. What are the total number of trips taken by both types(memeber and casual) of customers.
  2. What is the avg trip duration for each types of bike ?
  3. What is the avg trip duration for each types of members ?

Data Dictionary:


  • Ride ID = unique id for all the rides
  • Rideable type = Types of bikes available for the riders like classic bikes, docked bikes and electric bike
  • Started at and Ended at = time at which the rides are starting
  • Start station name = Name of the starting station
  • Start station ID = ID of the starting station
  • End station name = Name of the ending station
  • End station ID = ID of the ending station
  • Start latitude & Start longitude = lat and long of the starting point
  • End latitude & End Longitude = lat and long of the ending point
  • Member or casual ride = representing whether a rider is a member ( who has a subscription ) or not a member ( casual rider who doesn’t have the a subscription )

Description of files and folders:

  • ingest_bike_data.py - contains the all the code used to transform the data and save in the required format
  • docker-compose.yml- helps to build the docker image which would be used to create the container with Postgres and Pgadmin image together, binding them into a common network.
  • Dockerfile - used to copy the ingest file to the docker conatiner created int he above step with the required packages and running the python
  • README.md - (the file you are reading) gives a brief intro about the project and conatins the instructions on how to reproduce the project.
  • prefect_flows - the folder containing the prefect flows to execute the pipeline in the fomr of flow.
  • dbt = this is files related to dbt which helps in Analytics Engineering. This is currently not implemented. Facing issues with setting up the source and querying.

Cloud

The project has been developed without using any cloud technologies. The whole project is done loaclly and the stpes to recreate the project is mentioned in the Reproducibility section.

Data ingestion (choose either batch or stream)

The data would be monthly generated and from the business persepective it is fruitful to ingest the data monthly rather than weekly or daily. Therefore we are using here the Batch processing way of ingesting the data. For the time being once the postgres and pgadmin is setup , we need to run the docker commands manually to ingest the data. Prefect is being used to create the flow to ingest the data by running it locally.

Data warehouse

As we are not using GCP for the project, we are considering the Progres Database as our Warehouse, which contains the data which will be used to gain insights using the Metabase tool.

Dashboard

We are using the Metabase as our Data Vizualization tool. The Postgres is used as the source of data for Metabase. As the whole project is done on local machine, so the Screenshots and gifs of the dashboard has been provided below in the The Dashboard section.

Reproducibility

  1. Create a new folder for cloning the repo into your folder
  2. Clone the project by using the command git clone <url-to-the-repo> . . If the repos contents are copied into you new folder, then this step is successful.
  3. You can skip this step if you running the project for the first time OR Just for sanity check try to run the command sudo rm -rf bike_postgres_data to remove the volume data from your local machine if you have ran the project before.
  4. Check if you already have Dokcer installed in your system by running the command docker --version. if not use the link Docker desktop setup to install Docker Desktop according to your OS.
  1. Next is to run the (Postgres + Pgadmin ) for accessing the database for quick queries on the database byusing the command docker compose up -d .
  2. Access the pgadmin url , to check if the pgadmin is working properly or not. Login with the creds [ usernam- [email protected] and password - root ].
    1. once you are logged in into postgres through pgadmin, try to create server
      1. with name of the server of your choice
      2. hostname = name of the postgres db you created while running the docker compose up -d
      3. username = root
      4. password = root
      5. Save the config. if the config info are correct then you would have a server with the server of our choice without any error.
  3. NOTE: By this stage The postgres should be activated which could be accessed by pgadmin GUI with all the required config before running the below docker command to ingest the data .
  1. As we are using Prefect for workflow orchestration, the way we are trying to ingest the data into the DB is bit different as follow:

  2. First create an virtual env with Python version = 3.9 (not tested with Python v=3.9+).

  3. Using the requirements.txt in the prefect folder, install the prefect dependencies into the environment. Check the prefect is installed properly by executing the command prefect version in CLI.

  4. Run the Prefect UI by executing the command prefect orion start.

    (de_zoomcamp_2023_capstone) prabin_nayak@DESKTOP-IUPLGMD:~$ prefect orion start
    
      ___ ___ ___ ___ ___ ___ _____    ___  ___ ___ ___  _  _
      | _ \ _ \ __| __| __/ __|_   _|  / _ \| _ \_ _/ _ \| \| |
      |  _/   / _|| _|| _| (__  | |   | (_) |   /| | (_) | .` |
      |_| |_|_\___|_| |___\___| |_|    \___/|_|_\___\___/|_|\_|
    
      Configure Prefect to communicate with the server with:
    
          prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
    
      View the API reference documentation at http://127.0.0.1:4200/docs
    
      Check out the dashboard at http://127.0.0.1:4200

    Check out the dashboard at http://127.0.0.1:4200 to check if you can access the Prefect Dashboard.

  5. Now we have to deploy our prefect flow to be able to execute the flow from the UI.

    1. Make sure you are in the folder prefect_flows or else run cd ./prefect_flows.
    2. Run the command prefect deployment build ./ingest_bike_data.py:main_flow -n "Capstone Project Flow" which will create a main_flow-deployment.yaml in the path ./prefect_flows. This will help us to deploy the prefect flow into the server(the local server we have have been running through orion)
    3. Once the file is created then run the command prefect deployment apply main_flow-deployment.yaml to upload the deployment so that you cna see that in the UI.
    4. Now, though you can see the flow , even if you run it, it won't run. It is because ther is no agent to pick up the flow. So you have to select the Work Queues tab from the left side panel. And if you select the default queue you would see that the Flow is scheduled but there is no agent.
    5. To start an agent, you have to run the command prefect agent start --work-queue "default". And then it will start the run.
  6. How the Data Vizualizations was created :

    1. installing METABASE
      1. Pull the Metabase image from the DOckerHub by using the command docker pull metabase/metabase:latest .
      2. Run the metabase docker with teh command docker run -d -p 3000:3000 --name metabase metabase/metabase
      3. Access the url localhost:3000/ to access the Metabase server running on your local instance.
      4. Check the below gif and screenshots for the dashboard created .
  7. To stop the command Postgres + PGadmin running use the command docker compose down .

Tools used for

  • Scripting the Data Ingest file - Python
  • For Containerization - Docker
  • For Datawarehouse - Postgres
  • For Data Vizualization - Metabase
  • Making the gif - Chrome Capture - screenshot & gif tool

The Dashboard

image

Trips count gif


Trips Count - chrome-capture-2023-2-21

Avg Trip Duration by ride type


Avg Trip dur by Ride Types chrome-capture-2023-2-21

Avg Trip Duration by Customer Type


Avg Trip dur by  Customer Type chrome-capture-2023-2-21

Future Improvements

  1. Currently the transformations are being done in the python script itself. Adding the dbt transformation to create FACTS and DIMENSIONS table dynamically.
  2. Adding the docker capabilities into the Prefect flow so that we can conatinerize the whole flow which can be setup in any host machine with a single command.
  3. Adding the distributed processing using Spark.
  4. Adding Stream processing based on feasibility and scope of the project.

Last but not the Least...

Thanks to Datatalks club for creating this Data Engineering course to help anyone and everyone to learn Data Engineering at ZERO COST. Other offerings by the Club are ML Zoomcamp and MLOps Zoomcamp. Do check them out !!

About

the repo for the DE zoomcamp 2023 which was submitted for evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published