DISCLAILMER: I have submitted the project in the First attempt itself. Because someone had given a wrong commit id for a project that I had to review, I didn't review one of the project that was assigned to me. That is the reason I have made some improvements( added prefect and modified the readme) and submitted again. Please do take time and review my project and do give any suggestions to improve my understnading. Also note, as I have added this readme after filling the form for the Project attempt-2, so the commit ID in the excel you saw would be diffrenet from the current commit id.
Excel sheet commit ID: f812320
Current Commid ID: whatever commit ID you can see
- de_zoomcamp_2023_capstone
-
Transformations (dbt, spark, etc)- NOT IMPLETMENTED YET
- Last but not the Least
For the Final Capstone project we are going to use the Citi Bike Sharing data toget some insights and explain them using the vizualization . All the tools explained in the course has been used to help with the process.
Citi Bike is the nation's largest bike share program, with 27,000 bikes and over 1,700 stations across Manhattan, Brooklyn, Queens, the Bronx, Jersey City and Hoboken. This dataset is about the Bike Sharing service Citi Bank in NY.
The official site is here
The Dataset site is here
Some important info about the dataset from the site:
- This data has been processed to remove
- trips that are taken by staff as they service and inspect the system.
- trips that are taken to/from any of our “test” stations (which we were using more in June and July 2013).
- any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it's secure).
The Question we are trying to answer are :
- What are the total number of trips taken by both types(memeber and casual) of customers.
- What is the avg trip duration for each types of bike ?
- What is the avg trip duration for each types of members ?
Ride ID
= unique id for all the ridesRideable type
= Types of bikes available for the riders like classic bikes, docked bikes and electric bikeStarted at
andEnded at
= time at which the rides are startingStart station name
= Name of the starting stationStart station ID
= ID of the starting stationEnd station name
= Name of the ending stationEnd station ID
= ID of the ending stationStart latitude
&Start longitude
= lat and long of the starting pointEnd latitude
&End Longitude
= lat and long of the ending pointMember
orcasual ride
= representing whether a rider is a member ( who has a subscription ) or not a member ( casual rider who doesn’t have the a subscription )
ingest_bike_data.py
- contains the all the code used to transform the data and save in the required formatdocker-compose.yml
- helps to build the docker image which would be used to create the container with Postgres and Pgadmin image together, binding them into a common network.Dockerfile
- used to copy the ingest file to the docker conatiner created int he above step with the required packages and running the pythonREADME.md
- (the file you are reading) gives a brief intro about the project and conatins the instructions on how to reproduce the project.prefect_flows
- the folder containing the prefect flows to execute the pipeline in the fomr of flow.dbt
= this is files related to dbt which helps in Analytics Engineering. This is currently not implemented. Facing issues with setting up the source and querying.
The project has been developed without using any cloud technologies. The whole project is done loaclly and the stpes to recreate the project is mentioned in the Reproducibility section.
The data would be monthly generated and from the business persepective it is fruitful to ingest the data monthly rather than weekly or daily. Therefore we are using here the Batch processing way of ingesting the data. For the time being once the postgres and pgadmin is setup , we need to run the docker commands manually to ingest the data. Prefect is being used to create the flow to ingest the data by running it locally.
As we are not using GCP for the project, we are considering the Progres Database as our Warehouse, which contains the data which will be used to gain insights using the Metabase tool.
We are using the Metabase as our Data Vizualization tool. The Postgres is used as the source of data for Metabase. As the whole project is done on local machine, so the Screenshots and gifs of the dashboard has been provided below in the The Dashboard section.
- Create a new folder for cloning the repo into your folder
- Clone the project by using the command
git clone <url-to-the-repo> .
. If the repos contents are copied into you new folder, then this step is successful. - You can skip this step if you running the project for the first time OR Just for sanity check try to run the command
sudo rm -rf bike_postgres_data
to remove the volume data from your local machine if you have ran the project before. - Check if you already have Dokcer installed in your system by running the command
docker --version
. if not use the link Docker desktop setup to install Docker Desktop according to your OS.
- Next is to run the (Postgres + Pgadmin ) for accessing the database for quick queries on the database byusing the command
docker compose up -d
. - Access the pgadmin url , to check if the pgadmin is working properly or not. Login with the creds [ usernam-
[email protected]
and password -root
].- once you are logged in into postgres through pgadmin, try to create server
- with name of the server of your choice
- hostname =
name of the postgres db you created while running the docker compose up -d
- username =
root
- password =
root
- Save the config. if the config info are correct then you would have a server with the server of our choice without any error.
- once you are logged in into postgres through pgadmin, try to create server
- NOTE: By this stage The postgres should be activated which could be accessed by pgadmin GUI with all the required config before running the below docker command to ingest the data .
-
As we are using Prefect for workflow orchestration, the way we are trying to ingest the data into the DB is bit different as follow:
-
First create an virtual env with Python version = 3.9 (not tested with Python v=3.9+).
-
Using the requirements.txt in the prefect folder, install the prefect dependencies into the environment. Check the prefect is installed properly by executing the command
prefect version
in CLI. -
Run the Prefect UI by executing the command
prefect orion start
.(de_zoomcamp_2023_capstone) prabin_nayak@DESKTOP-IUPLGMD:~$ prefect orion start ___ ___ ___ ___ ___ ___ _____ ___ ___ ___ ___ _ _ | _ \ _ \ __| __| __/ __|_ _| / _ \| _ \_ _/ _ \| \| | | _/ / _|| _|| _| (__ | | | (_) | /| | (_) | .` | |_| |_|_\___|_| |___\___| |_| \___/|_|_\___\___/|_|\_| Configure Prefect to communicate with the server with: prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api View the API reference documentation at http://127.0.0.1:4200/docs Check out the dashboard at http://127.0.0.1:4200
Check out the dashboard at http://127.0.0.1:4200 to check if you can access the Prefect Dashboard.
-
Now we have to deploy our prefect flow to be able to execute the flow from the UI.
- Make sure you are in the folder
prefect_flows
or else runcd ./prefect_flows
. - Run the command
prefect deployment build ./ingest_bike_data.py:main_flow -n "Capstone Project Flow"
which will create amain_flow-deployment.yaml
in the path./prefect_flows
. This will help us to deploy the prefect flow into the server(the local server we have have been running through orion) - Once the file is created then run the command
prefect deployment apply main_flow-deployment.yaml
to upload the deployment so that you cna see that in the UI. - Now, though you can see the flow , even if you run it, it won't run. It is because ther is no agent to pick up the flow. So you have to select the
Work Queues
tab from the left side panel. And if you select thedefault
queue you would see that the Flow is scheduled but there is no agent. - To start an agent, you have to run the command
prefect agent start --work-queue "default"
. And then it will start the run.
- Make sure you are in the folder
-
How the Data Vizualizations was created :
- installing METABASE
- Pull the Metabase image from the DOckerHub by using the command
docker pull metabase/metabase:latest
. - Run the metabase docker with teh command
docker run -d -p 3000:3000 --name metabase metabase/metabase
- Access the url
localhost:3000/
to access the Metabase server running on your local instance. - Check the below gif and screenshots for the dashboard created .
- Pull the Metabase image from the DOckerHub by using the command
- installing METABASE
-
To stop the command Postgres + PGadmin running use the command
docker compose down
.
- Scripting the Data Ingest file -
Python
- For Containerization -
Docker
- For Datawarehouse -
Postgres
- For Data Vizualization -
Metabase
- Making the gif - Chrome Capture - screenshot & gif tool
- Currently the transformations are being done in the python script itself. Adding the dbt transformation to create FACTS and DIMENSIONS table dynamically.
- Adding the docker capabilities into the Prefect flow so that we can conatinerize the whole flow which can be setup in any host machine with a single command.
- Adding the distributed processing using Spark.
- Adding Stream processing based on feasibility and scope of the project.
Thanks to Datatalks club for creating this Data Engineering course to help anyone and everyone to learn Data Engineering at ZERO COST. Other offerings by the Club are ML Zoomcamp and MLOps Zoomcamp. Do check them out !!