Quickly move to section you are interested in by clicking on appropriate link:
This project extracts, transforms, loads and analyzes data from the NYC Taxi and Limousine Commission.
Relevant data is collected, and transformations are carried out to support the analysis of revenue generation by the commission through taxi rides.
Successfully move data from Source, transform, and Visualize in a dashboard. Key Deliverables:
- Create a data pipeline to process data in batches from the NYC Taxi and Limousine Commission data records to a data warehouse.
- Create an analytical engineering model to transform the data for analysis.
- Create a dashboard to present key revenue insights to stakeholders.
The data architecture is an overview of the data pipeline. Key points are:
- A mage ETL pipeline from the NYC Taxi and Limousine Commission data records to a GCS bucket
- A mage ETL pipeline from the GCS bucket to the Big Query data warehouse.
- An analytical engineering model built on top of the Big Query data warehouse using DBT to further transform the data
- A dashboard built with Power BI for Analysis
The project's data source can be accessed here
Two months of data, June and July 2024, were collected from the NYC Taxi and Limousine Commission data records for green and yellow service taxis.
This data was then analyzed to present insights into the revenue of the commission for these two months
Yellow Taxi Dataset:
Field name description
- VendorID - A code indicating the TPEP provider that provided the record
- tpep_pickup_datetime - The date and time when the meter was engaged.
- tpep_dropoff_datetime - The date and time when the meter was disengaged.
- Passenger_count - The number of passengers in the vehicle (this is a driver-entered value).
- Trip_distance - The elapsed trip distance in miles reported by the taximeter.
- PULocationID - TLC Taxi Zone in which the taximeter was engaged.
- DOLocationID - TLC Taxi Zone in which the taximeter was disengaged.
- RateCodeID - The final rate code in effect at the end of the trip.
- Store_and_fwd_flag - Indicates whether the trip record was held in vehicle memory before sending to the vendor (“store and forward”) due to no connection to the server.
- Payment_type - A numeric code signifying how the passenger paid for the trip.
- Fare_amount - The time-and-distance fare calculated by the meter.
- Extra - Miscellaneous extras and surcharges, including the $0.50 and $1 rush hour and overnight charges.
- MTA_tax - $0.50 MTA tax automatically triggered based on the metered rate in use.
- Improvement_surcharge - $0.30 improvement surcharge assessed for trips at the flag drop. The surcharge began being levied in 2015.
- Tip_amount - Tip amount – this field is automatically populated for credit card tips. Cash tips are not included.
- Tolls_amount - Total amount of all tolls paid during the trip.
- Total_amount - The total amount charged to passengers (does not include cash tips).
- Congestion_Surcharge - Total amount collected for NYS congestion surcharge during the trip.
- Airport_fee - $1.25 for pickups only at LaGuardia and John F. Kennedy Airports.
Green Taxi Dataset:
Field name description
- VendorID - A code indicating the LPEP provider that provided the record.
- lpep_pickup_datetime - The date and time when the meter was engaged.
- lpep_dropoff_datetime - The date and time when the meter was disengaged.
- Passenger_count - The number of passengers in the vehicle (this is a driver-entered value).
- Trip_distance - The elapsed trip distance in miles reported by the taximeter.
- PULocationID - TLC Taxi Zone in which the taximeter was engaged.
- DOLocationID - TLC Taxi Zone in which the taximeter was disengaged.
- RateCodeID - The final rate code in effect at the end of the trip.
- Store_and_fwd_flag - Indicates whether the trip record was held in vehicle memory before sending to the vendor (“store and forward”) due to no connection to the server.
- Payment_type - A numeric code signifying how the passenger paid for the trip.
- Fare_amount - The time-and-distance fare calculated by the meter.
- Extra - Miscellaneous extras and surcharges, including the $0.50 and $1 rush hour and overnight charges.
- MTA_tax - $0.50 MTA tax automatically triggered based on the metered rate in use.
- Improvement_surcharge - $0.30 improvement surcharge assessed on hailed trips at the flag drop. The surcharge began being levied in 2015.
- Tip_amount - Tip amount – this field is automatically populated for credit card tips. Cash tips are not included.
- Tolls_amount - Total amount of all tolls paid during the trip.
- Total_amount - The total amount charged to passengers (does not include cash tips).
- Trip_type - A code indicating whether the trip was a street-hail or a dispatch. It is automatically assigned based on the metered rate in use but can be altered by the driver.
- Docker:- Containerization of applications -- build, share, run, and verify applications anywhere — without tedious environment configuration or management.
- Google Cloud Storage GCS - Data Lake for storage
- Google Cloud BigQuery - Data warehouse for analytical purposes
- Mage - Data and workflow orchestration
- Dbt- For analytics engineering via data transformation
- Power BI - Data Visualization
This section will give a thorough breakdown of how to reproduce this project
Two Mage ETL pipeline were built to load data from the NYC Taxi and Limousine Commission data records to a GCS Bucket. One for the green taxi records and another for the yellow taxi records. These pipeline collected data for two months for their respective service concatenated it and loaded it to the GCS Bucket
Yellow taxi records ingestion pipeline
Green taxi records ingestion pipeline
Two more Mage EL pipelines were built to load the data from the GCS Bucket to Big Query. One for the green taxi records and another for the yellow taxi records.
Yellow taxi records staging pipeline
Green taxi records staging pipeline
DBT was used to build and deploy an analytical model to transform the data.
In addition to the dataset gotten from NYC Taxi and Limousine Commission data records, the comission also provides a taxi zone lookup table to correctly identify boroughs. This was added as a seed in the dbt model
Using power BI, the following dashboard was built