Data Engineering Nanodegree on Udacity. This course focuses on learning data modeling, date warehousing, data lakes, data pipelines, and data ETL scheduling with Airflow
Following are my learning goals for this nanodegree:
- Data Modeling [DONE]
- Learn Data Modeling with Relational DB: Postgres
- Learn Data Modeling with NoSQL DB: Apache Cassandra
- Cloud Data Warehouse [DONE]
- Learn about cloud infrastructure for DW (AWS), and Infrastructure-as-code
- Implement a star-schema based DW on AWS Redshift
- Spark and DataLakes [DONE]
- Learning to use Spark to work with massive datasets
- Implement data-lake on AWS S3 using Spark on AWS EMR
- Data pipelines [DONE]
- Learn about data-pipeline and data-quality
- Implement production data-pipeline using Apache Airflow and build DW on AWS Redshift
- Capstone Project [DONE]
- Apply the knowledge learned in above lessons to perform ETL on immigration and demographics data
- Create gold dataset for analysts and data scientists to use and analyze immigration patterns in the US
Each of the course modules are separated into their own python module in this repository. They consists of README that one will need to refer to setup their environment (either locally or on AWS).