The sample demonstrate how DevOps principles can be applied end to end Data Pipeline Solution built according to the Medallion architecture pattern.
The solution pulls near realtime Melbourne Parking Sensor data from a publicly available REST api endpoint and saves this to Azure Data Lake Gen2. It then validates, cleanses, and transforms the data to a known schema using Azure Databricks. A second Azure Databricks job then transforms these into a Star Schema which are then loaded into Azure Synapse Analytics (formerly SQLDW) using Polybase. The entire pipeline is orchestrated with Azure Data Factory.
The following shows the overall architecture of the solution.
Sample PowerBI report
The following shows the overall CI/CD process end to end.
See here for details.
It makes use of the following azure services:
- Azure Data Factory
- Azure Databricks
- Azure Data Lake Gen2
- Azure Synapse Analytics (formerly SQLDW)
- Azure DevOps
- Application Insights
- PowerBI
For a detailed walk-through of the solution and key concepts, watch the following video recording:
The following summarizes key learnings and best practices demonstrated by this sample solution:
- Generally, you want to divide your data lake into three major areas which contain your Bronze, Silver and Gold datasets.
- Bronze - This is a landing area for your raw datasets with no to minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store.
- Silver - These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by Data Scientists.
- Gold - These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard Fact and Dimension tables.
- Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all succeeding datasets conform to a specific schema and known data invariants. This also can potentially prevent data pipeline failures in cases of unexpected changes to the input data.
- Data that does not pass this validation stage can be rerouted to a Malformed Record store for diagnostic purpose.
- It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can used to replay the data pipeline for both testing (ei. testing data validation logic) and data recovery purposes (ei. data corruption is introduced due to a bug in the data transformation code and thus pipeline needs to be replayed).
- Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines replayable and idempotent, you can recover from this state through deployment of code fix and replaying the data pipelines.
- Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.
- Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages.
- While it is possible to run tests against notebooks, by shifting tests left you increase developer productivity by increasing the speed of the feedback cycle.
- This means including all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions, and data validation and transformation logic.
- There should also be a safe, repeatable process to move changes through dev, test and finally production.
- Maintain a central, secure location for sensitive configuration such as database connection strings that can be access by the appropriate services within the specific environment.
- Any example of this is securing secrets in KeyVault per environment, then having the relevant services query KeyVault for the configuration.
- A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data quality should also be monitored. A common area that should have data monitoring is the malformed record store.
- As an example this repository showcases how to use open source framework Great Expectations to define, measure and report data quality metrics at different stages of the data pipeline. Captured Data Quality metrics are reported to Azure Monitor for further visualizing and alerting. Take a look at sample Data Quality report generated with Azure Monitor workbook. Great Expectations can be configured to generate HTML reports and host directly as static site on Azure Blob Storage. Read more on How to host and share Data Docs on Azure Blob Storage.
The Build and Release Pipelines definitions can be found here.
- Sandbox and Dev- the DEV resource group is used by developers to build and test their solutions. It contains two logical environments - (1) a Sandbox environment per developer so each developer can make and test their changes in isolation prior committing to
main
, and (2) a shared Dev environment for integrating changes from the entire development team. "Isolated" sandbox environment are accomplish through a number of practices depending on the Azure Service.- Databricks - developers use their dedicated Workspace folder to author and save notebooks. Developers can choose to spin up their own dedicated clusters or share a High-concurrency cluster.
- DataLake Gen2 - a "sandbox" file system is created. Each developer creates their own folder within this Sandbox filesystem.
- AzureSQL or SQLDW - A transient database (restored from DEV) is spun up per developer on demand.
- Data Factory - git integration allows them to make changes to their own branches and debug runs independently.
- Stage - the STG resource group is used to test deployments prior to going to production in a production-like environment. Integration tests are run in this environment.
- Production - the PROD resource group is the final Production environment.
There are eight numbered orange boxes describing the sequence from sandbox development to target environments:
- Developers develop in their own Sandbox environments within the DEV resource group and commit changes into their own short-lived git branches. (i.e. <developer_name>/<branch_name>)
- When changes are complete, developers raise a PR to
main
for review. This automatically kicks-off the PR validation pipeline which runs the unit tests, linting and DACPAC builds. - On PR completion, the commit to
main
will trigger a Build pipeline -- publishing all necessary Build Artifacts. - The completion of a successful Build pipeline will trigger the first stage of the Release pipeline. This deploys the publish build artifacts into the DEV environment, with the exception of Azure Data Factory*.
- Developers perform a Manual Publish to the DEV ADF from the collaboration branch (
main
). This updates the ARM templates in in theadf_publish
branch. - On the successful completion of the first stage, this triggers an Manual Approval Gate**. On Approval, the release pipeline continues with the second stage -- deploying changes to the Staging environment.
- Integration tests are run to test changes in the Staging environment.
- ***On the successful completion of the second stage, this triggers a second Manual Approval Gate. On Approval, the release pipeline continues with the third stage -- deploying changes to the Production environment.
Notes:
- This is a simplified Build and Release process for demo purposes based on Trunk-based development practices.
- *A manual publish is required -- currently, this cannot be automated.
- **The solution deployment script does not configure Approval Gates at the moment. See Known Issues, Limitations and Workarounds
- ***Many organization use dedicated Release Branches (including Microsoft) instead of deploying from
main
. See Release Flow.
More resources:
- Continuous Integration & Continuous Delivery with Databricks
- Continuous integration and delivery in Azure Data Factory
- Devops for AzureSQL
-
Unit Testing - These test small pieces of functionality within your code. Data transformation code should have unit tests and can be accomplished by abstracting Data Transformation logic into packages. Unit tests along with linting are automatically run when a PR is raised to
main
.- See here for unit tests within the solution and the corresponding QA Pipeline that executes the unit tests on every PR.
-
Integration Testing - These are run to ensure integration points of the solution function as expected. In this demo solution, an actual Data Factory Pipeline run is automatically triggered and its output verified as part of the Release to the Staging Environment.
- See here for the integration tests and the corresponding Release Pipeline Job Definition for running them as part of the Release pipeline.
More resources:
- pytest-adf - Pytest helper plugin for integration testing Azure Data Factory
- nutter testing framework - Testing framework for Databricks notebooks.
Observability-as-Code - Few key components of Observability and Monitoring are deployed and configured through Observability-as-Code at the time on Azure resources deployment. This includes log analytics workspace to collect monitoring data from key resources, central Azure dashboard to monitor key metrics and alerts to monitor the data pipelines. To learn more on monitoring specific service read below.
- Monitoring Azure Databricks with Azure Monitor
- Monitoring Azure Databricks Jobs with Application Insights
This sample requires certain prerequisites for deployment, which will be outlined below. While using the Dev Container is optional, we highly recommend it as it simplifies the setup process, ensures consistency across environments, and makes managing dependencies easier.
Follow the setup prerequisites, permissions, and deployment environment options.
- Github account : If you do not have one already, create a GitHub Account.
- Azure Account If you do not have one already, create an Azure Account.
- Permissions needed: ability to create and deploy to an azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group.
- Azure DevOps Project : Follow the documentation to create a new project, or use an existing project you wish to deploy these resources to.
- Permissions needed: ability to create service connections, pipelines and variable groups.
As previously mentioned, there are two approaches to setting up the deployment environment, and the prerequisites for each will be detailed below. We recommend opting for the Dev Container, as it streamlines the setup process, ensures consistency, and minimizes configuration effort. Using the Dev Container requires installation and configuration; refer to the documentation linked below for further details.
(Recommended)
-
Visual Studio Code Remote Development Extension
It is strongly recommended to use Dev Container for the deployment to avoid environment related issue
- For Windows users, Windows Subsystem For Linux
- az cli 2.6+
- az cli - application insights extension
- To install, run
az extension add --name application-insights
- To install, run
- Azure DevOps CLI
- To install, run
az extension add --name azure-devops
- To install, run
- Python 3+
- databricks-cli
- jq
- makepasswd
Set up the environment variables as specified, fork the GitHub repository, and log in to your Azure account before moving forward with the deploymet.
IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for list of deployed resources. NOTE: This deployment was tested using WSL 2 (Ubuntu 18.04) and Debian GNU/Linux 9.9 (stretch)
-
Configuration: Variables and Login
-
Fork this repository into a new Github repo.
-
Set the following required environment variables:
- TENANT_ID - an Entra ID. Used to login to a specific tenant with Azure CLI
- AZURE_SUBSCRIPTION_ID - Azure subscription id to use to deploy resources. Default: default azure subscription.
- AZDO_PROJECT - Target Azure DevOps project where Azure Pipelines and Variable groups will be deployed
- AZDO_ORGANIZATION_URL - Target Azure DevOps Organization of Azure DevOps project in this form
https://dev.azure.com/<organization>/
. Must be in the same tenant as $TENANT_ID - GITHUB_REPO - Name of your forked github repo in this form
<my_github_handle>/<repo>
. (ei. "devlace/mdw-dataops-import") - GITHUB_PAT_TOKEN - a Github PAT token. Generate them here. The token is needed to connect to the GitHub repository. When generating a token use a
fine-grained
token, select your repository and under repository permissions select Read access to Content and Webhooks. Under Account permissions select read access to Email.
Optionally, set the following environment variables:
- AZURE_LOCATION - Azure location to deploy resources. Default:
westus
. - DEPLOYMENT_ID - string appended to all resource names. This is to ensure uniqueness of azure resource names. Default: random five character string.
- AZDO_PIPELINES_BRANCH_NAME - git branch where Azure DevOps pipelines definitions are retrieved from. Default: main.
- AZURESQL_SERVER_PASSWORD - Password of the SQL Server instance. Default: random string.
-
Additionally - If you are using Dev Container which is the recommended option, follow the below steps:
- Rename
.envtemplate
under ".devcontainer" folder to.env
and update the values as mentioned above instead of setting those as environment variables. - Open the project inside the vscode Dev Container (see details here).
Note that the environment file is only loaded once, during the container build process. If you modify any environment variables after building your Dev Container, you will need to manually reload the new values by running
source .devcontainer/.env
- Rename
-
To further customize the solution, set parameters in
arm.parameters
files located in theinfrastructure
folder.- To enable Observability and Monitoring components through code(Observability-as-code), please set enable_monitoring parameter to true in
arm.parameters
files located in theinfrastructure
folder. This will deploy log analytics workspace to collect monitoring data from key resources, setup an Azure dashboards to monitor key metrics and configure alerts for ADF pipelines.
Login and Cluster Configuration
-
Ensure that you have completed the configuration for the variables described in the previous section, titled Configuration: Variables and Login.
-
This configuration will be used during the environment deployment process to facilitate login.
-
Create a
cluster.config.json
Spark configuration from thecluster.config.template.json
file. For the "node_type_id" field, select a SKU that is available from the following command in your subscription:az vm list-usage --location "<YOUR_REGION>" -o table
In the repository we provide an example, but you need to make sure that the SKU exists on your region and that is available for your subscription.
-
- To enable Observability and Monitoring components through code(Observability-as-code), please set enable_monitoring parameter to true in
-
-
Deploy Azure resources
cd
into thee2e_samples/parking_sensors
folder of the repo.- Run
./deploy.sh
.- The login process for deployment is interactive. When you run the script deploy.sh, a browser window will be open, prompting you to log in to Azure. If there is an open session from a previous deployment, it may log you out and request you to log in again.
- This may take around ~30mins or more to run end to end. So grab yourself a cup of coffee... ☕ But before you do so keep the following in mind:
- You might encounter deployment issues if the script attempts to create a Key Vault that conflicts with a previously soft-deleted Key Vault. In such cases, the deployment script may prompt you to confirm the purge of the previously deleted Key Vault.
- There are 3 points in time where you will need to authenticate to the databricks workspace, before the script continues to run. You will find the following message for the deployment of the dev, stage and production environments. Click the link highlighted in green, consent to authenticate to the databricks workspace and when the workspace opens successfully, return to the deployment windows and press Enter to continue:
- If you encounter an error with
cannot execute: required file not found
verify the line ending settings of your git configuration. This error is likely that the lines in the file are ending with CRLF. Using VSCode, verify that./deploy.sh
is set to LF only. This can be done using the control pallet and typing>Change End of Line Sequence
. Also, verify the files in thescripts
folder are also set to LF only. - After a successful deployment, you will find
.env.{environment_name}
files containing essential configuration information per environment. See here for list of deployed resources. - Note that if you are using Dev Container, you would run the same script but inside the Dev Container terminal.
- As part of the deployment script, the Azure DevOps Release Pipeline YAML definition has been updated to point to your Github repository. Commit and push these changes.
- This will trigger a Build and Release which will fail due to a lacking
adf_publish
branch -- this is expected. This branch will be created once you've setup git integration with your DEV Data Factory and publish a change.
- This will trigger a Build and Release which will fail due to a lacking
-
Setup ADF git integration in DEV Data Factory
IMPORTANT NOTE: Only the DEV Data Factory should be setup with Git integration. Do not setup git integration in the STG and PROD Data Factories.
- In the Azure Portal, navigate to the Data Factory in the DEV environment and launch the Data Factory portal.
- On the landing page, select "Set up code repository". For more information, see here.
- Fill in the repository settings with the following:
- Repository type: Github
- Use GitHub Enterprise Server: Unselected, unless you are using GitHub Enterprise Server
- Github Account: your_Github_account
- Git repository (select Use repository link, if forked): forked Github repository url
- Collaboration branch: main
- Root folder: /e2e_samples/parking_sensors/adf
- Import existing resources to repository: Selected
- Import resource into this branch: main
- When prompted to select a working branch, check Use existing and select main
Ensure you Import Existing Data Factory resources to repository. The deployment script deployed ADF objects with Linked Service configurations in line with the newly deployed environments. Importing existing ADF resources definitions to the repository overrides any default Linked Services values so they are correctly in sync with your DEV environment.
-
Trigger an initial Release
- In the DEV Data Factory portal, navigate to Pipelines and open the "P_Ingest_MelbParkingData" pipeline.
- In the top left corner, open the git drop down and create a Dev branch by clicking in "New Branch".
- Once the Dev branch is created, select the branch from the drop-down list and make a change in the Description fields from one of the pipeline tasks.
- Save the pipeline.
- Go to the Git drop down box in the top left corner again, but this time create a Pull request.
- Once the Pull request is merged, the CI and the CD pipelines will be triggered and you will be able to see the description change on your staging and production Data Factory.
- Publishing of the Data Factory changes is implemented automatically in the CI/CD pipelines using the generally available npm package. There is no need to click "Publish" in the Data Factory portal anymore.
- In Azure DevOps, notice a new run of the Build Pipeline (mdw-park-ci-artifacts) off
main
. This will build the Python package and SQL DACPAC, then publish these as Pipeline Artifacts. - After completion, this should automatically trigger the Release Pipeline (mdw-park-cd-release). This will deploy the artifacts across environments.
- Optional. Trigger the Data Factory Pipelines per environment.
- In the Data Factory portal of each environment, navigate to "Author", then select the
P_Ingest_MelbParkingData
. - Select "Trigger > Trigger Now".
- To monitor the run, go to "Monitor > Pipeline runs".
- Currently, the data pipeline is configured to use "on-demand" databricks clusters so it takes a few minutes to spin up. That said, it is not uncommon to change these to point to "existing" running clusters in Development for faster data pipeline runs.
- In the Data Factory portal of each environment, navigate to "Author", then select the
-
Optional. Visualize data in PowerBI
This requires PowerBI Desktop App installed.
- Open the provided PowerBi pbix (PowerBI_ParkingSensors.pbix) under
reports
folder. - Under Queries, select "Transform Data" > "Data source settings".
- Select "Change Source..." and enter the Server and Database details of your SQL Dedicated Pool. Click "Ok".
You can retrieve these from the Azure Portal under "Connection Strings" of your SQL Dedicated Pool Instance.
- Select "Edit Permissions...". Under "Credentials", select "Edit...". Select the "Database" tab. Enter the User name and password of your SQL Dedicated Pool Instance.
You can retrieve these from the Secrets in your KeyVault instance.
- Close the Data Source tabs.
- Click on Refresh data.
Your Dashboard will initially be empty. You will need your data pipeline to run a few times for the data in your SQL Dedicated Pool to populate.
- Open the provided PowerBi pbix (PowerBI_ParkingSensors.pbix) under
Congratulations!! 🥳 You have successfully deployed the solution and accompanying Build and Release Pipelines. For next steps, we recommend watching this presentation for a detailed walk-through of the running solution.
If you've encountered any issues, please review the Troubleshooting section. If you are still stuck, please file a Github issue with the relevant error message, error screenshots, and replication steps.
After a successful deployment, you should have the following resources:
- In Azure, three (3) Resource Groups (one per environment) each with the following Azure resources.
- Data Factory - with pipelines, datasets, linked services, triggers deployed and configured correctly per environment.
- Data Lake Store Gen2 and a Service Principal (SP) with Storage Contributor rights assigned.
- Databricks workspace
- notebooks uploaded at
/notebooks
folder in the workspace - SparkSQL tables created
- ADLS Gen2 mounted at
dbfs:/mnt/datalake
using the Storage Service Principal. - Databricks KeyVault secrets scope created
- notebooks uploaded at
- Log Analytics Workspace - including a kusto query on Query explorer -> Saved queries, to verify results that will be logged on Synapse notebooks (notebooks are not deployed yet).
- Azure Synapse SQL Dedicated Pool (formerly SQLDW) - currently, empty. The Release Pipeline will deploy the SQL Database objects.
- Azure Synapse Spark Pool - currently, empty. Configured to point the deployed Log Analytics workspace, under "Apache Spark Configuration".
- Azure Synapse Workspace - currently, empty.
- Application Insights
- KeyVault with all relevant secrets stored.
- In Azure DevOps
- Four (4) Azure Pipelines
- mdwdops-cd-release - Release Pipeline
- mdwdops-ci-artifacts - Build Pipeline
- mdwdops-ci-qa-python - "QA" pipeline runs on PR to
main
- mdwdops-ci-qa-sql - "QA" pipeline runs on PR to
main
- Three (6) Variables Groups - two per environment
- mdwdops-release-dev
- mdwdops-secrets-dev*
- mdwdops-release-stg
- mdwdops-secrets-stg*
- mdwdops-release-prod
- mdwdops-secrets-prod*
- Four (4) Service Connections
- Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
- mdwdops-serviceconnection-dev
- mdwdops-serviceconnection-stg
- mdwdops-serviceconnection-prod
- Github Service Connection for retrieving code from Github
- mdwdops-github
- Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
- Three additional Service Principals (one per environment) with Data Factory Contributor role for running Integration Tests
- Four (4) Azure Pipelines
Notes:
- *These variable groups are currently not linked to KeyVault due to limitations of creating these programmatically. See Known Issues, Limitations and Workarounds
- Environments and Approval Gates are not deployed as part of this solution. See Known Issues, Limitations and Workarounds
This sample comes with an optional, interactive clean-up script which will delete resources with mdwdops
in its name. It will list resources to be deleted and will prompt before continuing. IMPORTANT NOTE: As it simply searches for mdwdops
in the resource name, it could list resources not part of the deployment! Use with care.
ADLS Gen2 is structured as the following:
datalake <- filesystem
/sys/databricks/libs <- contains all libs, jars, wheels needed for processing
/data
/lnd <- Bronze - landing folder where all data files are ingested into.
/interim <- Silver - interim (cleansed) tables
/dw <- Gold - final tables
The following lists some limitations of the solution and associated deployment script:
- Azure DevOps Variable Groups linked to KeyVault can only be created via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
- Workaround: Deployment add sensitive configuration as "secrets" in Variable Groups with the downside of duplicated information. If you wish, you may manually link a second Variable Group to KeyVault to pull out the secrets. KeyVault secret names should line up with required variables in the Azure DevOps pipelines. See here for more information.
- Azure DevOps Environment and Approval Gates can only be managed via the UI, cannot be managed programmatically and was not incorporated in the automated deployment of the solution.
- Workaround: Approval Gates can be easily configured manually. See here for more information.
- ADF publishing through the CI/CD pipeline using the npm task still throws and error in the logs due to the missing publish_config.json file but the pipeline completes successfully.