languages | products | description | ||
---|---|---|---|---|
|
|
Responsible AI 2020 |
This repository wishes to show how using the latest machine learning features in Azure Machine Learning and Microsoft open-source toolkits we can put the responsible AI principles into practice. These tools empower data scientists and developers to understand ML models, protect people and their data, and control the end-to-end ML process.
For this, we will develop a solution that wishes to detect if a person is suitable for receiving treatment for heart disease or not. We will use a dataset that will help us classify patients that have heart disease from those that doesn’t. Using this example, we will show how to ensure ethical, transparent and accountable use of AI in a medical scenario.
This example ilustrates how to put the responsible AI principles into practice throughout the different stages of Machine Learning pipeline (Preprocessing, Training/Evaluation, Register Model).
Within this repository, you will find all the resources needed to create and simulate a medical scenario using Azure Machine Learning Service with Responsible AI techniques such as:
The goal of this project is to detect if a person is suitable for receiving a treatment for heart disease.
- Understand how Responsible AI techniques work.
- Use Azure Machine learning Service to create a Machine Learning Pipeline with these Responsible AI techniques.
- Prevent data exposure with differential privacy.
- Mitigate model unfairness.
- Interpret and explain models.
Azure Machine Learning Service give to use the capability to use MLOps techniques, it empowers data scientists and app developers to help bring ML models to production.
This MLOps functionalities that Azure Machine Learning have, enables you to track / version / audit / certify / re-use every asset in your ML lifecycle and provides orchestration services to streamline managing this lifecycle.
Model reproducibility & versioning
- Track, snapshot & manage assets used to create the model
- Enable collaboration and sharing of ML pipelines
Model packaging & validation
- Support model portability across a variety of platforms
- Certify model performance meets functional and latency requirements
Model auditability & explainability
- Maintain asset integrity & persist access control logs
- Certify model behavior meets regulatory & adversarial standards
Model deployment & monitoring
- Release models with confidence
- Monitor & know when to retrain by analyzing signals such as data drift
if you want to know more about how we have implemented the machine learning workflow using Azure Machine Learning Studio, please, see the following file
As ML integrates deeply into our day-to-day business processes, transparency is critical. Azure Machine Learning helps not only to understand model behavior, but also to assess and mitigate bias.
Interpretation and model’s explanation in Azure Machine Learning is based on the InterpretMLtoolset. It helps developers and data scientists to understand the models’ behavior and provide explanations about the decisions made during the model’s inference. Thus, it provides transparency to customers and business stakeholders.
Use model interpretation capability to:
-
Create accurate ML models.
-
Understand the behavior of a wide variety of models, including deep neural networks, during the learning and inference phases.
- Perform conditional analysis to determine the impact on model predictions when characteristic values are changed.
Today, a challenge with the creation of artificial intelligence systems is the inability to prioritize impartiality. By using Fairlearn with Azure Machine Learning,developers and data scientists can leverage specialized algorithms to ensure more unbiased results for everyone.
Use impartiality capabilities to:
-
Evaluate the bias of models during learning and implementation.
-
Mitigate bias while optimizing model performance.
-
Use interactive visualizations to compare a set of recommended models that mitigate bias.
ML is increasingly used in scenarios that encompass sensitive information, such as census and patient medical data. Current practices, such as writing or masking data, may be limiting ML. To address this issue, confidential machine learning and differential privacy techniques can be used to help organizations create solutions while maintaining data privacy and confidentiality.
By using the new Differential Privacy Toolkit with Azure Machine Learning, data science teams can create ML solutions that preserve privacy and help prevent the re-identification of a person's data. These differential privacy techniques have been developed in collaboration with researchers from the Institute of Quantitative Social Sciences (IQSS) and the Harvard School of Engineering.
Differential privacy protects sensitive information with:
-
Statistical noise injection into the data to help prevent the disclosure of private information, without a significant loss of accuracy.
-
Exposure risk management with budget monitoring for information used in individual consultations and with greater limitation of consultations, as appropriate.
In addition to data privacy, organizations seek to ensure the security and confidentiality of all ML resources.
To enable deployment and learning of secure models, Azure Machine Learning provides a robust set of network and data protection capabilities. This includes support for Azure virtual networks, private links to connect to machine learning workspaces, dedicated compute hosts, and client managed keys for encryption in transit and at rest.
On this secure basis, Azure Machine Learning also enables Microsoft data science teams to build models with sensitive data in a secure environment, without the ability to view data. The confidentiality of all machine learning resources is preserved during this process. This approach is fully compatible with open source machine learning frameworks and a wide range of hardware options. We are pleased to offer these confidential machine learning capabilities to all developers and data scientists later this year.
To build responsibly, the ML development process must be repeatable, reliable, and responsible to stakeholders. Azure Machine Learning enables decision makers, auditors, and all ML lifecycle members to support a responsible process.
Azure Machine Learning provides capabilities to automatically track lineage and maintain an audit trail of ML resources. Details such as execution history, learning environment, and explanations of data and models are captured in a central log, allowing organizations to meet various auditing requirements.
Technology | Description |
---|---|
Azure Machine Learning Service | Cloud services to train, deploy and manage machine learning models |
AutoML | Process of automating the time consuming, iterative tasks of machine learning model development |
Differential Privacy | Process of protecting personal information and users identity |
Interpret ML | Interpret a model by using an explainer that quantifies the amount of influence each feature contribues to the predicted label |
FairLearn | Python package that empowers developers of AI systems to assess their system's fairness and mitigate any observed unfairness issues. |
DataDrift | Data drift is the change in model input data that leads to model performance degradation |
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 1.
Download scratch dataset from: http://archive.ics.uci.edu/ml/datasets/Heart+Disease or https://www.kaggle.com/ronitf/heart-disease-uci
- age: age in years
- sex:
- 0: female
- 1: male
- chest_pain_type: chest pain type
- 1: typical angina
- 2: atypical angina
- 3: non-anginal pain
- 4: asymptomatic
- resting_blood_pressure: resting blood pressure (in mm Hg on admission to the hospital)
- cholesterol: serum cholestoral in mg/dl
- fasting_blood_sugar: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- rest_ecg: resting electrocardiographic results
- 0: normal
- 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- max_heart_rate_achieved: maximum heart rate achieved
- exercise_induced_angina: exercise induced angina (1 = yes; 0 = no)
- st_depression: ST depression induced by exercise relative to rest
- st_slope: the slope of the peak exercise ST segment
- 1: upsloping
- 2: flat
- 3: downsloping
- num_major_vessels: number of major vessels (0-3) colored by flourosopy
- thalassemia:
- 3 = normal;
- 6 = fixed defect;
- 7 = reversable defect
- target: diagnosis of heart disease (angiographic disease status)
- 0: < 50% diameter narrowing
- 1: > 50% diameter narrowing
The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.
The dataset we use in this repository is a customized one. In the Getting Started section, we explain how it is generated and which transformations were applied to create our new dataset. The dataset is based on UCI Heart Disease data. The original UCI dataset has by default 76 columns, but Kaggle provides a version containing 14 columns. We used Kaggle one. In this notebook, we'll explore the heart disease dataset.
As part of the exploratory analysis and preprocessing of our data, we have applied several techniques that help us understand the data. The insights discovered (visualizations) were uploaded into an Azure ML Experiment. As part of the study, we took our target variable, and we analyzed it and checked its interaction with other variables.
The original dataset doesn't have any personal or sensible data that we can use it to identify a person or mitigate fairness, just Sex and Age. Therefore, for the purpose of this project and to show the capabilities of Differential Privacy and Detect Fairness techniques, we have created a notebook to generate a custom dataset with the following new columns and schema:
- age (Original)
- sex (Original)
- chest_pain_type: chest pain type (Original)
- resting_blood_pressure (Original)
- cholesterol (Original)
- fasting_blood_sugar (Original)
- rest_ecg (Original)
- max_heart_rate_achieved (Original)
- exercise_induced_angina (Original)
- st_depression (Original)
- st_slope (Original)
- num_major_vessels (Original)
- thalassemia (Original)
- target (Original)
- state (custom)
- city (custom)
- address (custom)
- postal code (custom)
- ssn (social security card) (custom)
- diabetic (custom)
- 0: not diabetic
- 1: diabetic
- pregnant (custom)
- 0: not pregnant
- 1: pregnant
- ashtmatic (custom)
- 0: not ashtmatic
- 1: ashtmatic
- smoker (custom)
- 0: not smoker
- 1: smoker
- observations (custom)
To generate the information related to state, address, city, postal code we have downloaded it from https://github.com/EthanRBrown/rrad. From this repository we are able to get a list of real, random addresses that geocode successfully (tested on Google's Geocoding API service). The address data comes from the OpenAddresses project, and all the addresses are in the public domain. The addresses are deliberately not linked to people or businesses; the only guarantee is that they are real addresses that geocode successfully.
This custom dataset will be the dataset that the Azure ML steps will use. The name that we set to it was "complete_patients_dataset.csv". You can find it on ./dataset
The solution has this structure:
.
├── dataset
│ ├── complete_patients_dataset.csv
│ ├── heart_disease_preprocessed_inference.csv
│ ├── heart_disease_preprocessed_train.csv
│ ├── uci_dataset.csv
│ └── uci_dataset.yml
├── docs (documentation images)
├── infrastructure
│ ├── Scripts
│ │ └── Deploy-ARM.ps1
│ ├── deployment.json
│ ├── DeployUnified.ps1
│ └── README.md
├── src
│ ├── automated-ml (notebook to run AutoML)
│ ├── dataset-generator (notebook to generate dataset)
│ ├── deployment
│ ├── detect-fairness
│ ├── differential-privacy
│ ├── installation-libraries (notebook to install dependecies)
│ ├── mlops-pipeline
│ ├── monitoring
│ ├── notebooks-settings
│ ├── preprocessing
│ ├── retrain
│ └── utils
├── .gitignore
└── README.md
To run this sample you have to do this steps:
- Create infrastructure
- Run notebook to install dependencies
- Run Dataset Generator Notebook
- Publish the pipeline
- Submit pipeline using API
- Activate Data Drift Detector
- Activate re-train Step
We need a infrastructure in Azure to run the experiment. You can read more about the necessary infrastructure here.
To facilitate the task of creating the infrastructure, you can run the infrastructure/DeployUnified.ps1
script. You have to indicate the Resource Group, the Location and the Suscription Id.
The final view of the Azure resource group will be like the following image:
Note: The services you see marked with a red line will be created in the next steps. Don't worry about it!
Each notebook contains an environment.yml file listing all the necessary python libraries which are associated and required for the notebook execution.We recommend you use a conda environment.
Here is the basic recipe for using Conda to manage a project specific software stack.
(base) $ cd project-dir
(base) $ conda env create --prefix ./env --file environment.yml
(base) $ conda activate ./env # activate the environment
(/path/to/env) $ conda deactivate # done working on project (for now!)
There are more details below on creating your conda environment
The following libraries are required
- pylint
- numpy
- pandas
- ipykernel
- joblib
- sklearn
- azureml-sdk
- azureml-sdk[automl]
- azureml-widgets
- azureml-interpret
- azureml-contrib-interpret
- interpret-community
- azureml-monitoring
- opendp-whitenoise
- opendp-whitenoise-core
- matplotlib
- seaborn
- pandas-profiling
- fairlearn
- azureml-contrib-fairness
- azureml-datadrift
The Notebook will automatically find all Jupyter kernels installed on the connected compute instance. To add a kernel to the compute instance:
Select Open terminal in the Notebook toolbar.
Use the Visual Studio Code terminal window to create a new environment. For example:
- Conda commands to create local env by environment.yml:
conda env create -f environment.yml
- Set conda env into jupyter notebook:
python -m ipykernel install --user --name <environment_name> --display-name "Python (<environment_name>)"
- Activate the environment after creating newenv:
conda activate <environment_name>
- Install pip and ipykernel package to the new environment and create a kernel for that conda env
conda install pip
conda install ipykernel
python -m ipykernel install --user --name newenv --display-name "Python (newenv)"
Any of the available Jupyter Kernels can be installed. https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
Use the following to install the libraries: pip --disable-pip-version-check --no-cache-dir install pylint
Or inline within a Juputer Notebook use: !pip install numpy
You can now execute the notebooks successfully.
Virtual environments to execute Azure Machine Learning notebooks using Visual Studio Codespaces.(Optional)
This repository contains a labs to help you get started with Creating and deploying Azure machine learning module.
To complete the labs, you'll need the following:
-
A Microsoft Azure subscription. If you don't already have one, you can sign up for a free trial at https://azure.microsoft.com or a Student Subscription at https://aka.ms/azureforstudents.
-
A Visual Studio Codespaces environment. This provides a hosted instance of Visual Studio Code, in which you'll be able to run the notebooks for the lab exercises. To set up this environment:
- Browse to https://online.visualstudio.com
- Click Get Started.
- Sign in using the Microsoft account associated with your Azure subscription.
- Click Create environment. If you don't already have a Visual Studio Online plan, create one. This is used to track resource utlization by your Visual Studio Online environments. Then create an environment with the following settings:
- Environment Name: A name for your environment - for example, MSLearn-create-deploy-azure-ml-module.
- Git Repository: leestott/create-deploy-azure-ml-module
- Instance Type: Standard (Linux) 4 cores, 8GB RAM
- Suspend idle environment after: 120 minutes
- Wait for the environment to be created, and then click Connect to connect to it. This will open a browser-based instance of Visual Studio Code.
The current Visual Studio Codespaces Environment is based on Debian 10 there are some limitation to the Azure ML SDK with linux at present. Error on some notebooks may occur ensure the correct libraries and versions are installed using !pip install and please check library dependencies.
- Simply download the folder structure and upload the entire content to Azure Machine Learning Notebook
- You now need to create a new compute instance for your notebook environment
- You now need to install the AML Prequestites to the Notebook Compute Host, to do this simply open a notebook and then select the open terminal.
- select the terminal and install all the requirements using pip install
In this project we have inside src folder many directories with jupyter notebook that you have to execute to obtain and complete the objective of this repository. The folder src have:
- automated-ml: automated-ml.ipynb and environment.yml
- dataset-generator: dataset-generator.ipynb and environment.yml
- detect-fairness: fairlearn.ipynb and environment.yml
- differential-privacy: differential-privacy.ipynb and environment.yml
- mlops pipelines:
- explain_automl_model_local.ipynb
- mlops-publish-pipeline.ipynb
- mlops-submit-pipeline.ipynb
- environment.yml
- monitoring: datadrift-pipeline.ipynb and environment.yml
- preprocessing: exploratory_data_analysis.ipynb and environment.yml
Our recommendation is to use dedicated Conda environments for each of the Notebooks due to library and version dependencies if you are running this on a local machine non devcontainer you will need to create the conda enviroments via using conda navigator or execute the Conda installation before do anything inside these notebooks.
Run src/dataset-generator/dataset-generator.ipynb
to create the project dataset made from UCI Heart-Disease dataset specifically to Responsible AI steps.
See the dataset generated in the ./dataset folder
Run src/mlops-pipeline/mlops-publish-pipeline.ipynb
to create a machine learning service pipeline with Responsible AI steps and MLOps techniques that runs jobs unattended in different compute clusters.
You can see the run in the Azure Machine Learning Services Portal in the pipelines section of the portal.
Run src/mlops-pipeline/mlops-submit-pipeline.ipynb
to execute/invoke this publishes the pipeline via REST endpoint.
You can see the run in the Azure Machine Learning Services Portal in the pipelines section of the portal.
Run src/monitoring/datadrift-pipeline.ipynb
to create and execute data drift detector. At the end of this notebook, you will be able to make a request with new data in order to detect drift
Go to Azure Machine Learning portal models section. In the details tab now you can see a new section about Data Drift Detector status and configuration.
If Data Drift coefficient is greater than the configured threshold a new alert will be sent to the final user. In that moment, the user will can execute the re-train pipeline in order to improve the performance of the model taking into account the new collected data.
Go to Azure ML Portal Pipelines section. Click on the last pipeline version. Then, you will have to click on submit button. Now, you should see something like the following image:
First, select an existing experiment or create a new one for this new pipeline execution. Finally, in the same view, to do the retrain process correctly some parameters have to change:
- use_datadrift = False
- retrain_status_differential_privacy_step = True
- retrain_status_preprocessing_step = True
- update_deployment_deploy_step = True
Once the parameters are set, we have everything ready to execute the retraining process!
- Azure Machine Learning(Azure ML) Service Workspace
- Azure ML CLI
- Azure Responsible AI
- Azure ML Samples
- Azure ML Python SDK Quickstart
- Azure ML MLOps Quickstart
- Azure Machine learning
- Create development environment for Machine learning
- AML Python SDK
- AML Pipelines
- Getting started with Auto ML
- Intro to AML – MS Learn
- Automate model select with AML - MS Learn
- Train local model with AML - MS Learn
Tags: Azure Machine Learning Service, Machine Learning, Differential-Privacy, Fairlearn, MLOPs, Data-Drift, InterpretML