Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update for cluster config and template #941

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thesqlpro
Copy link
Contributor

@thesqlpro thesqlpro commented Dec 10, 2024

Type of PR

Template change for Databricks Cluster Configuration

Purpose

Update cluster configuration and template file. Newer spark version and change to auto termination (reduced to 10 minutes).

Does this introduce a breaking change? If yes, details on what can break

Configurations and notebooks that reference the use of DBFS (internal databricks file system) - Investigation notes in comments below

Author pre-publish checklist

Issues Closed or Referenced

@thesqlpro thesqlpro self-assigned this Dec 10, 2024
Copy link
Contributor

@ydaponte ydaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change alone will not work - you need to deploy the sample e2e and test:

  1. The deployment itself (running deploy.sh) - because when the deployment tries to install the libraries in the cluster they will fail on this version because DBFS will no longer be available on the cluster.
  2. The ADF pipelines will fail because the libraries will not be properly installed. My suggestion is deploy the solution with the Spark 15.4 version and start fixing in the interface the problems - when you know what to change then integrate the automation and run the deployment again e2e and make sure is working before submitting the PR.

Also, please update the metadata in the PR properly and describe the Type of PR - you left all the bullets, include the validation steps and which issues will be closed or reference when the PR closes. The body of the PR looks just like the template. Thanks!

@thesqlpro
Copy link
Contributor Author

thesqlpro commented Dec 13, 2024

Documenting the issues @ydaponte listed
configure_Databricks.sh
line 88-102
echo "Uploading libs TO dbfs..."
databricks fs cp --recursive --overwrite "./databricks/libs/ddo_transform-localdev-py2.py3-none-any.whl" "dbfs:/ddo_transform-localdev-py2.py3-none-any.whl"

Create JSON file for library installation

json_file="./databricks/config/libs.config.json"
cat < $json_file
{
"cluster_id": "$cluster_id",
"libraries": [
{
"whl": "dbfs:/ddo_transform-localdev-py2.py3-none-any.whl"
}
]
}
EOF

Databricks recommends using the Unity Catalog or workspace instead.
https://learn.microsoft.com/en-us/azure/databricks/init-scripts/cluster-scoped

All of the notebooks reference a dbfs location (still testing if this will work)
"Backwards compatibility path" is /FileStore/tables/ no longer the standard dbfs:/mnt/
recommendation is to use volumes as well
example: /Volumes////files.csv

That way files and init scripts are in Unity Catalog.

@ydaponte
Copy link
Contributor

@thesqlpro - yes, precisely, that was a temporary solution until we have the Unity Catalog + new Spark version in place. I've mention that on the very first sprint why I've mentioned that the dbfs file system will not be available anymore with the 15.4 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade Spark version to 15.4 LTS
3 participants