Databricks clusters

Question

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:

A workload for data engineers who will use Python and SQL.

A workload for jobs that will run notebooks that use Python, Scala, and SOL.

A workload that data scientists will use to perform ad hoc analysis in Scala and R.

The enterprise architecture team at your company identifies the following standards for Databricks environments: The data engineers must share a cluster. The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster. All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution1: You create a High Concurrency cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs.

Solution2: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs.

Does any of this solution meet the goal? or any other solutions possible and why?

Eliana Blake · Answer

Both Solution1 and Solution2 have their own pros and cons based on the requirements provided for the Azure Databricks workspace. Let's analyze each solution: 1. Solution1: - High Concurrency cluster for each data scientist: While this ensures that each data scientist has their own high concurrency cluster, it may lead to higher costs as high concurrency clusters are optimized for multiple users. - High Concurrency cluster for data engineers: Data engineers sharing a high concurrency cluster aligns with the requirement for them to share a cluster. However, this may potentially lead to resource contention depending on the workload. - Standard cluster for jobs: Using a Standard cluster for jobs may work well as long as the workload is not extremely intensive and can be managed effectively. 2. Solution2: - Standard cluster for each data scientist: Assigning a standard cluster to each data scientist can help control costs as standard clusters are suitable for single users. This also aligns with the requirement for each data scientist to have their own cluster that terminates automatically after 120 minutes of inactivity. - High Concurrency cluster for data engineers: This aligns with the requirement for data engineers to share a cluster. High concurrency clusters can handle multiple users efficiently. - Standard cluster for jobs: Similar to Solution1, using a Standard cluster for jobs can work well unless the workload is extremely demanding. Since both solutions have their trade-offs, here are a couple of additional considerations: - Cost Efficiency: Solution2 may be more cost-efficient as it utilizes standard clusters for individual data scientists. - Resource Management: Solution1 may provide better resource management for data scientists with high concurrency clusters, but it could lead to higher costs. Ultimately, the choice between Solution1 and Solution2 may depend on factors like budget constraints, workload intensity, and the importance of resource isolation for data scientists. It might be beneficial to pilot both solutions to observe actual usage patterns and optimize cluster configurations accordingly. If you have any specific preferences or constraints regarding cost, performance, or resource sharing, please let me know so I can provide a more tailored recommendation!