Introduction
Databricks Unity Catalog provides a unified governance layer for managing data access and organizing datasets efficiently. One of its core principles is the three-level namespace, which ensures consistent data organization across different workspaces.
In this article, we'll focus on.
- Understanding the Metastore and its role.
- Explaining the three-level namespace structure.
- Querying data using SQL in a Databricks notebook.
Challenges
Before the Unity Catalog, Databricks stores the data within the workspace's Hive Metastore. This led to significant data-sharing challenges across teams and departments.
- Data Silos Across Workspaces: Each Databricks workspace had its own Hive Metastore, meaning data was not accessible across different workspaces.
- Complex and Inefficient Data Sharing: Alternatively, teams resorted to external storage solutions (e.g., mounting an external Data Lake), which required additional permissions and configurations.
- Lack of Centralized Governance: There is no single source of truth for auditing and governance, making it difficult to track who accessed what data.
Unity Catalog Metastore and three-level namespace
![Catalog metastore]()
The top-level container is the Metastore. There are two key differences between Unity Catalog's Metastore and the Hive Metastore.
- This Metastore is defined at the regional level.
- It is the centralized metadata and governance layer for organizing and securing data across multiple workspaces.
Below the Metastore, Unity Catalog structures data in a three-level namespace:
- Catalog: The first-level container. You can define multiple catalogs in your Metastore.
- Schema: The next level is schema, and it's often referred to as Database. You can have multiple schemas within a Catalog.
- For the third level, we have Table, View, Function, and Volume.
- The table stores data in delta format.
- Volume allows you to store data in any format. This ranges from structure data (db table, XML), semi-structure (JSON), or binary file (Zip file).
Let's dive into the article on how to utilize the three-level namespace.
In this article, we will use the NYC Taxi sample data from Databricks.
We will be using Databricks' SQL Query for our demo.
- After you log in to Databricks, click on the 'SQL Editor' menu.
- A New Query will be created automatically.
- You can create additional queries by clicking on the '+' next to the query tab.
![Additional queries]()
Now, we need to create the catalog and schema for our table. In order to create the schema under the catalog, we have to use the "USE CATALOG [catalog_name]" or else the schema will be created under the hive_metastore instead.
-- Create Catalog
CREATE CATALOG IF NOT EXISTS demo_catalog
COMMENT 'My Demo Catalog';
USE CATALOG demo_catalog;
-- Create Schema
CREATE SCHEMA IF NOT EXISTS nyctaxi_trips
COMMENT 'My NYC Taxi trips analysis';
USE SCHEMA nyctaxi_trips;
After running the commands above, you should see the new demo_catalog and nyctaxi_trips created from the Catalog.
![Catlog]()
With the first and second namespace levels created, we can create the table using the sample data. Since we know the catalog and schema, we can use the "USE [catalog].[schema]" command to quickly switch to the proper schema to create our table.
USE demo_catalog.nyctaxi_trips;
CREATE TABLE IF NOT EXISTS trips_2016_02 AS
SELECT *
FROM samples.nyctaxi.trips
WHERE year(tpep_pickup_datetime) = 2016
AND month(tpep_pickup_datetime) = 2;
Let's look at the Catalog section again, and we see the trips_2016_02 table.
![Catalog Section]()
We can select from our table to confirm everything is set up correctly. Instead of using the "USE" command, you can provide the full namespace as well.
SELECT *
FROM demo_catalog.nyctaxi_trips.trips_2016_02;
Results
![Result]()
Finally, let's assign all users read access to this table. In addition to granting user SELECT access to the table, we have to provide access to the catalog and schema as well.
USE demo_catalog.nyctaxi_trips;
GRANT USE CATALOG ON CATALOG demo_catalog TO `account users`;
GRANT USE SCHEMA ON SCHEMA nyctaxi_trips TO `account users`;
GRANT SELECT ON TABLE trips_2016_02 TO `account users`;
Summary
By adopting the Unity Catalog, you are able to manage and provide data access across different workspaces within a region. In addition, the three-level namespace ensures data can be organized efficiently.
Happy Learning!
References