Using Spark to Read and Write Data in Microsoft Fabric Data Warehouse

Wilson Mok
5d
207
0
1

Article

Introduction

Microsoft Fabric released Spark Connector for Microsoft Fabric Data Warehouse to enable data engineers to directly read and write data in Fabric Data Warehouse using Spark notebooks. This eliminates the need for complex ETL processes or intermediary storage.

Previously, Lakehouse and Data Warehouses operated in separate environments, requiring additional effort to move data between them. With this new connector, data engineers can now streamline their workflows, reducing complexity and increasing efficiency.

In this article, we will discuss.

the challenges of using Spark and the data warehouse.
How the Spark Connector for Fabric Data Warehouse streamlines the process.
Walk-through of the sample code utilizing the connector.

Challenge: Using Spark and Data Warehouse

Before the Spark Connector for Microsoft Fabric Data Warehouse, data engineers had to take multiple steps to move and transform data between Lakehouse and the data warehouse. A common workflow involved:

Processing Data in Spark: Ingestion, cleaning, transformation, and storing data in the Lakehouse. Typically, this is done using Spark.
Transform Data for business: Since Fabric Data Warehouse relies on T-SQL, engineers had to write separate SQL scripts to move cleaned data from the Lakehouse into the dimensional model.
Data Pipeline: This process required Data Factory and Data Pipelines to automate the movement and transformation steps, adding complexity and operational overhead.

This setup meant working with multiple languages (PySpark for Spark and T-SQL for the warehouse), integrating different tools, and ensuring data was consistently transferred and transformed across environments. As a result, maintaining these workflows required additional effort, increased latency, and introduced potential points of failure.

Solution: A Unified Data Engineering Experience

The Spark Connector for Microsoft Fabric Data Warehouse eliminates these pain points by allowing direct read and write operations between Spark and the Fabric Data Warehouse using Notebooks.

Key benefits are,

Unified Data Engineering workflow: Data ingestion, transformation, and enrichment can now be done within Spark Notebooks. No need for T-SQL.
Improved performance: Direct integration removes the need for extra data movement steps, leading to optimized data transfers and lower processing overhead.
Enhanced adoption of Microsoft Fabric: This makes Microsoft Fabric a more attractive end-to-end data platform, allowing teams to manage both big data processing and structured analytics in a single environment.

Tutorial: Reading and Writing Data Using the Spark Connector

Let’s dive into a hands-on example demonstrating how to use the Spark Connector for reading and writing data between a notebook and the Fabric Data Warehouse.

We will be using the Lakehouse and data created in the Getting Started with MS Fabric article. Please visit the article to set up your Lakehouse and sample sales data.
To create a new Data Warehouse, we have to open the 'My workspace', click on 'New item' and select 'Warehouse'. We are going to name our warehouse 'My_Warehouse' and click 'Create'.
After the warehouse is created, we are going back to the 'My workspace'. This time we will be creating a Notebook.
Let's rename the Notebook to 'Working with DWH'.
The next steps are to attach our Lakehouse and Warehouse to the Notebook.
- Under 'Explorer', click on the 'Lakehouses' and click 'Add'. In the Add lakehouses popup, select 'Existing lakehouse(s) without Schema' and click on 'My_Lakehouse' then click 'Connect'.
- For the data warehouse, we click on 'Warehouses' under Explorer and click 'Add'. In the 'Select warehouse...' popup, select 'My_Warehouse' and click 'Confirm'.
Before we start developing the notebook, we need to ensure the Spark Runtime Version is 1.3 or newer. You can find the instruction here: Upgrade to Spark Runtime 1.3
Now, we are ready to develop our code in the Notebook. The first step, we will make sure we can select from our Lakehouse. This should print out 10 rows from our sample_sales_data table.
```
df_sales = spark.sql("SELECT * FROM My_Lakehouse.sample_sales_data")

display(df_sales.limit(10))
```
The next step is to import the spark.fabric libraries to enable the Spark Connect for Warehouse.
```
import com.microsoft.spark.fabric
from com.microsoft.spark.fabric.Constants import Constants
```
Let's write our df_sales to the warehouse. This is a simple example of copying the data to the Warehouse. If you are doing incremental load, you can use append mode as well.
```
df_sales.write.mode("overwrite").synapsesql("My_Warehouse.dbo.sample_sales_data")
```

To confirm, we wrote to the My_Warehouse successfully, let's retrieve the data and print out 10 rows again.

df_dwh_sales2 = spark.read.synapsesql("My_Warehouse.dbo.sample_sales_data")
display(df_dwh_sales2.limit(10))

In some situations, you will need to write the data to a Warehouse that is in a different workspace. In this scenario, you can explicitly define the workspaceId. The workspaceId can be found in the URL.
```
# https://app.powerbi.com/groups/[workspaceId]
df_dwh_sales2 = spark.read.option(Constants.WorkspaceId, "[workspaceId]").synapsesql("My_Warehouse.dbo.sample_sales_data")
```

References

Summary

The Spark Connector for Microsoft Fabric Data Warehouse allows data professionals to use Spark for end-to-end data solutions, eliminating the need for multiple tools and languages. This simplifies workflows, reduces complexity, and improves supportability, enabling data teams to deliver insights faster to the business. With a more streamlined approach, organizations can build scalable and efficient data pipelines within Microsoft Fabric.

Happy Learning!