Migrating from the Databricks Hive Metastore to using Databricks Unity Catalog involves several steps to ensure a smooth transition of metadata management and access control. Here’s a general approach for the migration process:
1. Plan and Prepare the Migration
- Understand the Differences: Unity Catalog provides enhanced features for governance and access control compared to the Hive Metastore, such as role-based access control (RBAC), multi-layered security, and data lineage tracking.
- Review Documentation: Ensure familiarity with Unity Catalog’s features and limitations by reviewing Databricks’ official documentation and understanding its differences with the legacy Hive Metastore.
- Identify Assets: Audit the Hive Metastore to determine which databases, tables, views, and permissions need to be migrated. Also, assess any dependencies on the current Hive Metastore.
- Governance & Security Model: Plan a new governance model in Unity Catalog by leveraging RBAC. Define which roles will access which resources based on Unity Catalog’s capabilities.
- Backups: Create backups of Hive Metastore data to prevent data loss in case of any issues during migration.
2. Set Up Unity Catalog
- Enable Unity Catalog: Ensure Unity Catalog is enabled for your Databricks workspace. Contact Databricks support if it’s not already enabled.
- Create Metastore in Unity Catalog: A Unity Catalog metastore must be created, and workspaces need to be attached to it.
- Assign Roles & Permissions: Create and assign roles to users and groups based on the governance model defined earlier. Unity Catalog uses fine-grained access control, so this step is crucial.
- Configure External Storage Locations: Set up storage locations that Unity Catalog will manage. Unity Catalog supports managing external locations for data, and you will need to define these based on your data architecture.
3. Data Migration
- Migrate Hive Metastore Metadata: Use Databricks utilities to copy metadata from the Hive Metastore to Unity Catalog. Databricks provides tools like
MSCK REPAIR TABLE
or other built-in migration scripts to help move the metadata.- You can also manually export and import schema definitions if needed.
- For large scale migrations, consider automation via Databricks jobs or scripts.
- Map Storage Locations: Ensure that table storage locations from the Hive Metastore match those in Unity Catalog, especially for external tables.
- Migrate Table and View Definitions: Recreate the table and view definitions in Unity Catalog based on the schema you imported from Hive Metastore.
- Handle Delta Tables: If using Delta Lake, re-register the Delta tables in Unity Catalog.
4. Test and Validate
- Run Queries: Test querying migrated tables and views from Unity Catalog to ensure the data and metadata are correctly registered.
- Validate Permissions: Check if the new RBAC settings are correctly applied by testing with different roles and users.
- Check Data Lineage: Unity Catalog provides data lineage capabilities, so validate that the lineage is being tracked correctly for data transformations and queries.
5. Update Workflows & Jobs
- Update Notebooks and Jobs: Update any notebooks, ETL jobs, and workflows to point to the new Unity Catalog instead of the Hive Metastore.
- Update Cluster Configurations: If you have clusters configured to use Hive Metastore, update their configurations to use Unity Catalog instead.
- Update Data Access Patterns: Any custom applications or reporting tools that rely on the Hive Metastore should be reconfigured to use Unity Catalog.
6. Decommission the Hive Metastore
- Disable Hive Metastore: After validating the migration and ensuring that all dependencies are updated, you can decommission the Hive Metastore.
- Clean Up Old Data: Optionally, clean up old metadata or configurations related to the Hive Metastore.
7. Monitor and Optimize
- Monitor Access and Usage: Use Unity Catalog’s audit and access logs to monitor how users are interacting with the new catalog and adjust permissions if necessary.
- Optimize Performance: Leverage Unity Catalog features like data discovery and governance tools to optimize data access performance.
- Ongoing Maintenance: Regularly review data access policies and metadata updates within Unity Catalog to ensure data governance remains effective.
This process ensures a secure and structured migration while leveraging the advanced governance and security capabilities of Unity Catalog.
To complete the migration from Databricks Hive Metastore to Unity Catalog, along with replacing Azure mount paths and updating notebooks, the following additional steps can be taken:
8. Replacing Azure Mount Paths
When moving from the Hive Metastore to Unity Catalog, it’s essential to update the way your data storage is referenced, especially if you are using Azure Data Lake Storage (ADLS) or Blob Storage mount paths in Databricks. Unity Catalog uses external locations and credentials for data access.
Steps to Replace Mount Paths:
- Identify Current Mounts: List all your current mounted paths using the following command in a Databricks notebook:
dbutils.fs.mounts()
- Unmount Existing Paths: For each mount that is no longer needed or is being replaced with Unity Catalog external locations, unmount the path:
dbutils.fs.unmount("/mnt/")
- Set Up External Locations in Unity Catalog: Unity Catalog manages data access using external locations, which are tied to cloud storage paths. You’ll need to define these external locations by:
- Creating a Storage Credential in Unity Catalog (linked to your Azure account).
- Creating an External Location that maps to your ADLS or Blob Storage path.
- Assigning access permissions (RBAC) to roles or users for those external locations.
- Here’s how to create an external location (replace placeholders accordingly):
CREATE EXTERNAL LOCATION my_data_location
URL 'abfss://@.dfs.core.windows.net/'
WITH CREDENTIAL my_storage_credential;
- Update Notebook References to Use Unity Catalog External Locations: Modify all Databricks notebooks that use
dbutils.fs.mount()
to reference the external locations instead. - Old way with
dbutils.fs.mount
dbutils.fs.mount(source="wasbs://@.blob.core.windows.net", mount_point="/mnt/data")
- New way using Unity Catalog: Update your notebook logic to directly access the external location without mounting (or if the access pattern changes with Unity Catalog).
spark.sql("SELECT * FROM my_catalog.my_schema.my_table")
9. Updating Notebooks
After migrating to Unity Catalog, notebooks that reference Hive Metastore tables or use legacy storage paths need to be updated:
- Update Table References: If notebooks refer to tables stored in the Hive Metastore, update them to use Unity Catalog’s naming convention:
- Old Hive Metastore
SELECT * FROM my_database.my_table;
Unity Catalog
SELECT * FROM my_catalog.my_schema.my_table;
- Update Data Access Patterns: If notebooks accessed data via mounted paths (
dbutils.fs.mount()
), you may need to change those references to direct table access through Unity Catalog or external locations. - Review and Refactor Permissions: Unity Catalog enforces fine-grained access control, so ensure that any references to user permissions or table access in notebooks are aligned with the new RBAC model.
10. Review Cluster Configuration and Policies
- Update Clusters to Use Unity Catalog: Ensure that clusters accessing Unity Catalog data are set up to do so. Attach clusters to the new metastore configured for Unity Catalog.
- Cluster Policy Review: Review any cluster policies to ensure they comply with Unity Catalog’s security model. You may need to create new policies for data access through Unity Catalog, as it supports more granular access controls than the Hive Metastore.
11. Review and Update Jobs and Workflows
- Scheduled Jobs: Any jobs that previously referred to Hive Metastore tables will need to be updated to use Unity Catalog table names.
- Workflow Automation: If you have automated workflows (e.g., using Databricks Workflows), check if they depend on legacy mounts or Hive Metastore and update them to align with Unity Catalog’s data access patterns.
12. Handle Delta Lake Tables (Optional)
- Re-register Delta Tables: If your notebooks or jobs are using Delta Lake tables, you need to re-register these tables under Unity Catalog. Unity Catalog offers enhanced governance over Delta tables.
- Access Delta Tables: After migration, your notebooks should access Delta tables via Unity Catalog:
SELECT * FROM my_catalog.my_schema.my_delta_table;
13. Test All Data Access and Jobs
- Test Notebooks: Validate that all notebook queries return the expected results and that access patterns have been correctly updated to Unity Catalog.
- Test Jobs: Ensure all automated jobs and workflows continue to function correctly with updated table names, paths, and permissions.
14. Ongoing Maintenance
- Monitor Jobs and Permissions: Use Unity Catalog’s auditing capabilities to track job execution and permission usage. This will help maintain data governance standards after the migration.
- Data Governance: Periodically review and refine the data governance strategy as your environment grows.
By carefully following these steps and handling mount paths, notebooks, jobs, and storage updates, you can ensure a smooth migration from the Databricks Hive Metastore to Unity Catalog while leveraging its advanced security and governance capabilities