Azure Data Lake

<p><em>TLTR: Clone this&nbsp;</em><a href="https://github.com/rebremer/blog-databrickshubspoke-git" rel="noopener ugc nofollow" target="_blank"><em>git project</em></a><em>, set params and run 0_script.sh to deploy 1 ALDSgen2 hub and N Databricks spokes</em></p> <p>A&nbsp;<a href="https://en.wikipedia.org/wiki/Data_lake" rel="noopener ugc nofollow" target="_blank">data lake</a>&nbsp;is a centralized repository of data that allows enterprises to create business value from data. Azure Databricks is a popular tool to analyze data and build data pipelines. In this blog, it is discussed how Azure Databricks can be connected to an ADLSgen2 storage account in a secure and scalable way. In this, the following is key:</p> <ul> <li>Defense in depth: ADLSgen2 contains sensitive data and shall be secured using private endpoints and Azure AD (<a href="https://docs.microsoft.com/en-us/azure/storage/common/shared-key-authorization-prevent?tabs=portal" rel="noopener ugc nofollow" target="_blank">disabling access keys</a>). Databricks can only access ADLSgen2 using private link and Azure AD</li> <li>Access control: Business units typically have their&nbsp;<a href="https://github.com/Azure/AzureDatabricksBestPractices/blob/master/toc.md#map-workspaces-to-business-divisions" rel="noopener ugc nofollow" target="_blank">own Databricks workspace</a>. Multiple workspaces shall be granted access to ADLSgen2 File Systems using Role Based Access Control (RBAC)</li> <li>Hub/spoke architecture: Only one hub network can access the ADLSgen2 account using private link. Databricks spoke networks peer to the hub network to simplify networking</li> </ul> <p><a href="https://towardsdatascience.com/how-to-connect-databricks-to-your-azure-data-lake-ff499f4ca1c"><strong>Read More</strong></a></p>
Tags: Azure Data