Tag: Databricks

Databricks Acquires MosaicML and Other Recent AI Acquisitions

hough the economy has been quite dynamic, AI is still a hot market. There have been a few major acquisitions and mergers over the past couple of weeks, each, potentially redefining the market in the near future. So let’s take a look at some of the biggest newsmakers making their rounds and if ...

Mastering MLOps: Building a Powerful MLOps Platform with Databricks

In the ever-evolving landscape of data-driven technologies, organizations are constantly on the lookout for platforms that can streamline their data analytics and machine learning workflows. Databricks — a powerful and innovative cloud-based platform that has garnered significant attention for...

Databricks Vs Snowflake

Introduction: Data is the backbone of any business, and managing it efficiently can give organisations a competitive edge. As data volumes continue to grow, businesses are turning to cloud-based data platforms to store, process, and analyze their data. Two popular options for cloud-based data ...

Intro to Databricks with PySpark

First, we’ll quickly go over the fundamental ideas behind Apache Spark and Databricks, their relationships with one another, and how to utilize them to model and analyze big data. Why should we use Databricks ? Big data and machine learning-related tasks are primarily carri...

I Passed the Databricks Certified Data Engineer Associate Exam: Here’s How You Can Too

As a consultant Data engineer, I have worked with Databricks extensively over the past months. I decided to apply the knowledge I had gained to their certification exams. I’m also aware that there are two types of thinking regarding certifications, especially in the Tech environment. Some beli...

Database Change Management on Lakehouse with Databricks SQL and Liquibase

Change is hard. Change in software is even harder, so with any software project, engineers set up a process for change management using a series of tools such as Git, GitActions, AzureDevOps, Jenkins, CircleCI, Terraform, and many many more. However, missing from these toolsets is the core concept o...

Using Spark on client side applications via Databricks Connect

The Data + AI Summit 2023 presented some new interesting capabilities of the Databricks platform and the Spark ecosystem as a whole. One of the most interesting releases for application development is Spark Connect, which is available as a service in Databricks via Databricks Connect. Databricks con...

Mastering Databricks Machine Learning: A Comprehensive Guide

Databricks feature store serves as a centralized repository that empowers data scientists to discover and collaborate on features, preventing data fragmentation. It ensures consistency by using the same code for both feature value computation during model training and inference, avoiding discre...

How to Configure Azure Databricks Unity Catalog with Terraform Part 4

In this story, we will learn how to configure Azure Databricks Unity Catalog with Terraform, and we will talk about how to design External Storage Accounts for Multiple Applications. In particular, we will learn: Creating Databricks External Storage Account for Multiple Applications ...

Managing Databricks: Asset Bundles

Introduction Remember when I mentioned that Databricks was cooking up a process for better workflow management in their quarterly roadmap call? Well, it appears to finally be here (or in Public Preview at least). Asset bundles are a way to follow proper software development prac...

Databricks Boosts Its Data Ingestion Capabilities with $100 Million Acquisition of Arcion

In a significant move to expand its data management capabilities, Databricks, the data lakehouse provider, has announced its acquisition of data replication and ingestion technology provider Arcion. This strategic acquisition, valued at over $100 million, strengthens Databricks’ port...

The Best Approach to Safeguarding Your Databricks Environment: A Comprehensive Guide to Backup and Restore Databricks

Backing up major Platform-as-a-Service (PAAS) systems can be a daunting task, but the importance of safeguarding these platforms cannot be overstated. Disaster can strike at any moment, regardless of how diligently you’ve followed best practices. In this blog, I will guide you through the s...

Efficient Change Data Capture (CDC) on Databricks Delta Tables with Spark

In today’s data-driven applications, organizations face a critical challenge: ensuring near-real-time data aggregation and accuracy for display on dashboards. As businesses integrate larger and more complex datasets from various sources, including streaming data from Kafka streams, they encoun...

DABs: Streamlining CI/CD for Your Data Products on Databricks

E2E Lifecycle management for AI/ML Projects and associated pipelines Preface : In the ever-evolving landscape of data science and engineering, Databricks has emerged as a popular platform for many Data Scientists/Engineers. It’s rich suite of Data Products, encompassing workflows, delta ...

Auto Loader — Handling Incremental ETL with Databricks

Data Handling is one of the crucial segment of any Data related job as proper data planning drives into results which led to efficient and economical storage, retrieval, and disposal of data. When it comes to Data Engineering profile, Data Loading (ETL) plays an equivalent role too. Data Loading ...

Running DBT on Databricks while using dbt_external_tables package to utilize Snowflake Tables

This article highlights a specific use case where one might need to run dbt on Databricks while utilizing tables in Snowflake. Typically, dbt runs on top of the database where it is instantiated. However, if a table needed to run dbt in Databricks does not exist in the hive-metastore and instead exi...

Ingest data with Databricks Autoloader

As a relatively new user to Databricks, I wanted to introduce a great tool that can take your big data from cloud storage to your data pipelines. Within Databricks, Databricks Autoloader is capable of ingesting millions of files per hour from your cloud service provider in a multitude...

How to Configure Azure Databricks Unity Catalog with Terraform Part 2

1. Intro: What is Azure Databricks, and What is it Used For? Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale. The Azure Databricks Lakehouse Platform integrates with cloud stora...

Create/update Databricks workflow Dynamically.

Introduction We will be discussing how we can dynamically create/update/delete Databricks workflow using Databricks REST API and user specified configuration file.Below is snippet of configuration yaml file, using this configuration we have to create workflows and jobs. For this sample con...

Databricks Snowflake Summit

Recap keynotes: e.g. Responsible AI, Lakehouse role in democratizing AI (more support for unstructured data for AI era in addition to structured data), Scaling AI on capacity and cost efficiency, LakehouseIQ to query Data In English Via An LLM with Unity Catalog, MosaicML machine lear...

Are Lakehouses a joke or is Databricks the endgame??

About me I’m Hugo Lu — I started my career working in finance before moving to JUUL, a scale-up, and falling into data engineering. I headed up the Data function at London-based Fintech Codat. I’m now CEO at Orchestra, a data release pipeline management platform that h...

How to Configure Azure Databricks Unity Catalog with Terraform Part 3

In this story, we will learn how to configure Azure Databricks Unity Catalog with Terraform. In particular, we will learn: Creating Databricks Access Connector for the External Storage Account Creating Databricks Storage Credential Creating the External Azure Storage Account Creating ...

Task Parameters and Values in Databricks Workflows

Databricks provides a set of powerful and dynamic orchestration capabilities that are leveraged to build scalable pipelines supporting data engineering, data science, and data warehousing workloads. Databricks Workflows allow users to create jobs that can have many tasks. Tasks can be exec...

Bridging the Gap: Empowering Business Stakeholders and Consultants with Databricks

In the ever-evolving world of data engineering and analytics, the need for a seamless and efficient means of communication between business stakeholders, data engineers, and consultants is more critical than ever. This communication bridge is where Databricks, combined with the power of English as a...

#80 Prepare for Databricks Data Engineer Associate certification exam part #5: Practice test samples

Databricks Academy Databricks offers its own practice exam sample with answers. This is helpful in a way that one has to do his/her own research on why one answer to a particular question is considered correct, while others are not. However, if you are not that confident even after finishing this...

How to read a delta table’s .snappy.parquet file in databricks

In Databricks, learn how to read .snappy.parquet files of your delta tables. TLDR Copy the .snappy.parquet file you want to read from the table’s location to a different directory in your storage Verify that the “_delta_log” folder for that table does not exist in the co...

Title: Setting up Databricks CLI on macOS: A Comprehensive Guide

Databricks is a powerful unified analytics platform that brings together big data and AI. To harness its capabilities, you’ll often need to interact with Databricks programmatically. One way to achieve this is by using the Databricks Command Line Interface (CLI). In this article, we’ll g...

SQL Variables in Databricks

In the above code, I am making use of the year_variable variable. This was a slight workaround previously and followed an odd syntax with using ${var.variable_name}. This workaround did not work in the Databricks SQL interface and only in the notebooks. So now that we have the officia...

Databricks Lakehouse Federation

My typical blogs explore broad tech value propositions, but today, we’re diving into the specifics: Databricks’ latest (DAIS 2023 announcement) Lakehouse Federation capability. My friends asked, and I’m here to deliver the scoop. Welcome to the Exciting World of Data M...

Unity Catalog as a Databricks Hive Metastore

So this is the last of the articles on metastores on Databricks. We’ve covered all the legacy metastores, External HMS, Glue, and HMS. Unity Catalog as mentioned earlier, is Databricks’s latest solution to metastore problems. Unity Catalog is a cutting-edge metastore service provided ...

Goodbye stackoverflow, Hello Databricks Assistant

The world is suddenly enthralled with GenAI and it seems as if there are no other topics worthy of discussion or consideration with the possible exception of Swifties now watching the NFL. Seriously, which of us has not given ChatGPT a try and many of us are deeply involved in our company’s la...

Explained: What is Databricks and why do we need it?

Before we understand as to what exactly is Databricks, we need to understand what is Apache Spark. Apache Spark is like a super-smart computer system that can handle lots and lots of information at the same time. It helps people do really big tasks, like sorting through a huge pile of data, figur...

Converting Stored Procedures to Databricks

Special thanks to co-author  Kyle Hale , Sr. Specialist Solutions Architect at Databricks. Introduction A stored procedure is an executable set of commands that is recorded in a relational database management system as an object. More generally speaking, it is simply code that can b...

Video: Leveraging Databricks AutoLoader: Better Visibility of CloudTrail Logs (Hebrew)

S3 logs generated by AWS CloudTrail provide organizations with essential visibility into user activity and resource utilization within their AWS infrastructure. However, working with raw CloudTrail logs can be challenging due to their size, complexity, and the need for optimal storage and query per...

Databricks Machine Learning Associate Certification: A Comprehensive Study Guide

Progressing further into 2023, the global stage is being reshaped by the deep-seated effects of technology, specifically the tidal wave of artificial intelligence (AI) and ML. Databricks has positioned itself as the premier platform for training these advanced models, growing in popularity due to it...

Exploring SQL Statement Execution with Databricks REST API

During the past week, there was a need to explore the execution of SQL statements on Databricks through the API to facilitate data consumption from our Lakehouse in company applications. This exploration involved a thorough examination of Databricks documentation and the subsequent summarization of ...

Databricks System Tables — An Introduction

System tables in Databricks serve as an analytical repository for operational data related to your account. They offer historical observability and can be highly useful for tracking various aspects of your Databricks environment. Currently, Databricks provides system tables for audit logs, billable ...

Writing PySpark logs in Apache Spark and Databricks

The closer your data product is getting to the production, the bigger is the importance of properly collecting and analysing logs. Logs help both during debugging in-depth issues and analysing the behaviour of your application. For general Python applications the classical choice would be to use ...

Can Microsoft Fabric complement your Databricks platform?

More than a year ago, we decided to leverage Databricks and the Delta Lake as the core for our data platform. Fast-forward to Microsoft Build 2023, when Microsoft announces their new data platform Microsoft Fabric. We were positively surprised by Fabric’s highlighted key points and d...

Real-Time Data Processing with Delta Live Tables: Use Cases and Best Practices for Databricks

After explaining Delta Live Tables (DLTs) in Databricks and how to incorporate them into data pipelines in my previous post, I wanted to take a deeper dive into some specific use cases of Delta Live Tables. What are Delta Live Tables again? Delta Live Tables, often abbreviated as DLTs...

Managing Databricks At A Wide Scale

Introduction I somehow managed to convince our data platform team that I need a higher level of access so I can see everyone’s cluster and job configurations. My OCD regarding FinOps was quickly confirmed after starting to peruse our inventory. While we do have various guardrails in place, ...

4 methods to execute any REST API including Databricks REST API

Below four methods can be used to execute/call any REST API including Databricks REST API. In this blog, we have provided examples of Databricks API but the same can be implemented for any other REST API. curl requests Python requests Using Postman application Using&n...

Complete Guide to Crack: “Databricks Certified Associate Developer for Apache Spark”

Recently, I cracked the “Databricks Certified Associate Developer for Apache Spark 3.0'’ certification. and then I got many requests to guide them on how to prepare for it. My Certificate I’ll not waste your time by proving all the exam details as you can find them at:&nb...

Real-Time Data Processing with Delta Live Tables: Use Cases and Best Practices for Databricks

After explaining Delta Live Tables (DLTs) in Databricks and how to incorporate them into data pipelines in my previous post, I wanted to take a deeper dive into some specific use cases of Delta Live Tables. What are Delta Live Tables again? Delta Live Tables, often abbreviated as DLTs...

Unveiling the Secrets: External Tables vs. External Volumes in Azure Databricks Unity Catalog

While reviewing the Databricks documentation about Unity Catalog, I came across a concept that initially seemed a bit perplexing: the distinction between accessing data objects stored in our cloud storage using External Tables versus External Volumes. This inspired me to write an article exploring t...

Azure Databricks

Azure Databricks is a collaborative analytics platform that is fully integrated with Microsoft Azure. It’s an Apache Spark-based analytics platform optimized for Azure, designed in collaboration between Microsoft and Databricks. Key Features 1. Managed Apache Spark: Azure Databricks p...

Databricks Workflows: Orchestration Made Easy

When it comes to orchestration frameworks for data engineering, there are many different options. Airflow is either loved or hated based on who you ask, as I’ve discussed in a previous post. As someone who has used Airflow for close to 6 years now, I haven’t seen enough bad to put m...

Git And Databricks

Introduction Databricks is one of the most popular platforms out there because of how easy it is for people of all backgrounds to get up and running. Its easy-to-use UI is very intuitive for analysts and even product managers who just need to go in and run the occasional query. For those who w...

Downloading files from Databricks’ DBFS

More often than not, you may be interested in downloading data from your Databricks instance. And whilst Databricks provides a UI for retrieving your DataFrame result, sometimes you are interested in generating data from your Databricks instance not directly related to DataFrames. Typical use cases ...

JSON in Databricks and PySpark

In the simple case, JSON is easy to handle within Databricks. You can read a file of JSON objects directly into a DataFrame or table, and Databricks knows how to parse the JSON into individual fields. But, as with most things software-related, there are wrinkles and variations. This article shows ho...

Databricks Vs Snowflake: Choosing Your Cloud Data Partner

Explore the critical decision of Databricks vs Snowflake as your cloud data partners. Discover the key differences to make the right choice for your needs Databricks Vs Snowflake: Choosing Your Cloud Data Partner During an interview in 2009, Google’s Chief Economist, Hal Varian quote...

Azure Databricks vs Azure Synapse

Introduction Azure Databricks: A Deep Dive Azure Databricks, built on Apache Spark, stands as a powerful analytics platform optimized for Microsoft Azure. It’s designed to facilitate seamless collaboration among data scientists, data engineers, and business analysts. The platform offers ...

Databricks Autoloader Cookbook — Part 1

In this article, we are going to discuss the following topics: How Autoloader handles empty files and file names starting with an underscore When to use the compression codec in Autoloader and what are the best practices for compressed files and various file formats modifiedAfter and ...

Databricks Delta Lake Tables: Managed vs Unmanaged

Delta Lake is a powerful storage layer for big data processing workloads in Databricks. In the previous article, we discussed Delta Lake on Databricks: Python Installation and Setup Guide When working with Delta Lake tables, you can choose between two types of tables: managed and unmanaged. In...

New Gradient quick-start — Optimize your Databricks jobs in minutes

Our new quick-start notebooks make testing Gradient crazy easy, to help people quickly optimize their Databricks Jobs at scale. Since we launched Gradient to help control and optimize Databricks Jobs, one piece of feedback from users was crystal clear to us: Learn More

Azure Data Factory & Databricks. Migration of init scripts to Databricks from DBFS to Workspace 2023/2024

In this article, we will discuss how to make a simple data loading process using Microsoft Azure Data Factory and Databricks in 2023/2024. In the second part, we will analyze the migration of init scripts from DBFS to Workspace in connection with the new update from Databricks. If you don’t do...

How to pass parameters between Data Factory and Databricks

When working with data in Azure, running a Databricks notebook as part of a Data Factory pipeline is a common scenario. There could be various arguments for choosing Databricks as an activity in your Data Factory flow, for example when the default Data Factory activities are not sufficient to meet y...

Databricks Secures $500M Investment, Elevating Valuation to $43 Billion Amid Late-Stage Uncertainty

TL;DR: - Databricks, an AI and data analytics firm, raised over $500 million in a Series I round. - This funding has elevated its valuation to an impressive $43 billion. - Notably, Databricks’ valuation surged despite a slowdown in late-stage startup valuations. - Diverse investors...

Optimizing Databricks SQL: Achieving Blazing-Fast Query Speeds at Scale

In this data age, delivering a seamless user experience is paramount. While there are numerous ways to measure this experience, one metric stands tall when evaluating the responsiveness of applications and databases: the P99 latency. Especially vital for SQL queries, this seemingly esoteric number i...

Finding the Path to a Managed Table in Databricks

This article shows how to find a path for a managed Databricks table. In Databricks, you might have been creating managed tables, writing to managed tables and reading from managed tables using the database.tablename (catalog.database.tablename, if you have upgraded to Unity Catalog) pattern. And...

Win The Title “Databricks Solutions Champion Architect.”

Recently, I was acknowledged as a “Databricks Solutions Architect Champion” for my recurring contributions to customer success and meaningful value creation through data engineering solutions leveraging Databricks. Winning this title is often seen as arduous, but with the proper insig...

Getting started with Databricks in Azure

In the modern world of data-driven decision-making, developers and data scientists play a crucial role in harnessing the potential of data. Databricks is a unified analytics platform designed to help developers, data scientists, and analysts collaborate seamlessly on big data projects. Leveraging th...

How to unit test PySpark programs in Databricks notebook?

Unit testing is a software development process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation. Photo by Startup Stock Photos from Pexels Nutter framework from Microsoft makes it e...

Cleaning up Cluster Logs in Databricks

In any data engineering or analytics environment, managing logs is a crucial task. Logs provide valuable insights into the health and performance of your clusters, but they can also consume valuable storage space if not managed properly. In this blog post, we will guide you through the process of cl...

Designing a Multi-Cloud Data Platform with Databricks

Multi-cloud deployments have become increasingly popular in recent years due to the benefits it provides such as increased resiliency and availability of applications and services. By utilising multiple cloud providers, organisations can avoid service disruptions caused by a single provider’s ...

How read a target table column data types and cast the same columns of the source table in azure databricks using pyspark

How to copy a delta table with dynamically casting all the columns to the data type of the target delta table columns in azure databricks using pyspark In this blog post, I will show you how to copy a delta table with dynamically casting all the columns to the data type of the target delta table ...

Configuring DNS resolution for Private Databricks Workspaces (AWS)

For customers on the E2 Platform, Databricks has a feature that allows them to use AWS PrivateLink to provision secure private workspaces by creating VPC endpoints to both the front-end and back-end interfaces of the Databricks infrastructure. The front-end VPC endpoint ensures that users connect to...

The Best Approach to Safeguarding Your Databricks Environment: A Comprehensive Guide to Backup and Restore Databricks

Backing up major Platform-as-a-Service (PAAS) systems can be a daunting task, but the importance of safeguarding these platforms cannot be overstated. Disaster can strike at any moment, regardless of how diligently you’ve followed best practices. In this blog, I will guide you through the s...

Streamlining Your Journey: Automating SCIM Configuration for Azure Databricks with Terraform

Recently, I embarked on a particularly challenging automation task: automating the SCIM (System for Cross-domain Identity Management) provisioning within the Azure Databricks environment. This journey, spurred by the necessity to efficiently manage user access and identities, highlighted the importa...