What is Databricks ?
Many data engineering tools have existed for decades. RDBMS to store and process structured data (OLTP), apache spark for executing batch analytics from a storage, MongoDB to store non strucuted data, data warehouse for OLAP which centralized all structured data from RDBMS, Data lake to store structured and non structured data but with limited performance.
then finally lakehouse a combination of DWH and datalake. It has high speed processing for structured data while also able to process non strucutred data ( with acceptable performance )
— databricks is scaleable unified data platform which capable doing all of these capability inlcuding collaborative workspaces where data engineer, data analyst and data scientist can collaborate together working on single source of code in form of jupyter notebook (can support scala, R, java and python)
databricks often abbreviated as dbx
Key Capabilities of Databricks
From these legacy data engineering tools, databricks summarized those existing capabiltiy into four key main capability
- ETL/ETL data pipeline creation
- End to end machine learning development from EDA, feature engineering all the way the model deployment (including inferecing as well)
- Data streaming - apache spark on the dbx can perform mini batch streaming, work for most of the cases. however, apache flink still has lowered latency.
- Big data analytics - distributed sql queries
all of these capability will be developed on jupyter notebook as collaborative work spaces.
note: apache spark on databricks is the premium version of open source apache spark. it has better performance compared to the original one.
Databricks Technology Stack
Below is the databricks technology stack that used on its unified data platform and the equivalent stack on other cloud data platform such as AWS, GCP, Azure and the on premsises data platform equivalent
Databricks Built-in Techstack
note: databricks also available on AWS, GCP and Azure.
No | Category | Databricks Built-In | Mandatory External Tech | (Optional) Integration | |
---|---|---|---|---|---|
1 | Storage & Engine | Delta Lake, Photon engine | AWS S3, Azure ADLS, GCP GCS | N/A - Not needed | |
2 | Compute & Runtime | Databricks runtime (based on apache spark ) Databricks SQL | AWS EC, Azure VM/Scale Sets, GCP CE | Data Shaping tools: DBT(Transformation), Airbyte / Prefect (Connector) & Informatica (MDM) | |
3 | ML/AI | MLflow, AutoML, Unity Catalog (Vector Search), Model Serving | n/a | TensorFlow, PyTorch, Scikit-Learn, LangChain, Hugging Faces | |
4 | Collaboration | Notebook, Repo, Workflow | n/a | Github, Gitlab, Azure Devops, Jenkins, PowerBI, Tableau, Looker | |
5 | Governance | Unity Catalog, Delta sharing, Cluster Policies | Cloud IAM(AWS IAM, Azure entraId/AD , GCP IAM) | n/a |
Databricks Built-in Techstack Comparison
below is the equivalent databricks tech stack on prem or based on cloud native component.
No | Databricks Native | Open Sources / Can Be Hosted On Prem | AWS Cloud native | Azure Cloud Native | GCP Cloud Native |
---|---|---|---|---|---|
1 | Delta Lake | Delta Lake(OSS), Apache Iceberg(Table), Apache Hudi | Athena + Lakeformation | Synapse, ADLS and pureview | BigLake |
2 | Photon (SQL query engine, vectorized) | Apache Spark SQl, DuckDB | RedShift RA 3 + Spectrum | Synapses Pools | BigQuery |
3 | ML Flow | MLFlow (OSS) | SageMaker | AzureML | VertexAI |
4 | Unity Catalog | Apache Atlas, DataHub | Glue Data Catalog+Lake Formation | Pureview nad Microsoft Fabric | Dataplex |
5 | Databricks Workflow | Apache Air Flow, Prefect, Dagster | Step Function / MWWA | Data Factory / Synapse Pipelines | Big Query |
6 | Databricks SQL (BI+SQL on lakehouse) | Trino/Presto, Apache Drill, Dremio | Athena/RedShift Spectrum | Synapse | Big Query |
7 | DLT(Delta Live Tables) for straming | Spark Structured Streaming+ Air Flow | Glue Streaming + Kinesis | Stream Analytics + Synapse Pipeline | Data Flow + Pub Sub |
8 | Auto Loader(ingest to delta lake, schema evo) | Apache Spark + Apache Kafka Connect | Kinesis, Fire House, Glue | Event Hub, Azure Data Factory | Data Flow + Pub Sub |
9 | object storage & data lake(on top of object storage) on AWS/GCP/Azure for both storage and meta store/catalog |
HDFS+Hive or MinIO/Cepth+Hive Metastore | AWS S3 | Azure ADLS | GCP GCS |
10 | Delta Lake (Lakehouse, combination of datalake + table format + catalog, unity catalog) | HDFS+Apache Icreberg+Hive Metastore | S3+Iceberg+Glue Data Catalog+ lake Formation | ADLS+Synapse/Fabric(based on iceberg) +PureView | GCS+Iceberg+DataPlex |
Conclusion:
On the end, every big player on tech world is creating their own version of data engineering technology for (not limited to) storage (data lake/delta lake/warehouse), processing (streaming, batch, sql processing), orchestration (pipeline/work flow), ML/AI development tools, governance (catalogs, security, lineage) and visualization.
Provision a Databricks Instances
There are two ways to use databricks on the cloud :
- databricks on AWS, GCP or Azure
- databricks free environment
the first option is suitable for actual project development wrok as it is deployed on top of most reknown cloud provider.
However, for learning purposes databricks provide sandbox environment for free.
Databricks Free Limitation
Free databricks instaces should only used for learning purposes not for commercial/production.
While databricks offer true zero of charge (free) services on thier databricks free edition, it has several important limitations such as :
- limited cluster, might not even able to handle medium-to-large data.
- One SQL Warehouses (can’t be scaled)
- Only one active pipeline
- limited number of model serving endpoint
- limited notebook supported language (python and java only)
- limited security options (vnet, MFA,role based access control per column or table)
Creating Free Databricks Instances
Go to following url
https://login.databricks.com/signup
register with google or preferred authentication method
Register and authenticate on databricks free instances using google or microsoft account
done, the free databricks instances now can be accessed.
Free databricks workspace now can be accessed
Cost Comparison Between Azure, GCP and AWS to Host Databricks Instances for Instances with Lowest Computing Power
cost may varies based on region(and the cloud provider might adjust the pricing), below cost estimation is based on south east asia region.
DBU Cost already included based on the VM specs
4 vCPU and 16 RAM
- Databricks on Azure : D4aS(4 vCPU and 16 RAM), price: 0.240 per hour, 0.55 DBU, per month = $550, per 60 hour(for learning purposes, 2 hour per day) = +- $45 (storage and infra cost might still be billed)
- Databricks on AWS: m5.xlarge (4 vCPU, 16 GB RAM), per month: +- $540, per 60 hours +- $44
- Databricks on GCP: n2-standard-4 (4 vCPU, 16 GB RAM), per month: +- $420, per 60 hours +- $34
costs differences is not that much between those cloud. Use the one that familiar with developer’s skill sets on the project