Databricks Certified Data Engineer Professional -Preparation

$11.99 (93% OFF)

About This Course

<div>If you are interested in becoming a Certified Data Engineer Professional from Databricks, you have come to the right place! This study guide will help you with preparing for this certification exam.</div><div><br></div><div>By the end of this course, you should be able to:</div><div><br></div><div>1- Develop Code for Data Processing using Python and SQL</div><div><br></div><div>Using Python and Tools for development</div><div><ul><li><span style="font-size: 1rem;">Design and implement a scalable Python project structure optimized for Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration.</span></li><li><span style="font-size: 1rem;">Manage and troubleshoot external third-party library installations and dependencies in Databricks, including PyPI packages, local wheels, and source archives.</span></li><li><span style="font-size: 1rem;">Develop User-Defined Functions (UDFs) using Pandas/Python UDF</span></li></ul></div><div><br></div><div>Building and Testing an ETL pipeline with Lakeflow Declarative Pipelines, SQL, and Apache Spark on the Databricks platform</div><div><ul><li><span style="font-size: 1rem;">Build and manage reliable, production-ready data pipelines, for batch and streaming data using Lakeflow Declarative Pipelines and Autoloader.</span></li><li><span style="font-size: 1rem;">Create and Automate ETL workloads using Jobs via UI/APIs/CLI.</span></li><li><span style="font-size: 1rem;">Explain the advantages and disadvantages of streaming tables compared to materialized views.</span></li><li><span style="font-size: 1rem;">Use APPLY CHANGES APIs to simplify CDC in Lakeflow Declarative Pipelines.</span></li><li><span style="font-size: 1rem;">Compare Spark Structured Streaming and Lakeflow Declarative Pipelines to determine the optimal approach for building scalable ETL pipelines. ● Create a pipeline component that uses control flow operators (e.g. if/else, foreach, etc.)</span></li><li><span style="font-size: 1rem;">Choose the appropriate configs for environments and dependencies, high memory for notebook tasks, and auto-optimization to disallow retries.</span></li><li><span style="font-size: 1rem;">Develop unit and integration tests using assertDataFrameEqual, assertSchemaEqual, DataFrame.transform, and testing frameworks, to ensure code correctness, including a built-in debugger.</span></li></ul></div><div><br></div><div>2- Data Ingestion & Acquisition:</div><div><ul><li><span style="font-size: 1rem;">Design and implement data ingestion pipelines to efficiently ingest a variety of data formats including Delta Lake, Parquet, ORC, AVRO, JSON, CSV, XML, Text and Binary from diverse sources such as message buses and cloud storage.</span></li><li><span style="font-size: 1rem;">Create an append-only data pipeline capable of handling both batch and streaming data using Delta.</span></li></ul></div><div><br></div><div>3- Data Transformation, Cleansing, and Quality</div><div><ul><li><span style="font-size: 1rem;">Write efficient Spark SQL and PySpark code to apply advanced data transformations, including window functions, joins, and aggregations, to manipulate and analyze large Datasets.</span></li><li><span style="font-size: 1rem;">Develop a quarantining process for bad data with Lakeflow Declarative Pipelines or autoloader in classic jobs.</span></li></ul></div><div><br></div><div>4- Data Sharing and Federation</div><div><ul><li><span style="font-size: 1rem;">Demonstrate delta sharing securely between Databricks deployments using Databricks to Databricks Sharing(D2D) or to external platforms using open sharing protocol(D2O).</span></li><li><span style="font-size: 1rem;">Configure Lakehouse Federation with proper governance across supported source Systems.</span></li><li><span style="font-size: 1rem;">Use Delta Share to share live data from Lakehouse to any computing platform.</span></li></ul></div><div><br></div><div>5- Monitoring and Alerting</div><div><ul><li><span style="font-size: 1rem;">Monitoring</span></li><li><span style="font-size: 1rem;">Use system tables for observability over resource utilization, cost, auditing and workload monitoring.</span></li><li><span style="font-size: 1rem;">Use Query Profiler UI and Spark UI to monitor workloads. ● Use the Databricks REST APIs/Databricks CLI for monitoring jobs and pipelines.</span></li><li><span style="font-size: 1rem;">Use Lakeflow Declarative Pipelines Event Logs to monitor pipelines.</span></li><li><span style="font-size: 1rem;">Alerting</span></li><li><span style="font-size: 1rem;">Use SQL Alerts to monitor data quality.</span></li><li><span style="font-size: 1rem;">Use the Workflows UI and Jobs API to set up job status and performance issue notifications.</span></li></ul></div><div><br></div><div>6- Cost & Performance Optimisation</div><div><ul><li><span style="font-size: 1rem;">Understand how / why using Unity Catalog managed tables reduces operation Overhead and maintenance burden.</span></li><li><span style="font-size: 1rem;">Understand delta optimization techniques, such as deletion vectors and liquid clustering.</span></li><li><span style="font-size: 1rem;">Understand the optimization techniques used by Databricks to ensure the performance of queries on large datasets (data skipping, file pruning, etc).</span></li><li><span style="font-size: 1rem;">Apply Change Data Feed (CDF) to address specific limitations of streaming tables and enhance latency.</span></li><li><span style="font-size: 1rem;">Use query profile to analyze the query and identify bottlenecks, such as bad data kipping, inefficient types of joins, data shuffling.</span></li></ul></div><div><br></div><div>7- Ensuring Data Security and Compliance</div><div><ul><li><span style="font-size: 1rem;">Applying Data Security mechanisms.</span></li><li><span style="font-size: 1rem;">Use ACLs to secure Workspace Objects, enforcing the principle of least privilege, including enforcing principles like least privilege, policy enforcement.</span></li><li><span style="font-size: 1rem;">Use row filters and column masks to filter and mask sensitive table data.</span></li><li><span style="font-size: 1rem;">Apply anonymization and pseudonymization methods such as Hashing, Tokenization, Suppression, and Generalization to confidential data.</span></li><li><span style="font-size: 1rem;">Ensuring Compliance</span></li><li><span style="font-size: 1rem;">Implement a compliant batch & streaming pipeline that detects and applies masking of PII to ensure data privacy.</span></li><li><span style="font-size: 1rem;">Develop a data purging solution ensuring compliance with data retention policies.</span></li></ul></div><div><br></div><div>8- Data Governance</div><div><ul><li><span style="font-size: 1rem;">Create and add descriptions/metadata about enterprise data to make it more discoverable.</span></li><li><span style="font-size: 1rem;">Demonstrate understanding of Unity Catalog permission inheritance model.</span></li></ul></div><div><br></div><div>9- Debugging and Deploying</div><div><ul><li><span style="font-size: 1rem;">Debugging and Troubleshooting</span></li><li><span style="font-size: 1rem;">Identify pertinent diagnostic information using Spark UI, cluster logs, system tables, and query profiles to troubleshoot errors.</span></li><li><span style="font-size: 1rem;">Analyze the errors and remediate the failed job runs with job repairs and parameter overrides.</span></li><li><span style="font-size: 1rem;">Use Lakeflow Declarative Pipelines event logs & the Spark UI to debug Lakeflow Declarative Pipelines and Spark pipelines.</span></li><li><span style="font-size: 1rem;">Deploying CI/CD</span></li><li><span style="font-size: 1rem;">Build and Deploy Databricks resources using Databricks Asset Bundles.</span></li><li><span style="font-size: 1rem;">Configure and integrate with Git-based CI/CD workflows using Databricks Git Folders for notebook and code deployment.</span></li></ul></div><div><br></div><div>10- Data Modelling</div><div><ul><li><span style="font-size: 1rem;">Design and implement scalable data models using Delta Lake to manage large datasets.</span></li><li><span style="font-size: 1rem;">Simplify data layout decisions and optimize query performance using Liquid Clustering.</span></li><li><span style="font-size: 1rem;">Identify the benefits of using liquid Clustering over Partitioning and Z-Order.</span></li><li><span style="font-size: 1rem;">Design Dimensional Models for analytical workloads, ensuring efficient querying and aggregation.</span></li></ul></div><div><span style="font-size: 1rem;">With the knowledge you gain during this course, you will be ready to take the certification exam.</span></div><div><br></div><div>I am looking forward to meeting you!</div>

What you'll learn:

Learn how to model data management solutions on Databricks Lakehouse
Build data processing pipelines using the Spark and Delta Lake APIs
Understand how to use and the benefits of using the Databricks platform and its tools
Build production pipelines using best practices around security and governance
Learn how to monitor and log production jobs
Follow best practices for deploying code on Databricks

About This Course

What you'll learn:

More Course Deals