Apache Spark 4.0 Overview: Spark Connect, Materialized Views, ANSI SQL, and Collation Support
Apache Spark 4.0 brings new features such as Spark Connect, a protocol for remote Spark development, Materialized Views for improved query performance and data management, ANSI SQL compliance, and col...
Strategies for Data Quality with Apache Spark
Data quality is crucial for the success of any data-driven organization. In this article, we explore the data quality landscape and how to ensure data quality across all data pipeline stages with Apac...
Unleashing the Power of Deequ for Efficient Spark Data Analysis
Ensuring data quality in big data environments can be challenging due to the sheer scale and complexity of the data. However, using automated data quality checks and data profiling processes can help ...
Build efficient tests for your Spark data pipeline using BDD with Cucumber
In this Medium article by Omar LARAQUI, youll learn about Behavior Driven Development (BDD) and how it can help you build efficient tests for your Spark data pipeline. Cucumber is the tool that allows...
Spark Tips. Optimizing JDBC data source reads
Spark Tips. Optimizing JDBC data source reads - In this blog post, we will discuss how to optimize reading from JDBC data sources in Spark. By default, JDBC data sources load data sequentially using a...
Databricks Opens Up Its Delta Lakehouse at Data + AI Summit
Databricks has open sourced most of the technology behind its Delta Lake, including APIs, with the launch of Delta Lake 2.0. The move towards open standards has been welcomed, as previously vendors ha...
Stop Using Notebooks: Why Data Scientists Should Code Like Developers
Data scientists often use notebooks at the early stages of a project to explore solutions and validate technical feasibility. However, coding in notebooks can prevent the implementation of good softwa...
Spark Tips. Partition Tuning
Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. In this blog post, the author provides tips and optimization methods that help achieve high ...