Blog Logo
TAGS

Apache Spark 4.0 Overview: Spark Connect, Materialized Views, ANSI SQL, and Collation Support

Apache Spark 4.0 brings new features such as Spark Connect, a protocol for remote Spark development, Materialized Views for improved query performance and data management, ANSI SQL compliance, and col...

Read more...

Strategies for Data Quality with Apache Spark

Data quality is crucial for the success of any data-driven organization. In this article, we explore the data quality landscape and how to ensure data quality across all data pipeline stages with Apac...

Read more...

Unleashing the Power of Deequ for Efficient Spark Data Analysis

Ensuring data quality in big data environments can be challenging due to the sheer scale and complexity of the data. However, using automated data quality checks and data profiling processes can help ...

Read more...

Build efficient tests for your Spark data pipeline using BDD with Cucumber

In this Medium article by Omar LARAQUI, youll learn about Behavior Driven Development (BDD) and how it can help you build efficient tests for your Spark data pipeline. Cucumber is the tool that allows...

Read more...

Spark Tips. Optimizing JDBC data source reads

Spark Tips. Optimizing JDBC data source reads - In this blog post, we will discuss how to optimize reading from JDBC data sources in Spark. By default, JDBC data sources load data sequentially using a...

Read more...

Databricks Opens Up Its Delta Lakehouse at Data + AI Summit

Databricks has open sourced most of the technology behind its Delta Lake, including APIs, with the launch of Delta Lake 2.0. The move towards open standards has been welcomed, as previously vendors ha...

Read more...

Stop Using Notebooks: Why Data Scientists Should Code Like Developers

Data scientists often use notebooks at the early stages of a project to explore solutions and validate technical feasibility. However, coding in notebooks can prevent the implementation of good softwa...

Read more...

Spark Tips. Partition Tuning

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. In this blog post, the author provides tips and optimization methods that help achieve high ...

Read more...