Spark Out Of Memory(OOM) Exception Handling

Soutir Sen

Nov 30, 2024

Spark OOM exceptions occur when a Spark application consumes more memory than allocated, leading to task failures.
Typical causes: Insufficient memory allocation for executors or drivers. Skewed data partitions causing some tasks to require significantly more memory. Unoptimized operations such as wide transformations or large shuffles.

2. Types of Memory in Spark

Driver Memory: Used for the Spark driver’s internal data structures and task scheduling.
Executor Memory: Divided into:
Storage Memory: Caches RDDs or DataFrames.
Execution Memory: Allocated for tasks (e.g., shuffles, joins, aggregations).
Off-Heap Memory: Managed outside the JVM heap, configured via spark.memory.offHeap.enabled.

3. Diagnosing the Issue

Error Logs: Look for messages like java.lang.OutOfMemoryError: Java heap space or GC overhead limit exceeded.
Metrics: Use Spark UI and Ganglia/Prometheus for identifying tasks with high memory usage.
Job Duration: Tasks running unusually long might indicate memory bottlenecks.

4. Best Practices for Preventing OOM Exceptions

a. Memory Configuration

Increase memory allocation:--executor-memory 4G --driver-memory 2G
Allocate sufficient cores to executors to distribute the load:--executor-cores 4

b. Shuffle Optimizations

Avoid wide transformations like groupBy or join with large datasets; use partitioning strategies: df=df.repartition(100,"partition_column")
Enable adaptive query execution (AQE) for better shuffle management: spark.conf.set("spark.sql.adaptive.enabled", "true")

c. Skewed Data Management

Identify skewed partitions using metrics.
Use techniques like salting or custom partitioners for better load balancing: df=df.withColumn("salted_key",concat(col("key"),lit("_"), rand()))

d. Broadcast Joins

Use broadcast joins for small datasets:
from pyspark.sql.functionsimport
broadcast result = large_df.join(broadcast(small_df), "key")
Ensure the broadcast threshold is appropriately configured: spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10 * 1024 * 1024) # 10MB

e. Caching and Persistence

Cache intelligently to prevent excessive memory usage: df.persist(StorageLevel.MEMORY_AND_DISK)
Use unpersist() to release memory when caching is no longer needed: df.unpersist()

f. Optimized Serialization

Use Kryo serialization for better memory efficiency: spark.conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")

5. Advanced Strategies

a. Off-Heap Memory

Enable off-heap memory to reduce heap pressure: codespark.conf.set("spark.memory.offHeap.enabled", "true") spark.conf.set("spark.memory.offHeap.size", "512M")

b. Garbage Collection (GC) Tuning

Use G1GC for better performance:
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC"
--conf spark.driver.extraJavaOptions="-XX:+UseG1GC"

c. Dynamic Allocation

Enable dynamic resource allocation to scale resources based on demand: spark.conf.set("spark.dynamicAllocation.enabled", "true")

d. File Formats

Prefer columnar file formats like Parquet/ORC to reduce memory usage during I/O operations.

6. Code Example: Handling Large Joins

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder \
    .appName("OOM Example") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.autoBroadcastJoinThreshold", "50MB") \
    .getOrCreate()

# Load large datasets
large_df = spark.read.parquet("hdfs:///path/large_data")
small_df = spark.read.parquet("hdfs:///path/small_data")

# Optimize join using broadcast
result = large_df.join(broadcast(small_df), "key")

# Cache result for reuse
result.persist()

# Perform operations
result.show()

# Release memory
result.unpersist()

7. Monitoring and Debugging

Use the Spark UI for inspecting stage-level memory usage.
Configure log levels:
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.eventLog.dir", "/path/to/logs")

8. Checklist for Avoiding OOM Exceptions

Analyze data skew and repartition.
Optimize shuffles and joins.
Allocate sufficient memory and cores.
Monitor Spark UI regularly.
Use caching wisely and clean up unused data.

Soutir’s Newsletter - Bigdata Spark

Discussion about this post

Soutir’s Newsletter - Bigdata Spark

Spark Out Of Memory(OOM) Exception Handling

1. Understanding Out of Memory (OOM) Exceptions

2. Types of Memory in Spark

3. Diagnosing the Issue

4. Best Practices for Preventing OOM Exceptions

a. Memory Configuration

b. Shuffle Optimizations

c. Skewed Data Management

d. Broadcast Joins

e. Caching and Persistence

f. Optimized Serialization

5. Advanced Strategies

a. Off-Heap Memory

b. Garbage Collection (GC) Tuning

c. Dynamic Allocation

d. File Formats

6. Code Example: Handling Large Joins

7. Monitoring and Debugging

8. Checklist for Avoiding OOM Exceptions

Discussion about this post