Spark Unit Functions causes Serialization results bigger than spark.driver.maxResultSize Error
Recently, while trying to debug an error during my on-call rotation, I ran into a situation where a foreachPartition was causing a Serialization results bigger than spark.driver.maxResultSize. The error is caused by too much data being pulled back to the driver and is normally associated with a collect function. But the stacktrace pointed to the foreachPartition
in our code base, so I struggled to fix the issue. The documentation says it’s a Unit function, so nothing should be returned back to the driver. But, when I dug into the Spark source code I found the issue.
Spark Source Code
Diving into the Spark source code, I was eventually lead to a SparkContext runJob function, which clearly shows where the results are accumulated. This function, while not the final runJob
function, nor the function that submits the job, it does clearly show how the results are accumulated.
But wait, if the function used in the foreachPartition
call returns a Unit then how does this cause a “bigger than spark.driver.maxResultSize” error? The answer is that Scala’s Unit
is still an object, which takes up memory. So, given enough partitions, you’ll eventually consume enough memory to cause an error.
Reproducing the issue
I was able to easily reproduce the issue in below (and also in this gist).
By setting the spark.driver.maxResultSize = 1b
even count fails.
Of course setting spark.driver.maxResultSize = 1b
is ridiculous, but it makes the behavior easy to reproduce.
Conclusion
Even when using a DataFrame Unit function like foreachPartition
or foreach
it is still possible to get a “bigger than spark.driver.maxResultSize” error. I’m not sure if this behavior is expected behavior or if it’s actually a bug. And after searching Jira, it doesn’t seem like anyone has logged a ticket about this behavior.