-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user to remove broadcast variables when they are no longer used #771
base: master
Are you sure you want to change the base?
Conversation
…iables across tasks or operations,especially when they are large. However, the current Spark does not allow user to remove those variables in one SparkContext. This becomes a major issue for long running Shark server which uses one SparkContext. To address this issue, this patch allow user to remove broadcast variables when they are no longer used. To remove a broadcast variable, users only need to call the Broadcast.rm(toClearSource:Boolean) methond,the broadcast variable across the slaves will be deleted. If toClearSource is set true, data source (e.g., file used by HttpServer) will be deleted too.
Thank you for your pull request. An admin will review this request soon. |
@@ -46,6 +47,21 @@ extends Broadcast[T](id) with Logging with Serializable { | |||
if (!isLocal) { | |||
sendBroadcast() | |||
} | |||
|
|||
override def rm(toClearSource: Boolean = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this function to remove, and toClearSource to releaseSource?
2.Add a parameter to determine whether block managers report the broadcast block to master or not.
Thank you for your pull request. An admin will review this request soon. |
Hi @RongGu , AFAIK Spark already has a time based automatic clean way in HttpBroadcast when spark.cleaner.ttl is enabled, this can mostly clean JobConf in HadoopRDD, But this mechanism has a issue with Spark Streaming (https://spark-project.atlassian.net/browse/STREAMING-38?jql=project%20%3D%20STREAMING), it would be a great help to use a memory track way to clean the broadcast var automatically, not the time based way. |
Hi, @jerryshao , Thanks for your comment. It is nice to make a automatic memory cleaner for broadcast variables. Nevertheless, the purpose of this patch is providing a removing broadcast API to users. These two things do not conflict in essence. For memory cleanup tasks, the lesson I learned is that, whatever program-monitoring mechanisms seems not better than clear the memory explicitly by users if possible. GC can not always be in time and it has overhead costs. Moreover, in this case, it is hard to determine whether a broadcast needed be used by users any more,TTL may lead to error as the issue in the Spark Streaming said. On the other side, it is a problem to leave large unused broadcast variables in memory, and users have no means to handle that. Therefore, ,here we provide a explicit removing broadcast method to users. |
In Spark, users can create broadcast variables to share read-only variables across tasks or operations,especially when they are large. However, the current Spark does not allow users to remove those variables in one SparkContext. This becomes a major issue for long running Shark servers which uses one SparkContext. To address this issue, this patch allows users to remove broadcast variables when they are no longer used. To remove a broadcast variable, users only need to call the Broadcast.rm(toClearSource:Boolean) methond, the broadcast variable across the slaves will be deleted. If toClearSource is set true, data source (e.g., file used by HttpServer) will be deleted too.