-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do I have to restart the Spark streaming job for new schema to take effect ? #309
Comments
Hi @akshayar |
Thanks @kevinwallimann . Is this a feature that you are considering to implement ? Or the nature of problem itself is such that it can not be solved? |
Hi @akshayar This is a limitation of Spark, so it's not in the scope of this project. |
What could be done is to run Spark's streaming "forEachBatch" and initialize the I'm a bit surprised that there is no way to implement a refresh behaviour. What if |
Look at this code:
If we later decide to change the reader schema we still have no way to change the Spark type, because it calls us not the other way. As far as I know DataFrame`s dataType cannot be changed you can only create new dataFrame so that is other limitation here. I am not expert on Spark streaming so if you think this can be done please provide some code example how this should work. |
Thanks for the answer. I see the issue: Spark infers the DataFrame schema only once and it cannot changed. An alternative would be to be able to stop the app automatically when a schema change happens so that it could be restarted (watchdog) with the new schema. |
Hi @github-raphael-douyere Spot on. Theoretically, it may be possible to change the logical plan from one microbatch to the next one, but I don't know how this could be implemented or if it would break something else in Spark. Your suggestion to restart a query when a schema change has been detected can work. In case of a schema change in the middle of a microbatch, some messages might still be consumed using the old schema. Maybe these constraints aren't present with the continuous mode, but I'm no expert about the continuous mode. If you find a solution to your problem, please feel free to share it here. |
Hi, I am facing the same issue just wanted to know if there was a solution to this problem. |
I was trying Abris library and consuming CDC record generated by Debezium.
I added column while the streaming job is running. I was expecting it to print the new col that I added. It did not. Does it not dynamically refresh the schema from the version information in the payload ?
Do I have to restart the spark streaming job to process/view new columns ?
The text was updated successfully, but these errors were encountered: