-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long delay after writing edge snapshots due to single-threaded renaming of output files by the Spark driver #324
Comments
I edited the issue description as previously I had mistakenly assumed that the numbers of output files produced in the insert / delete stages to be the same as the number of tasks - which is not the case. The numbers of output files produced for inserts / deletes is significantly smaller than the number of output files in the initial snapshot. This does explain the difference in behavior between these 2 sets of stages. So the question reduces to: Is there any way to control the numbers of files produced by the initial snapshot writing? |
@arvindshmicrosoft thanks for reporting this and apologies for the delay in responding. @dszakallas has started to look into this. We have one question - which Spark version do you use? If it's 2.x, can you upgrade to 3.x without much hassle? |
Thanks! My cluster config is:
So I am on current latest for both. OpenJDK 8. |
Taking a look at the hadoop-azure documentation: | Rename and Delete blob operations on directories with large number of files and sub directories currently is very slow as these operations are done one blob at a time serially. These files and sub folders can be deleted or renamed parallel. Following configurations can be used to enable threads to do parallel processing. Notice that I found this on the WABS connector's documentation, and I don't know whether this configuration works with ABFSS as well. Also, I found this piece of performance info on file operations regarding ABFSS: https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html#Performance_and_Scalability It states that for stores with hierarchical namespaces, directory renames/deletes should be atomic. So in case you are using a namespaced store, the possible reason for the slowness is that individual file renames are happening instead of a directory rename. I wonder if there's a configuration to achieve a single directory rename. @szarnyasg we should also think about recalibrating the size of our partitions/files in our EMR workload, since they seem very small (on the scale of couple MBs). Outputting larger chunks would mitigate issues like this. |
Thanks @dszakallas - there is no obvious configuration within ABFSS to group these. ABFSS is just a victim here, the looping is triggered by these lines:
In turn, such looping is due to the large number of partitions. The issue would be mitigated if we can configure the partition sizes in Datagen, to increase the "chunk size". Is there such a configuration in Datagen itself? |
You can configure ldbc_snb_datagen_spark/src/main/java/ldbc/snb/datagen/serializer/FileName.java Lines 5 to 31 in 3c3af90
You can try reducing that. Furthermore I created a pull request that exposes an internal configuration parameter, called oversize factor on the CLI. The multipliers are divided by this value, so larger oversizeFactor will result in fewer, larger files. #334 |
I generated SF30000 dataset directly to ADLS Gen 2. What I observed is that after the major snapshots are written, there are significant delays due to single-threaded renaming of the output files. There is a simple, serialized, loop to do this, and all of the action is on the driver. When working with remote cloud storage the overall delays are really significant, taking up ~ 67% of the total datagen time:
The subsequent ones (inserts and deletes) for each of these edges, does not have as bad of a "serialized" behavior, as they seem to write much smaller numbers of files as compared to the the initial snapshot. Is there any way to control the numbers of files produced by the initial snapshot writing?
A sample call stack of the "slow behavior" for the snapshots is below.
The text was updated successfully, but these errors were encountered: