Add scan planning api request and response models, parsers #11369

rahil-c · 2024-10-21T16:57:09Z

Opening a new pr which just focues on the scan planning model and parsers based off the original pr here https://github.com/apache/iceberg/pull/11180/files#diff-eeec6df5a574c599e4bab34e926583f9e382ec55bf0519d8fbea382bdd25c6c5

cc @amogh-jahagirdar @rdblue @danielcweeks @jackye1995 @nastra @singhpk234

* Cleanup plan table scan response * more response cleanup

rahil-c · 2024-10-21T18:40:29Z

Thanks @amogh-jahagirdar for helping clean up the pr, it is greatly appreciated!

core/src/main/java/org/apache/iceberg/rest/responses/FetchPlanningResultResponse.java

core/src/main/java/org/apache/iceberg/UnboundBaseFileScanTask.java

amogh-jahagirdar · 2024-10-21T22:10:10Z

api/src/main/java/org/apache/iceberg/exceptions/EntityNotFoundException.java

+import com.google.errorprone.annotations.FormatMethod;
+
+/** Exception raised when an entity is not found. */
+public class EntityNotFoundException extends RESTException implements CleanableFailure {


Do we need this? I remember @rdblue mentioned a common exception for when a resource can't be found, but it doesn't seem like it's really being used anywhere

I am not sure why we added this, I think we can remove it.

I'd say let's remove this for now since at the moment it's not being used, I think it's something we can add once there's clarity on how it would get used.

core/src/main/java/org/apache/iceberg/rest/requests/PlanTableScanRequest.java

core/src/main/java/org/apache/iceberg/rest/requests/PlanTableScanRequestParser.java

nastra · 2024-10-29T09:14:01Z

@rahil-c can you please make sure that my comments from #11180 are addressed here?

rahil-c · 2024-10-29T15:49:48Z

@rahil-c can you please make sure that my comments from #11180 are addressed here?

@nastra yes will try to address the comments from the original pr here.

cc @amogh-jahagirdar

core/src/main/java/org/apache/iceberg/rest/requests/PlanTableScanRequestParser.java

core/src/main/java/org/apache/iceberg/UnboundBaseFileScanTask.java

* Fix partitioned table planning * Fix condition for failing when specs by id is missing

amogh-jahagirdar

I think we're missing tests for the parsing of requests.

Edit: NVM, we do have tests for those. The names of the tests were a bit different than I expected. Discussed with @rahil-c who'll make the names more consistent

amogh-jahagirdar · 2024-11-25T22:33:42Z

core/src/test/java/org/apache/iceberg/TestBase.java

+  public static final Map<Integer, PartitionSpec> PARTITION_SPECS_BY_ID = Map.of(0, SPEC);
+
+  public static final DataFile FILE_A =


Following up on #11180 (comment), I'm good if we want to make these public. Would've been nice to keep package private but the original concern was that we were sort of breaking the existing pattern, but as @rahil-c mentioned, SPEC and SCHEMA are already public.

amogh-jahagirdar · 2024-11-25T22:40:51Z

core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java

@@ -511,7 +511,6 @@ public Table loadTable(SessionContext context, TableIdentifier identifier) {
            endpoints);

    trackFileIO(ops);
-


Can we undo this change, it's unnecessary

Will fix this

rahil-c · 2024-11-25T23:39:36Z

@nastra @amogh-jahagirdar Have addressed the comments from the main pr: https://github.com/apache/iceberg/pull/11180/files regarding the models onto to this pr. If something looks off or was not addressed please let me know. Also I will do another pass as well.

amogh-jahagirdar · 2024-11-27T04:30:07Z

api/src/main/java/org/apache/iceberg/exceptions/EntityNotFoundException.java

+import com.google.errorprone.annotations.FormatMethod;
+
+/** Exception raised when an entity is not found. */
+public class EntityNotFoundException extends RESTException implements CleanableFailure {


I'd say let's remove this for now since at the moment it's not being used, I think it's something we can add once there's clarity on how it would get used.

amogh-jahagirdar · 2024-11-27T04:32:45Z

core/src/main/java/org/apache/iceberg/ContentFileParser.java

+    // ignore the ordinal position (ContentFile#pos) of the file in a manifest,
+    // as it isn't used and BaseFile constructor doesn't support it.


Don't think we need this comment, we can remove it

@amogh-jahagirdar I actually am not the original author of that comment.

This was the pr that added it: b8db3f0. Do you still want me to remove it?

I think it is okay since it is pre-existing. It's also good context since the position isn't in the spec and is used to help streaming readers keep incremental state.

amogh-jahagirdar · 2024-11-27T04:35:40Z

core/src/main/java/org/apache/iceberg/rest/requests/FetchScanTasksRequest.java

+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.rest.RESTRequest;
+
+@SuppressWarnings("checkstyle:VisibilityModifier")


Do we need this?

Let me see if I can get rid of this, this was for not having getter and setters for the the plan-task I believe.

Im not sure if checkStyle has a bug or if im doing something incorrect. It seems that checkStyle is complaining that the private var planTask does not have an accessor method. However I do have the following in the class

public String planTask() { return planTask; }

And it seems to follow the pattern such as https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/requests/UpdateTableRequest.java

If i remove the @SuppressWarnings("checkstyle:VisibilityModifier") it will continue to fail the build unfortunately so will keep this for now.

amogh-jahagirdar · 2024-11-27T04:43:26Z

core/src/main/java/org/apache/iceberg/rest/requests/PlanTableScanRequestParser.java

+    if (request.snapshotId() != null) {
+      gen.writeNumberField(SNAPSHOT_ID, request.snapshotId());
+    }
+
+    if (request.startSnapshotId() != null) {
+      gen.writeNumberField(START_SNAPSHOT_ID, request.startSnapshotId());
+    }
+
+    if (request.endSnapshotId() != null) {
+      gen.writeNumberField(END_SNAPSHOT_ID, request.endSnapshotId());
+    }


On second look, the way this is written is a bit confusing after reading again since on the surface it seems like it's possible serializing snapshotId/startingSnapshotId/endSnapshotID even though it's either a point in time scan or an incremental scan. To be clear, I know that validate checks this, and it's not possible for the request to be in a state where all 3 are set, I'm more so talking about how a reader would interpret this code.

concretely I'd recommend replacing this with:

private void serializeSnapshotIdForScan(JsonGenerator gen, request) { if (request.snapshotId() != null) { gen.writeNumberField(SNAPSHOT_ID, request.snapshotId()); } else { gen.writeNumberField(START_SNAPSHOT_ID, request.startSnapshotId()); gen.writeNumberField(END_SNAPSHOT_ID, request.endSnapshotId()); } }

I think then you'd also be able to get rid of cyclomatic complexity override above.

Actually when looking at this again are we sure this is what we want? When reading the spec again https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L609

If the request does not include either incremental or point-in-time config properties, scan planning should produce a point-in-time scan of the latest snapshot in the table's main branch

It seems possible that the client does not need to pass either a snapshotId, or start and end snapshot ids, and that the server can just plan with the latest snapshot id if nothing was provided. I would assume then at serialize time this would be an issue with the following code,

private void serializeSnapshotIdForScan(JsonGenerator gen, request) { if (request.snapshotId() != null) { gen.writeNumberField(SNAPSHOT_ID, request.snapshotId()); } else { gen.writeNumberField(START_SNAPSHOT_ID, request.startSnapshotId()); gen.writeNumberField(END_SNAPSHOT_ID, request.endSnapshotId()); } }

as it may hit an error if the start and end ids were also null. I think the way it currently works because we the null check around each of the ids. Let me know what you think though

Talked offline with @amogh-jahagirdar will try amending this to use a else-if and check to see if START_SNAPSHOT_ID is set.

amogh-jahagirdar · 2024-11-27T04:52:06Z

core/src/main/java/org/apache/iceberg/RESTFileScanTaskParser.java

+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RESTFileScanTaskParser {


On second look, shouldn't this be in the REST package?

If we move this to the REST package then we run into issues that we saw in the past with the Data and Delete Files being package private.

I think we can keep where it is currently, so we can avoid making anything public.

Thanks for the reminder, I think in my PR to your branch, I forgot to address this part I think it's worth seeing how we can refactor this so that we can actually move RESTFileScanTaskParser in the REST package...it kind of stands out by not really following existing convention

amogh-jahagirdar · 2024-11-27T05:02:07Z