Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44065: [Java] Implement C Data Interface for RunEndEncodedVector #44241

Merged
merged 12 commits into from
Oct 17, 2024

Conversation

ViggoC
Copy link
Contributor

@ViggoC ViggoC commented Sep 26, 2024

@vibhatha
Copy link
Collaborator

@ViggoC it would be better to add a test here as well https://github.com/apache/arrow/blob/main/java/c/src/test/python/integration_tests.py
Please refer to the equivalent once for StringView and ListView

@ViggoC
Copy link
Contributor Author

ViggoC commented Sep 26, 2024

@vibhatha Your response is really quickly. I am still trying to understand your PR #41967, as well as how to run tests. If you could give me some guidance, it would be of great help.
So just integration_tests.py is must to have? How to run it?

@vibhatha
Copy link
Collaborator

Right, so there are few things we have to do here.
First try to build the C data interface related code, you can get an idea here. This way you can run the test cases.

For integration tests, you need to make sure to use archery. Refer to this. Make sure archery is installed and all other relevant things. I would use a mamba environment for this. Please refer to the Python development guidelines to learn more about using conda/mamba for that. CLI help menu can give more guidance. Make sure to test both IPC, and C Data tests.

One more important thing is the ComplexWriter component which contains the JSON based writers and readers which are key factors for these tests.

If you have any questions, please ask here. And thanks for pushing this effort.

@ViggoC
Copy link
Contributor Author

ViggoC commented Oct 8, 2024

@vibhatha @lidavidm If I want to release this feature in version 18.0, when is the due date?

@vibhatha
Copy link
Collaborator

vibhatha commented Oct 8, 2024

cc @raulcd

@lidavidm
Copy link
Member

lidavidm commented Oct 8, 2024

@ViggoC ViggoC marked this pull request as ready for review October 8, 2024 14:29
Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only went through one quick pass, I will review again.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 9, 2024
@vibhatha
Copy link
Collaborator

vibhatha commented Oct 9, 2024

@github-actions crossbow submit -g java

Copy link

github-actions bot commented Oct 9, 2024

Revision: b34aca3

Submitted crossbow builds: ursacomputing/crossbow @ actions-302154c076

Task Status
java-jars GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
verify-rc-source-java-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-java-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-java-macos-amd64 GitHub Actions

java/c/src/test/python/integration_tests.py Outdated Show resolved Hide resolved
*/
@Override
public void splitAndTransfer(int startIndex, int length) {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue filed for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have, but we should generally complete split and transfer features before we finalise the C Data interface. IMHO it is important to cover that in this stage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least this was the approach we took for StringView and ListView before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only implemented the necessary functions to pass through RoundTripTest. I'll implement it in this PR, if you all think so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ViggoC if this requires a lot more code or time, let's do it later, it is not a must. But later means as an immediate next feature. I was merely mentioning my practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not time consuming, I'll implement it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll wait for that.


@Override
public void copyValueSafe(int from, int to) {
this.to.copyFrom(from, to, RunEndEncodedVector.this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will just throw because we don't implement copyFrom, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 9, 2024
@github-actions github-actions bot added the awaiting change review Awaiting change review label Oct 12, 2024
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just waiting for splitAndTransfer though let me know if you'd prefer to split it out after all.

*/
@Override
public void splitAndTransfer(int startIndex, int length) {
throw new UnsupportedOperationException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll wait for that.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 14, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 14, 2024
Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.

int physicalEndIndex,
int physicalLength) {
toRunEndVector.setValueCount(physicalLength);
toRunEndVector.getValidityBuffer().setOne(0, toRunEndVector.getValidityBuffer().capacity());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. Could you please take a look how other vectors are doing this for validity buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ListVector, it slice the validityBuffer if startIndex % 8 == 0, and copy data one by one when the first bit starts from the middle of a byte.
But for run end encoded vector, the element of run end vector can never be null so I just set validity buffer of RunEndVector to 1.
What's your concern about this code? The memory of validity buffer should be reused when startIndex % 8 == 0, Or we should not set the bit beyond the physical length?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ViggoC not saying it's wrong, I mentioned I wasn't sure.

ViggoC and others added 3 commits October 15, 2024 13:28
fix nits comment

Co-authored-by: Vibhatha Lakmal Abeykoon <[email protected]>
int physicalEndIndex,
int physicalLength) {
toRunEndVector.setValueCount(physicalLength);
toRunEndVector.getValidityBuffer().setOne(0, toRunEndVector.getValidityBuffer().capacity());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine

int physicalLastIndex = physicalLength - 1;
if (toRunEndVector instanceof SmallIntVector) {
byte typeWidth = SmallIntVector.TYPE_WIDTH;
for (int i = 0; i < physicalLastIndex; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the loop go from [0, physicalLength) or [physicalStartIndex, physicalLastIndex)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it through [0, physicalLength - 1), and handle physicalLength - 1 separately

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but this loop is from [0, physicalLastIndex) - aren't we "crossing indices" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physicalLastIndex = physicalLength - 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Sorry, the naming is confusing, IMO - physicalLastIndex feels like it should go with physicalStartIndex and not physicalLength

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the naming is kind of confusing, I renamed them, does it make sense for you now?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 15, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 15, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Oct 17, 2024
@lidavidm lidavidm merged commit b175463 into apache:main Oct 17, 2024
18 checks passed
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Oct 17, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit b175463.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants