Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make FSTCompiler.compile() to only return the FSTMetadata #12831

Merged
merged 2 commits into from
Feb 5, 2024

Conversation

dungba88
Copy link
Contributor

@dungba88 dungba88 commented Nov 22, 2023

Description

Make FSTCompiler.compile() to only return the FSTMetadata. Depending on whether the DataOutput used implements FSTReader or not, the returned FST might be unreadable. Thus having this method only return the FSTMetadata means we will never return an unusable FST.

Now to create the FST there will be 2 ways:

  1. If a FSTReader DataOutput is used, we can create the FST with:
var fstMetadata = fstCompiler.compile();
var fst = FST.fromFSTReader(fstMetadata, fstCompiler.getFSTReader());
  1. If a non-FSTReader DataOutput is used, we need to first create the corresponding DataInput, then call:
var fstMetadata = fstCompiler.compile();
if (fstMetadata == null) {
    return null; // FST is empty
}

// Create the DataInput
var dataInput = ...

var fst = new FST<>(fstMetadata, dataInput);

Note that this PR depends on 2 others to be merged first:

FST.fromFSTReader will also handle the case where fstMetadata is null, in which case it would return null. It is essentially a sugar syntactic utility to make the migration easier. So we will change this code:

var fst = fstCompiler.compile();

to this:

var fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());

instead of:

var fstMetadata = fstCompiler.compile();
if (fstMetadata == null) {
    return;
}
var fst = new FST<>(fstMetadata, fstCompiler.getFSTReader());

@dungba88 dungba88 marked this pull request as draft November 22, 2023 05:22
@dungba88
Copy link
Contributor Author

dungba88 commented Dec 5, 2023

I'll put the CHANGES.txt for #12624 together with this

@dungba88 dungba88 force-pushed the compiler-metadata branch 3 times, most recently from 8ca0921 to 6418026 Compare December 8, 2023 04:21
* Save the FST to DataOutput. If you use an {@link org.apache.lucene.store.IndexOutput} to build
* the FST, then you should not and do not need to call this method, as the FST is already saved.
* Doing so will throw an {@link UnsupportedOperationException}.
* Save the FST to DataOutput.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer possible for user to create a FST that uses NullFSTReader, so this Javadoc will no longer holds. If users are able to create FST, then it will always be readable.

@@ -111,7 +111,7 @@ public NormalizeCharMap build() {
for (Map.Entry<String, String> ent : pendingPairs.entrySet()) {
fstCompiler.add(Util.toUTF16(ent.getKey(), scratch), new CharsRef(ent.getValue()));
}
map = fstCompiler.compile();
map = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());
Copy link
Contributor Author

@dungba88 dungba88 Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fromFSTReader is there to avoid the boilerplate null-check that each consumer must now do. Open for method name suggestion.

@@ -175,7 +176,7 @@ private FSTCompiler(
fst =
new FST<>(
new FST.FSTMetadata<>(inputType, outputs, null, -1, VERSION_CURRENT, 0),
toFSTReader(dataOutput));
NULL_FST_READER);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we can't pass actual null to FSTReader, NullFSTReader will be used instead.

@dungba88 dungba88 marked this pull request as ready for review December 9, 2023 07:29
@dungba88
Copy link
Contributor Author

dungba88 commented Dec 10, 2023

One of the point I'm unsure about this PR is that there is now 2 (obscured) ways to construct the FST, one using the DataInput+FSTStore and one using FSTReader returned by the on-heap DataOutput. If users don't read the Javadoc, they might be confused on how to get the FST. Maybe there could be other way which feels more natural.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 11, 2024
@mikemccand
Copy link
Member

Thank you stale bot!

@dungba88 -- what is the status of this change?

I think it makes sense to have two FST compile+consume paths -- one on heap, that you can (efficiently) consume (read) right away without writing FST to stable storage, another that writes and then reads from stable storage.

@github-actions github-actions bot removed the Stale label Jan 13, 2024
@dungba88
Copy link
Contributor Author

Thanks bot and @mikemccand, apparently I forgot about this PR!

I'll make another revision. The rebase after conflict seems to break the build.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 30, 2024
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @dungba88, and, incredibly, no additional conflicts despite getting stale.

I'll merge soon. Thanks!

@mikemccand
Copy link
Member

This is technically an API break, but FSTCompiler is an experimental API and effectively an internal Lucene datastructure, so I think we can safely backport to 9.x without deprecation path.

@mikemccand mikemccand merged commit 63d4ba9 into apache:main Feb 5, 2024
5 checks passed
@dungba88
Copy link
Contributor Author

dungba88 commented Feb 6, 2024

Thank you for merging @mikemccand !

mikemccand pushed a commit that referenced this pull request Feb 7, 2024
* Make FSTCompiler.compile() to only return the FSTMetadata

* tidy code
@mikemccand
Copy link
Member

Thank you for merging @mikemccand !

Thank you!

@mikemccand mikemccand added this to the 9.10.0 milestone Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants