Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw CorruptSegmentInfoException on encountering missing segment info (_N.si) file in CheckIndex #12872

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gokaai
Copy link
Contributor

@gokaai gokaai commented Dec 3, 2023

Description

Changes:

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for tackling this @gokaai! I added some comments -- I'm hoping we can find a less invasive way to know which .si failed.

@@ -664,7 +667,7 @@ public int compare(String a, String b) {
// failure (fastFail == false). so if we get here, we should // always have a valid lastCommit:
assert lastCommit != null;

if (lastCommit == null) {
if (lastCommit == null) { // TODO: Remove/move unreachable if statement?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this might still happen? E.g. some transient IOException or an Access Denied or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we already assert lastCommit != null just before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh you are right! And looking at the code, unless there is a bug, lastCommit should never be null. So I think an assert is appropriate and we should remove the if.

infoStream,
"WARNING: "
+ result.numBadSegments
+ " broken segments (containing "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say containing at least?

@@ -948,7 +965,7 @@ private Status.SegmentInfoStatus testSegment(
segInfoStat.maxDoc = info.info.maxDoc();

final Version version = info.info.getVersion();
if (info.info.maxDoc() <= 0) {
if (info.info.maxDoc() < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why did you have to remove the 0 case? Lucene considers this corruption (IW should never flush or merge to a 0 doc segment).

@@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment(
SegmentReader reader = null;

try {
if (info.info.maxDoc() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, you now separately handle the maxDoc == 0 case, here. But, what if an index has an illegal 0 doc segment, and an intact .si? We will now (incorrectly) state that it is missing its SegmentInfo? Maybe we could add a separate boolean somewhere to track segmentInfoMissing or so?

@@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment(
SegmentReader reader = null;

try {
if (info.info.maxDoc() == 0) {
throw new CheckIndexException(" this segment has missing or corrupt segment info");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the extra space before this?

Also, can we include specifics about which segment we were unable to read the .si for in the exception message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the extra space before this

I was keeping it consistent with this exception message - should I change this one too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes please?

+ " docs from the index. YOU WILL LOSE DATA. THIS IS YOUR LAST CHANCE TO CTRL+C!");
if (result.numBadSegmentsWithoutSegmentInfo > 0) {
opts.out.println(
"WARNING: " + result.totLoseDocCount + " or more documents will be lost\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the wording to at least ...? I think that's stronger / more alarming on casual read than or more.

Or maybe more than XX documents will be lost, since we (almost certainly) know the lost segment(s) have maxDoc > 0?

Whichever wording we use, let's be consistent in the thrown exception message above too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Will rephrase as 'at least' in all occurrences

codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ);
try {
info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ);
} catch (FileNotFoundException | NoSuchFileException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can we upgrade this to catch any "normal" (not crazy things like OOME) Throwable? E.g. if the .si file had some bits scrambled or what not we will get a more exotic exception or maybe an exception because the checksum is mismatched.

Can we somehow return the root cause exception here and include it in CheckIndexException that we raise above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - will catch all exceptions and throw a new, specific CorruptSegmentInfoException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about IOException? Tried catching all Throwable t here but it causes TestIndexWrtier#testThreadInterruptDeadlock to fail as it catches ThreadInterruptExceptions when it shouldn't

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception handling is tricky! I'm not sure what is the right exact set of exceptions to catch here ... but it should be broader than just the "no such file" exceptions I think. Corruption in .si can result in exotic exceptions...

// segment info (.si) file, we initialize an empty segment info (containing 0 docs)
// to keep track of the segment name, so that we can continue reading other segments, and
// the segment with the missing .si file can be later validated or exorcised by CheckIndex()
info =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't like that this will suppress corruption when SegmentInfos is being loaded NOT from CheckIndex? This is the wrong layer to have such leniency? This method should throw the exception back up to the caller and caller should handle properly?

Could we instead 1) create a subclass of CorruptIndexException, maybe CorruptSegmentInfoException or so, 2) require that caller pass in the segment name, and root cause exception, when creating it, and 3) catch any Throwable here and re-throw that subclass?

Then we can fix CheckIndex to specifically catch that, pull the segment name out, and do its thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing @mikemccand ! What I find tricky here are two things -

  • Throwing an exception in CheckIndex short-circuits the flow that makes it possible for -exorcise to do it's thing
  • Since we don't do a deeper level check on segments which aren't the latest segment, the recently added unit test where we expect prior commits to throw an exception if corrupt, does not pass

I'll make a new revision trying to address both of these things + your comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Throwing an exception in CheckIndex short-circuits the flow that makes it possible for -exorcise to do it's thing

Hmm this should still work. The exception should be thrown at a low level, when the codec is trying to read the .si file. This is then caught by CheckIndex, and it knows which segment is affected by checking instanceof of the exception and then extracting the segment name. Then it can report the root cause and/or exorcise just fine?

  • Since we don't do a deeper level check on segments which aren't the latest segment, the recently added unit test where we expect prior commits to throw an exception if corrupt, does not pass

Hmm why is that? Codec should throw an exception when we try to load that (non-latest) SIS too?

@gokaai
Copy link
Contributor Author

gokaai commented Dec 19, 2023

Decided to tackle this in smaller steps -

  • Made a new (simpler) revision where we throw a more specific error on encountering missing segment info. Created a new exception CorruptSegmentInfoException.java
  • Will create a new commit (or issue?) to add in fixes and unit tests for exorciseIndex

@gokaai gokaai force-pushed the LUCENE-7820 branch 2 times, most recently from 5632e30 to 59384e7 Compare December 22, 2023 11:24
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gokaai!

import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.DataOutput;

/** This exception is thrown when Lucene detects an inconsistency in the index. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we improve the javadoc to state when this particular class is used (to express corruption when loading a single _N.si file?).

"segment info file: " + segName + ".si cannot be read - it may be missing or corrupt",
input);
} else {
throw t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why not also CorruptSegmentInfoException in this case too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried catching all Throwable t here but it causes TestIndexWrtier#testThreadInterruptDeadlock to fail as it doesn't throw ThreadInterruptExceptions

public class CorruptSegmentInfoException extends CorruptIndexException {

/** Create exception with a message only */
public CorruptSegmentInfoException(String message, DataInput input) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this class also take the segmentName so CheckIndex can do something interesting with it (know which segment to exorcise, report which segment's .si is broken, etc.)?

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 18, 2024
info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ);
} catch (ThreadInterruptedException e) {
throw e;
} catch (Exception e) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching a general-purpose exception here was tricky !

  • TestIndexWrtier#testThreadInterruptDeadlock expected a ThreadInterruptedException to be thrown as is
  • TestTransactions#testTransactions expected an 'on-purpose' IOException to be thrown as is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gokaai!

@gokaai gokaai changed the title Enable CheckIndex to exorcise segments with missing segment infos (.si) (#7820) Throw CorruptSegmentInfoException on encountering missing segment info (_N.si) file in CheckIndex Jan 19, 2024
@gokaai
Copy link
Contributor Author

gokaai commented Jan 19, 2024

I edited the PR description to more accurately represent what is being done. We aren't dealing with the exorcism issue yet, but throwing the right exception on encountering missing segment info. I will tackle the final step (actually making exorcism possible) in a subsequent PR

@github-actions github-actions bot removed the Stale label Jan 20, 2024
Copy link

github-actions bot commented Feb 4, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Feb 4, 2024
@mikemccand
Copy link
Member

Thanks @gokaai -- I'll try to review soon!

If possible please try not to force-push: it removes the history of the past commits and makes it harder to see what changed on this iteration :)

@github-actions github-actions bot removed the Stale label Feb 6, 2024
@gokaai
Copy link
Contributor Author

gokaai commented Feb 8, 2024

Thanks @gokaai -- I'll try to review soon!

If possible please try not to force-push: it removes the history of the past commits and makes it harder to see what changed on this iteration :)

Sorry about that @mikemccand . I messed up my workspace and couldn't salvage it - I'll keep it in mind while making any new revisions.

@mikemccand
Copy link
Member

Thanks @gokaai -- I'll try to review soon!
If possible please try not to force-push: it removes the history of the past commits and makes it harder to see what changed on this iteration :)

Sorry about th

Ahh OK no worries. I experience such git disasters myself far more often than I would hope :)

@dweiss
Copy link
Contributor

dweiss commented Feb 8, 2024

Thanks @gokaai -- I'll try to review soon!
If possible please try not to force-push: it removes the history of the past commits and makes it harder to see what changed on this iteration :)

Sorry about th

Ahh OK no worries. I experience such git disasters myself far more often than I would hope :)

Look at what git reflog does and how it may help. It is really, really difficult to not be able to recover your work with git.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants