-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw CorruptSegmentInfoException on encountering missing segment info (_N.si) file in CheckIndex #12872
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tackling this @gokaai! I added some comments -- I'm hoping we can find a less invasive way to know which .si
failed.
@@ -664,7 +667,7 @@ public int compare(String a, String b) { | |||
// failure (fastFail == false). so if we get here, we should // always have a valid lastCommit: | |||
assert lastCommit != null; | |||
|
|||
if (lastCommit == null) { | |||
if (lastCommit == null) { // TODO: Remove/move unreachable if statement? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this might still happen? E.g. some transient IOException
or an Access Denied or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we already assert lastCommit != null
just before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh you are right! And looking at the code, unless there is a bug, lastCommit
should never be null. So I think an assert
is appropriate and we should remove the if
.
infoStream, | ||
"WARNING: " | ||
+ result.numBadSegments | ||
+ " broken segments (containing " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say containing at least
?
@@ -948,7 +965,7 @@ private Status.SegmentInfoStatus testSegment( | |||
segInfoStat.maxDoc = info.info.maxDoc(); | |||
|
|||
final Version version = info.info.getVersion(); | |||
if (info.info.maxDoc() <= 0) { | |||
if (info.info.maxDoc() < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why did you have to remove the 0
case? Lucene considers this corruption (IW should never flush or merge to a 0 doc segment).
@@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment( | |||
SegmentReader reader = null; | |||
|
|||
try { | |||
if (info.info.maxDoc() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see, you now separately handle the maxDoc == 0
case, here. But, what if an index has an illegal 0 doc segment, and an intact .si
? We will now (incorrectly) state that it is missing its SegmentInfo
? Maybe we could add a separate boolean somewhere to track segmentInfoMissing
or so?
@@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment( | |||
SegmentReader reader = null; | |||
|
|||
try { | |||
if (info.info.maxDoc() == 0) { | |||
throw new CheckIndexException(" this segment has missing or corrupt segment info"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the extra space before this
?
Also, can we include specifics about which segment we were unable to read the .si
for in the exception message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the extra space before this
I was keeping it consistent with this exception message - should I change this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes please?
+ " docs from the index. YOU WILL LOSE DATA. THIS IS YOUR LAST CHANCE TO CTRL+C!"); | ||
if (result.numBadSegmentsWithoutSegmentInfo > 0) { | ||
opts.out.println( | ||
"WARNING: " + result.totLoseDocCount + " or more documents will be lost\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the wording to at least ...
? I think that's stronger / more alarming on casual read than or more
.
Or maybe more than XX documents will be lost
, since we (almost certainly) know the lost segment(s) have maxDoc > 0
?
Whichever wording we use, let's be consistent in the thrown exception message above too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Will rephrase as 'at least' in all occurrences
codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); | ||
try { | ||
info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); | ||
} catch (FileNotFoundException | NoSuchFileException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can we upgrade this to catch any "normal" (not crazy things like OOME) Throwable
? E.g. if the .si
file had some bits scrambled or what not we will get a more exotic exception or maybe an exception because the checksum is mismatched.
Can we somehow return the root cause exception here and include it in CheckIndexException
that we raise above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - will catch all exceptions and throw a new, specific CorruptSegmentInfoException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about IOException
? Tried catching all Throwable t
here but it causes TestIndexWrtier#testThreadInterruptDeadlock
to fail as it catches ThreadInterruptExceptions
when it shouldn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception handling is tricky! I'm not sure what is the right exact set of exceptions to catch here ... but it should be broader than just the "no such file" exceptions I think. Corruption in .si
can result in exotic exceptions...
// segment info (.si) file, we initialize an empty segment info (containing 0 docs) | ||
// to keep track of the segment name, so that we can continue reading other segments, and | ||
// the segment with the missing .si file can be later validated or exorcised by CheckIndex() | ||
info = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't like that this will suppress corruption when SegmentInfos
is being loaded NOT from CheckIndex
? This is the wrong layer to have such leniency? This method should throw the exception back up to the caller and caller should handle properly?
Could we instead 1) create a subclass of CorruptIndexException
, maybe CorruptSegmentInfoException
or so, 2) require that caller pass in the segment name, and root cause exception, when creating it, and 3) catch any Throwable
here and re-throw that subclass?
Then we can fix CheckIndex
to specifically catch that, pull the segment name out, and do its thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing @mikemccand ! What I find tricky here are two things -
- Throwing an exception in
CheckIndex
short-circuits the flow that makes it possible for-exorcise
to do it's thing - Since we don't do a deeper level check on segments which aren't the latest segment, the recently added unit test where we expect prior commits to throw an exception if corrupt, does not pass
I'll make a new revision trying to address both of these things + your comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Throwing an exception in
CheckIndex
short-circuits the flow that makes it possible for-exorcise
to do it's thing
Hmm this should still work. The exception should be thrown at a low level, when the codec is trying to read the .si
file. This is then caught by CheckIndex
, and it knows which segment is affected by checking instanceof
of the exception and then extracting the segment name. Then it can report the root cause and/or exorcise just fine?
- Since we don't do a deeper level check on segments which aren't the latest segment, the recently added unit test where we expect prior commits to throw an exception if corrupt, does not pass
Hmm why is that? Codec should throw an exception when we try to load that (non-latest) SIS too?
Decided to tackle this in smaller steps -
|
5632e30
to
59384e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gokaai!
import org.apache.lucene.store.DataInput; | ||
import org.apache.lucene.store.DataOutput; | ||
|
||
/** This exception is thrown when Lucene detects an inconsistency in the index. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we improve the javadoc to state when this particular class is used (to express corruption when loading a single _N.si
file?).
"segment info file: " + segName + ".si cannot be read - it may be missing or corrupt", | ||
input); | ||
} else { | ||
throw t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why not also CorruptSegmentInfoException
in this case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried catching all Throwable t here but it causes TestIndexWrtier#testThreadInterruptDeadlock
to fail as it doesn't throw ThreadInterruptExceptions
public class CorruptSegmentInfoException extends CorruptIndexException { | ||
|
||
/** Create exception with a message only */ | ||
public CorruptSegmentInfoException(String message, DataInput input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this class also take the segmentName
so CheckIndex
can do something interesting with it (know which segment to exorcise, report which segment's .si
is broken, etc.)?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
…o (_N.si) file in CheckIndex
info = codec.segmentInfoFormat().read(directory, segName, segmentID, IOContext.READ); | ||
} catch (ThreadInterruptedException e) { | ||
throw e; | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a general-purpose exception here was tricky !
TestIndexWrtier#testThreadInterruptDeadlock
expected a ThreadInterruptedException to be thrown as isTestTransactions#testTransactions
expected an 'on-purpose' IOException to be thrown as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gokaai!
I edited the PR description to more accurately represent what is being done. We aren't dealing with the exorcism issue yet, but throwing the right exception on encountering missing segment info. I will tackle the final step (actually making exorcism possible) in a subsequent PR |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Thanks @gokaai -- I'll try to review soon! If possible please try not to force-push: it removes the history of the past commits and makes it harder to see what changed on this iteration :) |
Sorry about that @mikemccand . I messed up my workspace and couldn't salvage it - I'll keep it in mind while making any new revisions. |
Ahh OK no worries. I experience such |
Look at what git reflog does and how it may help. It is really, really difficult to not be able to recover your work with git. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Description
Changes: