Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bitswap/client/msgq): prevent duplicate requests #691

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Wondertan
Copy link
Member

@Wondertan Wondertan commented Oct 17, 2024

Previously, in-progress requests could be re-requested again during periodic rebroadcast. The queue requests, and while awaiting a response, the rebroadcast event happens. Rebroadcast event changes previously sent WANTs to pending and sends them again in a new message, duplicating some WANT requests.

The solution here is to ensure WANT was in sent status for long enough before bringing it back to pending. This utilizes the existing sendAt map, which tracks when every CID was sent. Then, on every event, it compares if the message was around longer than rebroadcastInterval

@Wondertan Wondertan requested a review from a team as a code owner October 17, 2024 18:42
if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
return false
mq.rebroadcastIntervalLk.RLock()
rebroadcastInterval := mq.rebroadcastInterval
Copy link
Member Author

@Wondertan Wondertan Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this could be a different new parameter/constant

@Wondertan
Copy link
Member Author

I tested this on a k8s cluster and with a local node connected to it. It works as expected, but I believe this would benefit a lot from a proper test. Unfortunately, I can't allocate time to writing one. It's not that straightforward.

@Wondertan
Copy link
Member Author

Wondertan commented Oct 17, 2024

For context, I detect duplicates with a custom multihash that logs out when the same data is hashed again. This essentially uncovered #690, and this issue

@Wondertan Wondertan force-pushed the message-queue-duplicates branch 3 times, most recently from d193c2f to 9020b71 Compare October 19, 2024 23:20
Copy link

codecov bot commented Oct 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.41%. Comparing base (19bcc75) to head (a61d89f).
Report is 50 commits behind head on main.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #691      +/-   ##
==========================================
+ Coverage   60.36%   60.41%   +0.05%     
==========================================
  Files         243      243              
  Lines       31034    31030       -4     
==========================================
+ Hits        18734    18748      +14     
+ Misses      10634    10618      -16     
+ Partials     1666     1664       -2     
Files with missing lines Coverage Δ
...tswap/client/internal/messagequeue/messagequeue.go 84.49% <100.00%> (-0.43%) ⬇️
bitswap/client/wantlist/wantlist.go 90.90% <ø> (-0.88%) ⬇️

... and 16 files with indirect coverage changes


🚨 Try these New Features:

@lidel lidel added the need/triage Needs initial labeling and prioritization label Oct 22, 2024
@gammazero gammazero added need/analysis Needs further analysis before proceeding need/maintainers-input Needs input from the current maintainer(s) and removed need/triage Needs initial labeling and prioritization labels Oct 22, 2024
mq.peerWants.sent.Remove(want.Cid)
toRebroadcast++
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop looks like a duplicate of the above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its peerWants vs broadcastWants tho

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean here is that it looks like the duplicate, but it works on different map

Copy link
Contributor

@gammazero gammazero Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is these look like they should call the same function defined for a WantList. Like Absorb in boxo/bitswap/client/wantlist/wantlist.go

mq.peerWants.sent.Remove(want.Cid)
toRebroadcast++
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous call to mq.peerWants.pending.Absorb also did:

// Invalidate the cache up-front to avoid doing any work trying to keep it up-to-date.                                 
w.cached = nil

Evan though all the send entries may not be added to pending in the new code, the cache still needs to be cleared. Otherwise, calling mq.peerWants.pending.Entries() will not return the newly added entries. Alternatively, the new entries can be to added to the wantlist's cached entries and sorted.

I suggest to keep the previous code:

mq.bcstWants.pending.Absorb(mq.bcstWants.sent)
mq.peerWants.pending.Absorb(mq.peerWants.sent)

And instead modify the Absorb function in boxo/bitswap/client/wantlist/wantlist.go. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work because pending *WantList doesn't have access to sentAt map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to make a new method on recallWantList struct tho

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, check the new version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the cache invalidation note. WantList.Add also invalidates the cache, so we don't have to it.

Comment on lines -491 to -492
if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably good to leave since it avoids Lock/Unlock of mq.rebroadcastIntervalLk and time.Now().

	if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
		return 0
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock exists only for testing. The interval is never changed outside of the unit test. Thus, I don't see any contention zero length check could prevent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is not about contention but about saving unnecessary lock/unlock calls, but if this only happens every 30 seconds, then it's probably not very important.

@gammazero
Copy link
Contributor

triage note: This is a good candidate for testing in rainbow staging to observe performance differences.

@gammazero gammazero added status/blocked Unable to be worked further until needs are met need/author-input Needs input from the original author and removed need/maintainers-input Needs input from the current maintainer(s) labels Oct 29, 2024
Previously, in-progress requests could be re-requested again during periodic rebroadcast.
The queue requests, and while awaiting response, the rebroadcast event happens.
Rebroadcast event changes previosly sent WANTs to pending and sends them again in a new message.

The solution here is to ensure WANT was in sent status for long enough, before bringing it back to pending.
This utilizes existing `sendAt` map which tracks when every CID was sent.
Comment on lines -134 to -135
// Absorb all the entries in other into this want list
func (w *Wantlist) Absorb(other *Wantlist) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted as deadcode

@lidel lidel requested a review from hsanjuan November 12, 2024 17:35
Copy link
Contributor

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing to consider here is that:

  • before, a "want" would be re-broadcasted at most 30 seconds after it was sent (could be 0.1s)
  • after, a "want" would be re-broadcasted only after at least 30 seconds after it was sent (could be 59.9s).

In that respect the code looks good.

I am not sure how much of an improvement this is in practice (perhaps clients were lucky to hit a short rebroadcast period sometimes), but it makes clients more respectful at least and perf should not be based on "luck".

I think we can test on staging and discuss in the next triage if we accept the change.

Comment on lines -491 to -492
if mq.bcstWants.sent.Len() == 0 && mq.peerWants.sent.Len() == 0 {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is not about contention but about saving unnecessary lock/unlock calls, but if this only happens every 30 seconds, then it's probably not very important.

@gammazero gammazero added need/maintainers-input Needs input from the current maintainer(s) and removed need/analysis Needs further analysis before proceeding need/author-input Needs input from the original author status/blocked Unable to be worked further until needs are met labels Nov 19, 2024
@gammazero
Copy link
Contributor

Need to test on staging before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/maintainers-input Needs input from the current maintainer(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants