-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control memory explosion on large list of queries #102
base: master
Are you sure you want to change the base?
Conversation
Overlapping fields is a bit of an artificial test case.
.gitignore
Outdated
@@ -1 +1,3 @@ | |||
/internal/tests/testdata/graphql-js | |||
|
|||
.idea/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's specific to your developer environment. It'd probably be best to add in your global .gitignore.
Have zapped the gitignore as requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a very reasonable change. Thank you for including the benchmark.
Would you be able to fix up the merge conflicts? I'm happy to merge the changes once this PR is good to go.
if o { | ||
a = "non-overlapping" | ||
} | ||
b.Run(fmt.Sprintf("%d queries %s aliases", c, a), func(b *testing.B) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we'd want to b.ResetTimer()
after setup? What do you think?
@@ -171,9 +171,10 @@ func validateSelectionSet(c *opContext, sels []query.Selection, t schema.NamedTy | |||
validateSelection(c, sel, t) | |||
} | |||
|
|||
useCache := len(sels) <= 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this threshold be configurable? How did we arrive at 100?
@@ -381,16 +382,21 @@ func detectFragmentCycleSel(c *context, sel query.Selection, fragVisited map[*qu | |||
} | |||
} | |||
|
|||
func (c *context) validateOverlap(a, b query.Selection, reasons *[]string, locs *[]errors.Location) { | |||
func (c *context) validateOverlap(a, b query.Selection, reasons *[]string, locs *[]errors.Location, useCache bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side comment on the internal function signatures, does not apply to these changes: the internal functions may have a nicer APIs if we took structs instead of ever-expanding lists of arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tonyghita and this is a refactoring that sooner or later has to be done would improve maintainability of the library. However, it will be a breaking change as well it will require a considerable amount of work.
Can someone merge this? |
There are merge conflicts to be fixed first... |
graphql_test.go
Outdated
@@ -8,6 +8,8 @@ import ( | |||
"github.com/neelance/graphql-go" | |||
"github.com/neelance/graphql-go/example/starwars" | |||
"github.com/neelance/graphql-go/gqltesting" | |||
"fmt" | |||
"bytes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports are in the wrong group.
The situation occurs quickly with large requests size. I've sent a ~100KB query string and it rapidly leaked to more than 1GB of memory at the
|
I'd be willing to put together a PR with the merge conflicts fixed here, as well as the benchmark fix. If we did this, how would we want to approach the threshold? Make it configurable? Or should we just allow 100 to be the default? |
Thank you, @svanburen! I would prefer if this is opt-in and the threshold is configurable. |
I've been playing around with this for a bit and have it mostly working - the only issue is that given the current implementation, if we make the threshold configurable and it's below a certain value, the validation tests will panic (a value of 1 or 2 seems to trigger this). I can put up a PR with the failing tests, but any initial thoughts here? I'm wondering if we should ignore the configuration for the time being and just get in a merge-able PR? |
# Conflicts: # graphql_test.go
Fixed |
This reverts commit 00e76d5.
mute scanner errors
In our use case (server to server), we were making batch queries to our Go server and ran into a problem when a single request started to exceed about queries 100. In our use case, we needed to pack a bunch of queries together as we can very effectively parallelise them and want to hide the request latencies.
Performance would collapse and the resident memory would spike up to 10s of GB in production where the requests had ~3000 queries. I added a benchmark
BenchmarkFragmentQueries
tographql_test.go
and with a little profiling it looked like almost all of this was due to the validation. With an early return on the validation, the time and memory overheads would vanish.https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L174-L178 causes a ^2 explosion of memory due to https://github.com/neelance/graphql-go/blob/master/internal/validation/validation.go#L389-L393 - the keys start taking up a huge amount of space causing the hashmap implementation to struggle while it expands buckets internally to cope. This memory also does not get released with the GC of the map itself. To control this, this PR stops the use of the map if the selection length is greater than 100 which causes more computation but is faster than fighting the map. I also made a few other tweaks to improve performance and it now works ok for our usecase with MBs instead of GBs of RSS and sub second overheads.
The numbers get much worse if field name overlaps due to
validateFieldOverlap
not early terminating but I skipped fixes/investigation as that does not look like a common issue in the real world. However, if you try it the current master will exceed the 10min limit for a benchmark test so likely needs some guards to prevent DoS issues with public APIs. I believe this can all be a bit smarter as the queries have the same repeating fragments but I am new to this codebase.Performance tests
The headline for us was reducing 65s (of mainly validation) down to 1s for 10,000 queries in the request.