-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple keys in bigdiffy #138
Support multiple keys in bigdiffy #138
Conversation
401dab3
to
5669ac1
Compare
@tailrec | ||
def get(xs: Array[String], i: Int, r: GenericRecord): String = | ||
if (i == xs.length - 1) { | ||
r.get(xs(i)).toString | ||
} else { | ||
get(xs, i + 1, r.get(xs(i)).asInstanceOf[GenericRecord]) | ||
} | ||
val xs = key.split('.') | ||
(r: GenericRecord) => get(xs, 0, r) | ||
(r: GenericRecord) => keys.map { k => get(k.split('.'), 0, r) }.mkString("_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what if we change this to a Set[String]
instead? My concern is this can produce some weird edge cases. They will probably be rare but it's not really transparent to end user.
e.g.
key1_something, key2
vs key1, something_key2
produces the same aggregate key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only suggesting Set
since we use that in sampler but open to other ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be more confusing is the results don't respect the user ordering of the keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some dataset that preserves order instead. Up to you. Doesn't block this PR imo but would be nice to have fixed :)
Codecov Report
@@ Coverage Diff @@
## master #138 +/- ##
=========================================
Coverage ? 70.93%
=========================================
Files ? 37
Lines ? 1421
Branches ? 170
=========================================
Hits ? 1008
Misses ? 413
Partials ? 0
Continue to review full report at Codecov.
|
Can you also add a test that runs with multiple keys? |
Sorry, was travelling for a bit. Can you resolve conflicts? and then we can try to get this merged |
c0752b6
to
875e038
Compare
@idreeskhan PTAL, rebased master and added multikey case class |
LGTM, if you have a chance could you do some quick and dirty benchmarking to see the performance impact of boxing? If not we can probably just create an open issue for it and revisit if it crops up. |
@idreeskhan I created #141 |
👍 |
Addresses #137.
There were two options that I thought about:
*KeyFn
functions.Decided to go with (1) since we will need to join the keys anyways to write the results.