-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL Add esql hash function #117989
ESQL Add esql hash function #117989
Conversation
Documentation preview: |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @idegtiarenko, I've created a changelog YAML for you. |
# Conflicts: # x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/60_usage.yml
/** | ||
* Hash function | ||
*/ | ||
HASH_FUNCTION(Build.current().isSnapshot()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok just releasing this as is without keeping it on a snapshot build first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely familiar with the release procedure re esql functions.
This one looks fairly simple so maybe? But I am looking for input from others on this question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally it's fine to just release it to the wild. If, for example, you were uncomfortable with the syntax or list of supported hashes. I scanned this PR this morning and saw you talking about the list of hashes being jvm dependent which isn't a surprise but is kind of annoying. So if you are worried about it then, yeah, we can keep this a snapshot. But if not, 🤷.
The real thing is to ask yourself, "what's left before I remove this?" If the list doesn't fit in one PR, then, yeah. This or a feature flag is the way to go. Frankly I think a feature flag is better, but it's not a big deal either way.
|
||
new FunctionDefinition[] { def(Match.class, Match::new, "match"), def(QueryString.class, QueryString::new, "qstr") }, | ||
// hash | ||
new FunctionDefinition[] { def(Hash.class, Hash::new, "hash") } }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just a regular old string function. I think.
I'd just stick it next to LTRIM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add also short hands for common algorithms as noted in #98545 - SHA0, SHA1 plus SHAKE128.
For that you'd simply extend the main hash function and use a built in algorithm:
public Md5 extends Hash {
public Md5(Expression input) {
super("md5", input);
}
}
It's a small thing that we'll be appreciated by the security folks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. If nobody minds I would like to add those aliases in a separate followup pr to keep this one smaller.
Re supported algorithms, I believe the set of supported ones could be different for different JVMs.
In openjdk javadocs I am seeing:
* <p> Every implementation of the Java platform is required to support
* the following standard {@code MessageDigest} algorithms:
* <ul>
* <li>{@code SHA-1}</li>
* <li>{@code SHA-256}</li>
In https://docs.oracle.com/javase/7/docs/api/java/security/MessageDigest.html this list is following:
Every implementation of the Java platform is required to support the following standard MessageDigest algorithms:
* MD5
* SHA-1
* SHA-256
I have checked list of supported algorithms with Security.getAlgorithms("MessageDigest")
in various JVMs and was getting the same [SHA3-512, SHA-1, SHA-384, SHA3-384, SHA-224, SHA-512/256, SHA-256, MD2, SHA-512/224, SHA3-256, SHA-512, SHA3-224, MD5]
set in
- 17.0.2-open
- 21.0.2-open
- 23-open
- 23.0.1-oracle
- 17-jbr
- eclipse adoptium 17
So we could easily add aliases for MD5 and SHA, but SHAKE would require a some custom implementation.
var result = alg.digest(); | ||
return new BytesRef(HexFormat.of().formatHex(result)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wants a resolveType
method that forces the parameters to be strings. Without that folks can do something like
ROW a = HASH("md5", 12)
and we'll end up with runtime exceptions casting stuff to BytesRefBlock
. The compute engine wants an external system - like the language itself - to make sure to only build trees of operators that make valid types. And resolveTypes
is the trick that does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also things like:
ROW a = HASH("md5", "192.168.0.1"::IP)
will work. But it'll hash the 128 bit representation we use for IPs. That might even be what people want, but it's maybe unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify the comment above : "192.168.0.1" as a string gets converted into an IP. Which then gets automatically converted into a byteref; please check whether it's the original "192.168.0.1" or something else.
|
||
@Override | ||
public EvalOperator.ExpressionEvaluator.Factory toEvaluator(ToEvaluator toEvaluator) { | ||
if (alg.foldable() && alg.dataType() == DataType.KEYWORD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check for KEYWORD
type should be redundant with a resolveType
method that checks for isString
or something.
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java
Outdated
Show resolved
Hide resolved
for (String alg : List.of("MD5", "SHA", "SHA-224", "SHA-256", "SHA-384", "SHA-512")) { | ||
cases.addAll(createTestCases(alg)); | ||
} | ||
return parameterSuppliersFromTypedData(cases); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This almost certainly wants the WithDefaultChecks
flavor - that'll add tests that make sure the types you haven't mentioned aren't accidentally supported by the function.
private static BytesRef hash(MessageDigest alg, BytesRef input) { | ||
alg.update(input.bytes, input.offset, input.length); | ||
var result = alg.digest(); | ||
return new BytesRef(HexFormat.of().formatHex(result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we have three or four hand rolled hex encoders. I might add a flavor that encodes directly into a BytesRef
into MessageDigests
. Something like
static BytesRef process(@Fixed(includeInToString = false) BreakingBytesRefBuilder scratch, BytesRef alg, BytesRef input) throws NoSuchAlgorithmException {
...
alg.update(input.bytes, input.offset, input.length);
byte[] digest = alg.digest();
scratch.clear();
scratch.grow(digest.length * 2);
MessageDigests.toHexUtf8Bytes(scratch.bytes(), digest);
return scratch.bytesRefView();
}
Something like that. It'd skip a copy or two. And it'd allow us to reuse the allocations from last time.
@@ -92,7 +92,7 @@ setup: | |||
- gt: {esql.functions.to_long: $functions_to_long} | |||
- match: {esql.functions.coalesce: $functions_coalesce} | |||
# Testing for the entire function set isn't feasbile, so we just check that we return the correct count as an approximation. | |||
- length: {esql.functions: 127} # check the "sister" test below for a likely update to the same esql.functions length check | |||
- length: {esql.functions: 128} # check the "sister" test below for a likely update to the same esql.functions length check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we'd blasted this thing....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't (we should IMHO) and it keeps failing.
@idegtiarenko please also update the counter at line 166, otherwise it will fail in release tests, and you won't realize it until it's merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit surprised about the other one.
Currently it is passing in the pr, meaning the count is expected (for snapshot build).
Is there a way to match against different count in release and non release builds?
Or may be this is an argument against (Build.current().isSnapshot())
? (see #117989 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is green because the second one is just not running. You can make it run adding test-release
label to the PR, but the CI will take much more to run.
(Build.current().isSnapshot())
in capabilities only impacts on tests; if you want to make the function as snapshot-only you'll have to register it as a snapshot function in EsqlFunctionRegistry rather than a normal one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@idegtiarenko those two tests are executed depending on snapshot nature of the build. One test runs in snapshot-only CI runs, the other one runs in release-only CI runs (using test-release
label on the PR). The different nature of those two tests is visible in the capabilities settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all!
public void writeTo(StreamOutput out) throws IOException { | ||
source().writeTo(out); | ||
out.writeNamedWriteable(alg); | ||
out.writeNamedWriteable(input); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed for the majority (if not all) functions - raised #118037
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look and open a followup pr.
It also sounds like something similar might be possible for most of the resolveType()
implementations. Happy to discuss it separately.
|
||
@Evaluator(warnExceptions = NoSuchAlgorithmException.class) | ||
static BytesRef process(BytesRef alg, BytesRef input) throws NoSuchAlgorithmException { | ||
return hash(MessageDigest.getInstance(alg.utf8ToString()), input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate the algorithm at planning time. Use Validatable
interface to verify the algorithm expression can be folded to a string and check the algorithm exist:
... implements Validatable {
public void validate(Failures failures) {
try { MessageDigest.getInstance(alg)
} catch (Exception ex) {
failures.add("Invalid hashing algorithm [{}}", alg);
}
}
}
By doing the validation here, you can save the algorithm directly as a string instead as an expression.
Separately, you can validate the well known algorithms and use org.es.common.MessageDigests
instead which has MD5 and SHA_1 variants.
Update: My comment is similar to Nik's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly that is executed during the planning.
As a result we could only have constant alg
inputs and will be unable to evaluate hash with algorithm read from the field or a variable. I wonder if we want to be able to dynamically resolve alg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a constant then I think we should do what Costin says. I think the Validatable interface brings the failure much further forwards at planning time which is good. And it's a standard we'd like to stick to more closely, which is good. But for non-constant algorithms I'm fine supporting non-constant algorithms just as you have it.
var result = alg.digest(); | ||
return new BytesRef(HexFormat.of().formatHex(result)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify the comment above : "192.168.0.1" as a string gets converted into an IP. Which then gets automatically converted into a byteref; please check whether it's the original "192.168.0.1" or something else.
|
||
new FunctionDefinition[] { def(Match.class, Match::new, "match"), def(QueryString.class, QueryString::new, "qstr") }, | ||
// hash | ||
new FunctionDefinition[] { def(Hash.class, Hash::new, "hash") } }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add also short hands for common algorithms as noted in #98545 - SHA0, SHA1 plus SHAKE128.
For that you'd simply extend the main hash function and use a built in algorithm:
public Md5 extends Hash {
public Md5(Expression input) {
super("md5", input);
}
}
It's a small thing that we'll be appreciated by the security folks
27a9042
to
3abf208
Compare
return new HashConstantEvaluator.Factory( | ||
source(), | ||
context -> new BreakingBytesRefBuilder(context.breaker(), "hash"), | ||
context -> md, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be MessageDigest.getInstance(algorithm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safe for maybe some more tests, it LGTM.
|
||
@FunctionInfo( | ||
returnType = "keyword", | ||
description = "Computes the hash of the input using various algorithms such as MD5, SHA, SHA-224, SHA-256, SHA-384, SHA-512." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd maybe add that their availability varies with the JVM and its configuration?
Wondering if we'll ever want to export this functionality through a SHOW (-equivalent) command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so long as we have a list of the ones that are guaranteed to be supported we're good. If we need to support more and it's only on some JVMs we can figure out the wording.
# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java
|
||
private static TestCaseSupplier createTestCase(String algorithm, DataType algorithmType, DataType inputType) { | ||
return new TestCaseSupplier(algorithm, List.of(algorithmType, inputType), () -> { | ||
var input = randomAlphaOfLength(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's randomize the length of this one. Like @bpintea mentioned, it'd be nice to get empty string sometimes. Maybe:
rarely() ? "" : randomRealisticUnicodeOfLengthBetween(1, 1000)
It'll probably work fine, but more paranoia seems good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to call TestCaseSupplier.stringCases
which has zoo of these things.
|
||
private static TestCaseSupplier createLiteralTestCase(String algorithm, DataType algorithmType, DataType inputType) { | ||
return new TestCaseSupplier(algorithm, List.of(algorithmType, inputType), () -> { | ||
var input = randomAlphaOfLength(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same randomization here too I think.
return new Hash(source, args.get(0), args.get(1)); | ||
} | ||
|
||
public void testInvalidAlgorithmLiteral() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a test case for invalid algorithm
already. Do you want to keep this one? I might move it to another class in that case - a HashExtraTests
or something. That way you don't run it 1233242413414 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they are covering different branches. One added via cases.add
is covering non-constant evaluator.
This one here is foldable, as a result it throws from org.elasticsearch.xpack.esql.expression.function.scalar.string.Hash#toEvaluator
as input is not a valid algotythm.
Such exception is not caught and not verified by withWarning
nor withFoldingException
.
Please let me know if I am missing something or if exception should not be thrown there.
@@ -63,6 +64,7 @@ public static List<NamedWriteableRegistry.Entry> getNamedWriteables() { | |||
entries.add(Concat.ENTRY); | |||
entries.add(E.ENTRY); | |||
entries.add(EndsWith.ENTRY); | |||
entries.add(Hash.ENTRY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alphabetical order, please.
💚 Backport successful
|
This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.
This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.
This change introduces esql
hash(alg, input)
function that relies on the Java MessageDigest to compute the hash.I will also add
md5(input)
,sha(input)
and may be several other functions in a followup pr once this is merged