Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL Add esql hash function #117989

Merged
merged 44 commits into from
Dec 18, 2024
Merged

Conversation

idegtiarenko
Copy link
Contributor

@idegtiarenko idegtiarenko commented Dec 4, 2024

This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.

I will also add md5(input), sha(input) and may be several other functions in a followup pr once this is merged

@idegtiarenko idegtiarenko added >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v9.0.0 labels Dec 4, 2024
@idegtiarenko idegtiarenko requested a review from nik9000 December 4, 2024 13:40
@idegtiarenko idegtiarenko changed the title Add esql hash function [ES|QL] Add esql hash function Dec 4, 2024
Copy link
Contributor

github-actions bot commented Dec 4, 2024

Documentation preview:

@idegtiarenko idegtiarenko changed the title [ES|QL] Add esql hash function ESQL Add esql hash function Dec 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @idegtiarenko, I've created a changelog YAML for you.

idegtiarenko and others added 4 commits December 4, 2024 14:41
# Conflicts:
#	x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/60_usage.yml
@idegtiarenko idegtiarenko requested a review from bpintea December 4, 2024 17:32
/**
* Hash function
*/
HASH_FUNCTION(Build.current().isSnapshot()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok just releasing this as is without keeping it on a snapshot build first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely familiar with the release procedure re esql functions.
This one looks fairly simple so maybe? But I am looking for input from others on this question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally it's fine to just release it to the wild. If, for example, you were uncomfortable with the syntax or list of supported hashes. I scanned this PR this morning and saw you talking about the list of hashes being jvm dependent which isn't a surprise but is kind of annoying. So if you are worried about it then, yeah, we can keep this a snapshot. But if not, 🤷.

The real thing is to ask yourself, "what's left before I remove this?" If the list doesn't fit in one PR, then, yeah. This or a feature flag is the way to go. Frankly I think a feature flag is better, but it's not a big deal either way.


new FunctionDefinition[] { def(Match.class, Match::new, "match"), def(QueryString.class, QueryString::new, "qstr") },
// hash
new FunctionDefinition[] { def(Hash.class, Hash::new, "hash") } };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just a regular old string function. I think.

I'd just stick it next to LTRIM

Copy link
Member

@costin costin Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add also short hands for common algorithms as noted in #98545 - SHA0, SHA1 plus SHAKE128.
For that you'd simply extend the main hash function and use a built in algorithm:

public Md5 extends Hash {
       public Md5(Expression input) {
              super("md5", input);
       }
}

It's a small thing that we'll be appreciated by the security folks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. If nobody minds I would like to add those aliases in a separate followup pr to keep this one smaller.

Re supported algorithms, I believe the set of supported ones could be different for different JVMs.
In openjdk javadocs I am seeing:

 * <p> Every implementation of the Java platform is required to support
 * the following standard {@code MessageDigest} algorithms:
 * <ul>
 * <li>{@code SHA-1}</li>
 * <li>{@code SHA-256}</li>

In https://docs.oracle.com/javase/7/docs/api/java/security/MessageDigest.html this list is following:

Every implementation of the Java platform is required to support the following standard MessageDigest algorithms:
* MD5
* SHA-1
* SHA-256

I have checked list of supported algorithms with Security.getAlgorithms("MessageDigest") in various JVMs and was getting the same [SHA3-512, SHA-1, SHA-384, SHA3-384, SHA-224, SHA-512/256, SHA-256, MD2, SHA-512/224, SHA3-256, SHA-512, SHA3-224, MD5] set in

  • 17.0.2-open
  • 21.0.2-open
  • 23-open
  • 23.0.1-oracle
  • 17-jbr
  • eclipse adoptium 17

So we could easily add aliases for MD5 and SHA, but SHAKE would require a some custom implementation.

var result = alg.digest();
return new BytesRef(HexFormat.of().formatHex(result));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wants a resolveType method that forces the parameters to be strings. Without that folks can do something like

ROW a = HASH("md5", 12)

and we'll end up with runtime exceptions casting stuff to BytesRefBlock. The compute engine wants an external system - like the language itself - to make sure to only build trees of operators that make valid types. And resolveTypes is the trick that does it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also things like:

ROW a = HASH("md5", "192.168.0.1"::IP)

will work. But it'll hash the 128 bit representation we use for IPs. That might even be what people want, but it's maybe unexpected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify the comment above : "192.168.0.1" as a string gets converted into an IP. Which then gets automatically converted into a byteref; please check whether it's the original "192.168.0.1" or something else.


@Override
public EvalOperator.ExpressionEvaluator.Factory toEvaluator(ToEvaluator toEvaluator) {
if (alg.foldable() && alg.dataType() == DataType.KEYWORD) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check for KEYWORD type should be redundant with a resolveType method that checks for isString or something.

for (String alg : List.of("MD5", "SHA", "SHA-224", "SHA-256", "SHA-384", "SHA-512")) {
cases.addAll(createTestCases(alg));
}
return parameterSuppliersFromTypedData(cases);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This almost certainly wants the WithDefaultChecks flavor - that'll add tests that make sure the types you haven't mentioned aren't accidentally supported by the function.

private static BytesRef hash(MessageDigest alg, BytesRef input) {
alg.update(input.bytes, input.offset, input.length);
var result = alg.digest();
return new BytesRef(HexFormat.of().formatHex(result));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we have three or four hand rolled hex encoders. I might add a flavor that encodes directly into a BytesRef into MessageDigests. Something like

    static BytesRef process(@Fixed(includeInToString = false) BreakingBytesRefBuilder scratch, BytesRef alg, BytesRef input) throws NoSuchAlgorithmException {
        ...
        alg.update(input.bytes, input.offset, input.length);
       byte[] digest = alg.digest();
       scratch.clear();
       scratch.grow(digest.length * 2);
       MessageDigests.toHexUtf8Bytes(scratch.bytes(), digest);
       return scratch.bytesRefView();
    }

Something like that. It'd skip a copy or two. And it'd allow us to reuse the allocations from last time.

@@ -92,7 +92,7 @@ setup:
- gt: {esql.functions.to_long: $functions_to_long}
- match: {esql.functions.coalesce: $functions_coalesce}
# Testing for the entire function set isn't feasbile, so we just check that we return the correct count as an approximation.
- length: {esql.functions: 127} # check the "sister" test below for a likely update to the same esql.functions length check
- length: {esql.functions: 128} # check the "sister" test below for a likely update to the same esql.functions length check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we'd blasted this thing....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't (we should IMHO) and it keeps failing.
@idegtiarenko please also update the counter at line 166, otherwise it will fail in release tests, and you won't realize it until it's merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit surprised about the other one.
Currently it is passing in the pr, meaning the count is expected (for snapshot build).
Is there a way to match against different count in release and non release builds?
Or may be this is an argument against (Build.current().isSnapshot())? (see #117989 (comment))

Copy link
Contributor

@luigidellaquila luigidellaquila Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is green because the second one is just not running. You can make it run adding test-release label to the PR, but the CI will take much more to run.

(Build.current().isSnapshot()) in capabilities only impacts on tests; if you want to make the function as snapshot-only you'll have to register it as a snapshot function in EsqlFunctionRegistry rather than a normal one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idegtiarenko those two tests are executed depending on snapshot nature of the build. One test runs in snapshot-only CI runs, the other one runs in release-only CI runs (using test-release label on the PR). The different nature of those two tests is visible in the capabilities settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all!

Comment on lines 55 to 59
public void writeTo(StreamOutput out) throws IOException {
source().writeTo(out);
out.writeNamedWriteable(alg);
out.writeNamedWriteable(input);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed for the majority (if not all) functions - raised #118037

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look and open a followup pr.
It also sounds like something similar might be possible for most of the resolveType() implementations. Happy to discuss it separately.


@Evaluator(warnExceptions = NoSuchAlgorithmException.class)
static BytesRef process(BytesRef alg, BytesRef input) throws NoSuchAlgorithmException {
return hash(MessageDigest.getInstance(alg.utf8ToString()), input);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate the algorithm at planning time. Use Validatable interface to verify the algorithm expression can be folded to a string and check the algorithm exist:

... implements Validatable {
    public void validate(Failures failures) {
           try { MessageDigest.getInstance(alg) 
           } catch (Exception ex) {
               failures.add("Invalid hashing algorithm [{}}", alg);
           }
    }
}

By doing the validation here, you can save the algorithm directly as a string instead as an expression.
Separately, you can validate the well known algorithms and use org.es.common.MessageDigests instead which has MD5 and SHA_1 variants.

Update: My comment is similar to Nik's

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly that is executed during the planning.
As a result we could only have constant alg inputs and will be unable to evaluate hash with algorithm read from the field or a variable. I wonder if we want to be able to dynamically resolve alg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a constant then I think we should do what Costin says. I think the Validatable interface brings the failure much further forwards at planning time which is good. And it's a standard we'd like to stick to more closely, which is good. But for non-constant algorithms I'm fine supporting non-constant algorithms just as you have it.

var result = alg.digest();
return new BytesRef(HexFormat.of().formatHex(result));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify the comment above : "192.168.0.1" as a string gets converted into an IP. Which then gets automatically converted into a byteref; please check whether it's the original "192.168.0.1" or something else.


new FunctionDefinition[] { def(Match.class, Match::new, "match"), def(QueryString.class, QueryString::new, "qstr") },
// hash
new FunctionDefinition[] { def(Hash.class, Hash::new, "hash") } };
Copy link
Member

@costin costin Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add also short hands for common algorithms as noted in #98545 - SHA0, SHA1 plus SHAKE128.
For that you'd simply extend the main hash function and use a built in algorithm:

public Md5 extends Hash {
       public Md5(Expression input) {
              super("md5", input);
       }
}

It's a small thing that we'll be appreciated by the security folks

@idegtiarenko idegtiarenko requested a review from astefan December 11, 2024 15:03
@astefan astefan added auto-backport Automatically create backport pull requests when merged v8.18.0 labels Dec 12, 2024
return new HashConstantEvaluator.Factory(
source(),
context -> new BreakingBytesRefBuilder(context.breaker(), "hash"),
context -> md,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be MessageDigest.getInstance(algorithm)

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe for maybe some more tests, it LGTM.


@FunctionInfo(
returnType = "keyword",
description = "Computes the hash of the input using various algorithms such as MD5, SHA, SHA-224, SHA-256, SHA-384, SHA-512."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe add that their availability varies with the JVM and its configuration?
Wondering if we'll ever want to export this functionality through a SHOW (-equivalent) command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so long as we have a list of the ones that are guaranteed to be supported we're good. If we need to support more and it's only on some JVMs we can figure out the wording.

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java
@idegtiarenko idegtiarenko requested a review from nik9000 December 13, 2024 13:38

private static TestCaseSupplier createTestCase(String algorithm, DataType algorithmType, DataType inputType) {
return new TestCaseSupplier(algorithm, List.of(algorithmType, inputType), () -> {
var input = randomAlphaOfLength(10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's randomize the length of this one. Like @bpintea mentioned, it'd be nice to get empty string sometimes. Maybe:

rarely() ? "" : randomRealisticUnicodeOfLengthBetween(1, 1000)

It'll probably work fine, but more paranoia seems good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to call TestCaseSupplier.stringCases which has zoo of these things.


private static TestCaseSupplier createLiteralTestCase(String algorithm, DataType algorithmType, DataType inputType) {
return new TestCaseSupplier(algorithm, List.of(algorithmType, inputType), () -> {
var input = randomAlphaOfLength(10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same randomization here too I think.

return new Hash(source, args.get(0), args.get(1));
}

public void testInvalidAlgorithmLiteral() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a test case for invalid algorithm already. Do you want to keep this one? I might move it to another class in that case - a HashExtraTests or something. That way you don't run it 1233242413414 times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe they are covering different branches. One added via cases.add is covering non-constant evaluator.
This one here is foldable, as a result it throws from org.elasticsearch.xpack.esql.expression.function.scalar.string.Hash#toEvaluator as input is not a valid algotythm.
Such exception is not caught and not verified by withWarning nor withFoldingException.

Please let me know if I am missing something or if exception should not be thrown there.

@@ -63,6 +64,7 @@ public static List<NamedWriteableRegistry.Entry> getNamedWriteables() {
entries.add(Concat.ENTRY);
entries.add(E.ENTRY);
entries.add(EndsWith.ENTRY);
entries.add(Hash.ENTRY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetical order, please.

@idegtiarenko idegtiarenko merged commit 7cf28a9 into elastic:main Dec 18, 2024
16 checks passed
@idegtiarenko idegtiarenko deleted the add_esql_hash branch December 18, 2024 08:56
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

idegtiarenko added a commit to idegtiarenko/elasticsearch that referenced this pull request Dec 18, 2024
This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.
elasticsearchmachine pushed a commit that referenced this pull request Dec 18, 2024
This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Dec 18, 2024
This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants