ESQL Add esql hash function #117989

idegtiarenko · 2024-12-04T13:40:29Z

This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.

I will also add md5(input), sha(input) and may be several other functions in a followup pr once this is merged

github-actions · 2024-12-04T13:40:42Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2024-12-04T13:41:17Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2024-12-04T13:41:18Z

Hi @idegtiarenko, I've created a changelog YAML for you.

# Conflicts: # x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/60_usage.yml

nik9000 · 2024-12-04T17:25:32Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java

+        /**
+         * Hash function
+         */
+        HASH_FUNCTION(Build.current().isSnapshot()),


Are we ok just releasing this as is without keeping it on a snapshot build first?

I am not entirely familiar with the release procedure re esql functions.
This one looks fairly simple so maybe? But I am looking for input from others on this question.

generally it's fine to just release it to the wild. If, for example, you were uncomfortable with the syntax or list of supported hashes. I scanned this PR this morning and saw you talking about the list of hashes being jvm dependent which isn't a surprise but is kind of annoying. So if you are worried about it then, yeah, we can keep this a snapshot. But if not, 🤷.

The real thing is to ask yourself, "what's left before I remove this?" If the list doesn't fit in one PR, then, yeah. This or a feature flag is the way to go. Frankly I think a feature flag is better, but it's not a big deal either way.

x-pack/plugin/esql/qa/testFixtures/src/main/resources/hash.csv-spec

nik9000 · 2024-12-04T17:27:29Z

...sql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java

-
+            new FunctionDefinition[] { def(Match.class, Match::new, "match"), def(QueryString.class, QueryString::new, "qstr") },
+            // hash
+            new FunctionDefinition[] { def(Hash.class, Hash::new, "hash") } };


I think it's just a regular old string function. I think.

I'd just stick it next to LTRIM

Add also short hands for common algorithms as noted in #98545 - SHA0, SHA1 plus SHAKE128.
For that you'd simply extend the main hash function and use a built in algorithm:

public Md5 extends Hash { public Md5(Expression input) { super("md5", input); } }

It's a small thing that we'll be appreciated by the security folks

Makes sense. If nobody minds I would like to add those aliases in a separate followup pr to keep this one smaller.

Re supported algorithms, I believe the set of supported ones could be different for different JVMs.
In openjdk javadocs I am seeing:

* <p> Every implementation of the Java platform is required to support * the following standard {@code MessageDigest} algorithms: * <ul> * <li>{@code SHA-1}</li> * <li>{@code SHA-256}</li>

In https://docs.oracle.com/javase/7/docs/api/java/security/MessageDigest.html this list is following:

Every implementation of the Java platform is required to support the following standard MessageDigest algorithms: * MD5 * SHA-1 * SHA-256

I have checked list of supported algorithms with Security.getAlgorithms("MessageDigest") in various JVMs and was getting the same [SHA3-512, SHA-1, SHA-384, SHA3-384, SHA-224, SHA-512/256, SHA-256, MD2, SHA-512/224, SHA3-256, SHA-512, SHA3-224, MD5] set in

17.0.2-open

21.0.2-open

23-open

23.0.1-oracle

17-jbr

eclipse adoptium 17

So we could easily add aliases for MD5 and SHA, but SHAKE would require a some custom implementation.

nik9000 · 2024-12-04T17:33:55Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java

+        var result = alg.digest();
+        return new BytesRef(HexFormat.of().formatHex(result));
+    }
+


I think this wants a resolveType method that forces the parameters to be strings. Without that folks can do something like

ROW a = HASH("md5", 12)

and we'll end up with runtime exceptions casting stuff to BytesRefBlock. The compute engine wants an external system - like the language itself - to make sure to only build trees of operators that make valid types. And resolveTypes is the trick that does it.

Also things like:

ROW a = HASH("md5", "192.168.0.1"::IP)

will work. But it'll hash the 128 bit representation we use for IPs. That might even be what people want, but it's maybe unexpected.

To clarify the comment above : "192.168.0.1" as a string gets converted into an IP. Which then gets automatically converted into a byteref; please check whether it's the original "192.168.0.1" or something else.

nik9000 · 2024-12-04T17:34:17Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java

+
+    @Override
+    public EvalOperator.ExpressionEvaluator.Factory toEvaluator(ToEvaluator toEvaluator) {
+        if (alg.foldable() && alg.dataType() == DataType.KEYWORD) {


This check for KEYWORD type should be redundant with a resolveType method that checks for isString or something.

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java

nik9000 · 2024-12-04T17:37:51Z

...ql/src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/HashTests.java

+        for (String alg : List.of("MD5", "SHA", "SHA-224", "SHA-256", "SHA-384", "SHA-512")) {
+            cases.addAll(createTestCases(alg));
+        }
+        return parameterSuppliersFromTypedData(cases);


This almost certainly wants the WithDefaultChecks flavor - that'll add tests that make sure the types you haven't mentioned aren't accidentally supported by the function.

nik9000 · 2024-12-04T17:49:03Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java

+    private static BytesRef hash(MessageDigest alg, BytesRef input) {
+        alg.update(input.bytes, input.offset, input.length);
+        var result = alg.digest();
+        return new BytesRef(HexFormat.of().formatHex(result));


I feel like we have three or four hand rolled hex encoders. I might add a flavor that encodes directly into a BytesRef into MessageDigests. Something like

static BytesRef process(@Fixed(includeInToString = false) BreakingBytesRefBuilder scratch, BytesRef alg, BytesRef input) throws NoSuchAlgorithmException { ... alg.update(input.bytes, input.offset, input.length); byte[] digest = alg.digest(); scratch.clear(); scratch.grow(digest.length * 2); MessageDigests.toHexUtf8Bytes(scratch.bytes(), digest); return scratch.bytesRefView(); }

Something like that. It'd skip a copy or two. And it'd allow us to reuse the allocations from last time.

nik9000 · 2024-12-04T17:49:34Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/60_usage.yml

@@ -92,7 +92,7 @@ setup:
  - gt: {esql.functions.to_long: $functions_to_long}
  - match: {esql.functions.coalesce: $functions_coalesce}
  # Testing for the entire function set isn't feasbile, so we just check that we return the correct count as an approximation.
-  - length: {esql.functions: 127} # check the "sister" test below for a likely update to the same esql.functions length check
+  - length: {esql.functions: 128} # check the "sister" test below for a likely update to the same esql.functions length check


I thought we'd blasted this thing....

We didn't (we should IMHO) and it keeps failing.
@idegtiarenko please also update the counter at line 166, otherwise it will fail in release tests, and you won't realize it until it's merged.

I am a bit surprised about the other one.
Currently it is passing in the pr, meaning the count is expected (for snapshot build).
Is there a way to match against different count in release and non release builds?
Or may be this is an argument against (Build.current().isSnapshot())? (see #117989 (comment))

The PR is green because the second one is just not running. You can make it run adding test-release label to the PR, but the CI will take much more to run.

(Build.current().isSnapshot()) in capabilities only impacts on tests; if you want to make the function as snapshot-only you'll have to register it as a snapshot function in EsqlFunctionRegistry rather than a normal one.

@idegtiarenko those two tests are executed depending on snapshot nature of the build. One test runs in snapshot-only CI runs, the other one runs in release-only CI runs (using test-release label on the PR). The different nature of those two tests is visible in the capabilities settings.

Thanks all!

costin · 2024-12-05T01:31:28Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java

+    public void writeTo(StreamOutput out) throws IOException {
+        source().writeTo(out);
+        out.writeNamedWriteable(alg);
+        out.writeNamedWriteable(input);
+    }


This shouldn't be needed for the majority (if not all) functions - raised #118037

Will take a look and open a followup pr.
It also sounds like something similar might be possible for most of the resolveType() implementations. Happy to discuss it separately.

costin · 2024-12-05T01:41:47Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java

+
+    @Evaluator(warnExceptions = NoSuchAlgorithmException.class)
+    static BytesRef process(BytesRef alg, BytesRef input) throws NoSuchAlgorithmException {
+        return hash(MessageDigest.getInstance(alg.utf8ToString()), input);


Validate the algorithm at planning time. Use Validatable interface to verify the algorithm expression can be folded to a string and check the algorithm exist:

... implements Validatable { public void validate(Failures failures) { try { MessageDigest.getInstance(alg) } catch (Exception ex) { failures.add("Invalid hashing algorithm [{}}", alg); } } }

By doing the validation here, you can save the algorithm directly as a string instead as an expression.
Separately, you can validate the well known algorithms and use org.es.common.MessageDigests instead which has MD5 and SHA_1 variants.

Update: My comment is similar to Nik's

If I understand correctly that is executed during the planning.
As a result we could only have constant alg inputs and will be unable to evaluate hash with algorithm read from the field or a variable. I wonder if we want to be able to dynamically resolve alg.

If it is a constant then I think we should do what Costin says. I think the Validatable interface brings the failure much further forwards at planning time which is good. And it's a standard we'd like to stick to more closely, which is good. But for non-constant algorithms I'm fine supporting non-constant algorithms just as you have it.

costin · 2024-12-05T01:46:30Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/hash/Hash.java

+        var result = alg.digest();
+        return new BytesRef(HexFormat.of().formatHex(result));
+    }
+


To clarify the comment above : "192.168.0.1" as a string gets converted into an IP. Which then gets automatically converted into a byteref; please check whether it's the original "192.168.0.1" or something else.

costin · 2024-12-05T01:52:56Z

...sql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java

-
+            new FunctionDefinition[] { def(Match.class, Match::new, "match"), def(QueryString.class, QueryString::new, "qstr") },
+            // hash
+            new FunctionDefinition[] { def(Hash.class, Hash::new, "hash") } };


Add also short hands for common algorithms as noted in #98545 - SHA0, SHA1 plus SHAKE128.
For that you'd simply extend the main hash function and use a built in algorithm:

public Md5 extends Hash { public Md5(Expression input) { super("md5", input); } }

It's a small thing that we'll be appreciated by the security folks

nik9000 · 2024-12-12T15:36:39Z

.../esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Hash.java

+                return new HashConstantEvaluator.Factory(
+                    source(),
+                    context -> new BreakingBytesRefBuilder(context.breaker(), "hash"),
+                    context -> md,


Shouldn't this be MessageDigest.getInstance(algorithm)

bpintea

Safe for maybe some more tests, it LGTM.

x-pack/plugin/esql/qa/testFixtures/src/main/resources/hash.csv-spec

bpintea · 2024-12-12T15:51:59Z

.../esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/Hash.java

+
+    @FunctionInfo(
+        returnType = "keyword",
+        description = "Computes the hash of the input using various algorithms such as MD5, SHA, SHA-224, SHA-256, SHA-384, SHA-512."


I'd maybe add that their availability varies with the JVM and its configuration?
Wondering if we'll ever want to export this functionality through a SHOW (-equivalent) command.

I think so long as we have a list of the ones that are guaranteed to be supported we're good. If we need to support more and it's only on some JVMs we can figure out the wording.

# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java

nik9000 · 2024-12-13T15:04:51Z

.../src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/HashTests.java

+
+    private static TestCaseSupplier createTestCase(String algorithm, DataType algorithmType, DataType inputType) {
+        return new TestCaseSupplier(algorithm, List.of(algorithmType, inputType), () -> {
+            var input = randomAlphaOfLength(10);


Let's randomize the length of this one. Like @bpintea mentioned, it'd be nice to get empty string sometimes. Maybe:

rarely() ? "" : randomRealisticUnicodeOfLengthBetween(1, 1000)

It'll probably work fine, but more paranoia seems good.

Another option is to call TestCaseSupplier.stringCases which has zoo of these things.

nik9000 · 2024-12-13T15:05:08Z

.../src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/HashTests.java

+
+    private static TestCaseSupplier createLiteralTestCase(String algorithm, DataType algorithmType, DataType inputType) {
+        return new TestCaseSupplier(algorithm, List.of(algorithmType, inputType), () -> {
+            var input = randomAlphaOfLength(10);


Same randomization here too I think.

nik9000 · 2024-12-13T15:06:23Z

.../src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/HashTests.java

+        return new Hash(source, args.get(0), args.get(1));
+    }
+
+    public void testInvalidAlgorithmLiteral() {


You have a test case for invalid algorithm already. Do you want to keep this one? I might move it to another class in that case - a HashExtraTests or something. That way you don't run it 1233242413414 times.

I believe they are covering different branches. One added via cases.add is covering non-constant evaluator.
This one here is foldable, as a result it throws from org.elasticsearch.xpack.esql.expression.function.scalar.string.Hash#toEvaluator as input is not a valid algotythm.
Such exception is not caught and not verified by withWarning nor withFoldingException.

Please let me know if I am missing something or if exception should not be thrown there.

astefan · 2024-12-13T20:00:44Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/ScalarFunctionWritables.java

@@ -63,6 +64,7 @@ public static List<NamedWriteableRegistry.Entry> getNamedWriteables() {
        entries.add(Concat.ENTRY);
        entries.add(E.ENTRY);
        entries.add(EndsWith.ENTRY);
+        entries.add(Hash.ENTRY);


Alphabetical order, please.

x-pack/plugin/esql/qa/testFixtures/src/main/resources/hash.csv-spec

elasticsearchmachine · 2024-12-18T08:58:11Z

💚 Backport successful

Status	Branch	Result
✅	8.x

This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.

idegtiarenko added 8 commits December 4, 2024 10:17

test case

39a2d17

dummy implementation

a9ff454

simple implementation

5101864

attempt to reuse MessageDigest

fabaabb

unit tests

33d5473

update description

06a322a

replace getBytes usage

2c83925

introduce capability

644874f

idegtiarenko added >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v9.0.0 labels Dec 4, 2024

idegtiarenko requested a review from nik9000 December 4, 2024 13:40

idegtiarenko changed the title ~~Add esql hash function~~ [ES|QL] Add esql hash function Dec 4, 2024

idegtiarenko changed the title ~~[ES|QL] Add esql hash function~~ ESQL Add esql hash function Dec 4, 2024

idegtiarenko and others added 4 commits December 4, 2024 14:41

Update docs/changelog/117989.yaml

b324f39

update functions counter

514e1db

fix required capability

052c247

Merge branch 'main' into add_esql_hash

48d4dd1

# Conflicts: # x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/60_usage.yml

idegtiarenko requested a review from bpintea December 4, 2024 17:32

nik9000 reviewed Dec 4, 2024

View reviewed changes

costin reviewed Dec 5, 2024

View reviewed changes

idegtiarenko added 5 commits December 5, 2024 10:08

additional test cases

1adfc89

update spec

633a044

move function definition

f063055

update

062967f

enable by default

62f97c3

idegtiarenko requested a review from astefan December 11, 2024 15:03

idegtiarenko added 2 commits December 12, 2024 10:41

rename alg -> algorithm

99ad357

more hash test cases

3abf208

idegtiarenko force-pushed the add_esql_hash branch from 27a9042 to 3abf208 Compare December 12, 2024 09:55

astefan added auto-backport Automatically create backport pull requests when merged v8.18.0 labels Dec 12, 2024

nik9000 reviewed Dec 12, 2024

View reviewed changes

bpintea approved these changes Dec 12, 2024

View reviewed changes

idegtiarenko added 5 commits December 13, 2024 12:58

cover folded literal case

0983786

Merge branch 'main' into add_esql_hash

7b5a3ec

fix test

08eaee6

additional cases

bfab16b

Merge branch 'main' into add_esql_hash

8dfca90

# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java

idegtiarenko requested a review from nik9000 December 13, 2024 13:38

update function count

be91871

nik9000 approved these changes Dec 13, 2024

View reviewed changes

astefan reviewed Dec 13, 2024

View reviewed changes

idegtiarenko added 6 commits December 16, 2024 08:32

Merge branch 'main' into add_esql_hash

cd2df96

fix order

fb86d18

randomize input

edcf755

merge test scenario creation

2c8f541

Merge branch 'main' into add_esql_hash

5ac4ae6

cleanup tests

c72e947

idegtiarenko merged commit 7cf28a9 into elastic:main Dec 18, 2024
16 checks passed

idegtiarenko deleted the add_esql_hash branch December 18, 2024 08:56

idegtiarenko mentioned this pull request Dec 18, 2024

[8.x] ESQL Add esql hash function (#117989) #118927

Merged

idegtiarenko added a commit to idegtiarenko/elasticsearch that referenced this pull request Dec 18, 2024

ESQL Add esql hash function (elastic#117989)

8b9a833

This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.

elasticsearchmachine pushed a commit that referenced this pull request Dec 18, 2024

ESQL Add esql hash function (#117989) (#118927)

c34e8e2

This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.

rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Dec 18, 2024

ESQL Add esql hash function (elastic#117989)

0077ab6

This change introduces esql hash(alg, input) function that relies on the Java MessageDigest to compute the hash.

ESQL Add esql hash function #117989

ESQL Add esql hash function #117989

Conversation

idegtiarenko commented Dec 4, 2024 • edited Loading

github-actions bot commented Dec 4, 2024

elasticsearchmachine commented Dec 4, 2024

elasticsearchmachine commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luigidellaquila Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

costin Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpintea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Dec 18, 2024

💚 Backport successful

idegtiarenko commented Dec 4, 2024 •

edited

Loading

costin Dec 5, 2024 •

edited

Loading

luigidellaquila Dec 5, 2024 •

edited

Loading

costin Dec 5, 2024 •

edited

Loading