Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] - Performing complex relational queries is merely impossible. #446

Open
DumboJetEngine opened this issue Nov 24, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@DumboJetEngine
Copy link
Contributor

DumboJetEngine commented Nov 24, 2023

Is your feature request related to a problem?

While trying to translate some SQL relational queries into OpenSearch queries using nested documents and the OpenSearch.Client, I got into MANY problems due to the fact that I had to gather all nested fields under a single nested query.
I wrote a question here, but I am not sure if is clear enough.

Essentially, when the SQL query is complex and involves many JOINed tables, the nested queries in OpenSearch become too complex to write by hand.

For example, say you have this SQL query:

SELECT *
FROM Document d
INNER JOIN NestedType1 t1 ON t1.Id = d.Id  -- nested document 1
INNER JOIN NestedType2 t2 ON t2.Id = d.Id  -- nested document 2
WHERE (
    (t1.Key = 1 OR t2.Value = 7) AND (t1.Key2 = 5 OR t2.Value = 9 OR d.Id = 7)
)

To translate that into an OpenSearch query, after you create the nested documents for t1 & t2, you have to gather all t1 filters under a single nested query, all the t2 filters under another nested query, and all non-nested field filters outside the nested queries.

By doing some manual boolean transformations one could shape the query like that:

(t1.Key = 1 || t2.Value = 7) && (t1.Key2 = 5 || t2.Value = 9 || d.Id = 7)

    ===>

(t1.Key = 1 && t1.Key2= 5) ||
(t1.Key = 1 && t2.Value = 9) ||
(t1.Key = 1 && d.Id = 7) ||
(t2.Value = 7 && t1.Key = 5) ||
(t2.Value = 7 && t2.Value = 9) ||  -- not possible
(t2.Value = 7 && d.Id = 7) 

    ===>

( NestedQuery(t1.Key = 1 && t1.Key2= 5) ) ||
( NestedQuery(t1.Key = 1) && NestedQuery(t2.Value = 9) ) ||
( NestedQuery((t1.Key = 1)) && d.Id = 7 ) ||
( NestedQuery(t2.Value = 7) && NestedQuery(t1.Key = 5) ) ||
( NestedQuery(t2.Value = 7) && d.Id = 7) 

This gets very easily out of hand for more complex queries, and even more for dynamic queries where you cannot statically analyze and simplify the queries beforehand.

What solution would you like?

A way to preprocess queries, in order to gather all nested fields under a single nested query would be great, since this cannot be done manually most of the times.
In the example above, it would be great if the mentioned query "simplification" and gathering of all:

  • t1 filters under a single nested query
  • t2 filters under another single nested query
  • d filters outside all nested queries
    ...could be performed by the library. Also, when using property attributes/annotations for the nested object, the library could infer by itself when to use a nested query and when not, without having the programmer specify anything extra.
    I guess it is a complex feature, but at least I hope I have explained well what the problem is.
    If I am missing something (because I am quite new to OpenSearch), let me know.

The way I am imagining the implementation is something like this, which uses C# Expression trees in the .Relational() method:

var response = openSearchClient.Search<Document>(s => s
  .Index(documentsIndexName)
    .Query(q => q
      .Bool(bq => bq
        .Filter(mq =>
        {
          return mq.Relational(rq =>
            return  (  rq.Term(d => d.t1.First().Key, 1)  ||  rq.Term(d => d.t2.First().Value, 7)  ) &&
                (  rq.Term(d => d.t1.First().Key2, 5)  ||  rq.Term(d => d.t2.First().Value, 9) || rq.Term(d => d.id, 7) );
          );
        })
      )
    )
  );

I guess a similar feature must already exist in the SQL plugin for OS, since that problem would also be present there and had to be solved somehow, although the SQL plugin has some restrictions from what I can recall.

What alternatives have you considered?

  • Simplifying the queries manually as above : it was not feasible because my queries were built dynamically and were too complex.
  • Using the inner_hits from each nested query in my query to eliminate some returned results : it was not feasible since I had many nested queries and nested documents queried in a single convoluted query.
  • Creating a single nested document (instead of many) that contains the result of a LEFT JOIN between all the nested documents, so that all nested fields belong to a single nested path and query : that was not enough, since non-nested fields would still need to be placed outside the nested query and that was not so feasible by hand. Plus, the amount of data stored in OS was increasing dramatically by the join operation.
  • Using the SQL plugin instead of the OpenSearch.Client : this is not possible since my queries are built dynamically, I have more than 2 nested objects, etc.
@DumboJetEngine DumboJetEngine added enhancement New feature or request untriaged labels Nov 24, 2023
@Xtansia Xtansia removed the untriaged label Nov 26, 2023
@DumboJetEngine
Copy link
Contributor Author

DumboJetEngine commented Nov 28, 2023

Correcting myself on this statement :

I guess a similar feature must already exist in the SQL plugin for OS, since that problem would also be present there and had to be solved somehow

Turns out, the SQL plugin has the same (or similar) issues (check bullet #4):
https://opensearch.org/docs/latest/search-plugins/sql/sql/complex/

Quoting the page :

In a WHERE statement, don’t combine trees that contain multiple indexes.

So, using the SQL plugin does not seem to be a viable solution at all.

@DumboJetEngine
Copy link
Contributor Author

DumboJetEngine commented Nov 29, 2023

And on a second review, using C# expressions does not seem like a good idea (too many limitations).
Probably, preprocessing the query before sending it to O.S. would be better.
And ask for it like that :

var response = openSearchClient.Search<Document>(s => s
  .Index(documentsIndexName)
    .Query(q => q
      .Bool(bq => bq
        .Filter(rq =>
        {
            return  (  rq.Term(d => d.t1.First().Key, 1)  ||  rq.Term(d => d.t2.First().Value, 7)  ) &&
                (  rq.Term(d => d.t1.First().Key2, 5)  ||  rq.Term(d => d.t2.First().Value, 9) || rq.Term(d => d.id, 7) );
        })
      )
    )
    .RelationalMode(true)
  );

But I am not 100% sure how one could exclude specific nested queries from being "relationalized".
Perhaps with an extra parameter that could be specified on each nested query.

@DumboJetEngine
Copy link
Contributor Author

I have started experimenting on an implementation for this.
I am not sure what will be the end result, but it is interesting for me already...

@DumboJetEngine
Copy link
Contributor Author

I have a working implementation of this, that also adds some extra tiny features.
It appears to work for the queries I tested it on, but I will need to test it some more.

One thing that I had to do was to add a way of cloning IQuery/QueryBase instances, and I have created an extension method that handles each type of query individually, but I would guess that the best thing would be to make IQuery expose a .Clone() method. This, however, will make the code diverge more from the original ElasticSearch code (but maybe that is OK?) and it will also not keep the changes contained in a small portion of the code, but they will be all over the place.
If you want me to add the .Clone() method, let me know.

I am planning to add some tests as well, but I have various issues with Visual Studio not running the tests.

There were also some "challenges" with preserving the scoring behavior with Filter vs Must, but I think this is of smaller importance when you try to port SQL queries to O.S. .

This is how it works now:

var queryChangeLogStringBuilder = new StringBuilder(4 * 1024);
var queryResponse = openSearchClient.Search<Book>(s => s
  .Index(documentsIndexName)
    .Query(q => q
      .Bool(bq => bq
        .Filter(rq =>
        {
            var query = (
                    (rq.Term(d => d.Tags.First().Name, "cats") || rq.Term(d => d.Price, 11))
                    &&
                    rq.Term(d => d.Tags.First().Name, "dogs")
                )
                .Name("cats-and-dogs")
                .Strict(true)
                .TransformToRelationalEquivalent<Book>(queryChangeLogStringBuilder);

            return query;
        })
      )
    )
  );



[OpenSearchType(IdProperty = "Id")]
public class Book
{
    public int Id { get; set; }
    public string? Title { get; set; }
    public int Pages { get; set; }
    public decimal Price { get; set; }

    [Nested]
    public Tag[] Tags { get; set; } = new Tag[0];

    [Nested]
    public Meta[] Meta { get; set; } = new Meta[0];

}

public class Tag
{
    public string? Name { get; set; }
    public int Weight { get; set; }
}

public class Meta
{
    public string? Name { get; set; }
    public string? Value { get; set; }
}

@DumboJetEngine
Copy link
Contributor Author

DumboJetEngine commented Jan 4, 2024

@Xtansia
Is it OK if I send, in one commit and one PR, the implementation for this issue ( #446 ), #380, and #375 ?

@Xtansia
Copy link
Collaborator

Xtansia commented Jan 7, 2024

@DumboJetEngine for ease of review and change tracking purpose, it would best that they're separate PRs please

@DumboJetEngine
Copy link
Contributor Author

@Xtansia
OK. Done.
If you can review and merge the two small PRs I have just sent, so that I can resolve any conflict for the 3rd, it would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants