Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Hadoop Client #134

Closed
brijos opened this issue Jun 10, 2022 · 38 comments
Closed

[FEATURE] Hadoop Client #134

brijos opened this issue Jun 10, 2022 · 38 comments
Assignees
Labels
enhancement New feature or request

Comments

@brijos
Copy link

brijos commented Jun 10, 2022

Is your feature request related to a problem?
When using Elastic Search in the past, some in the community had been used to connecting using a Hadoop client. The ask is to create a Hadoop client which will connect to OpenSearch.

@brijos brijos added the enhancement New feature or request label Jun 10, 2022
@kffallis
Copy link

As part of a feature for this, it would be great to provide a signing mechanism or some sort of add on extension that signs requests using sigv4 so that it can be used on AWS. Additionally, the authentication mechanisms should consider other cloud providers so when installing and setting up the connector, much like you would do with Fluentd, you would pull in specific packages for this type of functionality so we can consider ease of adoption on other cloud provider signing / auth mechanisms.

@mattweber
Copy link

+1, I am looking for an equivalent of the elasticsearch-hadoop client. The initial blog post says this is going to be supported but I do not see any additional details or dates for release. Can we please get an ETA?

@dblock
Copy link
Member

dblock commented Jun 15, 2022

Generally we've been forking these and making them work for OpenSearch. @CEHENKLE can speak to potential schedules, but don't hold back if you have cycles to do it, we'll gladly adopt the fork in the organization, or just contribute things like Sigv4 signing into someone else's fork.

@mattweber
Copy link

@dblock @CEHENKLE previously I was able to use the older versions of elasticsearch-hadoop that don't do version checks against OpenSearch 1.3+ but with the removal of types in OpenSearch 2.0 this is no longer possible and people are completely blocked. I will gladly contribute if I get a chance to dig into it but was hoping there might already be some existing work on it.

@dblock
Copy link
Member

dblock commented Jun 15, 2022

@mattweber There's some reverting going on (e.g. opensearch-project/OpenSearch#3484) around types to make some of it backward compatible again to prevent exactly this types of issues. Try against 2.0 branch (OpenSearch 2.1), and open an issue if something is still not working?

@mattweber
Copy link

Will do, thanks for the info @dblock

@brijos
Copy link
Author

brijos commented Jun 22, 2022

@mattweber Did you have any luck?

I can answer on behalf of @CEHENKLE that we don't have anyone at the moment who can focus on Hadoop, but we would love assistance if you are offering @mattweber.

@mattweber
Copy link

@brijos no luck, the client assumes we have types when it sees at what it thinks is elastic version 2.0.0. I forked the client, removed that check, and manually built the spark driver and have been using that fork. I only use it for writing data to opensearch so the change was minimal. I am not sure what would be involved to support everything.

Is there any docs on how to add a new client or a stub repository I can open a pull request against? I am not commiting to anything but if I get some time I would gladly look into adding it.

@kffallis
Copy link

The needs i have for this are two fold, 1) support basic auth / certs for open source impls of the engine and 2) support various additional auth protocols like sigv4 on AWS. If someone has done some initial work that addresses OpenSearch 1.x and can float backwards to a min of Elasticsearch version 7.1, that would get things started. I can hook the sigv4 signing if this is in java or python. Just need to know the entry points for whichever HTTP client lib is being used and hopefully that package supports an auth interface.

@brijos
Copy link
Author

brijos commented Jul 1, 2022

Quick update, I'm going to start the process of creating a new repo for a Hadoop client with the intent of @mattweber contributing what he has. I'll keep this thread updated as we progress.

@brijos
Copy link
Author

brijos commented Jul 5, 2022

@mattweber does your fork include support for Hive3 and Spark3 as well?

@mattweber
Copy link

@brijos I do not have a fork, I literally just took elastic's code and made a single line change. This was off a 7.13.4 which is the version before the explicit version checks and supports Spark3. On a real fork we would need to use a fork of 7.10.2 and I do not know if Spark3 was in that version or not.

@mattweber
Copy link

Spark 3 support didn't land until 7.12.0 so it would not be in any fork without needing to us to do additional work.

@wbeckler
Copy link

wbeckler commented Aug 9, 2022

Is there any docs on how to add a new client or a stub repository I can open a pull request against? I am not commiting to anything but if I get some time I would gladly look into adding it.

@mattweber You can create your own public repo and add an issue in the repo proposing to be moved to the OpenSearch project. Tag @wbeckler in that issue. There might be some checkpoints that I need to clear, and then we'll move it over. Then we would want to make you a maintainer once the repo is in the OpenSearch project.

@dblock
Copy link
Member

dblock commented Aug 10, 2022

@wbeckler FYI, we do have precedent for keeping a repo admin (vs. just a maintainer). The owner of https://github.com/opensearch-project/opensearch-plugin-template-java retained admin rights on the repo when it was moved into this org.

@wbeckler
Copy link

This is now in progress: https://github.com/opensearch-project/opensearch-hadoop/pulls

@indranilr
Copy link

Will this client be made available via Maven ? We are using Open Search 1.2.4 in our organization and have been hunting for compatible Spark client.

@wbeckler
Copy link

This is coming to Maven once the application passes a security review. These reviews usually take 1-2 months, but no date has been set yet.

@wbeckler wbeckler transferred this issue from opensearch-project/opensearch-clients Feb 28, 2023
@wbeckler wbeckler mentioned this issue Mar 9, 2023
@wbeckler wbeckler removed the untriaged label Mar 9, 2023
@prashantsc
Copy link

prashantsc commented Mar 28, 2023

Thanks @harshavamsi and @wbeckler. Would like to know if we have any development on ETA for opensearch-hadoop in last two weeks?

@harshavamsi
Copy link
Collaborator

@prashantsc the client is ongoing security review inside of Amazon. We are targeting 4/30/23 currently, but it could change depending on outcomes.

@prashantsc
Copy link

Thanks @harshavamsi for the information.

@junhl
Copy link

junhl commented May 1, 2023

@harshavamsi do you have updates on the security review ? We passed 4/30/23 so wondering if a new date is to be set.

@harshavamsi
Copy link
Collaborator

@junhl Thanks for checking in. Happy to report that the security review is complete and we're now working towards a release. There is some ongoing discussion over at opensearch-project/opensearch-build#3385 about the nuances of releasing this client. That is also where we will track the release. Stay tuned this week where most likely a release is going to happen. Thank for being patient!

@chaitujil
Copy link

chaitujil commented May 5, 2023

@harshavamsi Spark 3.x clusters can't connect to Opensearch using elasticsearch-spark libraries. Somehow the compatibility is broken and the connection fails with

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version

However Spark 2.x clusters don't have this problem. Blocker for us because we only have Spark 3.x clusters :(

Is this client coming anytime soon?

@harshavamsi
Copy link
Collaborator

Hi @chaitujil the OpenSearch build team is working on getting a release out as soon as possible. In the meanwhile could you let me know if you're using Managed OpenSearch on AWS or if you're hosting your own service. Also, can I know what versions of OpenSearch you're targetting? Thanks.

@chaitujil
Copy link

chaitujil commented May 8, 2023

@harshavamsi We are targeting OS 1.1 cluster. Managed OpenSearch.

@harshavamsi
Copy link
Collaborator

Hi folks, we published Snapshots here -- https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/client/. Do give it a try and let us know. We're going to publish an actual release early this week. Thanks!

@prashantsc
Copy link

Thanks @harshavamsi for the update.

@mgolatkar
Copy link

@harshavamsi It looks like the opensearch-spark libs shared at the link above https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/client/ work with Amazon OpenSearch Service with OpenSearch engine v2.3. We want to connect to Amazon OpenSearch Service with Elasticsearch engine v7.10 in the AWS Glue Job (Type: Spark, Glue version - 3.0 or 4.0) . Can you please share the version of the libraries/jar files that we need to use for the same?

@harshavamsi
Copy link
Collaborator

@harshavamsi It looks like the opensearch-spark libs shared at the link above https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/client/ work with Amazon OpenSearch Service with OpenSearch engine v2.3. We want to connect to Amazon OpenSearch Service with Elasticsearch engine v7.10 in the AWS Glue Job (Type: Spark, Glue version - 3.0 or 4.0) . Can you please share the version of the libraries/jar files that we need to use for the same?

Hi, we have a workaround that supports ES 7.x clusters. Have you given the connector a try? Does it fail?

@indranilr
Copy link

Hi folks, we published Snapshots here -- https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/client/. Do give it a try and let us know. We're going to publish an actual release early this week. Thanks!

Is this release compatible with Open Search 1.2.x and Spark 3.2.x / Scala 2.12 ?

@mgolatkar
Copy link

@harshavamsi It looks like the opensearch-spark libs shared at the link above https://aws.oss.sonatype.org/content/repositories/snapshots/org/opensearch/client/ work with Amazon OpenSearch Service with OpenSearch engine v2.3. We want to connect to Amazon OpenSearch Service with Elasticsearch engine v7.10 in the AWS Glue Job (Type: Spark, Glue version - 3.0 or 4.0) . Can you please share the version of the libraries/jar files that we need to use for the same?

Hi, we have a workaround that supports ES 7.x clusters. Have you given the connector a try? Does it fail?

@harshavamsi On referencing the jar file opensearch-spark-30_2.13-1.0.0-20230513.002822-1.jar in our AWS Glue Job (Type: Spark, Glue version - 3.0 or 4.0) we get the following error:

Scala signature package has wrong version expected: 5.0 found: 5.2 in package.class

Does this jar have to be compiled with java 8? Can you please help take a look at the issue?

@akshayjain3450
Copy link

akshayjain3450 commented May 30, 2023

Hi @harshavamsi , is there a place wherein I can get all the configuration/options supported by Open search through spark?
This helps me in my project to allow users to define these as per their requirement and need.

Using, OpenSearch 1.3.10 with Spark 3.3.1 with Scala 2.12.

@harshavamsi
Copy link
Collaborator

@akshayjain3450 please refer to this file -- https://github.com/opensearch-project/opensearch-hadoop/blob/main/mr/src/main/java/org/opensearch/hadoop/cfg/ConfigurationOptions.java. It has all the configuration options that you can set within the client. Thanks!

@harshavamsi
Copy link
Collaborator

Closing this issue as completed via #227

@venkatbrr
Copy link

@harshavamsi I don't see workflow for scala 2.12 version. are not publishing any jar for Spark 3.x and Scala 2.12 compatable versions?
https://github.com/opensearch-project/opensearch-hadoop/tree/main/.github/workflows

@Xtansia
Copy link
Collaborator

Xtansia commented Nov 2, 2023

@harshavamsi I don't see workflow for scala 2.12 version. are not publishing any jar for Spark 3.x and Scala 2.12 compatable versions?
https://github.com/opensearch-project/opensearch-hadoop/tree/main/.github/workflows

@venkatbrr The artifact you're looking for is published as org.opensearch.client:opensearch-spark-30_2.12

@venkatbrr
Copy link

venkatbrr commented Nov 2, 2023

@harshavamsi @Xtansia Can you please update the compatability section for 1.0.1 version? we are trying to use Hadoop client for AWS Opensearch managed cluster of 2.5 version. So, want check the compatibility.
https://github.com/opensearch-project/opensearch-hadoop/blob/main/COMPATIBILITY.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests