Skip to content
This repository has been archived by the owner on Apr 4, 2019. It is now read-only.

Do I have to set up HDFS in order to use streamX? #60

Open
iShiBin opened this issue Jul 25, 2018 · 1 comment
Open

Do I have to set up HDFS in order to use streamX? #60

iShiBin opened this issue Jul 25, 2018 · 1 comment

Comments

@iShiBin
Copy link

iShiBin commented Jul 25, 2018

I noticed I have to configure the hadoop config files like core-site.xml, hdfs-site.xml to configure S3. And I could not find the mentioned config/hadoop-conf in my installation (Kafka 0.10.2.0). So do I have to use HDFS in order to use this streamX?

What I am trying to do is to transform some messages in JSON format to parquet and then store them in S3.

Using spark could achieve this target but it would require a long-running cluster to do, or I can use the checkpoint to do a per day basic ETL.

@OneCricketeer
Copy link

OneCricketeer commented Feb 11, 2019

And I could not find the mentioned config/hadoop-conf in my installation (Kafka 0.10.2.0).

Kafka is not a Hadoop project, that is why you will not find it there. You must make this folder on your own. An EMR instance, or other EC2 Hadoop-provisioned machine would have this folder.

So do I have to use HDFS in order to use this streamX?

Not exactly, but you need to use a Hadoop compatibile filesystem (which S3 is).

Since this project uses the Hadoop FileSystem API, you need to just specify the configuration directory with the XML files included.

Using spark could achieve this target but it would require a long-running cluster to do

Kafka Connect consumers also typically are long-running, as part of a cluster / consumer-group.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants