You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 4, 2019. It is now read-only.
I noticed I have to configure the hadoop config files like core-site.xml, hdfs-site.xml to configure S3. And I could not find the mentioned config/hadoop-conf in my installation (Kafka 0.10.2.0). So do I have to use HDFS in order to use this streamX?
What I am trying to do is to transform some messages in JSON format to parquet and then store them in S3.
Using spark could achieve this target but it would require a long-running cluster to do, or I can use the checkpoint to do a per day basic ETL.
The text was updated successfully, but these errors were encountered:
And I could not find the mentioned config/hadoop-conf in my installation (Kafka 0.10.2.0).
Kafka is not a Hadoop project, that is why you will not find it there. You must make this folder on your own. An EMR instance, or other EC2 Hadoop-provisioned machine would have this folder.
So do I have to use HDFS in order to use this streamX?
Not exactly, but you need to use a Hadoop compatibile filesystem (which S3 is).
Since this project uses the Hadoop FileSystem API, you need to just specify the configuration directory with the XML files included.
Using spark could achieve this target but it would require a long-running cluster to do
Kafka Connect consumers also typically are long-running, as part of a cluster / consumer-group.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I noticed I have to configure the hadoop config files like core-site.xml, hdfs-site.xml to configure S3. And I could not find the mentioned config/hadoop-conf in my installation (Kafka 0.10.2.0). So do I have to use HDFS in order to use this streamX?
What I am trying to do is to transform some messages in JSON format to parquet and then store them in S3.
Using spark could achieve this target but it would require a long-running cluster to do, or I can use the checkpoint to do a per day basic ETL.
The text was updated successfully, but these errors were encountered: