This repository has been archived by the owner on Apr 4, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 54
Merging kafka-connect-hdfs 3.3.0 to support Kafka 0.11 #49
Open
lewisdawson
wants to merge
105
commits into
qubole:master
Choose a base branch
from
lewisdawson:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 101 commits
Commits
Show all changes
105 commits
Select commit
Hold shift + click to select a range
24fcba4
"Version bump to 3.2.0-SNAPSHOT"
c8a2a83
Version bump to 0.10.2.0-SNAPSHOT
76e40c8
fix version to 3.2 in docs
b254849
Merge branch '3.1.x'
ewencp 7b5dfa1
fix a typo
aseigneurin b8de69c
fix a typo
aseigneurin 9e0c9c1
Merge pull request #135 from aseigneurin/fix-typos
ewencp c058019
fix typo
hjespers 43379dc
Merge remote-tracking branch 'origin/3.1.x'
ewencp 0f6173c
CC-273: connect deps should be scoped as provided
shikhar e42429d
Switch to toEnrichedRst() for config opts docgen
shikhar f68beb4
CC-391: Fix configuration with DefaultPartitioner
kkonstantine be9a84e
Merge branch 'CC-391-Configuration-with-DefaultPartitioner-compares-a…
kkonstantine f38410a
HOTFIX: Override new method in MockSinkTaskContext.
kkonstantine e26514e
Merge branch 'HOTFIX-Override-newly-added-method-in-SinkTaskContext-i…
kkonstantine e3f6641
Fix topics with periods for hive #136 #137
ig-michaelpearce fb2b0ab
accidental un-needed format change, revert
ig-michaelpearce 0f754b8
remove accidental new line added
ig-michaelpearce 413a169
Merge pull request #164 from IG-Group/ISSUE-136
ewencp b0c4839
Updated Readme for Kafka connect FAQ (#75)
rekhajoshm 9841a90
updated avro-tools instructions in doc (#65)
coughman 3090fa7
Merge with docs branch
9838b61
Merge branch '3.1.x'
5b21d21
Bump version to 3.3.0-SNAPSHOT
cf40f3b
Bump Confluent to 3.3.0-SNAPSHOT, Kafka to 0.10.3.0-SNAPSHOT
9196b6c
Bump version to 3.2.0-SNAPSHOT
34c9088
MINOR: Upgrade Hadoop version to 2.7.3 and joda-time to 2.9.7
kkonstantine 55716d4
Merge branch '3.2.x'
kkonstantine 4e357b5
Handle primitive types in AvroRecordWriterProvider. (#176)
ewencp 24cb9c0
Merge remote-tracking branch 'origin/3.0.x' into 3.1.x
ewencp 545f2f3
Merge remote-tracking branch 'origin/3.1.x' into 3.2.x
ewencp 2c8522e
Merge branch '3.2.x'
ewencp 535be47
Set Confluent to 3.2.0, Kafka to 0.10.2.0-cp1.
da94c38
Bump version to 3.2.1-SNAPSHOT
58f6c85
Bump Confluent to 3.2.1-SNAPSHOT
8c0596b
Bump Kafka to 0.10.2.1-SNAPSHOT
9afe713
Bump Kafka to 0.11.0.0-SNAPSHOT
1abe85c
Merge remote-tracking branch 'origin/3.2.x'
norwood ef68aa6
CC-437: Update 3.2.0 changelog
ewencp 241eb77
Add missing PR-170 to release notes.
ewencp 2a55cd9
Merge remote-tracking branch 'origin/3.2.0-post' into 3.2.x
ewencp 60a5323
Merge remote-tracking branch 'origin/3.2.x'
ewencp 25b10bb
Merge pull request #179 from confluentinc/cc-437-3.2.0-changelog
ewencp 713de9a
Merge remote-tracking branch 'origin/3.2.0-post' into 3.2.x
ewencp 4e12b94
Merge remote-tracking branch 'origin/3.2.x'
ewencp 449ef58
Add 3.2.1 changelog.
ewencp 54e0943
Merge pull request #184 from confluentinc/3.2.1-changelog
ewencp 0b53750
Merge remote-tracking branch 'origin/3.2.x'
ewencp 8b439a0
Set Confluent to 3.2.1, Kafka to 0.10.2.1-cp1.
64598ed
Bump version to 3.2.2-SNAPSHOT
cee729b
Bump Confluent to 3.2.2-SNAPSHOT, Kafka to 0.10.2.2-SNAPSHOT
7116cef
Merge remote-tracking branch 'origin/3.2.1-post' into 3.2.x
norwood 7efda02
Merge remote-tracking branch 'origin/3.2.x'
norwood e786248
Fix HdfsSinkConnector to extend from SinkConnector instead of Connector.
ewencp 33023cb
Merge pull request #194 from confluentinc/extend-sink-connector
ewencp d52155f
Merge remote-tracking branch 'origin/3.0.x' into 3.1.x
ewencp 01af529
Merge remote-tracking branch 'origin/3.1.x' into 3.2.x
ewencp d89f97a
Merge remote-tracking branch 'origin/3.2.x'
ewencp cc37611
CC-491: Consolidate and simplify unit tests of HDFS connector
kkonstantine 02d1bb7
Move verification methods and members to TestWithMiniDFSCluster
kkonstantine e939f75
Consolidate records creation methods in TopicPartitionWriterTest too.
kkonstantine b87115f
Fix hive tests.
kkonstantine 66ddb58
Add tests with interleaved (non consecutive) records between partitions.
kkonstantine 815a693
Code style fixes.
kkonstantine e57fed9
Increase maximum perm size to accommodate failing tests.
kkonstantine d449c4f
Upgrade surefire parameters for test forking.
kkonstantine b8a335b
Clean commented out code and fix typo.
kkonstantine be14311
Remove debugging statement from parquet test.
kkonstantine fd28e5e
Merge branch 'CC-491-Consolidate-unit-tests-of-HDFS-connector'
kkonstantine 6970a12
Fix incorrect licensing and webpage info.
ewencp 5d252e9
Merge pull request #200 from confluentinc/remove-incorrect-license
ewencp 08d8f1a
Merge remote-tracking branch 'origin/3.0.x' into 3.1.x
ewencp 6ccc55b
Merge remote-tracking branch 'origin/3.1.x' into 3.2.x
ewencp 5ac7268
Merge remote-tracking branch 'origin/3.2.x' into 3.3.x
ewencp 3592793
Upgrade avro to 1.8.0
mageshn c1d1f97
Add 3.2.2 changelog
ewencp 78b79a7
Merge pull request #204 from confluentinc/3.2.2-changelog
ewencp 95b100b
Merge remote-tracking branch 'origin/3.2.x' into 3.3.x
ewencp a1fd0c4
Upgrade avro to 1.8.2
mageshn c464ef2
Upgrade avro to 1.8.2
mageshn 437485a
jenkinsfile time
norwood 2afec6a
Merge pull request #205 from confluentinc/avro-upgrade
ewencp 5337079
Set Confluent to 3.2.2, Kafka to 0.10.2.1-cp2.
28eb13b
Add 3.3.0 changelog
kkonstantine 428f211
Merge pull request #209 from kkonstantine/3.3.0-changelog
kkonstantine 367da71
Update quickstart to use Confluent CLI
kkonstantine 2e8694d
Add stop instructions at the end of the quickstart.
kkonstantine e2d9b26
Improve wording around 'confluent log connect'
kkonstantine bb6996d
Set Confluent to 3.3.0, Kafka to 0.11.0.0-cp1.
ddddfd3
Merge pull request #217 from kkonstantine/Update-quickstart-to-use-CLI
kkonstantine 6af158f
Bump version to 3.3.1-SNAPSHOT
4a7803f
Bump Confluent to 3.3.1-SNAPSHOT
d04f924
Merge remote-tracking branch 'origin/3.3.0-post' into 3.3.x
326bdcd
Bump Kafka to 0.11.0.1-SNAPSHOT
4f73c33
Merge remote-tracking branch 'origin/3.2.2-post' into 3.2.x
79cbee7
Merge branch '3.2.x' into 3.3.x
44e0d85
Bump version to 3.2.3-SNAPSHOT
fb41464
Bump Confluent to 3.2.3-SNAPSHOT
d890d61
Merge remote-tracking branch 'origin/3.2.x' into 3.3.x
4ec9fe2
Set Confluent to 3.3.0-hotfix.1, Kafka to 0.11.0.0-hotfix.1.
ConfluentJenkins a4c6874
Merging upstream v3.3.0-hotfix.1 tag to support Kafka 0.11.0.0.
91322bf
Merging upstream v3.3.0-hotfix.1 tag to support Kafka 0.11.0.0.
lewisdawson e1041b5
Added ability to connector to convert CSV data to Parquet before uplo…
03eb245
Added ability to connector to convert CSV data to Parquet before uplo…
lewisdawson 227de21
Merge branch 'master' of github.com:lewisdawson/streamx
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
#!/usr/bin/env groovy | ||
common { | ||
slackChannel = '#connect-eng' | ||
} | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,8 +5,8 @@ The HDFS connector allows you to export data from Kafka topics to HDFS files in | |
and integrates with Hive to make data immediately available for querying with HiveQL. | ||
|
||
The connector periodically polls data from Kafka and writes them to HDFS. The data from each Kafka | ||
topic is partitioned by the provided partitioner and divided into chucks. Each chunk of data is | ||
represented as an HDFS file with topic, kafka partition, start and end offsets of this data chuck | ||
topic is partitioned by the provided partitioner and divided into chunks. Each chunk of data is | ||
represented as an HDFS file with topic, kafka partition, start and end offsets of this data chunk | ||
in the filename. If no partitioner is specified in the configuration, the default partitioner which | ||
preserves the Kafka partitioning is used. The size of each data chunk is determined by the number of | ||
records written to HDFS, the time written to HDFS and schema compatibility. | ||
|
@@ -20,9 +20,8 @@ Quickstart | |
In this Quickstart, we use the HDFS connector to export data produced by the Avro console producer | ||
to HDFS. | ||
|
||
Start Zookeeper, Kafka and SchemaRegistry if you haven't done so. The instructions on how to start | ||
these services are available at the Confluent Platform QuickStart. You also need to have Hadoop | ||
running locally or remotely and make sure that you know the HDFS url. For Hive integration, you | ||
Before you start the Confluent services, make sure Hadoop is | ||
running locally or remotely and that you know the HDFS url. For Hive integration, you | ||
need to have Hive installed and to know the metastore thrift uri. | ||
|
||
This Quickstart assumes that you started the required services with the default configurations and | ||
|
@@ -39,12 +38,41 @@ Also, this Quickstart assumes that security is not configured for HDFS and Hive | |
please make the necessary configurations change following `Secure HDFS and Hive Metastore`_ | ||
section. | ||
|
||
First, start the Avro console producer:: | ||
First, start all the necessary services using Confluent CLI. | ||
|
||
.. tip:: | ||
|
||
If not already in your PATH, add Confluent's ``bin`` directory by running: ``export PATH=<path-to-confluent>/bin:$PATH`` | ||
|
||
.. sourcecode:: bash | ||
|
||
$ confluent start | ||
|
||
Every service will start in order, printing a message with its status: | ||
|
||
.. sourcecode:: bash | ||
|
||
Starting zookeeper | ||
zookeeper is [UP] | ||
Starting kafka | ||
kafka is [UP] | ||
Starting schema-registry | ||
schema-registry is [UP] | ||
Starting kafka-rest | ||
kafka-rest is [UP] | ||
Starting connect | ||
connect is [UP] | ||
|
||
Next, start the Avro console producer to import a few records to Kafka: | ||
|
||
.. sourcecode:: bash | ||
|
||
$ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_hdfs \ | ||
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}' | ||
|
||
Then in the console producer, type in:: | ||
Then in the console producer, type in: | ||
|
||
.. sourcecode:: bash | ||
|
||
{"f1": "value1"} | ||
{"f1": "value2"} | ||
|
@@ -54,42 +82,102 @@ The three records entered are published to the Kafka topic ``test_hdfs`` in Avro | |
|
||
Before starting the connector, please make sure that the configurations in | ||
``etc/kafka-connect-hdfs/quickstart-hdfs.properties`` are properly set to your configurations of | ||
Hadoop, e.g. ``hdfs.url`` points to the proper HDFS and using FQDN in the host. Then run the | ||
following command to start Kafka connect with the HDFS connector:: | ||
Hadoop, e.g. ``hdfs.url`` points to the proper HDFS and using FQDN in the host. Then start connector by loading its | ||
configuration with the following command: | ||
|
||
.. sourcecode:: bash | ||
|
||
$ confluent load hdfs-sink -d etc/kafka-connect-hdfs/quickstart-hdfs.properties | ||
{ | ||
"name": "hdfs-sink", | ||
"config": { | ||
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector", | ||
"tasks.max": "1", | ||
"topics": "test_hdfs", | ||
"hdfs.url": "hdfs://localhost:9000", | ||
"flush.size": "3", | ||
"name": "hdfs-sink" | ||
}, | ||
"tasks": [] | ||
} | ||
|
||
$ ./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties \ | ||
etc/kafka-connect-hdfs/quickstart-hdfs.properties | ||
To check that the connector started successfully view the Connect worker's log by running: | ||
|
||
You should see that the process starts up and logs some messages, and then exports data from Kafka | ||
to HDFS. Once the connector finishes ingesting data to HDFS, check that the data is available | ||
in HDFS:: | ||
.. sourcecode:: bash | ||
|
||
$ confluent log connect | ||
|
||
Towards the end of the log you should see that the connector starts, logs a few messages, and then exports | ||
data from Kafka to HDFS. | ||
Once the connector finishes ingesting data to HDFS, check that the data is available in HDFS: | ||
|
||
.. sourcecode:: bash | ||
|
||
$ hadoop fs -ls /topics/test_hdfs/partition=0 | ||
|
||
You should see a file with name ``/topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro`` | ||
The file name is encoded as ``topic+kafkaPartition+startOffset+endOffset.format``. | ||
|
||
You can use ``avro-tools-1.7.7.jar`` | ||
(available in `Apache mirrors <http://mirror.metrocast.net/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar>`_) | ||
to extract the content of the file. Run ``avro-tools`` directly on Hadoop as:: | ||
You can use ``avro-tools-1.8.2.jar`` | ||
(available in `Apache mirrors <http://mirror.metrocast.net/apache/avro/avro-1.8.2/java/avro-tools-1.8.2.jar>`_) | ||
to extract the content of the file. Run ``avro-tools`` directly on Hadoop as: | ||
|
||
.. sourcecode:: bash | ||
|
||
$ hadoop jar avro-tools-1.8.2.jar tojson \ | ||
hdfs://<namenode>/topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro | ||
|
||
where "<namenode>" is the HDFS name node hostname. | ||
|
||
$ hadoop jar avro-tools-1.7.7.jar tojson \ | ||
/topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro | ||
or, if you experience issues, first copy the avro file from HDFS to the local filesystem and try again with java: | ||
|
||
or, if you experience issues, first copy the avro file from HDFS to the local filesystem and try again with java:: | ||
.. sourcecode:: bash | ||
|
||
$ hadoop fs -copyToLocal /topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro \ | ||
/tmp/test_hdfs+0+0000000000+0000000002.avro | ||
|
||
$ java -jar avro-tools-1.7.7.jar tojson /tmp/test_hdfs+0+0000000000+0000000002.avro | ||
$ java -jar avro-tools-1.8.2.jar tojson /tmp/test_hdfs+0+0000000000+0000000002.avro | ||
|
||
You should see the following output:: | ||
You should see the following output: | ||
|
||
.. sourcecode:: bash | ||
|
||
{"f1":"value1"} | ||
{"f1":"value2"} | ||
{"f1":"value3"} | ||
|
||
Finally, stop the Connect worker as well as all the rest of the Confluent services by running: | ||
|
||
.. sourcecode:: bash | ||
|
||
$ confluent stop | ||
Stopping connect | ||
connect is [DOWN] | ||
Stopping kafka-rest | ||
kafka-rest is [DOWN] | ||
Stopping schema-registry | ||
schema-registry is [DOWN] | ||
Stopping kafka | ||
kafka is [DOWN] | ||
Stopping zookeeper | ||
zookeeper is [DOWN] | ||
|
||
or stop all the services and additionally wipe out any data generated during this quickstart by running: | ||
|
||
.. sourcecode:: bash | ||
|
||
$ confluent destroy | ||
Stopping connect | ||
connect is [DOWN] | ||
Stopping kafka-rest | ||
kafka-rest is [DOWN] | ||
Stopping schema-registry | ||
schema-registry is [DOWN] | ||
Stopping kafka | ||
kafka is [DOWN] | ||
Stopping zookeeper | ||
zookeeper is [DOWN] | ||
Deleting: /tmp/confluent.w1CpYsaI | ||
|
||
.. note:: If you want to run the Quickstart with Hive integration, before starting the connector, | ||
you need to add the following configurations to | ||
|
@@ -144,7 +232,9 @@ description of the available configuration options. | |
|
||
Example | ||
~~~~~~~ | ||
Here is the content of ``etc/kafka-connect-hdfs/quickstart-hdfs.properties``:: | ||
Here is the content of ``etc/kafka-connect-hdfs/quickstart-hdfs.properties``: | ||
|
||
.. sourcecode:: bash | ||
|
||
name=hdfs-sink | ||
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector | ||
|
@@ -166,18 +256,22 @@ Format and Partitioner | |
~~~~~~~~~~~~~~~~~~~~~~ | ||
You need to specify the ``format.class`` and ``partitioner.class`` if you want to write other | ||
formats to HDFS or use other partitioners. The following example configurations demonstrates how to | ||
write Parquet format and use hourly partitioner:: | ||
write Parquet format and use hourly partitioner: | ||
|
||
.. sourcecode:: bash | ||
|
||
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat | ||
partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner | ||
|
||
.. note:: If you want ot use the field partitioner, you need to specify the ``partition.field.name`` | ||
.. note:: If you want to use the field partitioner, you need to specify the ``partition.field.name`` | ||
configuration as well to specify the field name of the record. | ||
|
||
Hive Integration | ||
~~~~~~~~~~~~~~~~ | ||
At minimum, you need to specify ``hive.integration``, ``hive.metastore.uris`` and | ||
``schema.compatibility`` when integrating Hive. Here is an example configuration:: | ||
``schema.compatibility`` when integrating Hive. Here is an example configuration: | ||
|
||
.. sourcecode:: bash | ||
|
||
hive.integration=true | ||
hive.metastore.uris=thrift://localhost:9083 # FQDN for the host part | ||
|
@@ -203,7 +297,9 @@ latest Hive table schema. Please find more information on schema compatibility i | |
Secure HDFS and Hive Metastore | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
To work with secure HDFS and Hive metastore, you need to specify ``hdfs.authentication.kerberos``, | ||
``connect.hdfs.principal``, ``connect.keytab``, ``hdfs.namenode.principal``:: | ||
``connect.hdfs.principal``, ``connect.keytab``, ``hdfs.namenode.principal``: | ||
|
||
.. sourcecode:: bash | ||
|
||
hdfs.authentication.kerberos=true | ||
connect.hdfs.principal=connect-hdfs/[email protected] | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Jenkinsfile needed?