Skip to content
This repository has been archived by the owner on Apr 4, 2019. It is now read-only.

Merging kafka-connect-hdfs 3.3.0 to support Kafka 0.11 #49

Open
wants to merge 105 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
24fcba4
"Version bump to 3.2.0-SNAPSHOT"
Oct 3, 2016
c8a2a83
Version bump to 0.10.2.0-SNAPSHOT
Oct 5, 2016
76e40c8
fix version to 3.2 in docs
Oct 5, 2016
b254849
Merge branch '3.1.x'
ewencp Oct 11, 2016
7b5dfa1
fix a typo
aseigneurin Oct 13, 2016
b8de69c
fix a typo
aseigneurin Oct 13, 2016
9e0c9c1
Merge pull request #135 from aseigneurin/fix-typos
ewencp Oct 13, 2016
c058019
fix typo
hjespers Oct 14, 2016
43379dc
Merge remote-tracking branch 'origin/3.1.x'
ewencp Oct 18, 2016
0f6173c
CC-273: connect deps should be scoped as provided
shikhar Nov 4, 2016
e42429d
Switch to toEnrichedRst() for config opts docgen
shikhar Nov 4, 2016
f68beb4
CC-391: Fix configuration with DefaultPartitioner
kkonstantine Nov 15, 2016
be9a84e
Merge branch 'CC-391-Configuration-with-DefaultPartitioner-compares-a…
kkonstantine Nov 17, 2016
f38410a
HOTFIX: Override new method in MockSinkTaskContext.
kkonstantine Dec 2, 2016
e26514e
Merge branch 'HOTFIX-Override-newly-added-method-in-SinkTaskContext-i…
kkonstantine Dec 6, 2016
e3f6641
Fix topics with periods for hive #136 #137
ig-michaelpearce Jan 14, 2017
fb2b0ab
accidental un-needed format change, revert
ig-michaelpearce Jan 14, 2017
0f754b8
remove accidental new line added
ig-michaelpearce Jan 14, 2017
413a169
Merge pull request #164 from IG-Group/ISSUE-136
ewencp Jan 17, 2017
b0c4839
Updated Readme for Kafka connect FAQ (#75)
rekhajoshm Jan 19, 2017
9841a90
updated avro-tools instructions in doc (#65)
coughman Jan 19, 2017
3090fa7
Merge with docs branch
Jan 24, 2017
9838b61
Merge branch '3.1.x'
Jan 24, 2017
5b21d21
Bump version to 3.3.0-SNAPSHOT
Jan 25, 2017
cf40f3b
Bump Confluent to 3.3.0-SNAPSHOT, Kafka to 0.10.3.0-SNAPSHOT
Jan 25, 2017
9196b6c
Bump version to 3.2.0-SNAPSHOT
Jan 25, 2017
34c9088
MINOR: Upgrade Hadoop version to 2.7.3 and joda-time to 2.9.7
kkonstantine Feb 9, 2017
55716d4
Merge branch '3.2.x'
kkonstantine Feb 17, 2017
4e357b5
Handle primitive types in AvroRecordWriterProvider. (#176)
ewencp Feb 27, 2017
24cb9c0
Merge remote-tracking branch 'origin/3.0.x' into 3.1.x
ewencp Feb 27, 2017
545f2f3
Merge remote-tracking branch 'origin/3.1.x' into 3.2.x
ewencp Feb 27, 2017
2c8522e
Merge branch '3.2.x'
ewencp Feb 27, 2017
535be47
Set Confluent to 3.2.0, Kafka to 0.10.2.0-cp1.
Feb 27, 2017
da94c38
Bump version to 3.2.1-SNAPSHOT
Mar 2, 2017
58f6c85
Bump Confluent to 3.2.1-SNAPSHOT
Mar 2, 2017
8c0596b
Bump Kafka to 0.10.2.1-SNAPSHOT
Mar 2, 2017
9afe713
Bump Kafka to 0.11.0.0-SNAPSHOT
Mar 4, 2017
1abe85c
Merge remote-tracking branch 'origin/3.2.x'
norwood Mar 9, 2017
ef68aa6
CC-437: Update 3.2.0 changelog
ewencp Mar 13, 2017
241eb77
Add missing PR-170 to release notes.
ewencp Mar 15, 2017
2a55cd9
Merge remote-tracking branch 'origin/3.2.0-post' into 3.2.x
ewencp Mar 15, 2017
60a5323
Merge remote-tracking branch 'origin/3.2.x'
ewencp Mar 15, 2017
25b10bb
Merge pull request #179 from confluentinc/cc-437-3.2.0-changelog
ewencp Mar 15, 2017
713de9a
Merge remote-tracking branch 'origin/3.2.0-post' into 3.2.x
ewencp Mar 15, 2017
4e12b94
Merge remote-tracking branch 'origin/3.2.x'
ewencp Mar 15, 2017
449ef58
Add 3.2.1 changelog.
ewencp Apr 11, 2017
54e0943
Merge pull request #184 from confluentinc/3.2.1-changelog
ewencp Apr 12, 2017
0b53750
Merge remote-tracking branch 'origin/3.2.x'
ewencp Apr 12, 2017
8b439a0
Set Confluent to 3.2.1, Kafka to 0.10.2.1-cp1.
May 1, 2017
64598ed
Bump version to 3.2.2-SNAPSHOT
May 4, 2017
cee729b
Bump Confluent to 3.2.2-SNAPSHOT, Kafka to 0.10.2.2-SNAPSHOT
May 4, 2017
7116cef
Merge remote-tracking branch 'origin/3.2.1-post' into 3.2.x
norwood May 4, 2017
7efda02
Merge remote-tracking branch 'origin/3.2.x'
norwood May 4, 2017
e786248
Fix HdfsSinkConnector to extend from SinkConnector instead of Connector.
ewencp May 19, 2017
33023cb
Merge pull request #194 from confluentinc/extend-sink-connector
ewencp May 19, 2017
d52155f
Merge remote-tracking branch 'origin/3.0.x' into 3.1.x
ewencp May 19, 2017
01af529
Merge remote-tracking branch 'origin/3.1.x' into 3.2.x
ewencp May 19, 2017
d89f97a
Merge remote-tracking branch 'origin/3.2.x'
ewencp May 19, 2017
cc37611
CC-491: Consolidate and simplify unit tests of HDFS connector
kkonstantine Apr 20, 2017
02d1bb7
Move verification methods and members to TestWithMiniDFSCluster
kkonstantine Apr 20, 2017
e939f75
Consolidate records creation methods in TopicPartitionWriterTest too.
kkonstantine Apr 21, 2017
b87115f
Fix hive tests.
kkonstantine Apr 22, 2017
66ddb58
Add tests with interleaved (non consecutive) records between partitions.
kkonstantine Apr 22, 2017
815a693
Code style fixes.
kkonstantine Apr 22, 2017
e57fed9
Increase maximum perm size to accommodate failing tests.
kkonstantine Nov 28, 2016
d449c4f
Upgrade surefire parameters for test forking.
kkonstantine Dec 14, 2016
b8a335b
Clean commented out code and fix typo.
kkonstantine Apr 26, 2017
be14311
Remove debugging statement from parquet test.
kkonstantine May 24, 2017
fd28e5e
Merge branch 'CC-491-Consolidate-unit-tests-of-HDFS-connector'
kkonstantine May 24, 2017
6970a12
Fix incorrect licensing and webpage info.
ewencp May 26, 2017
5d252e9
Merge pull request #200 from confluentinc/remove-incorrect-license
ewencp May 26, 2017
08d8f1a
Merge remote-tracking branch 'origin/3.0.x' into 3.1.x
ewencp May 26, 2017
6ccc55b
Merge remote-tracking branch 'origin/3.1.x' into 3.2.x
ewencp May 26, 2017
5ac7268
Merge remote-tracking branch 'origin/3.2.x' into 3.3.x
ewencp May 26, 2017
3592793
Upgrade avro to 1.8.0
mageshn Jun 6, 2017
c1d1f97
Add 3.2.2 changelog
ewencp Jun 6, 2017
78b79a7
Merge pull request #204 from confluentinc/3.2.2-changelog
ewencp Jun 7, 2017
95b100b
Merge remote-tracking branch 'origin/3.2.x' into 3.3.x
ewencp Jun 7, 2017
a1fd0c4
Upgrade avro to 1.8.2
mageshn Jun 7, 2017
c464ef2
Upgrade avro to 1.8.2
mageshn Jun 8, 2017
437485a
jenkinsfile time
norwood Jun 12, 2017
2afec6a
Merge pull request #205 from confluentinc/avro-upgrade
ewencp Jun 13, 2017
5337079
Set Confluent to 3.2.2, Kafka to 0.10.2.1-cp2.
Jun 15, 2017
28eb13b
Add 3.3.0 changelog
kkonstantine Jun 21, 2017
428f211
Merge pull request #209 from kkonstantine/3.3.0-changelog
kkonstantine Jun 21, 2017
367da71
Update quickstart to use Confluent CLI
kkonstantine Jul 27, 2017
2e8694d
Add stop instructions at the end of the quickstart.
kkonstantine Jul 27, 2017
e2d9b26
Improve wording around 'confluent log connect'
kkonstantine Jul 27, 2017
bb6996d
Set Confluent to 3.3.0, Kafka to 0.11.0.0-cp1.
Jul 27, 2017
ddddfd3
Merge pull request #217 from kkonstantine/Update-quickstart-to-use-CLI
kkonstantine Jul 27, 2017
6af158f
Bump version to 3.3.1-SNAPSHOT
Jul 28, 2017
4a7803f
Bump Confluent to 3.3.1-SNAPSHOT
Jul 28, 2017
d04f924
Merge remote-tracking branch 'origin/3.3.0-post' into 3.3.x
Jul 28, 2017
326bdcd
Bump Kafka to 0.11.0.1-SNAPSHOT
Jul 28, 2017
4f73c33
Merge remote-tracking branch 'origin/3.2.2-post' into 3.2.x
Jul 31, 2017
79cbee7
Merge branch '3.2.x' into 3.3.x
Jul 31, 2017
44e0d85
Bump version to 3.2.3-SNAPSHOT
Jul 31, 2017
fb41464
Bump Confluent to 3.2.3-SNAPSHOT
Jul 31, 2017
d890d61
Merge remote-tracking branch 'origin/3.2.x' into 3.3.x
Jul 31, 2017
4ec9fe2
Set Confluent to 3.3.0-hotfix.1, Kafka to 0.11.0.0-hotfix.1.
ConfluentJenkins Aug 17, 2017
a4c6874
Merging upstream v3.3.0-hotfix.1 tag to support Kafka 0.11.0.0.
Sep 16, 2017
91322bf
Merging upstream v3.3.0-hotfix.1 tag to support Kafka 0.11.0.0.
lewisdawson Sep 16, 2017
e1041b5
Added ability to connector to convert CSV data to Parquet before uplo…
Sep 20, 2017
03eb245
Added ability to connector to convert CSV data to Parquet before uplo…
lewisdawson Sep 20, 2017
227de21
Merge branch 'master' of github.com:lewisdawson/streamx
Sep 20, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Jenkinsfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/usr/bin/env groovy
common {
slackChannel = '#connect-eng'
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Jenkinsfile needed?

4 changes: 2 additions & 2 deletions NOTICE
Original file line number Diff line number Diff line change
Expand Up @@ -285,8 +285,8 @@ The following libraries are included in packaged versions of this project:

* Pentaho aggdesigner
* COPYRIGHT: Copyright 2006 - 2013 Pentaho Corporation
* LICENSE: licenses/LICENSE.gpl2.txt
* HOMEPAGE: https://github.com/pentaho/pentaho-aggdesigner
* LICENSE: licenses/LICENSE.apache2.txt
* HOMEPAGE: https://github.com/julianhyde/aggdesigner/tree/master/pentaho-aggdesigner-algorithm

* SLF4J
* COPYRIGHT: Copyright (c) 2004-2013 QOS.ch
Expand Down
23 changes: 23 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,29 @@
Changelog
=========

Version 3.3.0
-------------

* `PR-187 <https://github.com/confluentinc/kafka-connect-hdfs/pull/187>`_ - CC-491: Consolidate and simplify unit tests of HDFS connector.
* `PR-205 <https://github.com/confluentinc/kafka-connect-hdfs/pull/205>`_ - Upgrade avro to 1.8.2.

Version 3.2.2
-------------

* `PR-194 <https://github.com/confluentinc/kafka-connect-hdfs/pull/194>`_ - Fix HdfsSinkConnector to extend from SinkConnector instead of Connector.
* `PR-200 <https://github.com/confluentinc/kafka-connect-hdfs/pull/200>`_ - Fix incorrect licensing and webpage info.

Version 3.2.1
-------------
No changes

Version 3.2.0
-------------

* `PR-135 <https://github.com/confluentinc/kafka-connect-hdfs/pull/135>`_ - Fix typos
* `PR-164 <https://github.com/confluentinc/kafka-connect-hdfs/pull/164>`_ - Issue 136 - Support topic with dots in hive.
* `PR-170 <https://github.com/confluentinc/kafka-connect-hdfs/pull/170>`_ - MINOR: Upgrade Hadoop version to 2.7.3 and joda-time to 2.9.7

Version 3.1.1
-------------
No changes
Expand Down
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,9 @@ def setup(app):
# built documents.
#
# The short X.Y version.
version = '3.0'
version = '3.3'
# The full version, including alpha/beta/rc tags.
release = '3.1.3-SNAPSHOT'
release = '3.3.0-hotfix.1'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
150 changes: 123 additions & 27 deletions docs/hdfs_connector.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ The HDFS connector allows you to export data from Kafka topics to HDFS files in
and integrates with Hive to make data immediately available for querying with HiveQL.

The connector periodically polls data from Kafka and writes them to HDFS. The data from each Kafka
topic is partitioned by the provided partitioner and divided into chucks. Each chunk of data is
represented as an HDFS file with topic, kafka partition, start and end offsets of this data chuck
topic is partitioned by the provided partitioner and divided into chunks. Each chunk of data is
represented as an HDFS file with topic, kafka partition, start and end offsets of this data chunk
in the filename. If no partitioner is specified in the configuration, the default partitioner which
preserves the Kafka partitioning is used. The size of each data chunk is determined by the number of
records written to HDFS, the time written to HDFS and schema compatibility.
Expand All @@ -20,9 +20,8 @@ Quickstart
In this Quickstart, we use the HDFS connector to export data produced by the Avro console producer
to HDFS.

Start Zookeeper, Kafka and SchemaRegistry if you haven't done so. The instructions on how to start
these services are available at the Confluent Platform QuickStart. You also need to have Hadoop
running locally or remotely and make sure that you know the HDFS url. For Hive integration, you
Before you start the Confluent services, make sure Hadoop is
running locally or remotely and that you know the HDFS url. For Hive integration, you
need to have Hive installed and to know the metastore thrift uri.

This Quickstart assumes that you started the required services with the default configurations and
Expand All @@ -39,12 +38,41 @@ Also, this Quickstart assumes that security is not configured for HDFS and Hive
please make the necessary configurations change following `Secure HDFS and Hive Metastore`_
section.

First, start the Avro console producer::
First, start all the necessary services using Confluent CLI.

.. tip::

If not already in your PATH, add Confluent's ``bin`` directory by running: ``export PATH=<path-to-confluent>/bin:$PATH``

.. sourcecode:: bash

$ confluent start

Every service will start in order, printing a message with its status:

.. sourcecode:: bash

Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]

Next, start the Avro console producer to import a few records to Kafka:

.. sourcecode:: bash

$ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_hdfs \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'

Then in the console producer, type in::
Then in the console producer, type in:

.. sourcecode:: bash

{"f1": "value1"}
{"f1": "value2"}
Expand All @@ -54,42 +82,102 @@ The three records entered are published to the Kafka topic ``test_hdfs`` in Avro

Before starting the connector, please make sure that the configurations in
``etc/kafka-connect-hdfs/quickstart-hdfs.properties`` are properly set to your configurations of
Hadoop, e.g. ``hdfs.url`` points to the proper HDFS and using FQDN in the host. Then run the
following command to start Kafka connect with the HDFS connector::
Hadoop, e.g. ``hdfs.url`` points to the proper HDFS and using FQDN in the host. Then start connector by loading its
configuration with the following command:

.. sourcecode:: bash

$ confluent load hdfs-sink -d etc/kafka-connect-hdfs/quickstart-hdfs.properties
{
"name": "hdfs-sink",
"config": {
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"topics": "test_hdfs",
"hdfs.url": "hdfs://localhost:9000",
"flush.size": "3",
"name": "hdfs-sink"
},
"tasks": []
}

$ ./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties \
etc/kafka-connect-hdfs/quickstart-hdfs.properties
To check that the connector started successfully view the Connect worker's log by running:

You should see that the process starts up and logs some messages, and then exports data from Kafka
to HDFS. Once the connector finishes ingesting data to HDFS, check that the data is available
in HDFS::
.. sourcecode:: bash

$ confluent log connect

Towards the end of the log you should see that the connector starts, logs a few messages, and then exports
data from Kafka to HDFS.
Once the connector finishes ingesting data to HDFS, check that the data is available in HDFS:

.. sourcecode:: bash

$ hadoop fs -ls /topics/test_hdfs/partition=0

You should see a file with name ``/topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro``
The file name is encoded as ``topic+kafkaPartition+startOffset+endOffset.format``.

You can use ``avro-tools-1.7.7.jar``
(available in `Apache mirrors <http://mirror.metrocast.net/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar>`_)
to extract the content of the file. Run ``avro-tools`` directly on Hadoop as::
You can use ``avro-tools-1.8.2.jar``
(available in `Apache mirrors <http://mirror.metrocast.net/apache/avro/avro-1.8.2/java/avro-tools-1.8.2.jar>`_)
to extract the content of the file. Run ``avro-tools`` directly on Hadoop as:

.. sourcecode:: bash

$ hadoop jar avro-tools-1.8.2.jar tojson \
hdfs://<namenode>/topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro

where "<namenode>" is the HDFS name node hostname.

$ hadoop jar avro-tools-1.7.7.jar tojson \
/topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro
or, if you experience issues, first copy the avro file from HDFS to the local filesystem and try again with java:

or, if you experience issues, first copy the avro file from HDFS to the local filesystem and try again with java::
.. sourcecode:: bash

$ hadoop fs -copyToLocal /topics/test_hdfs/partition=0/test_hdfs+0+0000000000+0000000002.avro \
/tmp/test_hdfs+0+0000000000+0000000002.avro

$ java -jar avro-tools-1.7.7.jar tojson /tmp/test_hdfs+0+0000000000+0000000002.avro
$ java -jar avro-tools-1.8.2.jar tojson /tmp/test_hdfs+0+0000000000+0000000002.avro

You should see the following output::
You should see the following output:

.. sourcecode:: bash

{"f1":"value1"}
{"f1":"value2"}
{"f1":"value3"}

Finally, stop the Connect worker as well as all the rest of the Confluent services by running:

.. sourcecode:: bash

$ confluent stop
Stopping connect
connect is [DOWN]
Stopping kafka-rest
kafka-rest is [DOWN]
Stopping schema-registry
schema-registry is [DOWN]
Stopping kafka
kafka is [DOWN]
Stopping zookeeper
zookeeper is [DOWN]

or stop all the services and additionally wipe out any data generated during this quickstart by running:

.. sourcecode:: bash

$ confluent destroy
Stopping connect
connect is [DOWN]
Stopping kafka-rest
kafka-rest is [DOWN]
Stopping schema-registry
schema-registry is [DOWN]
Stopping kafka
kafka is [DOWN]
Stopping zookeeper
zookeeper is [DOWN]
Deleting: /tmp/confluent.w1CpYsaI

.. note:: If you want to run the Quickstart with Hive integration, before starting the connector,
you need to add the following configurations to
Expand Down Expand Up @@ -144,7 +232,9 @@ description of the available configuration options.

Example
~~~~~~~
Here is the content of ``etc/kafka-connect-hdfs/quickstart-hdfs.properties``::
Here is the content of ``etc/kafka-connect-hdfs/quickstart-hdfs.properties``:

.. sourcecode:: bash

name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
Expand All @@ -166,18 +256,22 @@ Format and Partitioner
~~~~~~~~~~~~~~~~~~~~~~
You need to specify the ``format.class`` and ``partitioner.class`` if you want to write other
formats to HDFS or use other partitioners. The following example configurations demonstrates how to
write Parquet format and use hourly partitioner::
write Parquet format and use hourly partitioner:

.. sourcecode:: bash

format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner

.. note:: If you want ot use the field partitioner, you need to specify the ``partition.field.name``
.. note:: If you want to use the field partitioner, you need to specify the ``partition.field.name``
configuration as well to specify the field name of the record.

Hive Integration
~~~~~~~~~~~~~~~~
At minimum, you need to specify ``hive.integration``, ``hive.metastore.uris`` and
``schema.compatibility`` when integrating Hive. Here is an example configuration::
``schema.compatibility`` when integrating Hive. Here is an example configuration:

.. sourcecode:: bash

hive.integration=true
hive.metastore.uris=thrift://localhost:9083 # FQDN for the host part
Expand All @@ -203,7 +297,9 @@ latest Hive table schema. Please find more information on schema compatibility i
Secure HDFS and Hive Metastore
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To work with secure HDFS and Hive metastore, you need to specify ``hdfs.authentication.kerberos``,
``connect.hdfs.principal``, ``connect.keytab``, ``hdfs.namenode.principal``::
``connect.hdfs.principal``, ``connect.keytab``, ``hdfs.namenode.principal``:

.. sourcecode:: bash

hdfs.authentication.kerberos=true
connect.hdfs.principal=connect-hdfs/[email protected]
Expand Down
Loading