Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
chanskw committed Feb 17, 2015
2 parents 8c5b7e9 + f929d95 commit 5b26d33
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 28 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,19 @@ The `HDFS2DirectoryScan` is similar to the `DirectoryScan` operator.
The `HDFS2DirectoryScan` operator repeatedly scans an HDFS directory and writes the names of new or modified files
that are found in the directory to the output port. The operator sleeps between scans.

# Consistent Region Behavior
# Behavior in a consistent region

* The operator can participate in a consistent.
* The operator can be at the start of a consistent region if there is no input port.
* The operator supports periodic and operator-driven consistent region policies.
* If consistent region policy is set as operator driven, the operator initiates a drain after
each tuple is submitted. This allows for a consistent state to be established after a file is fully processed.
* If consistent region policy is set as periodic, the operator respects the period setting
and establishes consistent states accordingly.
This means that multiple files can be processed before a consistent state is established.
* At checkpoint, the operator saves the last submitted filename and its modification timestamp to the checkpoint.
* Upon application failures, the operator resubmits all files that are newer than the last submitted file at checkpoint.
The `HDFS2DirectoryScan` operator can participate in a consistent region.
The operator can be at the start of a consistent region if there is no input port.
The operator supports periodic and operator-driven consistent region policies.

If consistent region policy is set as operator driven, the operator initiates a drain after each tuple is submitted.
This allows for a consistent state to be established after a file is fully processed.
If consistent region policy is set as periodic, the operator respects the period setting and establishes consistent states accordingly.
This means that multiple files can be processed before a consistent state is established.

At checkpoint, the operator saves the last submitted filename and its modification timestamp to the checkpoint.
Upon application failures, the operator resubmits all files that are newer than the last submitted file at checkpoint.

# Exceptions

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ You can optionally control whether the operator closes the current output file a
of the file in bytes, the number of tuples that are written to the file, or the time in seconds that the file is open for writing,
or when the operator receives a punctuation marker.

# Consistent Region Behavior
# Behavior in a consistent region

The `HDFS2FileSink` operator supports consistent region.
The `HDFS2FileSink` operator can participate in a consistent region.
The operator can be part of a consistent region, but cannot be at the start of a consistent region.
The operator guarantees that tuples are written to a file in HDFS at least once,
but duplicated tuples can be written to the file if application failure occurs.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,22 @@ The operator opens a file on HDFS and sends out its contents in tuple format on
If the optional input port is not specified, the operator reads the HDFS file that is specified in the **file** parameter and
provides the file contents on the output port. If the optional input port is configured, the operator reads the files that are
named by the attribute in the tuples that arrive on its input port and places a punctuation marker between each file.

# Behavior in a consistent region

# Consistent Region Behavior
The `HDFS2FileSource` operator can participate in a consistent region.
The operator can be at the start of a consistent region if there is no input port.

* The operator can participate in a consistent.
* The operator can be at the start of a consistent region if there is no input port.
* The operator supports periodic and operator-driven consistent region policies.
* If consistent region policy is set as operator driven,
the operator initiates a drain after a file is fully read.
* If consistent region policy is set as periodic, the operator respects the period setting
and establishes consistent states accordingly.
This means that multiple consistent states can be established before a file is fully read.
* At checkpoint, the operator saves the current file name and file cursor location.
* If the operator does not have an input port, upon application failures, the operator resets
the file cursor back to the checkpointed location, and starts replaying tuples from the cursor location.
* If the operator has an input port and is in a consistent region, the operator relies on its upstream operators
to properly replay the filenames for it to re-read the files from the beginning.
The operator supports periodic and operator-driven consistent region policies.
If the consistent region policy is set as operator driven, the operator initiates a drain after a file is fully read.
If the consistent region policy is set as periodic, the operator respects the period setting and establishes consistent states accordingly.
This means that multiple consistent states can be established before a file is fully read.

At checkpoint, the operator saves the current file name and file cursor location.
If the operator does not have an input port, upon application failures, the operator resets
the file cursor back to the checkpointed location, and starts replaying tuples from the cursor location.
If the operator has an input port and is in a consistent region, the operator relies on its upstream operators
to properly reply the filenames for it to re-read the files from the beginning.

# Exceptions

Expand Down Expand Up @@ -129,7 +129,7 @@ The following example shows how the operator accesses GPFS remotely and reads a
<parameter>
<name>file</name>
<description>
This parameter specifies the name of file that the operator opens and reads.
This parameter specifies the name of the file that the operator opens and reads.
This parameter must be specified when the optional input port is not configured.
If the optional input port is used and the file name is specified, the operator generates an error.</description>
<optional>true</optional>
Expand Down

0 comments on commit 5b26d33

Please sign in to comment.