Row filter plugin for Embulk

A filter plugin for Embulk to filter out rows

Configuration

Requirement: version >= 0.3.0

where: Select only rows which matches with conditions written in SQL-like syntax. See SQL-like Syntax

Example

filters:
  - type: row
    where: column1 = 'str'

filters:
  - type: row
    where: |-
      (
        string_column START_WITH 'str' AND
        number_column > 1.0
      )
      OR
      (
        time_column = TIMESTAMP '2016-01-01 +0900' AND
        "true_column" = true
      )

See SQL-like Syntax for more details

SQL-like Syntax

where: column1 = 'str'

where: |-
  (
    string_column START_WITH 'str' AND
    number_column > 1.0
  
  )
  OR
  (
    time_column = TIMESTAMP '2016-01-01 +0900' AND
    "true_column" = true AND
    string_column REGEXP '^reg'
  )

Literals

Boolean Literal

true or TRUE or false or FALSE are considered as a boolean literal

Number Literal

Characters matching with a regular expression -?[0-9]+(\.[0-9]+)? is considered as a number literal

String Literal

Characters surrounded by ' such as 'foo' is considered as a string literal

Timestamp Literal

NOTE: It became possible to omit TIMESTAMP keyword on comparing with timestamp identifier (column) from version >= 0.3.3.

TIMESTAMP ( NumberLiteral | StringLiteral ) such as TIMESTAMP 1470433087.747123 or TIMESTAMP '2016-08-06 06:38:07.747123 +0900' is considered as a timestamp literal

Number is a epoch time since 1970-01-01 UTC with nano time resolution.

String is a timestamp string which matches with one of following format:

%Y-%m-%d %H:%M:%S.%N %z
%Y-%m-%d %H:%M:%S.%N
%Y-%m-%d %H:%M:%S %z
%Y-%m-%d %H:%M:%S
%Y-%m-%d %z
%Y-%m-%d

The time zone for formats without %z is UTC, and the time resolution is micro second (caused by limitation of Embulk TimestampParser).

Json Literal

Not supported yet

Identifier Literal

Characters matching with a regular expression [a-zA-Z_][a-zA-z0-9_]* such as foobar, and characters surrounded by " such as "foo-bar", "foo.bar", and "foo\"bar" are considred as an identifier literal, that is, embulk's column name.

Operators

Boolean Operator

=
!=

Number Operator (Long and Double)

=
!=
>
>=
<=
<

String Operator

=
!=
START_WITH
END_WITH
INCLUDE
REGEXP

Timestamp Operator

=
!=
>
>=
<=
<

Json Operator

Not supported yet

unary operator

"xxx IS NULL"
"xxx IS NOT NULL"
"NOT xxx"

Old Configuration

Versions >= 0.3.0 has where option to supports SQL-like syntax. I recommend to use it.

Following options are deprecated, and will be removed someday.

condition: AND or OR (string, default: AND).
conditions: select only rows which matches with conditions.
- column: column name (string, required)
- operator operator (string, optional, default: ==)
  - boolean operator
    - ==
    - !=
  - numeric operator (long, double, Timestamp)
    - ==
    - !=
    - >
    - >=
    - <=
    - <
  - string operator
    - ==
    - !=
    - start_with (or startsWith)
    - end_with (or endsWith)
    - include (or contains)
  - unary operator
    - IS NULL
    - IS NOT NULL
- argument: argument for the operation (string, required for non-unary operators)
- not: not (boolean, optional, default: false)
- format: special option for timestamp column, specify the format of timestamp argument, parsed argument is compared with the column value as Timestamp object (string, default is %Y-%m-%d %H:%M:%S.%N %z)
- timezone: special option for timestamp column, specify the timezone of timestamp argument (string, default is UTC)

NOTE: column type is automatically retrieved from input data (inputSchema)

Example (AND)

Deprecated

filters:
  - type: row
    condition: AND
    conditions:
      - {column: foo,  operator: "IS NOT NULL"}
      - {column: id,   operator: ">=", argument: 10}
      - {column: id,   operator: "<",  argument: 20}
      - {column: name, opeartor: "include", argument: foo, not: true}
      - {column: time, operator: "==", argument: "2015-07-13", format: "%Y-%m-%d"}

Example (OR)

Deprecated

filters:
  - type: row
    condition: OR
    conditions:
      - {column: a, operator: "IS NOT NULL"}
      - {column: b, operator: "IS NOT NULL"}

Example (AND of OR)

Deprecated

You can express a condition such as (A OR B) AND (C OR D) by combining multiple filters like

filters:
  - type: row
    condition: OR
    conditions:
      - {column: a, operator: "IS NOT NULL"}
      - {column: b, operator: "IS NOT NULL"}
  - type: row
    condition: OR
    conditions:
      - {column: c, operator: "IS NOT NULL"}
      - {column: d, operator: "IS NOT NULL"}

Comparisions

embulk-filter-calcite
- embulk-filter-calcite is a pretty nice plugin which enables us to write SQL query to filter embulk records, not only WHERE but also SELECT.
- Based on my benchmark (Japanese), embulk-filter-row was faster than embulk-filter-calcite.
- Choose which to use as your demand.

ToDo

Support filtering by values of type: json with JSONPath
Support IN operator

ChangeLog

CHANGELOG.md

Development

Run example:

$ ./gradlew classpath
$ embulk preview -I lib example/example.yml

Run test:

$ ./gradlew test

Run checkstyle:

$ ./gradlew check

Release gem:

$ ./gradlew gemPush

Development of SQL-like Syntax

Read the article Supported SQL-like Syntax with embulk-filter-row using BYACC/J and JFlex.

To download BYACC/J and JFlex and run them, you can use:

$ script/byaccj.sh

or

$ ./gradlew byaccj # this runs script/byaccj.sh internally

This generates src/main/java/org/embulk/filter/row/where/{Parser,ParserVal,Yylex}.java.

The byaccj task of gradle is ran before compileJava task (which means to be ran before classpath or test task also) automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
config/checkstyle		config/checkstyle
example		example
gradle/wrapper		gradle/wrapper
lib/embulk/filter		lib/embulk/filter
script		script
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Row filter plugin for Embulk

Configuration

Example

SQL-like Syntax

Literals

Boolean Literal

Number Literal

String Literal

Timestamp Literal

Json Literal

Identifier Literal

Operators

Boolean Operator

Number Operator (Long and Double)

String Operator

Timestamp Operator

Json Operator

unary operator

Old Configuration

Example (AND)

Example (OR)

Example (AND of OR)

Comparisions

ToDo

ChangeLog

Development

Development of SQL-like Syntax

About

Releases

Packages

Languages

License

medjed/embulk-filter-row

Folders and files

Latest commit

History

Repository files navigation

Row filter plugin for Embulk

Configuration

Example

SQL-like Syntax

Literals

Boolean Literal

Number Literal

String Literal

Timestamp Literal

Json Literal

Identifier Literal

Operators

Boolean Operator

Number Operator (Long and Double)

String Operator

Timestamp Operator

Json Operator

unary operator

Old Configuration

Example (AND)

Example (OR)

Example (AND of OR)

Comparisions

ToDo

ChangeLog

Development

Development of SQL-like Syntax

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages