A filter plugin for Embulk to change timestamp format
- columns: columns to retain (array of hash)
- name: name of column (required)
- type: type to cast, choose one of
string
,timestamp
,long
(unixtimestamp),double
(unixtimestamp) (string, default isstring
) - from_format: specify the format of the input string (array of strings, default is default_from_timestamp_format)
- from_timezone: specify the timezone of the input string (string, default is default_from_timezone)
- to_format: specify the format of the output string (string, default is default_to_timestamp_format)
- to_timezone: specify the timezone of the output string (string, default is default_to_timezone)
- from_unit: specify the time unit of the input unixtimestamp (string, default is default_from_timestamp_unit)
- to_unit: specify the time unit of the output unixtimestamp (string, default is default_to_timestamp_unit)
- default_from_timestamp_format: default timestamp format for the input string (array of strings, default is
["%Y-%m-%d %H:%M:%S.%N %z"]
) - default_from_timezone: default timezone for the input string (string, default is
UTC
) - default_to_timestamp_format: default timestamp format for the output string (string, default is
%Y-%m-%d %H:%M:%S.%N %z
) - default_to_timezone: default timezone for the output string (string, default is
UTC
) - default_from_timestamp_unit: default time unit such as
sec
(for second),ms
(for milli second),us
(for micro second),ns
(for nano second) for the input unixtimestamp (string, default issecond
) - default_to_timestamp_unit: default time unit such as
sec
(for second),ms
(for milli second),us
(for micro second),ns
(for nano second) for the output unixtimestamp (string, default issecond
) - stop_on_invalid_record: stop bulk load transaction if a invalid record is found (boolean, default is
false
) - timestamp_parser (experimental): set
auto_java
to try to convert ruby format to java format to use faster java timestamp parser (string, default isauto
)
Say example.jsonl is as follows (this is a typical format which Exporting BigQuery table outputs):
{"timestamp":"2015-07-12 15:00:00 UTC","nested":{"timestamp":"2015-07-12 15:00:00 UTC"}}
{"timestamp":"2015-07-12 15:00:00.1 UTC","nested":{"timestamp":"2015-07-12 15:00:00.1 UTC"}}
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl # not json parser
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["%Y-%m-%d %H:%M:%S.%N %z", "%Y-%m-%d %H:%M:%S %z"]
default_to_timezone: "Asia/Tokyo"
default_to_timestamp_format: "%Y-%m-%d %H:%M:%S.%N"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
Output will be as:
{"timestamp":1436713200000,"nested":{"timestamp":"2015-07-13 00:00:00.0}}
{"timestamp":1436713200100,"nested":{"timestamp":"2015-07-13 00:00:00.1}}
See ./example for more examples.
For type: json
column, you can specify JSONPath for column's name as:
name: $.payload.key1
name: "$.payload.array[0]"
name: "$.payload.array[*]"
name: $['payload']['key1.key2']
Following operators of JSONPath are not supported:
- Multiple properties such as
['name','name']
- Multiple array indexes such as
[1,2]
- Array slice such as
[1:2]
- Filter expression such as
[?(<expression>)]
Embulk's timestamp parser originally uses jruby implementation, but it is slow. To improve performance, this plugin also supports Java's Joda-Time DateTimeFormat format as:
in:
type: file
path_prefix: example/example.jsonl
parser:
type: jsonl
columns:
- {name: timestamp, type: string}
- {name: nested, type: json}
filters:
- type: timestamp_format
default_from_timestamp_format: ["yyyy-MM-dd HH:mm:ss.SSS z", "yyyy-MM-dd HH:mm:ss z", "yyyy-MM-dd HH:mm:ss"]
default_to_timezone: "Asia/Taipei"
default_to_timestamp_format: "yyyy-MM-dd HH:mm:ss.SSS Z"
columns:
- {name: timestamp, type: long, to_unit: ms}
- {name: $.nested.timestamp}
out:
type: stdout
If format strings contain %
, jruby parser/formatter is used. Otherwirse, java parser/formatter is used
Automatic Conversion of Ruby Timestamp Format to Java Timestamp Format (experimental)
If you configure timestamp_parser: auto_java
, this plugin tries to convert ruby format into java format automatically to use faster java timestamp parser.
COMPARISON:
Benchmark test sets are available at ./bench. In my environment (Mac Book Pro), for 1000000 timestamps:
- java parser + java formatter: 1.3s
- java parser + jruby formatter: 1.4s
- jruby parser + java formatter: 64.52s
- jruby parser + jruby formatter: 65.06s
JRuby parser is slow, but JRuby formatter is not so slow.
JRuby parser has micro second resolution. Java (Joda-Time) parser has milli second resolution.
Nano second resolution is partially supported by this plugin itself. Use parser format nnnnnnnnn
for Java parser as
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
This plugin finds places of nano second from texts with regular expression \.(\d+)
.
For formatter, you can use nnnnnnnnn
for nano and nnnnnn
for micro as
yyyy-MM-dd HH:mm:ss.nnnnnnnnn z
yyyy-MM-dd HH:mm:ss.nnnnnn z
FYI: Java8's DateTimeFormatter supports nano second resolution, but we can not use it because embulk supports Java7.
- Write test
Run example:
$ ./gradlew classpath
$ embulk preview -I lib example/example.yml
Run test:
$ ./gradlew test
Run checkstyle:
$ ./gradlew check
Release gem:
$ ./gradlew gemPush