Skip to content

Codezerker/csvsql

Repository files navigation

Csvsql

Build Status Gem Version

Use SQL to query your CSV file, and return a new CSV.

Requirements

  • SQlite 3.6.16 or newer.

Installation

Add this line to your application's Gemfile:

gem 'csvsql'

And then execute:

$ bundle

Or install it yourself as:

$ gem install csvsql

Usage

Example CSV mydata.csv

name,total:int,created_at:datetime
a,12,2018-1-1 11:22
b,21,2018-3-1 10:20
c,39,2018-1-19 20:10

The first line is title for each columns. We use : split the title name and column type. the column type is varchar(255) if you did give a type.

There are same types: int, integer float, double, date, datetime, varchar(255).

All sqlite3 type is supported, but maybe some type cannot convert to a sqlite3 value type.

Query from file & Output

# csvsql [options] SQL
csvsql -h # Get help
$ csvsql -i mydata.csv "select * from csv where created_at > '2018-1-1' and total > 20" > /tmp/result.csv
# name:varchar(255),total:int,created_at:datetime
# b,21,2018-3-1 10:20:00
# c,39,2018-1-19 20:10:00

csvsql -i mydata.csv "select count(*) as total_record from csv where created_at > '2018-1-1'" > /tmp/result.csv
# total_record:integer
# 2

csvsql -i mydata.csv "select name, total from csv where total < 30" > /tmp/result.csv
# name:varchar(255),total:integer
# a,12
# b,21

Query from stdin

If not give a csv by -i, we will get the input from stdin.

csvsql -i mydata.csv "select name, total from csv where total < 30" | csvsql "select count(*) as count from csv"
# count:integer
# 2

Cache CSV data

It will save the parsed data to ~/.csvsql_cache folder. it will not parse data again next time if the data file didn't change

csvsql -i large.csv -c "select count(*) from csv"

# the second, it will be fast.
csvsql -i large.csv -c "select count(*) from csv"

Query multiple CSV files

For multiple files, we should name each files. This name will be a table name.

csvsql -i users.csv:users -i posts.csv:posts "select * from posts join users on posts.user_id = users.id where users.role = 'guest'"

With cache, those name will be a database name, the table name is csv. If a csv file was updated, its db will automatically update. Other files still use cache.

csvsql -i users.csv:users -i posts.csv:posts -c "select * from posts.csv join users.csv on posts.csv.user_id = users.csv.id where users.csv.role = 'guest'"

Clear Cache

This command will remove all data in the ~/.csvsql_cache

csvsql --clear-cache

Performance (MBP 2016)

Data

title,desc,created_at:datetime
1asdfklajskdjfk alksd flka sdfkja sldfk ,lka sdfa sdlkfj alkr jl2kjlkajslkdfjak,2017-10-20 20:20

10,000 lines

$ time csvsql -i /tmp/a.csv "select count(*) from csv"

output

count(*)
100000

real	0m5.070s
user	0m4.776s
sys	0m0.256s

10,000 lines with cache

$ time csvsql -c -i /tmp/a.csv "select count(*) from csv"

Output

count(*)
100000

real	0m4.677s
user	0m4.309s
sys	0m0.253s

Second output

$ time csvsql -c -i /tmp/a.csv "select count(*) from csv"
count(*)
100000

real	0m0.502s
user	0m0.327s
sys	0m0.151s
  • 1,000,000 lines
  • 1,000,000 lines with cache

Known Issues

  1. Output columns have no a date type.

Because the SQLite3 version too old. Upgrate it.

Development

  • Make sure your code has some testing.
  • Run rubocop -a before commit your code.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/csvsql.

License

The gem is available as open source under the terms of the MIT License. /

Releases

No releases published

Packages

No packages published