Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update in ms excel compatible formats documentation #20

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
- [NDJSON](config/dataset-formats/ndjson.md)
- [Delta Lake](config/dataset-formats/delta.md)
- [Arrow](config/dataset-formats/arrow.md)
- [Xlsx](config/dataset-formats/xlsx.md)
- [MS Excel compatible formats](config/dataset-formats/excel.md)
- [Blob store](./config/blob-store.md)
- [Databases](./config/databases.md)
- [Postgres wire protocol](postgres.md)
Expand Down
88 changes: 88 additions & 0 deletions src/config/dataset-formats/excel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# MS Excel compatible formats.

ROAPI supports loading a few Microsoft Excel compatible formats like xls, xlsx, xlsb, ods.

## Configuration
To load MS Excel compatible files the config should be specified like:
```yaml
tables:
- name: "<table name>"
uri: "<files path>"
option:
format: "<file format>"
sheet_name: "Sheet1"
rows_range_start: 2
rows_range_end: 5
columns_range_start: 1
columns_range_end: 6
schema_inference_lines: 3
```
* **format** - name of file format. Currently supported files format:
* xls (Microsoft Excel 5.0/95 Workbook)
* xlsx (Excel Workbook)
* xlsb (Excel Binary Workbook)
* ods (OpenDocument Spreadsheet)
* **sheet_name** - the name of the spread sheet with table data. By default, most files initially use Sheet1 as the `sheet_name`. Be sure to change this `sheet_name` as needed if your spreadsheet uses a different name.
![xlsx_sheet_name](../../images/xlsx_sheet_name.png)
If no `sheet_name` is specified, ROAPI will use first spreadsheet.
* **Table range options**
* **rows_range_start** - the first row of the table. It contains column names. By default, `rows_range_start` is 0 (the first raw in spreadsheet)
* **rows_range_end** - the last row of the table. By default, ROAPI reads all data.
* **columns_range_start** - the column of the table. By default, `columns_range_start` is 0 (first column in spreadsheet)
* **columns_range_end** - the last column of the table. By default, ROAPI reads all columns.
For example, to take only selected data:
![spread_sheet_range](../../images/spread_sheet_range.png)
the config file looks like:
```yaml
tables:
- name: "<table name>"
uri: "<files path>"
option:
format: "<file format>"
sheet_name: "Sheet1"
rows_range_start: 1
rows_range_end: 4
columns_range_start: 1
columns_range_end: 3
```
* **schema_inference_lines** - the number of rows (inside table range) to use in schema inference. This number includes the row with column names, so, for example, `schema_inference_lines: 3` means ROAPI will use first row for column names inference and 2 rows for column types inference. If this option is not specified then ROAPI reads all rows for column data types inference.

## Schema inference.
ROAPI can infer schema of data automatically. The first row of data range is a row with column names. After column names inference ROAPI will infer data types by scanning all remaining rows or limited number of rows specified in `schema_inference_lines` option.
If column contains more than one data type (for exaple, float and int) then ROAPI use Utf8 datatype.

Also, it is possible to specify schema in configuration file. This allows to avoid schema inference from data and loading of table will be faster.

```yaml
tables:
- name: "excel_table"
uri: "path/to/file.xlsx"
option:
format: "xlsx"
schema:
columns:
- name: "int_column"
data_type: "Int64"
nullable: true
- name: "string_column"
data_type: "Utf8"
nullable: true
- name: "float_column"
data_type: "Float64"
nullable: true
- name: "datetime_column"
data_type: !Timestamp [Seconds, null]
nullable: true
- name: "duration_column"
data_type: !Duration Second
nullable: true
- name: "date32_column"
data_type: Date32
nullable: true
- name: "date64_column"
data_type: Date64
nullable: true
- name: "null_column"
data_type: Null
nullable: true
```
23 changes: 0 additions & 23 deletions src/config/dataset-formats/xlsx.md

This file was deleted.

Binary file added src/images/spread_sheet_range.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.