Python package with data structures and functionality to read/write files in StarTable format and contain and manipulate the information therein.
For feature requests and bugs relating specifically to this Python package, please refer to this GitHub project's issue tracker.
For issues relating to the StarTable standard more broadly, please consult the StarTable standard page.
Contributions are welcome, especially if they relate to an issue that you have previously discussed or posted in the issue tracker.
Available on PyPI:
pip install startables
import startables as st # << recommended import idiom
import pandas as pd
# Build table manually
df = pd.DataFrame(data=[[float('nan'), 'gnu', 3], [4, 'gnat', '{{(+ x y)}}']], columns=['a', 'b', 'c'])
col_specs = {n: ColumnMetadata(Unit(u)) for n, u in zip(['a', 'b', 'c'], ['-', 'text', 'm'])}
table_x = st.Table(df=df, name='some_table', col_specs=col_specs, destinations={'success', 'glory'})
# Accessing Table contents: use Bundle.df pandas.DataFrame interface
print(table_x.df['b'])
# Read bundle of tables from file
b = st.read_csv('animal_farm_startable_file.csv')
# Make new bundle containing a subset of this bundle's tables by name and/ or destination
farm_bundle = b.filter(name_pattern='farm')
# Accessing tables in bundle: use Bundle.tables List interface
for t in b.tables:
print(t.name)
# Add tables to bundle
b.tables.append(table_x)
# Remove tables from bundle by name and/or destination
removed_tables = b.pop_tables(destination='my_farm') # tables now removed from b
# ... More examples to come ...
Table
contains a single StarTable table block, including table name, destination field, and equal-length columns, with each column containing a list of values and having a name and metadata (currently, a unit field and a (not-fully-implemented) remark, both of them strings).
Table contents are stored in a pandas DataFrame. Table.df
grants the user read/write access to this DataFrame. Table column names are stored as the DataFrame column (Series) names. Column units are stored separately.
The user can modify the DataFrame at will using the pandas.DataFrame API, or even replace the DataFrame entirely. However, introducing columns with names not covered in the Table's column specification will break the Table. Other than that, removing columns, adding/removing rows, editing cells etc. are all fine and shouldn't break anything.
Bundle
is a container for one or more Table
that belong together, usually because they:
- have a common origin e.g. come from the same file or file structure, and/or
- are understood as having a common context, in particular when evaluating expressions
Bundle
is intended as the primary interface for file I/O. The read_csv()
and read_excel()
functions read StarTable files in CSV and Excel format, respectively, and both return a Bundle
, while Bundle.to_csv()
and Bundle.to_excel()
writes a collection of tables to these same file formats.
TableOrigin
contains an indication of where a given Table
came from, intended for use in traceability. Currently it is just a stub that contains a single string.
ColumnMetadata
is a container class for a column's unit (and free-text remark, though this is not tied with read/write methods yet, so of limited utility). A Table's columns are specified by supplying a dict of column_name:ColumnMetadata that covers (at least) all column names present in the Table's child DataFrame.
Table cells are allowed to contain not only literal values, but also Lisp expressions.
Table.evaluate_expressions(context)
will return a Table
with expressions (if there are any) evaluated based on the given context. Can also specify inplace=True
to do this in-place.
Bundle.evaluate_expressions(context)
does the same thing, but for all its child tables.
The units of a Table
can be converted according to a UnitPolicy
.
A UnitConversion
defines how a given source unit is converted to an associated reference unit.
km_m = ScaleUnitConversion(src_unit=Unit('km'), ref_unit=Unit('m'), ref_per_src=1000)
km_m.to_ref(42) # returns 42000
km_m.from_ref(2000) # returns 2
A UnitPolicy
contains an arbitrary number of UnitConversion
, with the restriction that any source unit is associated to one and only one reference unit, i.e. can't include a UnitConversion
from 'mile'
to 'km'
and another from 'mile'
to 'm'
(but sure can include one from 'km'
to 'm'
and another from 'mile'
to 'm'
). Reference units themselves are automatically added as source units, with themselves as their own reference unit through an IdentityUnitConversion
. Conversion is then possible between any two source units that share the same reference unit.
C_K = LinearUnitConversion(Unit('°C'), Unit('K'), slope=1, intercept=273.15)
cup = CustomUnitPolicy([
ScaleUnitConversion(Unit('km'), Unit('m'), 1000),
ScaleUnitConversion(Unit('mm'), Unit('m'), 0.001),
IdentityUnitConversion(Unit('metre'), Unit('m')), # alias of a reference unit
C_K,
C_K.alias(src_unit_alias=Unit('deg_C')) # alias of a source unit
])
cup.convert(42, from_unit=Unit('m'), to_unit=Unit('mm')) # returns 42000
cup.convert(42, from_unit=Unit('mm'), to_unit=Unit('metre')) # returns 0.042
cup.convert(42, from_unit=Unit('km'), to_unit=Unit('mm')) # returns 42000000
cup.convert_to_ref(20, src_unit=Unit('deg_C')) # returns 293.15
cup.ref_unit(src_unit=Unit('°C')) # returns Unit('K')
A Table
's units are converted column by column in accordance with the UnitPolicy
.
-
Table.convert_to_ref_units()
converts each column to itsUnitPolicy
reference unit by callingUnitPolicy.convert_to_ref()
-
Table.convert_units()
converts to new units explicitly specified for each column. -
Table.convert_to_home_units()
is a special case ofTable.convert_units()
which converts back to the Table's "home units". "Home units" are saved in the Table's col_specs and are the column units with which theTable
was created (whether manually or read from file), unless they are explicitly changed later.
Unit conversion does not support expressions. Expressions must be evaluated prior to unit conversion.
This project was migrated to GitHub from a private server at v0.8.0. Changes prior to this are not included in the GitHub repo; nevertheless, the pre-0.8.0 changelog is documented here. PS-## below refers to issue numbers on a legacy YouTrack issue tracker on a private server. These issue numbers are left as is for the historical record.
This project follows semantic versioning. This changelog follows the guidelines published on keepachangelog.com.
- First parameter of
read_csv()
renamed fromstream
tofilepath_or_buffer
. The namestream
was inconsistent with the expected type since 0.7.3, namelystr
orpathlib.Path
(in addition toTextIO
streams). Also, the new namefilepath_or_buffer
is consistent withpandas.read_csv()
. This change will break code that has usedstream
as a named argument, though we are hopeful that this has rarely if ever been done by users of this API. - Removed restriction on
openpyxl
version (was previously restricted to < 2.6). This is a less crappy fix to PS-49 than had previously been implemented.
- PS-52 read_csv() throws warning when given a stream as input; asks for a filename
- PS-53 Bundle.to_csv() fails when column names are not strings
openpyxl<26
dependency in environment.yml- PS-19 Reading from CSV can fail if not enough column delimiters on first line of CSV file
- PS-48 Increase compatibility of startables python library by allowing non-standard formatted .csv files
- PS-50 CSV files exported from Excel results in first table not being read due to UTF-8-BOM
Version 2.6.0 of our openpyxl
dependency, released a couple of weeks ago, contains major breaking changes (which kind of goes against the spirit of minor version updates in semantic versioning...) and these breaking changes do indeed break startables
. To remedy this, the openpyxl
version number is now fixed to <2.6
in the startables
conda package recipe.
- Breaking changes in methods
Bundle.filter()
andBundle.pop_tables()
:- Parameter
exact_name
renamed toname
for consistency with the naming of destination-related parameters. - Ordering of parameters in signature changed to
(name, name_pattern, destination, destination_pattern, ignore_case)
- PS-43 Name and destination filters are now case-insensitive by default. Can be made case-sensitive again by setting parameter
ignore_case=False
.
- Parameter
- PS-41 Filtering on destinations by regular expression
All of the changes in this version address PS-27: Add/remove tables in Bundle
Breaking changes in Bundle
:
- Method
tables
renamed tofilter
. Instead of returning aList[Table]
, now returns aBundle
containing the filtered tables (i.e. a subset of the originalBundle
). - Property
tables
introduced (not to be confused with the former method of the same name). Returns the internal list of of tables stored in thisBundle
. - All list-related operations are delegated to the list returned by the
tables
property. In particular:- Can now add
Table
s toBundle
(a main driver for this major change) by invokingList
'sappend()
andextend()
methods onBundle.tables
. - Magic methods
__getitem__
,__iter__
, and__len__
have been removed.
- Can now add
- Method
Bundle.pop_tables()
to remove a selection ofBundle
's member tables, selected by name and/or destination. Returns the removed tables. (This was the other main driver for this major change.)
- PS-2 The ordering of destinations is now preserved. Table destinations can now be supplied as any
Iterable
(changed fromSet
) and are then stored internally as aList
, thus preserving pre-existing order (if any). Potentially breaking change: AValueError
will be raised upon encountering any duplicates in the destinations supplied to aTable
, either when read from file or programmatically. (Because duplicates are indeed nonsensical.) Duplicates were previously eliminated silently when read from file usingread_csv
andread_excel
, and were not possible programmatically (pedants please refrain) as they had to be given as aSet
.
- Introducing: Unit conversion machinery PS-10
- Script that publishes this readme on windwiki
- PS-8 A more helpful error message on syntax error raised while parsing an expression cell, guiding the user to the offending cell
- PS-35 Python version requirement relaxed to 3.6 and above (was strictly 3.6)
- PS-33 Logging. Now gone. Was generating too much noise in client code logs.
- PS-15
read_csv()
andread_excel()
now accepts'nan'
,'NaN'
, and'NAN'
as valid no-data markers. Previously, only'-'
was accepted.
- PS-28 Numeric data in text columns doesn't get read.
- PS-5 Table blocks with zero rows are ignored by read_excel() and read_csv().
- PS-14
read_csv()
now forwards*args
and*kwargs
topandas.read_csv()
, so user can now make use ofpandas.read_csv()
's many useful arguments, not leastdecimal
to control which decimal separator is used in numerical columns (typically either'.'
or','
). Breaking change: included in this forward is the previously explicitly implementedsep
argument, which means that the default value ofsep
has now changed from';'
to pandas' own default,','
. This is a breaking change, but improves consistency with pandas' API.
- Column metadata (basically, just units for now) now stored in a separate field, rather than as monkeypatches on the child data frame's columns. The latter proved too fragile.
Table.df
setter: user can now replace the Table's child dataframe, as long as all the columns of the new df are described in (and consistent with) the Table's column specification (dict of name:ColumnMetadata). If not, error is raised.- PS-4 Ability to get
Bundle.tables()
by destination - PS-13 Add exact_name option to
Bundle.tables()
- Column metadata now supports not only a unit, but also a free-text remark, but this is not yet used in the file readers and writers; until it is, this feature won't be very useful.
- PS-7 After using
read_excel()
,evaluate_expressions()
fails unless DataFrame index is manually reset - Other minor bug fixes.
Complete redesign compared to the earlier 0.1 package. Total breaking change in the API. Pandas dataframes now lie at the heart of Table objects. Requires Python 3.6.