tab2dict
aims to solve a common problem in developing scientific software (i.e., models):
How to manage data with multiple subscripts, e.g., importing them into the model, saving them to result files, etc.
For example, tab2dict
, you can do as follows:
First, create a .xlsx
or .csv
file as follows to save the data:
id_sector | id_building_type | unit | value |
---|---|---|---|
1 | 1 | count | 10000 |
1 | 2 | count | 12000 |
2 | 3 | count | 500 |
2 | 6 | count | 3000 |
2 | 7 | count | 200 |
3 | 3 | count | 300 |
3 | 4 | count | 2000 |
4 | 5 | count | 2500 |
5 | 3 | count | 800 |
5 | 5 | count | 100 |
5 | 7 | count | 400 |
Second, use the following code to import the data into the model as an instance of TabDict
:
from tab2dict import TabDict
building_stock = TabDict.from_file("Data_BuildingStock.xlsx")
Third, create a model-specific class by inheriting the TabKey
class provided by tab2dict
,
with all the relevant id_xxx
as its attributes.
from typing import Optional
from tab2dict import TabKey
class BuildingTabKey(TabKey):
def __init__(
self,
id_sector: Optional[int] = None,
id_building_type: Optional[int] = None,
time_year: Optional[int] = None,
):
self.id_sector = id_sector
self.id_building_type = id_building_type
self.time_year = time_year
Fourth, create an instance of the BuildingTabKey
class to get data from the building_stock
.
tkey = BuildingTabKey(id_sector=1, id_building_type=2)
a = building_stock.get_item(tkey) # -> 12000
There are two main advantages of using tab2dict
:
- As implied by the name,
TabDict
usedict
to save data. So, the speed ofget_item()
is fast. But of course, the input data will all be loaded and take some space in the memory. - It allows the users to adapt the input tables relatively freely (e.g., adding or removing index columns) with limited changes in the code. This is a good feature especially useful in the agent-based models. Because each agent can be assigned with its own
tabkey
, knowing all theid_xxx
more than necessary for individual tables ortabdicts
. Then there can be no necessary changes in the code when one more "subscript (id_xxx
column)" is added to a parameter in is input table.
You can find more detailed introduction in the following sections as well as in the test folder of the project.
Hope tab2dict
can make your life easier in developing models!
Most functions of tab2dict
are clearly shown in the test files.
Below is a quick introduction of the main points.
There are three types of predefined input tables that can be loaded and converted to TabDict
instances:
- "ID" tables named as "ID_XXX"
- "Relation" tables named as "Relation_XXX"
- "Data" tables named as "Data_XXX"
In each type of tables, the column names must follow the convention shown in the examples below:
First, in the "ID" tables:
- The name of the first column must start with
id_
, e.g.,id_sector
orid_building_type
. - The name of the second column must be
name
.
Example: ID_Sector.xlsx
id_sector | name |
---|---|
1 | residential |
2 | manufacturing |
3 | retail |
4 | hotel and restaurant |
5 | public administration |
Example: ID_BuildingType.xlsx
id_building_type | name |
---|---|
1 | single family house |
2 | multiple family house |
3 | office |
4 | Shopping mall |
5 | accommodation and restaurant buildings |
6 | production buildings |
7 | other buildings |
Second, in the "Relation" tables:
- There can only be two columns, showing the relation between the two indexes.
Example: Relation_Sector_BuildingType.xlsx
id_sector | id_building_type |
---|---|
1 | 1 |
1 | 2 |
2 | 3 |
2 | 6 |
2 | 7 |
3 | 3 |
3 | 4 |
4 | 5 |
5 | 3 |
5 | 5 |
5 | 7 |
Third, in the "Data" tables:
tab2dict
supports multiple index columns in a "Data" table. And, apart fromid_
,tab2dict
also hastime_
as a known prefix for index column names, designed for timeseries data. But, there is no third known prefix. So, all index columns must have names starting with eitherid_
ortime_
. Or, they will not be identified as index columns.- The
unit
column is optional but suggested to be included. - By default, the name of the last "value column" is
value
.tab2dict
also allows users to define other names for the "value column". Besides, the users can have multiple value columns in one "Data" table. For such tables, users should specify thevalue_column_name
when loading the table and converting it to an instance ofTabDict
(example below).
Example: Data_BuildingStock.xlsx
id_sector | id_building_type | unit | value |
---|---|---|---|
1 | 1 | count | 10000 |
1 | 2 | count | 12000 |
2 | 3 | count | 500 |
2 | 6 | count | 3000 |
2 | 7 | count | 200 |
3 | 3 | count | 300 |
3 | 4 | count | 2000 |
4 | 5 | count | 2500 |
5 | 3 | count | 800 |
5 | 5 | count | 100 |
5 | 7 | count | 400 |
Example: Data_EnergyIntensityIndex.xlsx
id_building_type | time_year | unit | value |
---|---|---|---|
1 | 2020 | Euros | 1 |
1 | 2030 | Euros | 0.99 |
1 | 2040 | Euros | 0.98 |
1 | 2050 | Euros | 0.97 |
2 | 2020 | Euros | 1 |
2 | 2030 | Euros | 0.98 |
2 | 2040 | Euros | 0.96 |
2 | 2050 | Euros | 0.94 |
3 | 2020 | Euros | 1 |
3 | 2030 | Euros | 0.97 |
3 | 2040 | Euros | 0.94 |
3 | 2050 | Euros | 0.91 |
4 | 2020 | Euros | 1 |
4 | 2030 | Euros | 0.96 |
4 | 2040 | Euros | 0.92 |
4 | 2050 | Euros | 0.88 |
5 | 2020 | Euros | 1 |
5 | 2030 | Euros | 0.95 |
5 | 2040 | Euros | 0.9 |
5 | 2050 | Euros | 0.85 |
6 | 2020 | Euros | 1 |
6 | 2030 | Euros | 0.94 |
6 | 2040 | Euros | 0.88 |
6 | 2050 | Euros | 0.82 |
7 | 2020 | Euros | 1 |
7 | 2030 | Euros | 0.93 |
7 | 2040 | Euros | 0.86 |
7 | 2050 | Euros | 0.79 |
Example: Data_BuildingParameter.xlsx
id_sector | id_building_type | unit | a | b | c |
---|---|---|---|---|---|
1 | 1 | count | 1 | 10 | 100 |
1 | 2 | count | 2 | 20 | 200 |
2 | 3 | count | 3 | 30 | 300 |
2 | 6 | count | 4 | 40 | 400 |
2 | 7 | count | 5 | 50 | 500 |
3 | 3 | count | 6 | 60 | 600 |
3 | 4 | count | 7 | 70 | 700 |
4 | 5 | count | 8 | 80 | 800 |
5 | 3 | count | 9 | 90 | 900 |
5 | 5 | count | 10 | 100 | 1000 |
5 | 7 | count | 11 | 110 | 1100 |
tab2dict
provides the base class TabKey
that has to be inherited and extended according to the input table columns.
from typing import Optional
from tab2dict import TabKey
class BuildingTabKey(TabKey):
def __init__(
self,
id_sector: Optional[int] = None,
id_building_type: Optional[int] = None,
time_year: Optional[int] = None,
):
self.id_sector = id_sector
self.id_building_type = id_building_type
self.time_year = time_year
As shown, the attributes of the BuildingTabKey
class are the index column names in the input tables.
As mentioned, they must always start with id_
or time_
.
Users can load and convert a data file to a TabDict
instance (tdict
) by calling the from_file
function.
The index columns will be automatically identified and recorded as its key_cols
.
from tab2dict import TabDict
sectors = TabDict.from_file("ID_Sector.xlsx")
building_types = TabDict.from_file("ID_BuildingType.xlsx")
relation_sector_building_type = TabDict.from_file("Relation_Sector_BuildingType.xlsx")
building_stock = TabDict.from_file("Data_BuildingStock.xlsx")
energy_intensity_index = TabDict.from_file("Data_EnergyIntensityIndex.xlsx")
building_parameter_a = TabDict.from_file("Data_BuildingParameter.xlsx", value_column_name="a")
building_parameter_b = TabDict.from_file("Data_BuildingParameter.xlsx", value_column_name="b")
building_parameter_c = TabDict.from_file("Data_BuildingParameter.xlsx", value_column_name="c")
Besides, users can also use the from_dataframe
function to create a tdict
,
but the tdict_type
must be specified as ID
, Relation
, or Data
.
from tab2dict import TabDict
import pandas as pd
df = pd.DataFrame({
"id_sector": [1, 1, 2, 3, 4],
"id_building_type": [1, 2, 6, 4, 5],
"unit": "count",
"value": [10, 20, 30, 40, 50]
})
building_number = TabDict.from_dataframe(df=df, tdict_type="Data")
With a TabKey
instance (tkey
), users can fetch data from a tdict
, as long as the tkey
has values for the tdict
's key_cols
.
The tkey
can know more than necessary, which supports the following flexibilities
- removing an index column from an input table will not cause code changes;
- adding a new index column in the input table may neither, as long as the
tkey
has value for the added column.
tkey = BuildingTabKey(id_sector=1, id_building_type=2, time_year=2030)
sectors.get_item(tkey) # -> "residential"
building_types.get_item(tkey) # -> "multiple family house"
relation_sector_building_type.get_item(tkey) # -> [1, 2]
building_stock.get_item(tkey) # -> 12000
energy_intensity_index.get_item(tkey) # -> 0.98
building_parameter_a.get_item(tkey) # -> 2
building_parameter_b.get_item(tkey) # -> 20
building_parameter_c.get_item(tkey) # -> 200
building_number.get_item(tkey) # -> 20
Finally, users can also create an empty tdict
by calling the create_empty_data_tdict
function.
The key_cols
must be speficied. The type of the tdict
is Data
.
In practice, the empty data tdict
s can be used for collecting model results by calling the set_item
or accumulate_item
functions.
from tab2dict import TabDict
building_number_result = TabDict.create_empty_data_tdict(key_cols=["id_sector", "id_building_type"])
tkey = BuildingTabKey(id_sector=1, id_building_type=1)
building_number_result.set_item(tkey, 10000)
tkey.id_building_type = 2
building_number_result.set_item(tkey, 20000)
building_number_result.accumulate_item(tkey, 20000)
tkey.id_building_type = 3
building_number_result.accumulate_item(tkey, 30000)
tkey.id_building_type = 1
building_number_result.get_item(tkey) # -> 10000
tkey.id_building_type = 2
building_number_result.get_item(tkey) # -> 40000
tkey.id_building_type = 3
building_number_result.get_item(tkey) # -> 30000
- Python: 3.8 or later
package | user | contributor |
---|---|---|
pandas | ✓ | ✓ |
openpyxl | ✓ | ✓ |
pytest | ✓ | |
pytest-cov | ✓ | |
black | ✓ | |
pylint | ✓ |
- 29.12.2023 - Initial Release (v0.0.1)
Author: @SongminYu.
License: MIT.