-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Pandas Migration
Part of Google Summer of Code 2016, referent: @sstanovnik.
This page aims to be the progress monitor and later API migration instructions for the pandas migration. See sstanovnik/orange3:koalas for the development branch. See the pull request for code comments.
- Compatibility with existing API where sensible, deprecation notices where needed.
- New Table is
pd.DataFrame
- SQL data source support (not strictly
pandas
) - Nice code docs.
See this Google Sheet for something with more flexibility than this wiki's tables. It is basically a development timeline, including both research and TODOs (with detailed notes). Useful for exploring decisions in-depth.
- Orange now uses the same indexing as
pandas
. I'll leave the explanation to their documentation. - If you try to change values in
table.X
,table.Y
ortable.metas
, you'll get an error. The new way of setting data is directly, thepandas
way. See the above indexing link. - The table contains actual values directly, instead of e.g. integer indices of the discrete variable's values. This is much more intuitive, and the numeric data is transparently compputed when using
X
,Y
ormetas
. - Read up on
pandas
'Caterogical
here, which is used for discrete variables if you will be doing anything in-depth with them. -
table.W
has been replaced withtable.weights
-
RowInstance
s have been removed. The new equivalent isTableSeries
, which works similarly toTable
for learning purposes. - Filtering rows is no longer done with
Filter
s, but instead withpandas
' filtering. Much better. - A
Table
does not have an implicitbool
value, soif table
does not work. Use specific operators likeif table is not None
,if not len(table)
and so on. - Weights are now set through
Table.set_weights
, which accepts a scalar, a sequence, or an existing column name. - Datetime operations on
TimeVariable
columns are now backed bypandas
, which means they're much better than before. - Sparse support is much better! However, the whole
SparseTable
is sparse now, not just theX
part.X
,Y
andmetas
return dense matrices if they containStringVariables
- this is ascipy
limitation. -
Variable.to_val
now transforms a descriptor into its numeric value: e.g."male"
into1
wherevariable.values == ["female", "male"]
. -
Table.get_column_view
is no more, replaced bypandas
' own attribute column access. If the column name is a valid Python identifier,table.column_name
is the same astable["column_name"]
. -
Table.checksum
was removed in favour ofhash(Table)
-
Table.shuffle
was removed in favour of thepandas
t.sample(frac=1)
. - A new
Table.merge
method was added - a wrapper forpd.DataFrame.merge
which handles internal columns and the domain. - No more
Table.from_table_rows
, usepandas
indexing and slicing.
When this project is "complete", the work will not be over. What follows is a general guide to porting existing code to the new API (because of several breaking changes), what to be mindful of when fixing bugs that could be related to pandas
, as well as some TODOs that will likely have to be completed to make the product fully stable.
As expected from a porting project, not much functionality was explicitly added. However, you get to use all of pandas
' functionality for time series, DataFrame
handling (e.g. merge
is wrapped to handle Orange specifics), filtering, sparse features and other things. Expect a more stable codebase with new features and bugfixes being added by proxy by pandas
.
Read the pandas subclassing instructions for an intro to subclassing pandas
. This section only mentions Orange specifics.
Extend _metadata
(not overwrite) if you want to specify any additional attributes. Any custom attribute, no matter how temporary, must be included, otherwise it won't work. If you define custom column names, include the in _INTERNAL_COLUMN_NAMES
, which, again, you need to extend. Use and maybe modify _is_orange_construction_path
to pass through constructions that weren't called implicitly - pandas
uses the constructor every time we want a subset or anything similar.
Indexing as it was pre-pandas doesn't exist any more. It is replaced by pure pandas indexing. Read up on it here, especially the different behaviours and selectors (like .ix
) and the notion of integer position and the index. In short, t.iloc[i]
gets the ith row of the table, whereas t.loc[i]
returns the row where the index value (in pandas not necessarily an integer) is i. With unique indexing in Orange (explained later), the latter almost never works. Boolean indexing works with either loc
, iloc
or just t[...]
.
The default getitem
(and setitem
) work on columns. t[0]
does not return the first row of the table, but the column with the name 0
. See iterating for an elaboration on this.
The implementation before pandas had a notion of a unique index (think _init_ids
). The pandas implementation formalizes that further and sets the pandas index (think .loc
) to be globally unique. This is done transparently, no modification needed, and works with subsets and such. If tests fail when run in a group, but not by themselves, it's almost always an indexing issue, as that is an omnipresent global state.
The return type depends on the selector. Selecting a single row with t.iloc[0]
yields a Series
. Selecting a single row with t.iloc[[0]]
or multiple with t.iloc[list_of_positions]
yields a Table
. This distinction is largely inconsequential---read the series section---but is important to keep it in mind.
For columns with valid python names, attribute indexing works. Example: for a column named sex
, t.sex
works and is equal to t["sex"]
and t[t.domain["sex"]]
. However for a column named column with space
, there is no attribute, not even with spaces transformed to underscores.
While Tables should be immutable, sometimes one would like to change a value (or a subset of them). Setting the top-left-most element can be done in two ways: t.iloc[0, 0]
or t.loc[t.index[0], t.columns[0]]
, or with the explicit index and column if available. This can NOT be done with t.iloc[0]["col1"]
or t.iloc[0].loc["col1"]
or t.iloc[0].iloc[0]
or similar as, in pandas, that most likely creates a copy of the object (but not always).
A pd.DataFrame
has no implicit boolean value. Do not write if data
or if not data
, instead use if data is not None
or if data.isnull().any().any()
or similar. Most widgets have this error.
This topic is undecided, iteration may preserve previous Orange behaviour if that is decided upon.
Iterating through a pd.DataFrame
gives column names, not rows, consistent with t[...]
selection. However, len(table)
gives the number of rows in the table. Blame pandas. To iterate over rows, use t.iterrows()
which yields tuples of (row_index, row_series)
. See iterrows, itertuples and iteritems for alternatives.
The domain hasn't changed much. What has changed is that Variable
now extends str
, which allows Variable
s to be used in table column names and for indexing with col = t[variable]
. Column names are pure strings by default, but can be Variables--this is discouraged, because strings work just as well if not better.
In pandas, Series
objects are one-dimensional (where DataFrame
is 2D and Panel
is 3D). They can be either rows or columns and behave the same. They're almost exactly like the previous RowInstance
and provide .X
, .Y
, .metas
and .weights
for compatibility. The domain is transparently passed to the series when selected from a Table.
The data
class structure has changed significantly. See this picture for a general overview. Don't use from Orange.data.table import Table
, but just from Orange.data import x
, as everything is pulled to that level.
TableBase
and SeriesBase
are now the base classes, like the now-defunct Storage
. This is important for widget connections. All Orange data storage should extend from those. Table
is dense, SparseTable
is sparse, but SparseTable
does not extend Table
. SqlTable
extends Table
. Corpus
extends SparseTable
.
Some special columns have been introduced. To simplify just about everything, weights are now always present and default to 1. The column isn't included in the domain, but instead just exists in the table columns as TableBase._WEIGHTS_COLUMN_NAME
, right now __weights__
. This means weights are automatically transferred to subsets and changing weights on a view also chagnes them in the parent.
Corpus
has some other special columns, everything should be included in cls._INTERNAL_COLUMN_NAMES
.
Weights are now accessible solely by .weights
, no longer .W
.
Previously, all data was held in np.ndarray
s as X/Y/metas/weights. Now, these are computed properties with data stored in the table itself. An important distinction is that the table now holds actual values, not their indices and such. Example: for a discrete variable column, the table now contains male
and female
, not 0
and 1
. This allows t[t.sex == 'male']
.
X, Y, metas and weights are computed properties which do not give the variable descriptors, but rather the computed values--the same as they did before. Taking the above example, t.Y
with a class_var
of DiscreteVariable("sex")
would return a column of zeros and ones. The returned np.array
s are marked as immutable and attempting to mutate them raises an error.
As before, X, Y and metas always return 2D matrices, except for Y, which returns 1D if there is only one class variable. weights
always returns 1D, never None
.
I feel like this is a very major change that needs to be emphasised, emphasised, emphasised. Table.X
, Table.Y
, Table.metas
and Table.weights
are read-only, computed properties, and this cannot be set. Set or modify individual columns the pandas
way if you need to.
Discrete variables are automatically converted into pandas Categorical
columns. This is the queivalent of R's factor
and tightly constrains values onto the predefined values. As such, you cannot modify an element (or an a row) where that column's value doesn't exist in the registered values, at least before modifying them. Upon creating, DiscreteVariable.values
is synced with the column's allowed values, with the proper ordering flag. Care must be taken when apending rows to tables, but shouldn't pose a problem in most cases.
pandas has native datetime
columns, in the same way as the categorical column type is category
. This allows a whole bunch of nice temporal processing functionality to be inherited from pandas. Also, the following works: t[t.timecol > '2015-06-07'].
Time variables are automatically parsed and their time zones registered upon creating/reading a table. Otehr functionality is inherited directly from pandas. The numeric value is unix time in seconds, internally, in the table, time values are represented by a native pandas data type.
Filters have been removed, use pandas filtering now. For Select Rows, a filter shim that is required for the GUI was constructed.
An important note right off the bat here is that proper sparse functionality requires (the yet unreleased) pandas 0.19.0
, which fixes a bucketload of bugs and adds proper multi-type containers.
Sparse now very likely works better than before. A big difference is that the entire table is sparse, not only
X
. This unifies behaviour for likely negligible performance loss. A good migration example is in biolab/orange3-text#97.
Tables now use .is_sparse
and .is_dense
instead of .density
returning some not-quite-enum. The previous Table.DENSE|SPARSE|SPARSE_BOOL
do not exist any more.
The density of a dense and a sparse table is defined as the number of undefined values in the table. For dense, this is computed with .isnull()
, for sparse, this uses the default pandas
functionality, which takes into account fill_value
(always np.nan for us).
Because Orange historically broke __new__
to enable a MaGiC signature, some compatibility shims had to be employed to conform to pandas
. __new__
and __init__
check if they are being called from pandas internals in quite a roundabout way: checking for pandas-only args
and kwargs
signatures (detailed description in the comment block above the relevant code). If so, the Orange constructor is skipped entirely and only the pandas
part is used.
Why do we need this complexity? Because pandas
calls the constructor fairly regularly (even has something like this in its own code called fastpath
), and because each slicing or DataFrame
op returns a new object (with possibly shared data) - and to do that, it calls the constructor. To transfer attributes and such, pandas
has __finalize__
(that's not python native).
When subclassing pandas
objects, you have to override _constructor
, _constructor_sliced
for dimensionality reduction and _constructor_expanddim
for dimensionality expansion. These normally just return the respective classes, but we included a domain transfer mechanism so SeriesBase
objects retain TableBase
domains and properties, enabling X/Y/metas/weights.
To transfer custom properties when slicing etc., all property names must be added to _metadata
. After this, pandas
will automagically transfer those properties, but not when using some pandas
global functions, such as pd.concat
. When subclassing Table
or similar, remember to extend the parent's _metadata
, not overwrite it.
The MRO of the new classes is Orange
first, pandas
second. This means you can override any 2D pandas functionality by redefining in TableBase
.
Generally, Table(df)
works via Table.from_dataframe
. This infers the domain, but you can pass one to skip that.
Unneeded. Value
was a remnant of C, Instance
(RowInstance
in particular) now has an analog in SeriesBase
. .columns
is not needed because pandas kind of already supports that with attribute access.
This is the ugliest part of the entire project. I haven't refactored much except what was urgently needed, because there is no future with this approach, which falls apart when you poke it in the wrong way. len(SqlTable)
now returns the number of downloaded and stored rows, not the length of the database. This was needed because the data is now stored in pandas
and its internals require len
to be sensible. We have approx_len
and exact_len
for database lengths.
A lot of cython code was removed and replaced with pandas. This includes contingencies, value counts, distributions and statistics.
Should be faster. Also used for Excel files. A slightly different approach is used to try to infere the Orange header: first, the first 3 rows are read, then the rest. Check out CsvReader
in data/io.py
if you want to know more.
- When a test fails fail when run in a group, but not when run by itself, it's an indexing issue (see the global indexing section).
- When there is a column mismatch, check for columns from
_INTERNAL_COLUMN_NAMES
. - Are you constructing matrices (ad-hoc) with numeric values of otherwise textual discrete attributes? Don't do that. Also keep in mind that mixing strings and numbers in numpy arrays forces the whole dtype to
object
, use an ad-hoc list instead. - Tables don't have and implicit boolean value, as a consequence assertEqual(t1, t2) doesn't work because of an implicit bool of a boolean matrix.
- Try not to use
"?"
for invalid values, usenp.nan
. - Using proper pandas indexing? See indexing above.
- Iterating properly? See iterating above.
- Comparison of Tables with
.equals
failing? Check if the indexes are different, likely due to global indexing. If so, override one of them. - Can't read a small, hand-written CSV file in a test? Use
@mock.patch("Orange.data.io.CSVReader.DELIMITERS", your_delimiter)
to force the sniffer to output a specific delimiter. - Use
setUp
instead ofsetUpClass
to avoid funny inter-test state. - Does
pandas
maybe not return the correct data type? It could be a pandas bug, subclassing isn't completely there yet.
In rough order of prominence, descending.
- Indexing. Use
.loc
,.iloc
and similar, see the dev note for more. -
Table
s now hold actual values instead of integer/float descriptors. -
Variable.to_val
converts from a descriptor to a value used in X/Y/metas. - X/Y/metas aren't settable any more, are computed properties (not even views).
- Iterating doesn't work the way you'd expect, see the dev note
- No more
.W
, always.weights
. - No implicit boolean value for the table.
- Were in-place, are not any more:
Corpus.extend_corpus
. - Removed
Table,from_table_rows
, use properpandas
indexing and slicing. - "Deleting" rows works by selecting an opposite subset, not by
del t[i]
. - Inserting and extending does not exist any more, as the row order is not important - use
Table.concatenate
. - Due to the use of
Categorical
s, discrete variables are much more constrained. - No more
Filter
s,Instance
s (except insql/compat
, ugly) andValue
s. -
SeriesBase
has.X
, not.x
likeRowInstance
had previously. -
len(SqlTable)
now returns the number of downloaded and stored rows, not the length of the database. -
Table.checkssum
removed, usehash(Table)
. -
Table.shuffle
removed, uset.sample(frac=1)
.
- Go through the whole codebase and see where there could be very nice pandas functions used, instead of some weird workarounds using numpy and nulls. This is a big endeavour, but it would speed up orange and ensure stability.
- Remove the whole
SqlTable
shebang and create a new Apache Spark addon.- Completely separate widgets, with a transformation/sampler widget to transform the Spark structure into a
Table
. - Don't try to maintain compatibility with
Table
, it's too much work. Create a new table-like structure that encloses Spark'sDataFrame
(not compatible with pandas') and has a domain and other needed things. - The most work would likely be with widgets, so some plan on how to use existing visualization widgets (with subclassing) woudl be nice. Just changing
setData
would be ideal.
- Completely separate widgets, with a transformation/sampler widget to transform the Spark structure into a
- Fix basic stats, distributions, contingency. They have weird call paterns with even weirder structs, decide how to clean that up. pandas has
.describe()
, so use that.- Contingency may need to be directed back to the cython implementation for speed.
- Domain inference for sparse matrices.
- Performance improvements:
- See how many copies of tables are made. Maybe too many, because of some old pipeline that now copies everything unnecessarily - likely due to the fact
X
,Y
andmetas
aren't primary storage any more. - LOO has problems. Actually, all parallel things. Does importing Orange take a while and is slow because of that?
- See how many copies of tables are made. Maybe too many, because of some old pipeline that now copies everything unnecessarily - likely due to the fact
- Any can't-see-the-forest-for-the-trees issues that I missed.
- See what has changed in
pandas 0.19.0
and adapt as needed. - Transform
.Y
to return 2D. - Improve
TimeVariable
processing (timezone discovery could likely be faster). - Check widgets' inputs, some things may have broken because of the table class structure change.