Make all imputation methods consistent in regard to encoding requirements #824

nicolassidoux · 2024-11-13T09:35:33Z

Description of feature

Imputation requires proper encoding of the data, but at the moment, there is no consistent strategy about what to do if a passed AnnData is not encoded. Some methods throw an exception, while some others silently encode the data with an arbitrary method.

It would be probably better to just throw an exception and let the caller decide what to do instead of altering the data.

Function	Encoding required	Action if not encoded
`explicit_impute`	No	None
`simple_impute`	Strategy-dependant	Raise ValueError
`knn_impute`	Yes	Perform ordinal encoding
`miss_forest_impute`	Yes	Perform ordinal encoding
`mice_forest_impute`	Yes	Perform ordinal encoding

So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.

The text was updated successfully, but these errors were encountered:

nicolassidoux · 2024-11-15T16:11:03Z

Comments so far (note: I haven’t tested anything yet)

Imports from protected
- I noticed that prepocessing/_imputation.py contains imports from protected modules or functions:
```
from ehrapy.anndata.anndata_ext import _get_column_indices  
```
  - _get_column_indices is primarily used outside its module and could be valuable to the public API. Since the documentation is already in place, I suggest making it public.
```
from ehrapy.core._tool_available import _check_module_importable  
```
  - _check_module_importable is only used (outside its module) in prepocessing/_imputation.py to check for the presence of sklearnex. If conditional execution based on this is truly necessary, I propose making _check_module_importable public across the entire ehrapy module.
  - For consistency, I also recommend making _shell_command_accessible module-public. To align naming conventions, I renamed _tool_available to _utils_available to match _utils_doc (previously _doc_util).
Return Value Behavior
- Some functions return an AnnData object only if the copy argument is set. I believe this is a mistake. The hints specifices the functions should consistently return the AnnData object passed to them, regardless of whether it’s copied at the start.
Use of Progress
- I noticed some Progress are being used, but as far as I know, we don’t have any way to track the progress of imputations. To simplify the code, I recommend removing these.
Improvements to _get_non_numerical_column_indices (prepocessing/_imputation.py)
- I rewrote _is_float_or_nan to eliminate redundancy
- I added _is_float_or_nan_row to avoid the need for np.vectorize. This change also resolves the type mismatch warnings caused by the use of np.apply_along_axis with the result of np.vectorize.
Documentation
- I’m generally skeptical about documenting the exceptions a function can raise unless enforced by a compiler (e.g., in Java).
  - If we want to document all possible exceptions, we’d have to include everything that could be raised by nested calls, which is practically impossible to maintain—especially when external libraries are involved.
  - If we don’t want to do this, what’s the point of an incomplete list?
- In my opinion, documenting exceptions is only relevant in two cases:
  1. The function itself handles all exceptions from its calls and raises a limited, well-defined set of exceptions.
  2. The function raises custom exceptions that users might need to catch specifically.

nicolassidoux · 2024-11-20T18:58:56Z

I’ve updated the branch following my tests.

In test_imputation.py, I added a _base_check_imputation function designed to ensure that every imputation test meets the following criteria:

No missing data remains after the imputation.
No non-missing data is altered during the imputation process.
Only the targeted subset is modified during partial imputations, even if the data in other subsets is missing.

Additionally, I wrote dedicated tests to validate the _base_check_imputation function itself. Test-ception, if you will.

Let me know your thoughts before I proceed with creating a PR.

nicolassidoux added the enhancement New feature or request label Nov 13, 2024

nicolassidoux self-assigned this Nov 13, 2024

nicolassidoux mentioned this issue Nov 21, 2024

Encoding consistency with imputations #807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make all imputation methods consistent in regard to encoding requirements #824

Make all imputation methods consistent in regard to encoding requirements #824

nicolassidoux commented Nov 13, 2024

nicolassidoux commented Nov 15, 2024

nicolassidoux commented Nov 20, 2024 •

edited

Loading

Make all imputation methods consistent in regard to encoding requirements #824

Make all imputation methods consistent in regard to encoding requirements #824

Comments

nicolassidoux commented Nov 13, 2024

Description of feature

nicolassidoux commented Nov 15, 2024

Comments so far (note: I haven’t tested anything yet)

nicolassidoux commented Nov 20, 2024 • edited Loading

nicolassidoux commented Nov 20, 2024 •

edited

Loading