Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipeOpDecode #835

Open
mb706 opened this issue Oct 3, 2024 · 3 comments
Open

PipeOpDecode #835

mb706 opened this issue Oct 3, 2024 · 3 comments
Assignees
Labels
Type: New PipeOp Issue suggests a new PipeOp

Comments

@mb706
Copy link
Collaborator

mb706 commented Oct 3, 2024

Inverses one-hot-encoding: creates a factor-column that indicates which of the input numeric columns has the maximal value.

Should have argument 'treatment_encoding' (init: FALSE): if TRUE, it includes an additional level if all cols are 0 (becoming the inverse of PipeOpEncode with method == treatment).

Should also have an argument group_pattern, a regular expression. The group_pattern is applied to all col names and the first regex group is extracted. All columns that have different value here are treated separately from each other. The levels that are created then correspond to gsub(group_pattern, "", colnames()). Initialized as "^([^.]*)\\.".

The point here is that we may have columns x.a, x.b, x.c, y.a, y.b. The "^([^.]*)\\."-match matches "x" for the first three cols and creates levels a, b, and c. It then matches "y" for the last two cols, creating the factor cols with levels a and b. Should the user e.g. have columns x_a, x_b, ..., then this would need to be changed to "^([^_]*)_". Should the user not want any groups, and instead get levels x.a, x.b, ..., y.b in a single result column, the pattern would be "".

If the pattern is not "", we ignore all columns that do not match the group_pattern; I am assuming that this is what a user wants basically all of the time, even though it unfortunately undermines the affect_columns argument somewhat.

@mb706 mb706 added the Type: New PipeOp Issue suggests a new PipeOp label Oct 3, 2024
@mb706
Copy link
Collaborator Author

mb706 commented Oct 8, 2024

suggestion for state:

  • named list, named by columns that are being created, with content for each such column:
    • named character, named by the name of input columns, containing in each entry the name of the resulting factor.
    • for treatment_encoding, maybe also include an entry with empty name, containing the label of the reference factor.

also maybe the content of the treatment_encoding flag for prediction, since changing the hyperparamter after training is not allowed to have an effect.

probably good idea to use PipeOpTaskPreprocSimple.

@mb706
Copy link
Collaborator Author

mb706 commented Oct 8, 2024

in the x.a, x.b, x.c, y.a, y.b, the state would be

list(
  colmaps = list(
    x = c(x.a = "a", x.b = "b", x.c = "c"),
    y = c(y.a = "a", y.b = "b")
  ),
  treatment_encoding = FALSE
)

@mb706
Copy link
Collaborator Author

mb706 commented Oct 8, 2024

maybe treatment_encoding flag is not necessary, since we can see this from the fact that there are entries in the col-maps with empty name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: New PipeOp Issue suggests a new PipeOp
Projects
None yet
Development

No branches or pull requests

2 participants