-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PipeOpDecode #835
Labels
Type: New PipeOp
Issue suggests a new PipeOp
Comments
suggestion for
also maybe the content of the probably good idea to use PipeOpTaskPreprocSimple. |
in the list(
colmaps = list(
x = c(x.a = "a", x.b = "b", x.c = "c"),
y = c(y.a = "a", y.b = "b")
),
treatment_encoding = FALSE
) |
maybe |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Inverses one-hot-encoding: creates a factor-column that indicates which of the input numeric columns has the maximal value.
Should have argument 'treatment_encoding' (init: FALSE): if TRUE, it includes an additional level if all cols are 0 (becoming the inverse of PipeOpEncode with method == treatment).
Should also have an argument
group_pattern
, a regular expression. Thegroup_pattern
is applied to all col names and the first regex group is extracted. All columns that have different value here are treated separately from each other. The levels that are created then correspond togsub(group_pattern, "", colnames())
. Initialized as"^([^.]*)\\."
.The point here is that we may have columns
x.a
,x.b
,x.c
,y.a
,y.b
. The"^([^.]*)\\."
-match matches"x"
for the first three cols and creates levelsa
,b
, andc
. It then matches"y"
for the last two cols, creating the factor cols with levelsa
andb
. Should the user e.g. have columnsx_a
,x_b
, ..., then this would need to be changed to"^([^_]*)_"
. Should the user not want any groups, and instead get levelsx.a
,x.b
, ...,y.b
in a single result column, the pattern would be""
.If the pattern is not
""
, we ignore all columns that do not match thegroup_pattern
; I am assuming that this is what a user wants basically all of the time, even though it unfortunately undermines theaffect_columns
argument somewhat.The text was updated successfully, but these errors were encountered: