Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom_token argument in step_tokenize() doesn't like it when main argument isn't x #248

Open
gaohuachuan opened this issue Oct 4, 2023 · 2 comments
Labels
bug an unexpected problem or unintended behavior documentation

Comments

@gaohuachuan
Copy link

The problem

I created a function cn_seg() for Chinese word segmentation. The function takes a character vector as input and output a list of character vectors as requested. But when I set custom_token = cn_seg, it throws an error.

Reproducible example

words <- c("下面是不分行输出的结果", "下面是不输出的结果")

library(jiebaR)                           # For Chinese word segmentation

cn_seg <- function(text) {                      
  engine <- worker(bylines = TRUE)
  segment(text, engine)
}

cn_seg(words)

cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))

recipe(~ words, data = cn_text) |> 
  step_tokenize(words, custom_token = cn_seg) |> 
  show_tokens(content)
#> Error in `step_tokenize()`:
#> Caused by error in `token()`:
#> ! unused argument (x = data[, 1, drop = TRUE])
#> Run `rlang::last_trace()` to see where the error occurred.
@gaohuachuan gaohuachuan changed the title step_tokenize() throw an error when it's used to token Chinese text step_tokenize() throw an error when it's used to tokenize Chinese text Oct 4, 2023
@gaohuachuan gaohuachuan changed the title step_tokenize() throw an error when it's used to tokenize Chinese text step_tokenize() throws an error when it's used to tokenize Chinese text Oct 4, 2023
@EmilHvitfeldt
Copy link
Member

Hello @gaohuachuan! 👋 thanks for reporting!

I found two things. Firstly, it wasn't documented, but it appears that the custom tokenization function uses the argument x as input. That should be fixed or documented correctly.

Secondly you should reference the same variable in show_tokens() as you used in step_tokenize(). So it should be show_tokens(words) instead of show_tokens(content)

words <- c("下面是不分行输出的结果", "下面是不输出的结果")

library(jiebaR)
#> Loading required package: jiebaRD

cn_seg <- function(x) {
  engine <- worker(bylines = TRUE)
  segment(x, engine)
}

cn_seg(words)
#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"

library(textrecipes)

cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))

recipe(~ words, data = cn_text) |>
  step_tokenize(words, custom_token = cn_seg) |>
  show_tokens(words)
#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"

@EmilHvitfeldt EmilHvitfeldt changed the title step_tokenize() throws an error when it's used to tokenize Chinese text custom_token argument in step_tokenize() doesn't like it when main argument isn't x Oct 4, 2023
@EmilHvitfeldt EmilHvitfeldt added feature a feature request or enhancement bug an unexpected problem or unintended behavior documentation and removed feature a feature request or enhancement labels Oct 4, 2023
@gaohuachuan
Copy link
Author

Thanks for your reply. The problem with my code is the x argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior documentation
Projects
None yet
Development

No branches or pull requests

2 participants