custom_token argument in step_tokenize() doesn't like it when main argument isn't x #248

gaohuachuan · 2023-10-04T09:37:01Z

The problem

I created a function cn_seg() for Chinese word segmentation. The function takes a character vector as input and output a list of character vectors as requested. But when I set custom_token = cn_seg, it throws an error.

Reproducible example

words <- c("下面是不分行输出的结果", "下面是不输出的结果")

library(jiebaR)                           # For Chinese word segmentation

cn_seg <- function(text) {                      
  engine <- worker(bylines = TRUE)
  segment(text, engine)
}

cn_seg(words)

cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))

recipe(~ words, data = cn_text) |> 
  step_tokenize(words, custom_token = cn_seg) |> 
  show_tokens(content)
#> Error in `step_tokenize()`:
#> Caused by error in `token()`:
#> ! unused argument (x = data[, 1, drop = TRUE])
#> Run `rlang::last_trace()` to see where the error occurred.

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2023-10-04T15:33:54Z

Hello @gaohuachuan! 👋 thanks for reporting!

I found two things. Firstly, it wasn't documented, but it appears that the custom tokenization function uses the argument x as input. That should be fixed or documented correctly.

Secondly you should reference the same variable in show_tokens() as you used in step_tokenize(). So it should be show_tokens(words) instead of show_tokens(content)

words <- c("下面是不分行输出的结果", "下面是不输出的结果")

library(jiebaR)
#> Loading required package: jiebaRD

cn_seg <- function(x) {
  engine <- worker(bylines = TRUE)
  segment(x, engine)
}

cn_seg(words)
#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"

library(textrecipes)

cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))

recipe(~ words, data = cn_text) |>
  step_tokenize(words, custom_token = cn_seg) |>
  show_tokens(words)
#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"

gaohuachuan · 2023-10-04T16:02:56Z

Thanks for your reply. The problem with my code is the x argument.

gaohuachuan changed the title ~~step_tokenize() throw an error when it's used to token Chinese text~~ step_tokenize() throw an error when it's used to tokenize Chinese text Oct 4, 2023

gaohuachuan changed the title ~~step_tokenize() throw an error when it's used to tokenize Chinese text~~ step_tokenize() throws an error when it's used to tokenize Chinese text Oct 4, 2023

EmilHvitfeldt changed the title ~~step_tokenize() throws an error when it's used to tokenize Chinese text~~ custom_token argument in step_tokenize() doesn't like it when main argument isn't x Oct 4, 2023

EmilHvitfeldt added feature a feature request or enhancement bug an unexpected problem or unintended behavior documentation and removed feature a feature request or enhancement labels Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom_token argument in step_tokenize() doesn't like it when main argument isn't x #248

custom_token argument in step_tokenize() doesn't like it when main argument isn't x #248

gaohuachuan commented Oct 4, 2023

EmilHvitfeldt commented Oct 4, 2023

gaohuachuan commented Oct 4, 2023

custom_token argument in step_tokenize() doesn't like it when main argument isn't x #248

custom_token argument in step_tokenize() doesn't like it when main argument isn't x #248

Comments

gaohuachuan commented Oct 4, 2023

The problem

Reproducible example

EmilHvitfeldt commented Oct 4, 2023

gaohuachuan commented Oct 4, 2023