You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created a function cn_seg() for Chinese word segmentation. The function takes a character vector as input and output a list of character vectors as requested. But when I set custom_token = cn_seg, it throws an error.
Reproducible example
words <- c("下面是不分行输出的结果", "下面是不输出的结果")
library(jiebaR) # For Chinese word segmentation
cn_seg <- function(text) {
engine <- worker(bylines = TRUE)
segment(text, engine)
}
cn_seg(words)
cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))
recipe(~ words, data = cn_text) |>
step_tokenize(words, custom_token = cn_seg) |>
show_tokens(content)
#> Error in `step_tokenize()`:
#> Caused by error in `token()`:
#> ! unused argument (x = data[, 1, drop = TRUE])
#> Run `rlang::last_trace()` to see where the error occurred.
The text was updated successfully, but these errors were encountered:
gaohuachuan
changed the title
step_tokenize() throw an error when it's used to token Chinese text
step_tokenize() throw an error when it's used to tokenize Chinese text
Oct 4, 2023
gaohuachuan
changed the title
step_tokenize() throw an error when it's used to tokenize Chinese text
step_tokenize() throws an error when it's used to tokenize Chinese text
Oct 4, 2023
I found two things. Firstly, it wasn't documented, but it appears that the custom tokenization function uses the argument x as input. That should be fixed or documented correctly.
Secondly you should reference the same variable in show_tokens() as you used in step_tokenize(). So it should be show_tokens(words) instead of show_tokens(content)
EmilHvitfeldt
changed the title
step_tokenize() throws an error when it's used to tokenize Chinese text
custom_token argument in step_tokenize() doesn't like it when main argument isn't x
Oct 4, 2023
The problem
I created a function
cn_seg()
for Chinese word segmentation. The function takes a character vector as input and output a list of character vectors as requested. But when I setcustom_token = cn_seg
, it throws an error.Reproducible example
The text was updated successfully, but these errors were encountered: