You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I tried searching for existing documentation or discussions on how to run a model for sentence embeddings over two separate columns and did not find any. I was wondering if there are any recommendations or known gotchas on the topic. Say I have a data frame with a name and address column, and would like to use a RoBERTa model to compute sentence embeddings for both. Best I could come up with was something as follows:
def createPipeline(source: String): Pipeline = {
val documentAssembler = new DocumentAssembler()
.setInputCol(source)
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = XlmRoBertaEmbeddings
.pretrained("xlm_roberta_base", "xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(false)
val sentenceEmbeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("sentence_embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)
new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings,
sentenceEmbeddings,
embeddingsFinisher
))
}
And then basically doing something like this:
val testDataWithNameEmbeddings = createPipeline("name").fit(testData).transform(testData).select($"name", $"address", $"finished_embeddings".alias("name_embeddings"))
val testDataWithBothEmbeddings = createPipeline("address").fit(testDataWithNameEmbeddings).transform(testDataWithNameEmbeddings).select($"name", $"address", $"name_embeddings", $"finished_embeddings".alias("address_embeddings"))
This appears to work, but feels... wrong? The existence of MultiDocumentAssembler and setInputCols APIs on several of the stages led me down a rabbit hole of trying out different approaches to see if I could annotate and tokenize multiple columns in one stage, but I hit a variety of issues and assertions for different components of the pipeline. For example, calling setInputCols on Tokenizer with an array containing more than one column results in:
IllegalArgumentException: requirement failed: setInputCols in REGEX_TOKENIZER_2889f26665ad expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: document.
Closest thing I stumbled upon is this old issue where someone was trying to run multiple models in one pipeline, but if I try to add two embeddings stages in the pipeline spark-ml fails with:
IllegalArgumentException: requirement failed: Cannot have duplicate components in a pipeline.
Not sure how common of a use case this is given there don't seem to be other issues like it, would appreciate some thoughts on the topic.
Link to the documentation pages (if available)
No response
How could the documentation be improved?
Hi,
I tried searching for existing documentation or discussions on how to run a model for sentence embeddings over two separate columns and did not find any. I was wondering if there are any recommendations or known gotchas on the topic. Say I have a data frame with a
name
andaddress
column, and would like to use a RoBERTa model to compute sentence embeddings for both. Best I could come up with was something as follows:And then basically doing something like this:
This appears to work, but feels... wrong? The existence of
MultiDocumentAssembler
andsetInputCols
APIs on several of the stages led me down a rabbit hole of trying out different approaches to see if I could annotate and tokenize multiple columns in one stage, but I hit a variety of issues and assertions for different components of the pipeline. For example, callingsetInputCols
onTokenizer
with an array containing more than one column results in:IllegalArgumentException: requirement failed: setInputCols in REGEX_TOKENIZER_2889f26665ad expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: document
.Closest thing I stumbled upon is this old issue where someone was trying to run multiple models in one pipeline, but if I try to add two
embeddings
stages in the pipeline spark-ml fails with:IllegalArgumentException: requirement failed: Cannot have duplicate components in a pipeline.
Not sure how common of a use case this is given there don't seem to be other issues like it, would appreciate some thoughts on the topic.
Thanks!
Environment:
Spark 3.5.0, Scala 2.12, Spark NLP 5.5.1
The text was updated successfully, but these errors were encountered: