R sparklyr extension for using John Snow Labs Spark NLP library.
Install from this Github repository using:
remotes::install_github("r-spark/sparknlp")
remotes::install_github("r-spark/[email protected]")
remotes::install_github("r-spark/sparknlp@dev")
The following features/annotators have not been implemented yet
- AlbertForSequenceClassification
- AnnotationMerger
- BertForSequenceClassification
- CamemBertEmbeddings
- DeBertaEmbeddings
- DeBertaForSequenceClassification
- DeBertaForTokenClassification
- DistilBertForSequenceClassification
- Doc2Vec
- GPT2Transformer
- LongformerForSequenceClassification
- MedicalBertForTokenClassification
- RoBertaForSequenceClassification
- XlmRoBertaForSequenceClassification
- XlnetForSequenceClassification
- WordSegmenter
- AnnotationMerger
- ChunkKeyPhraseExtraction
- ChunkMapperApproach
- ChunkSentenceSplitter
- DeIdentification
- EntityChunkEmbeddings
- MedicalBertForSequenceClassification
- MedicalBertForTokenClassifier
- MedicalDistilBertForSequenceClassification
- StructuredDeIdentification
- TFGraphBuilder
- ZeroShotRelationExtraction
There are a lot of examples in R notebooks inside the examples
directory. I recommended starting with the
notebooks in tutorials/certification_trainings.
The examples
directory structure here follows the notebook examples found at
spark-nlp-workshop.
Note that not all the Jupyter notebooks found there have been ported yet, but all functionality still exists in the
package.
The package has a default Spark NLP version that will be used when a Spark Session
is created with spark_connect
. This is usually the latest version.
R package version | Default Spark NLP version |
---|---|
0.2.x | 3.0.1 |
0.3.x | 3.0.2 |
0.5.x | 3.0.3 |
0.6.x | 3.1.0 |
0.7.x | 3.1.1 |
0.8.x | 3.1.2 |
0.9.x | 3.3.0 |
0.10.x | 3.3.1 |
0.11.x | 3.3.4 |
0.12.x | 3.4.0 |
0.13.x | 3.4.1 |
0.14.x | 3.4.2 |
0.15.x | 3.4.4 |
0.16.x | 4.2.0 |
The function nlp_version()
will show you the version that will be used. If you wish to change the version call
the function set_nlp_version()
before connecting to Spark. This will only change the
version of the Spark NLP jar that is loaded by Spark. There could be some code changes
that won't work with older versions of the library. If that happens, the best thing to
do would be to use an older version of this package.
Note: the version used when the licensed healthcare library is enabled is determined separately based on the secret code provided by John Snow Labs. See the section Licensed models and annotators below for more information.
John Snow Labs does provide GPU enabled versions of the library jars. If you would like
to use these jars set the environment variable SPARK_NLP_GPU
to "TRUE". If this
is not set or is set to something that R doesn't treat as TRUE using as.logical
then
the regular CPU library will be used.
If you have purchased a license to the licensed models and annotators, first follow the normal steps in Spark NLP for Healthcare Getting Started.
Once you've done this you should have your AWS-CLI credentials setup and can use the licensed pretrained models in the unlicensed annotators (such as ner_dl).
In order to use the licensed annotators you must also setup the environment variable SPARK_NLP_SECRET_CODE with the secret code provided by John Snow Labs with your license and SPARK_NLP_LICENSE with the license key.