Spark NLP 5.4.0: Launching OpenVINO Runtime Integration, Advanced Model Support for LLMs, Enhanced Performance with New Annotators, Improved Cloud Scalability, and Comprehensive Updates Across the Board!
📢 It's All About LLMs!
We're excited to share some amazing updates in the latest Spark NLP release of Spark NLP 🚀 5.4.0! This update is packed with new features and improvements that are set to transform natural language processing. One of the highlights is the integration of OpenVINO Runtime, which significantly boosts performance and efficiency across Intel hardware. You can now enjoy up to a 40% increase in performance compared to TensorFlow, with support for various model formats like ONNX, PaddlePaddle, TensorFlow, and TensorFlow Lite.
We've also added some powerful new annotators: BertEmbeddings
, RoBertaEmbeddings
, and XlmRoBertaEmbeddings
. These are specially fine-tuned to take full advantage of the OpenVINO toolkit, offering better model accuracy and speed.
Another big change is in how we distribute models. We've moved from Broadcast to addFile
for model distribution, which makes it easier to scale and manage large language models (LLMs) in cloud environments. This is especially helpful for models with over 7 billion parameters.
In addition, we've introduced the Mistral
and Phi-2
architectures, optimized for high-efficiency quantization. There are also practical improvements to core components, like enhanced pooling for BERT-based models and updates to the OpenAIEmbeddings annotator for better performance and integration.
We want to thank our community for their valuable feedback, feature requests, and contributions. Our Models Hub now contains over 37,000+ free and truly open-source models & pipelines. 🎉
Spark NLP ❤️ OpenVINO
🔥 New Features & Enhancements
NEW Integration: OpenVINO Runtime for Spark NLP 🚀: We're thrilled to announce the integration of OpenVINO Runtime, enhancing Spark NLP with high-performance inference capabilities. OpenVINO Runtime supports direct reading of models in ONNX
, PaddlePaddle
, TensorFlow
, and TensorFlow Lite
formats, enabling out-of-the-box optimizations and superior performance on supported Intel hardware.
Enhanced Model Support and Performance Gains: The integration allows Spark NLP to utilize the OpenVINO Runtime API for Java, facilitating the loading and execution of models across various formats including ONNX, PaddlePaddle, TensorFlow, TensorFlow Lite, and OpenVINO IR. Impressively, benchmarks show up to a 40% performance improvement over TensorFlow with no additional tuning required. Additionally, users can harness the full optimization and quantization capabilities of the OpenVINO toolkit via the Model Conversion API.
Enabled Annotators: This update brings OpenVINO compatibility to a range of Spark NLP annotators, including BertEmbeddings
, RoBertaEmbeddings
, XlmRoBertaEmbeddings
, T5Transformer
, E5Embeddings
, LLAMA2
, Mistral
, Phi2
, and M2M100
.
Acknowledgements: This significant enhancement was accomplished during Google Summer of Code 2023. Special thanks to Rajat Krishna (@rajatkrishna) and the entire OpenVINO team for their invaluable support and collaboration. #14200
- New Mistral Integration: We are excited to introduce the
Mistral
integration, featuring models fine-tuned on theMistralForCasualLM
architecture. This addition enhances performance and efficiency by supporting quantization in INT4 and INT8 for CPUs via OpenVINO. #14318
Continuing our commitment to user-friendly and scalable solutions, the integration of the Mistral architecture has been designed to be straightforward and easily adoptable, ensuring that users can leverage these enhancements without complexity:
doc_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
mistral = MistralTransformer \
.pretrained() \
.setMaxOutputLength(50) \
.setDoSample(False) \
.setInputCols(["document"]) \
.setOutputCol("mistral_generation")
- New Phi-2 Integrations: Introducing
Phi-2
, featuring models fine-tuned using thePhiForCausalLM
architecture. This update enhances OpenVINO's capabilities, enabling quantization in INT4 and INT8 for CPUs to optimize both performance and efficiency. #14318
Continuing our commitment to user-friendly and scalable solutions, the integration of the Phi architecture has been designed to be straightforward and easily adoptable, ensuring that users can leverage these enhancements without complexity:
doc_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
phi2 = Phi2Transformer \
.pretrained() \
.setMaxOutputLength(50) \
.setDoSample(False) \
.setInputCols(["document"]) \
.setOutputCol("phi2_generation")
- NEW: Enhanced LLM Distribution: We've optimized the scalability of large language models (LLMs) in the cloud by transitioning from Broadcast to
addFile
for deep learning distribution across any cluster. This change addresses the challenges of handling modern LLMs—some boasting over 7 billion parameters—by improving memory management and overcoming serialization limits previously encountered with Java Bytes and Apache Spark's Broadcast method. This update significantly boosts Spark NLP's ability to process LLMs efficiently, underscoring our dedication to delivering scalable NLP solutions.#14236
-
NEW: MPNetForTokenClassification Annotator: Introducing the
MPNetForTokenClassification
annotator in Spark NLP 🚀. This annotator efficiently loads MPNet models equipped with a token classification head (a linear layer atop the hidden-states output), ideal for Named-Entity Recognition (NER) tasks. It supports models trained or fine-tuned in ONNX format usingMPNetForTokenClassification
for PyTorch orTFCamembertForTokenClassification
for TensorFlow from HuggingFace 🤗. [View Pull Request](#14322 -
Enhanced Pooling for BERT, RoBERTa, and XLM-RoBERTa: We've added support for average pooling in
BertSentenceEmbeddings
,RoBertaSentenceEmbeddings
, andXLMRoBertaEmbeddings
annotators. This feature is especially useful when the [CLS] token is not fine-tuned for sentence embeddings via average pooling. View Pull Request -
Refined OpenAIEmbeddings: Upgraded to support escape characters to prevent JSON content issues, changed the output annotator type from
DOCUMENT
toSENTENCE_EMBEDDINGS
(note: this affects backward compatibility), enhanced output embeddings with metadata from the document column, introduced a Python unit test class, and added a new submodule for reliable saving/loading of the annotator. View Pull Request -
New OpenVINO Notebooks: Released notebooks for exporting HuggingFace models using Optimum Intel and importing into Spark NLP. This update includes notebooks for
BertEmbeddings
,E5Embeddings
,LLAMA2Transformer
,RoBertaEmbeddings
,XlmRoBertaEmbeddings
, andT5Transformer
. View Pull Request
🐛 Bug Fixes
- Resolved Connection Timeout Issue: Fixed the
Timeout waiting for connection from pool
error that occurred when downloading multiple models simultaneously. View Pull Request - Corrected Llama-2 Decoder Position ID: Addressed an issue where the Llama-2 decoder received an incorrect next position ID. View Pull Request
- Stabilized BertForZeroShotClassification: Fixed crashes in sentence-wise pipelines by implementing a method to pad all required arrays within a batch to the same length. View Pull Request
- Updated Transformers Dependency: Resolved the import issue with
keras.engine
by updating the transformers version to4.34.1
. View Pull Request - ONNX Model Version Compatibility: Fixed
Unsupported model IR version: 10, max supported IR version: 9
by setting the ONNX version toonnx==1.14.0
. View Pull Request - Resolved Breeze Compatibility Issue: Addressed
java.lang.NoSuchMethodError
by ensuring compatibility with Spark 3.4 and updating documentation accordingly. View Pull Request - Updated Libraries in Notebooks: Updated transformers and TensorFlow versions across all notebooks. View Pull Request
- Fixed Division by Zero Error in SwinForImageClassification Notebook: Addressed an error that occurred when updating the TensorFlow version. View Pull Request
- Fixed Missing spp File in XLMRoberta Annotator: Corrected a bug causing a missing spp file in the XLMRobertaForXXX annotator. View Pull Request
- Enhanced XLNet Embeddings Signature: Updated TensorFlow signature in XLNet embeddings source code to support custom inputs while maintaining backward compatibility. View Pull Request
- Added ModelHub Cards for M2M100 and Llama-2: Included missing modelhub cards to enhance model accessibility. View Pull Request
- Optimized Caching for Streamlit Demos: Implemented caching to enhance performance across all Streamlit demonstrations. View Pull Request
- Introduced UAEEmbeddings Notebook: Added a new example notebook for
UAEEmbeddings
. View Pull Request
🐛 Dependencies
- Published New OpenVINO Artifacts: Built and published new OpenVINO artifacts for both CPU and GPU to enhance performance and compatibility. View Changes
- Updated ONNX Runtime: Upgraded
onnxruntime
to version1.18.0
for enhanced stability and performance on both CPU and GPU. - Upgraded Azure Libraries: Updated
azure-identity
to1.12.2
andazure-storage-blob
to12.26.0
to improve security and integration with Azure services.
💾 Models
The complete list of all 37000+ models & pipelines in 230+ languages is available on Models Hub
📓 New Notebooks
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Examples for 100+ examples
📖 Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
❤️ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP! - Medium Spark NLP articles
- JohnSnowLabs official Medium
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==5.4.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.4.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.4.0</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.4.0</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.4.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.4.0.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.4.0.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.4.0.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.4.0.jar
Pull Requests:
- #14233
- #14255
- #14236
- #14200
- #14249
- #14253
- #14276
- #14238
- #14234
- #14232
- #14309
- #14308
- #14322
- #14313
- #14318
- #14319
- #14328
- #14333
- #14331
- #14330
- #14324
- #14334
What's Changed
- Update 2024-02-11-bge_m3_xx.md by @maziyarpanahi in #14244
- Closed the S3 connection by @mehmetbutgul in #14233
- openVINO Dependencies by @DevinTDHa in #14255
- [SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators by @danilojsl in #14236
- Integrating OpenVINO Runtime in Spark NLP by @rajatkrishna in #14200
- Fixing colab notebook bugs by @ahmedlone127 in #14249
- adding model hub cards + updating readme + small typo fix on M2M100Te… by @ahmedlone127 in #14253
- bert for zero shot classification crashes on sentence basis by @ahmedlone127 in #14276
- Sparknlp 1035 test all notebooks to import tensor flow models to spark nlp by @ahmedlone127 in #14238
- SPARKNLP-1036: Onnx Example notebooks by @DevinTDHa in #14234
- Adding caching to streamlit demos by @AbdullahMubeenAnwar in #14232
- Fixies by @agsfer in #14307
- Add openvino GPU dependency by @DevinTDHa in #14309
- LLAMA2 OpenVINO Position ID Fix by @rajatkrishna in #14308
- Sparknlp 1016 implement mp net for token classification by @ahmedlone127 in #14322
- Uploading OpenVINO example notebooks by @rajatkrishna in #14313
- SparkNLP - 995 Introducing MistralAI LLMs by @prabod in #14318
- SparkNLP 1043 integrate new casual lm annotators to use open vino by @prabod in #14319
- Fixed LLAMA generation bug by @prabod in #14320
- Add Pooling Average to Broken XXXForSentenceEmbedding annotators by @ahmedlone127 in #14328
- Fix models link on FAQ by @dcecchini in #14333
- adding onnx support and average pooling by @ahmedlone127 in #14330
- uploading UAEEmbeddings notebook by @AbdullahMubeenAnwar in #14324
- Refactor OpenAIEmbeddings by @mehmetbutgul in #14334
- Models hub by @maziyarpanahi in #14335
- 540 Release Candidate by @maziyarpanahi in #14247
New Contributors
- @rajatkrishna made their first contribution in #14200
Full Changelog: 5.3.3...5.4.0