📢 It's All About LLMs!

We're excited to share some amazing updates in the latest Spark NLP release of Spark NLP 🚀 5.4.0! This update is packed with new features and improvements that are set to transform natural language processing. One of the highlights is the integration of OpenVINO Runtime, which significantly boosts performance and efficiency across Intel hardware. You can now enjoy up to a 40% increase in performance compared to TensorFlow, with support for various model formats like ONNX, PaddlePaddle, TensorFlow, and TensorFlow Lite.

We've also added some powerful new annotators: BertEmbeddings, RoBertaEmbeddings, and XlmRoBertaEmbeddings. These are specially fine-tuned to take full advantage of the OpenVINO toolkit, offering better model accuracy and speed.

Another big change is in how we distribute models. We've moved from Broadcast to addFile for model distribution, which makes it easier to scale and manage large language models (LLMs) in cloud environments. This is especially helpful for models with over 7 billion parameters.

In addition, we've introduced the Mistral and Phi-2 architectures, optimized for high-efficiency quantization. There are also practical improvements to core components, like enhanced pooling for BERT-based models and updates to the OpenAIEmbeddings annotator for better performance and integration.

We want to thank our community for their valuable feedback, feature requests, and contributions. Our Models Hub now contains over 37,000+ free and truly open-source models & pipelines. 🎉

Spark NLP ❤️ OpenVINO

🔥 New Features & Enhancements

NEW Integration: OpenVINO Runtime for Spark NLP 🚀: We're thrilled to announce the integration of OpenVINO Runtime, enhancing Spark NLP with high-performance inference capabilities. OpenVINO Runtime supports direct reading of models in ONNX, PaddlePaddle, TensorFlow, and TensorFlow Lite formats, enabling out-of-the-box optimizations and superior performance on supported Intel hardware.

Enhanced Model Support and Performance Gains: The integration allows Spark NLP to utilize the OpenVINO Runtime API for Java, facilitating the loading and execution of models across various formats including ONNX, PaddlePaddle, TensorFlow, TensorFlow Lite, and OpenVINO IR. Impressively, benchmarks show up to a 40% performance improvement over TensorFlow with no additional tuning required. Additionally, users can harness the full optimization and quantization capabilities of the OpenVINO toolkit via the Model Conversion API.

Enabled Annotators: This update brings OpenVINO compatibility to a range of Spark NLP annotators, including BertEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, T5Transformer, E5Embeddings, LLAMA2, Mistral, Phi2, and M2M100.

Acknowledgements: This significant enhancement was accomplished during Google Summer of Code 2023. Special thanks to Rajat Krishna (@rajatkrishna) and the entire OpenVINO team for their invaluable support and collaboration. #14200

New Mistral Integration: We are excited to introduce the Mistral integration, featuring models fine-tuned on the MistralForCasualLM architecture. This addition enhances performance and efficiency by supporting quantization in INT4 and INT8 for CPUs via OpenVINO. #14318

> Performance of Mistral 7B and different Llama models on a wide range of benchmarks. For all metrics, all models were re-evaluated with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 13B on all metrics, and is on par with Llama 34B (since Llama 2 34B was not released, we report results on Llama 34B). It is also vastly superior in code and reasoning benchmarks. https://mistral.ai/news/announcing-mistral-7b/

Continuing our commitment to user-friendly and scalable solutions, the integration of the Mistral architecture has been designed to be straightforward and easily adoptable, ensuring that users can leverage these enhancements without complexity:

doc_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

mistral = MistralTransformer \
            .pretrained() \
            .setMaxOutputLength(50) \
            .setDoSample(False) \
            .setInputCols(["document"]) \
            .setOutputCol("mistral_generation")

New Phi-2 Integrations: Introducing Phi-2, featuring models fine-tuned using the PhiForCausalLM architecture. This update enhances OpenVINO's capabilities, enabling quantization in INT4 and INT8 for CPUs to optimize both performance and efficiency. #14318

Continuing our commitment to user-friendly and scalable solutions, the integration of the Phi architecture has been designed to be straightforward and easily adoptable, ensuring that users can leverage these enhancements without complexity:

doc_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

phi2 = Phi2Transformer \
        .pretrained() \
        .setMaxOutputLength(50) \
        .setDoSample(False) \
        .setInputCols(["document"]) \
        .setOutputCol("phi2_generation")

NEW: Enhanced LLM Distribution: We've optimized the scalability of large language models (LLMs) in the cloud by transitioning from Broadcast to addFile for deep learning distribution across any cluster. This change addresses the challenges of handling modern LLMs—some boasting over 7 billion parameters—by improving memory management and overcoming serialization limits previously encountered with Java Bytes and Apache Spark's Broadcast method. This update significantly boosts Spark NLP's ability to process LLMs efficiently, underscoring our dedication to delivering scalable NLP solutions.#14236

NEW: MPNetForTokenClassification Annotator: Introducing the MPNetForTokenClassification annotator in Spark NLP 🚀. This annotator efficiently loads MPNet models equipped with a token classification head (a linear layer atop the hidden-states output), ideal for Named-Entity Recognition (NER) tasks. It supports models trained or fine-tuned in ONNX format using MPNetForTokenClassification for PyTorch or TFCamembertForTokenClassification for TensorFlow from HuggingFace 🤗. [View Pull Request](#14322
Enhanced Pooling for BERT, RoBERTa, and XLM-RoBERTa: We've added support for average pooling in BertSentenceEmbeddings, RoBertaSentenceEmbeddings, and XLMRoBertaEmbeddings annotators. This feature is especially useful when the [CLS] token is not fine-tuned for sentence embeddings via average pooling. View Pull Request
Refined OpenAIEmbeddings: Upgraded to support escape characters to prevent JSON content issues, changed the output annotator type from DOCUMENT to SENTENCE_EMBEDDINGS (note: this affects backward compatibility), enhanced output embeddings with metadata from the document column, introduced a Python unit test class, and added a new submodule for reliable saving/loading of the annotator. View Pull Request
New OpenVINO Notebooks: Released notebooks for exporting HuggingFace models using Optimum Intel and importing into Spark NLP. This update includes notebooks for BertEmbeddings, E5Embeddings, LLAMA2Transformer, RoBertaEmbeddings, XlmRoBertaEmbeddings, and T5Transformer. View Pull Request

🐛 Bug Fixes

Resolved Connection Timeout Issue: Fixed the Timeout waiting for connection from pool error that occurred when downloading multiple models simultaneously. View Pull Request
Corrected Llama-2 Decoder Position ID: Addressed an issue where the Llama-2 decoder received an incorrect next position ID. View Pull Request
Stabilized BertForZeroShotClassification: Fixed crashes in sentence-wise pipelines by implementing a method to pad all required arrays within a batch to the same length. View Pull Request
Updated Transformers Dependency: Resolved the import issue with keras.engine by updating the transformers version to 4.34.1. View Pull Request
ONNX Model Version Compatibility: Fixed Unsupported model IR version: 10, max supported IR version: 9 by setting the ONNX version to onnx==1.14.0. View Pull Request
Resolved Breeze Compatibility Issue: Addressed java.lang.NoSuchMethodError by ensuring compatibility with Spark 3.4 and updating documentation accordingly. View Pull Request
Updated Libraries in Notebooks: Updated transformers and TensorFlow versions across all notebooks. View Pull Request
Fixed Division by Zero Error in SwinForImageClassification Notebook: Addressed an error that occurred when updating the TensorFlow version. View Pull Request
Fixed Missing spp File in XLMRoberta Annotator: Corrected a bug causing a missing spp file in the XLMRobertaForXXX annotator. View Pull Request
Enhanced XLNet Embeddings Signature: Updated TensorFlow signature in XLNet embeddings source code to support custom inputs while maintaining backward compatibility. View Pull Request
Added ModelHub Cards for M2M100 and Llama-2: Included missing modelhub cards to enhance model accessibility. View Pull Request
Optimized Caching for Streamlit Demos: Implemented caching to enhance performance across all Streamlit demonstrations. View Pull Request
Introduced UAEEmbeddings Notebook: Added a new example notebook for UAEEmbeddings. View Pull Request

🐛 Dependencies

Published New OpenVINO Artifacts: Built and published new OpenVINO artifacts for both CPU and GPU to enhance performance and compatibility. View Changes
Updated ONNX Runtime: Upgraded onnxruntime to version 1.18.0 for enhanced stability and performance on both CPU and GPU.
Upgraded Azure Libraries: Updated azure-identity to 1.12.2 and azure-storage-blob to 12.26.0 to improve security and integration with Azure services.

💾 Models

The complete list of all 37000+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Examples for 100+ examples

📖 Documentation

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.4.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.4.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.4.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.4.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.4.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.4.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.4.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.4.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.4.0.jar

Pull Requests:

What's Changed

Update 2024-02-11-bge_m3_xx.md by @maziyarpanahi in #14244
Closed the S3 connection by @mehmetbutgul in #14233
openVINO Dependencies by @DevinTDHa in #14255
[SPARKNLP-1037] Adding addFile changes to to replace broadcast in all ONNX based annotators by @danilojsl in #14236
Integrating OpenVINO Runtime in Spark NLP by @rajatkrishna in #14200
Fixing colab notebook bugs by @ahmedlone127 in #14249
adding model hub cards + updating readme + small typo fix on M2M100Te… by @ahmedlone127 in #14253
bert for zero shot classification crashes on sentence basis by @ahmedlone127 in #14276
Sparknlp 1035 test all notebooks to import tensor flow models to spark nlp by @ahmedlone127 in #14238
SPARKNLP-1036: Onnx Example notebooks by @DevinTDHa in #14234
Adding caching to streamlit demos by @AbdullahMubeenAnwar in #14232
Fixies by @agsfer in #14307
Add openvino GPU dependency by @DevinTDHa in #14309
LLAMA2 OpenVINO Position ID Fix by @rajatkrishna in #14308
Sparknlp 1016 implement mp net for token classification by @ahmedlone127 in #14322
Uploading OpenVINO example notebooks by @rajatkrishna in #14313
SparkNLP - 995 Introducing MistralAI LLMs by @prabod in #14318
SparkNLP 1043 integrate new casual lm annotators to use open vino by @prabod in #14319
Fixed LLAMA generation bug by @prabod in #14320
Add Pooling Average to Broken XXXForSentenceEmbedding annotators by @ahmedlone127 in #14328
Fix models link on FAQ by @dcecchini in #14333
adding onnx support and average pooling by @ahmedlone127 in #14330
uploading UAEEmbeddings notebook by @AbdullahMubeenAnwar in #14324
Refactor OpenAIEmbeddings by @mehmetbutgul in #14334
Models hub by @maziyarpanahi in #14335
540 Release Candidate by @maziyarpanahi in #14247

New Contributors

@rajatkrishna made their first contribution in #14200

Full Changelog: 5.3.3...5.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.4.0: Launching OpenVINO Runtime Integration, Advanced Model Support for LLMs, Enhanced Performance with New Annotators, Improved Cloud Scalability, and Comprehensive Updates Across the Board!