diff --git a/assets/models/system/cxrreportgen/asset.yaml b/assets/models/system/cxrreportgen/asset.yaml new file mode 100644 index 0000000000..fcf5c5a05b --- /dev/null +++ b/assets/models/system/cxrreportgen/asset.yaml @@ -0,0 +1,4 @@ +extra_config: model.yaml +spec: spec.yaml +type: model +categories: ["Foundation Models"] diff --git a/assets/models/system/cxrreportgen/description.md b/assets/models/system/cxrreportgen/description.md new file mode 100644 index 0000000000..e8d3e876e1 --- /dev/null +++ b/assets/models/system/cxrreportgen/description.md @@ -0,0 +1,123 @@ +## Overview + +The CxrReportGen model utilizes a multimodal architecture, integrating a BiomedCLIP image encoder with a Phi-3-Mini text encoder to accurately interpret complex medical imaging studies of chest X-rays. CxrReportGen follows the same framework as **[MAIRA-2](https://www.microsoft.com/en-us/research/publication/maira-2-grounded-radiology-report-generation/)**. Its primary function is to generate comprehensive and structured radiology reports, with visual grounding represented by bounding boxes on the images. + +### Training information + +| **Training Dataset** | **Details** | +|----------------|---------------------| +| **[MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.0.0/)** | Frontal chest X-rays from the training partition of the MIMIC-CXR dataset and the associated text reports. Rule-based processing was carried out to extract findings and impressions separately, or to map non-labeled report sections to the relevant sections. During training, text is randomly sampled from either the findings or the impression section. In total 203,170 images from this dataset were used.| +| **Propiertary datasets** | Multiple other proprietary datasets, composed of procured data, were additionally leveraged for training. Caution was taken to ensure there was no leakage of test data samples in the data used for training. | + +**Training Statistics:** + - **Data Size:** ~400,000 samples + - **Batch Size:** 16 + - **Epochs:** 3 + - **Learning Rate:** 2.5e-05 + - **Hardware:** 8 A100 GPUs + - **Training Time:** 1 day and 19 hours + - **Sku:** Standard_ND96amsr_A100_v4 + +### License and where to send questions or comments about the model +The license for CXRReportGen is the MIT license. +For questions or comments, please contact: hlsfrontierteam@microsoft.com + +## Benchmark Results + +### Findings Generation on MIMIC-CXR test set: + +| CheXpert F1-14 (Micro) | CheXpert F1-5 (Micro)| RadGraph-F1 | ROUGE-L | BLEU-4| +|----------------|--------------|-------------|---------|-------| +| 59.1 | 59.7 | 40.8 | 39.1 |23.7 | + + +### Grounded Reporting on [GR-Bench test set](https://arxiv.org/pdf/2406.04449v1): + +| CheXpert F1-14 (Micro) | RadGraph-F1 | ROUGE-L | Box-Completion (Precision/Recall)| +|------------------------|------------ |----------|-----------------| +| 60.0 | 55.6 | 56.6 | 71.5/82.0 | + +## Carbon Footprint +The estimated carbon emissions during training are 0.06364 tCO2eq. + + +## Sample Input and Output + +### Input: +```json +{'input_data': + {'columns': ['frontal_image', 'lateral_image', 'indication', 'technique', 'comparison'], + 'index': [0], + 'data': [ + [ + base64.encodebytes(read_image(frontal)).decode("utf-8"), + base64.encodebytes(read_image(lateral)).decode("utf-8"), + 'Pneumonia', + 'One view chest', + 'None' + ]]}, + 'params': {}} +``` + +### Output: +Output is json encoded inside an array. +```python +findings = json.loads(result[0]["output"]) +findings +``` + +```json +[['Cardiac silhouette remains normal in size.', None], + ['Hilar contours are unremarkable.', None], + ['There are some reticular appearing opacities in the left base not seen on the prior exam.', + [[0.505, 0.415, 0.885, 0.775]]], + ['There is blunting of the right costophrenic sulcus.', + [[0.005, 0.555, 0.155, 0.825]]], + ['Upper lungs are clear.', None]] +``` +The generated bounding box coordinates are the (x, y) coordinates of the top left and bottom right corners of the box, e.g. (x_topleft, y_topleft, x_bottomright, y_bottomright). These are relative to the cropped image (that is, the image that the model ultimately got as input), so be careful while visualising. + +You can optionally apply the below code on the output to adjust the size: +```python + def adjust_box_for_original_image_size(box: BoxType, width: int, height: int) -> BoxType: + """ + This function adjusts the bounding boxes from the MAIRA-2 model output to account for the image processor + cropping the image to be square prior to the model forward pass. The box coordinates are adjusted to be + relative to the original shape of the image assuming the image processor cropped the image based on the length + of the shortest side. + + Args: + box (BoxType): + The box to be adjusted, normalised to (0, 1). + width (int): + Original width of the image, in pixels. + height (int): + Original height of the image, in pixels. + + Returns: + BoxType: The box normalised relative to the original size of the image. + """ + crop_width = crop_height = min(width, height) + x_offset = (width - crop_width) // 2 + y_offset = (height - crop_height) // 2 + + norm_x_min, norm_y_min, norm_x_max, norm_y_max = box + + abs_x_min = int(norm_x_min * crop_width + x_offset) + abs_x_max = int(norm_x_max * crop_width + x_offset) + abs_y_min = int(norm_y_min * crop_height + y_offset) + abs_y_max = int(norm_y_max * crop_height + y_offset) + + adjusted_norm_x_min = abs_x_min / width + adjusted_norm_x_max = abs_x_max / width + adjusted_norm_y_min = abs_y_min / height + adjusted_norm_y_max = abs_y_max / height + + return (adjusted_norm_x_min, adjusted_norm_y_min, adjusted_norm_x_max, adjusted_norm_y_max) +``` + +## Ethical Considerations + +CxrReportGen should not be used as a diagnostic tool or as a substitute for professional medical advice. It is designed to assist radiologists by generating findings and reports, but final clinical decisions should always be made by human experts. + +For detailed guidelines on ethical use, refer to Microsoft's [Responsible AI Principles](https://www.microsoft.com/en-us/ai/responsible-ai). diff --git a/assets/models/system/cxrreportgen/model.yaml b/assets/models/system/cxrreportgen/model.yaml new file mode 100644 index 0000000000..9d8360c59f --- /dev/null +++ b/assets/models/system/cxrreportgen/model.yaml @@ -0,0 +1,8 @@ +path: + container_name: models + container_path: huggingface/CxrReportGen/mlflow_model_folder + storage_name: automlcesdkdataresources + type: azureblob +publish: + description: description.md + type: mlflow_model diff --git a/assets/models/system/cxrreportgen/spec.yaml b/assets/models/system/cxrreportgen/spec.yaml new file mode 100644 index 0000000000..36d46ca4f1 --- /dev/null +++ b/assets/models/system/cxrreportgen/spec.yaml @@ -0,0 +1,30 @@ +$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json + +name: CxrReportGen +path: ./ + +properties: + inference-min-sku-spec: 24|1|220|64 + inference-recommended-sku: Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4 + languages: en + SharedComputeCapacityEnabled: true + +tags: + task: image-text-to-text + industry: health-and-life-sciences + Preview: "" + inference_supported_envs: + - hf + license: mit + author: Microsoft + hiddenlayerscanned: "" + SharedComputeCapacityEnabled: "" + inference_compute_allow_list: + [ + Standard_NC24ads_A100_v4, + Standard_NC48ads_A100_v4, + Standard_NC96ads_A100_v4, + Standard_ND96asr_v4, + Standard_ND96amsr_A100_v4, + ] +version: 2 \ No newline at end of file diff --git a/assets/models/system/medimageinsight/asset.yaml b/assets/models/system/medimageinsight/asset.yaml new file mode 100644 index 0000000000..fcf5c5a05b --- /dev/null +++ b/assets/models/system/medimageinsight/asset.yaml @@ -0,0 +1,4 @@ +extra_config: model.yaml +spec: spec.yaml +type: model +categories: ["Foundation Models"] diff --git a/assets/models/system/medimageinsight/description.md b/assets/models/system/medimageinsight/description.md new file mode 100644 index 0000000000..d0a82c22d5 --- /dev/null +++ b/assets/models/system/medimageinsight/description.md @@ -0,0 +1,146 @@ +### Overview +Most medical imaging AI today is narrowly built to detect a small set of individual findings on a single modality like Chest x-Rays. +This training approach is data and computationally inefficient, requiring ~6-12 months per finding, and often fails to generalize in real world environments. +By further training existing multimodal foundation models on medical images and associated text data, Microsoft and Nuance created a multimodal foundation model that shows evidence of generalizing across various medical imaging modalities, anatomies, locations, severities, and types of medical data. +The training methods learn to map the medical text and images into a unified numerical vector representation space, which makes it easy for computers to understand the relationships between those modalities. + +Embeddings is an important building block in AI research and development for retrieval, search, comparison, classification, and tagging tasks, and developers and researchers can now use MedImageInsight embeddings in the medical domain. +MedImageInsight embeddings is open source allowing developers to customize and adapt to their specific use cases. + +### Model Architecture + +Microsoft MedImageInsight includes 360 million parameter image encoder and 252 million parameter language encoder and comes as pretrained model with fine-tuning capability. The language encoder is not run in inference for each image. It is only run once (offline) to generate classifier head. MedImageInsight is a vision language transformer and was derviced from the Florence computer vision foundation model. Florence is a two-tower architecture similar to CLIP, except the DaViT archictecture is used as the image encoder and the UniCL objective is used as the objective function for MedImageInsight. + +Model input supports image and text input and generates vector embeddings as output. This is a static model trained on an offline dataset that is described below. + +### License and where to send questions or comments about the model +A custom commercial license is available. Please contact the team for details. + +### Training information + +| **Training Dataset** | **Details** | +|----------------|---------------------| +| **[MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.0.0/)** | Frontal chest X-rays from the training partition of the MIMIC-CXR dataset and the associated text reports. Rule-based processing was carried out to extract findings and impressions separately, or to map non-labeled report sections to the relevant sections. During training, text is randomly sampled from either the findings or the impression section. In total 203,170 images from this dataset were used.| +| **[NIH-CXR-LT](https://pubmed.ncbi.nlm.nih.gov/36318048/)**| The NIH-CXR-LT dataset contains long tail distribution categories spanning 20 disease classes for frontal chest X-rays. 68,058 images from the training dataset were leveraged. | +| **[IRMA 2009](https://publications.rwth-aachen.de/record/113524/files/Lehmann_IRMACode_2003.pdf)** | A dataset containing X-rays covering a spectrum of body regions, views, and patient positions. Category information is specified in a coding system, with a PDF mapping the coding system to text for each of the code sub-parts. We converted the coding scheme to the text counterparts by extracting this mapping from the PDF, and leveraged the image and code-text pairs for training. | +| **[RSNA BoneAge](https://pubs.rsna.org/doi/abs/10.1148/radiol.2018180736?journalCode=radiology)** | Pediatric bone-age hand X-rays annotated with the development age of the images. The images are supplied in 8-bit format with inconsistent window leveling. Preprocessing was applied including histogram equalization followed by window leveling to control and standardize the appearance of the images for subsequent training and inference. The development age and gender of the image was converted to text using a standardized template. 12,611 images from the training partition are leveraged. | +| **[UPENN](https://www.nature.com/articles/s41597-022-01560-7)** | A dataset of MRI images of glioblastomas. Images were paired with the text of their DICOM image series descriptions. In total 4,645 images with associated texts were organized for training. | +| **[TCGA](https://www.cancerimagingarchive.net/collection/tcga-sarc/)** | multi-modal dataset of imaging for sarcoma diagnostics. CT and MRI images were extracted and associated with the text of their series description, constituting 5,643 image and text pairs. | +| **[SD198](https://link.springer.com/chapter/10.1007/978-3-319-46466-4_13)** | A dataset of clinical photographs of 198 skin lesions crawled from the web. Train and test splits were not made available but based on random 50% sampling, which we followed for consistency, yielding 3,253 images for training. | +| **[ISIC2019](https://arxiv.org/abs/1902.03368)** | A collection of dermascopic images of skin lesions, associated with 8 diagnostic states spanning metastatic and non-metastatic disease. 20,268 images from the training partition were leveraged. | +| **[PatchCamelyon](https://jamanetwork.com/journals/jama/fullarticle/2665774)** | Histopathological images of breast tissue depicting the presence or absence of cancer. 262,144 images and associated text labels were used in training. | +| **[RSNA Mammography](https://www.kaggle.com/competitions/rsna-breast-cancer-detection/data)** | Images from RSNA hosted and managed challenge on breast cancer detection from mammography. The dataset comprises several styles of mammo- grams with varying window levels and contrasts. No attempt was made to standardize or normalize the images. In total, 43,764 mammograms were leveraged for training. | +| **[LIDIC-IDRI](https://ieee-dataport.org/documents/lung-image-database-consortium-image-collection-lidc-idri)** | A dataset of chest CTs depicting lung nodules at various stages of development. Dataset was broken into tiles of 5x5 across images, with tiles labeled for the maturity of lung nodule present in the tile. 80,201 tiles were sampled for training. | +| **[PAD-UFES-20](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7479321/)** | A collection of clinical photographs of skin lesions taken from mo- bile devices, where the images have been cropped over the lesion of interest. 6 diseases are represented. According to precedent 2,065 images (90%) were leveraged for training, and 233 (10%) for testing. | +| **[ODIR-5k](https://www.kaggle.com/datasets/andrewmvd/ocular-disease-recognition-odir5k)** | Fundus images, where pairs of eyes were annotated across 6 categories. If one eye is not normal, the pair is labeled with the disease of the abnormal eye. Laterality specific textual descriptions were also available. Upon further processing, we discovered about 79 unique textual descriptions were assigned across 6,495 unique eyes, and opted to use these descriptions as labels instead of the reduced 6 labels. 5228 images were used for training, and 1267 images were used for evaluation, which constituted a random 20% sampling of the top 30 categories (with 10 or more instances in the dataset). | +| **Propiertary datasets** | Multiple other proprietary datasets, composed of procured data, data supplied by collaborative partners, and data crawled from the web were additionally leveraged for training. Caution was taken to ensure there was no leakage of test data samples in the crawled data used for training. | + + +| **Carbon Footprint** | **Details** | +|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **Carbon Footprint** | Pretraining utilized a cumulative 7680 GPU hours of computation on hardware of type V100 (TDP of 250W-400W). Estimated total emissions were 0.89184 tCO2eq. We trained on Azure Machine Learning. We used 64 V100 GPUs. Compute region was West US 2. | + + +### Evaluation Results +In this section, we report the results for the models on standard academic benchmarks. For all the evaluations, we use our internal evaluations library. For these models, we always pick the best score between our evaluation framework and any publicly reported results. +| **Modality** | **Use Case** | **Benchmark** | **Maturity relative to Human Expert** | **MSFT IP or Partner Models** | **Google Models** | +|----------------|---------------------|-----------------------------------------------------------------------------------------------|--------------------------------------|---------------------------------|---------------------------------------| +| **Radiology** | Classification | X-Ray: RSNA Bone age | 🟒 | 6.85 avg L1* | No test results | +| | Classification | X-Ray: IRMA2005 body-region/view categories | 🟒 | 0.99 mAUC* | No test results | +| | Classification | ChestXray14: Consolidation (finetuning) | 🟑 | 0.74 mAUC* | 0.74 mAUC (ELiXR)* | +| | Classification | ChestXray14: Edema (finetuning) | 🟑 | 0.86 mAUC* | 0.85 mAUC* (ELiXR) | +| | Classification | ChestXray14: Effusion (finetuning) | 🟑 | 0.83 mAUC* | 0.83 mAUC* (ELiXR) | +| | Classification | MR/CT: Exam categories | 🟑 | 0.95 mAUC* | No test results | +| | Classification | Chest CT: LIDC-IDRI Lung Nodules | 🟑 | 0.81 mAUC* | No model | +| | Classification | Mammography: RSNA Mammography | 🟑 | 0.81 mAUC* | No model | +| **Dermatology**| Classification | ISIC2019 | 🟑 | 0.84 mAUC* | No test results | +| | Classification | SD-198 | 🟑 | 0.93 mAUC* | No test results | +| | Classification | PADUFES20 | 🟑 | 0.96 mAUC | 0.97* (Med-PaLM-M 84B) | +| **Pathology** | Classification | PCAM | 🟑 | 0.96 mAUC* (PaLM) | No test results | +| | Classification | WILDS | 🟑 | 0.97 mAUC (PaLM) | No test results | + + +*SOTA for this task + +### Fairness evaluation + +The table below highlights the performance (AUC) of Bone Age prediction and ChextX-ray text search tasks for female and male respectively. + +| Tasks | AUC | +|----------------------------------------|--------| +| Bone Age (Female) | 6.9343 | +| Bone Age (Male) | 6.5446 | +| ChestX-ray text search (Female) | 0.8651 | +| ChestX-ray text search (Male) | 0.8603 | + + +The table below highlight characterisitcs of patients whose OCT images were included in the analysis. + +| Diagnosis | Diabetic Macular Edema (DME) | Choroidal Neovascularization (CNV) | Drusen | Normal | +|--------------------------------|------------------------------|------------------------------------|--------|--------| +| **Number of Patients** | 709 | 791 | 713 | 3548 | +| **Mean Age (years)** | 57 (Range: 20-90) | 83 (Range: 58-97) | 82 (Range: 40-95) | 60 (Range: 21-86) | +| **Gender** | | | | | +| Male | 38.3% | 54.2% | 44.4% | 59.2% | +| Female | 61.7% | 45.8% | 55.6% | 40.8% | +| **Ethnicity** | | | | | +| Caucasian | 42.6% | 83.3% | 85.2% | 59.9% | +| Asian | 23.4% | 6.3% | 8.6% | 21.1% | +| Hispanic | 23.4% | 8.3% | 4.9% | 10.2% | +| African American | 4.3% | 2.1% | 1.2% | 1.4% | +| Mixed or Other | 10.6% | 0% | 0% | 7.5% | + + +We plan on doing more comprehensive fairness evaluations before public release. + +### Ethical Considerations and Limitations + +Microsoft believes Responsible AI is a shared responsibility and we have identified six principles and practices help organizations address risks, innovate, and create value: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant use case and addresses unforeseen product misuse.β€― + +While testing the model with images and/or text, ensure the the data is PHI free and that there are no patient information or information that can be tracked to a patient identity. + +The model is not designed for the following use cases: +* **Use as a diagnostic tool or as a medical device** - Using information extracted by our service in diagnosis, cure, mitigation, treatment, or prevention of disease or other conditions, as a substitute of professional medical advice, diagnosis, treatment, or clinical judgment of a healthcare professional.β€― + +* **Scenarios without consent for data** -β€―Any scenario that uses health data for a purpose for which consent was not obtained.β€―β€― + +* **Use outside of health scenarios** - Any scenario that uses non-medical related image and/or serving purposes outside of the healthcare domain.β€― + +Please see Microsoft's Responsible AI Principles and approach available at [https://www.microsoft.com/en-us/ai/principles-and-approach/](https://www.microsoft.com/en-us/ai/principles-and-approach/) + + +### Sample inputs and outputs (for real time inference) + +Input: +```bash +data = { + "input_data": { + "columns": [ + "image", + "text" + ], + "index":[0, 1], + "data": [ + [base64.encodebytes(read_image(sample_image_ct_8Bits_Mono)).decode("utf-8"), "This 3D volume depicts the pancreas with a single tumor, the largest of which measures 5.10 centimeters in length."], + [base64.encodebytes(read_image(sample_image_mri_8Bits_Mono)).decode("utf-8"), "This 3D volume depicts the brain with a single tumor."] + ] + }, + "params": {} +} +``` + +Output: +```bash +[{"image_features": [[-0.040428221225738525, 0.015632804483175278, -0.034625787287950516, -0.013094332069158554, 0.023215821012854576, -0.010303247720003128, -0.003998206462711096, -0.00022746287868358195]] +``` + +## Data and Resource Specification for Deployment +* **Supported Data Input Format** +- Monochromatic 8-bit Images (i.e. PNG, TIFF) +- RGB Images (i.e. JPEG, PNG) +- Text (Maximum: 77 Tokens) + +* **Hardware Requirement for Compute Instances** +- Default: Single V100 GPU +- Minimum: Single GPU instance with 8Gb Memory +- Batch size: 4 (~6Gb Memory) diff --git a/assets/models/system/medimageinsight/model.yaml b/assets/models/system/medimageinsight/model.yaml new file mode 100644 index 0000000000..368c284af4 --- /dev/null +++ b/assets/models/system/medimageinsight/model.yaml @@ -0,0 +1,8 @@ +path: + container_name: models + container_path: huggingface/MedImageInsight/mlflow_model_folder + storage_name: automlcesdkdataresources + type: azureblob +publish: + description: description.md + type: mlflow_model diff --git a/assets/models/system/medimageinsight/spec.yaml b/assets/models/system/medimageinsight/spec.yaml new file mode 100644 index 0000000000..f1814275fb --- /dev/null +++ b/assets/models/system/medimageinsight/spec.yaml @@ -0,0 +1,34 @@ +$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json + +name: MedImageInsight +path: ./ + +properties: + inference-min-sku-spec: 6|1|112|64 + inference-recommended-sku: Standard_NC6s_v3, Standard_NC12s_v3, Standard_NC24s_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2 + languages: en + SharedComputeCapacityEnabled: true + +tags: + task: embeddings + industry: health-and-life-sciences + Preview: "" + inference_supported_envs: + - hf + license: mit + author: Microsoft + hiddenlayerscanned: "" + SharedComputeCapacityEnabled: "" + inference_compute_allow_list: + [ + Standard_NC6s_v3, + Standard_NC12s_v3, + Standard_NC24s_v3, + Standard_NC24ads_A100_v4, + Standard_NC48ads_A100_v4, + Standard_NC96ads_A100_v4, + Standard_ND96asr_v4, + Standard_ND96amsr_A100_v4, + Standard_ND40rs_v2, + ] +version: 1 \ No newline at end of file diff --git a/assets/models/system/medimageparse/asset.yaml b/assets/models/system/medimageparse/asset.yaml new file mode 100644 index 0000000000..fcf5c5a05b --- /dev/null +++ b/assets/models/system/medimageparse/asset.yaml @@ -0,0 +1,4 @@ +extra_config: model.yaml +spec: spec.yaml +type: model +categories: ["Foundation Models"] diff --git a/assets/models/system/medimageparse/description.md b/assets/models/system/medimageparse/description.md new file mode 100644 index 0000000000..50ba7b8445 --- /dev/null +++ b/assets/models/system/medimageparse/description.md @@ -0,0 +1,133 @@ +### Overview +Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. MedImageParse is a biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, and recognition for 82 object types across 9 imaging modalities. Through joint learning, we can improve accuracy for individual tasks and enable novel applications such as segmenting all relevant objects in an image through a text prompt, rather than requiring users to laboriously specify the bounding box for each object. + +On image segmentation, we showed that MedImageParse is broadly applicable, outperforming state-of-the-art methods on 102,855 test image-mask-label triples across 9 imaging modalities. + +MedImageParse is also able to identify invalid user inputs describing objects that do not exist in the image. On object detection, which aims to locate a specific object of interest, MedImageParse again attained state-of-the-art performance, especially on objects with irregular shapes. + +On object recognition, which aims to identify all objects in a given image along with their semantic types, we showed that MedImageParse can simultaneously segment and label all biomedical objects in an image. + +In summary, MedImageParse is an all-in-one tool for biomedical image analysis by jointly solving segmentation, detection, and recognition. + +It is broadly applicable to all major biomedical image modalities, paving the path for efficient and accurate image-based biomedical discovery. + +### Model Architecture +MedImageParse is built upon a transformer-based architecture, optimized for processing large biomedical corpora. Leveraging multi-head attention mechanisms, it excels at identifying and understanding biomedical terminology, as well as extracting contextually relevant information from dense scientific texts. The model is pre-trained on vast biomedical datasets, allowing it to generalize across various biomedical domains with high accuracy. + +### License and where to send questions or comments about the model +The license for MedImageParse is the MIT license. +For questions or comments, please contact: hlsfrontierteam@microsoft.com + +### Training information + +MedImageParse was trained on a large dataset comprising over six million triples of image, segmentation mask, and textual description. + +MedImageParse used 16 NVIDIA A100-SXM4-40GB GPUs for a duration of 58 hours. + +### Evaluation Results +Please see the paper for detailed information about methods and results. https://microsoft.github.io/BiomedParse/assets/BiomedParse_arxiv.pdf + +Bar plot comparing the Dice score between our method and competing methods on 102,855 test instances (image-mask-label +triples) across 9 modalities. MedSAM and SAM require bounding box as input. + +![MedImageParse comparison results on segmentation](medimageparseresults.png) + + + +### Fairness evaluation +We conducted fairness evaluation for different sex and age groups. Two-sided independent t-test +shows non-significant differences between female and male and between different age groups, with p-value > 5% for all imaging modalities and segmentation targets evaluated. + +### Ethical Considerations and Limitations + +Microsoft believes Responsible AI is a shared responsibility and we have identified six principles and practices help organizations address risks, innovate, and create value: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant use case and addresses unforeseen product misuse.β€― + +While testing the model with images and/or text, ensure the the data is PHI free and that there are no patient information or information that can be tracked to a patient identity. + +The model is not designed for the following use cases: +* **Use as a diagnostic tool or as a medical device** - Although MedImageParse is highly accurate in parsing biomedical it is not intended to be consumed directly and using information extracted by our service in diagnosis, cure, mitigation, treatment, or prevention of disease or other conditions, as a substitute of professional medical advice, diagnosis, treatment, or clinical judgment of a healthcare professional.β€― + +* **Scenarios without consent for data** -β€―Any scenario that uses health data for a purpose for which consent was not obtained.β€―β€― + +* **Use outside of health scenarios** - Any scenario that uses non-medical related image and/or serving purposes outside of the healthcare domain.β€― + +Please see Microsoft's Responsible AI Principles and approach available at [https://www.microsoft.com/en-us/ai/principles-and-approach/](https://www.microsoft.com/en-us/ai/principles-and-approach/) + + + +### Sample inputs and outputs (for real time inference) + +Input: +```bash +data = { + "input_data": { + "columns": [ + "image", + "text" + ], + "index":[0, 1], + "data": [ + [base64.encodebytes(read_image('./examples/Part_3_226_pathology_breast.png')).decode("utf-8"), "neoplastic cells in breast pathology & inflammatory cells."], + [base64.encodebytes(read_image('./examples/TCGA_HT_7856_19950831_8_MRI-FLAIR_brain.png')).decode("utf-8"), "brain tumor"] + ], + }, + "params": {} +} +``` + + + +## Data and Resource Specification for Deployment +* **Supported Data Input Format** +1. The model expect 2D 8-bit RGB or grayscale images by default, with pixel values ranging from 0 to 255 and resolution 1024*1024. +2. We provided preprocessing notebooks 4, 5, 6 to illustrate how to convert raw formats including DICOM, NIFTI, PNG, and JPG to desired format, with preprocessing steps such as CT windowing. +3. The model outputs pixel probabilities in the same shape as the input image. We convert the floating point probabilities to 8-bit grayscale outputs. The probability threshold for segmentation mask is 0.5, which corresponds to 127.5 in 8-bit grayscale output. +4. The model takes in text prompts for segmentation and doesn't have a fixed number of targets to handle. However, to ensure quality performance, we recommend the following tasks based on evaluation results. + - CT: abdomen: adrenal gland, aorta, bladder, duodenum, esophagus, gallbladder, kidney, kidney cyst, + kidney tumor, left adrenal gland, left kidney, liver, pancreas, postcava, + right adrenal gland, right kidney, spleen, stomach, tumor + colon: tumor + liver: liver, tumor + lung: COVID-19 infection, nodule + pelvis: uterus + - MRI-FLAIR: brain: edema, lower-grade glioma, tumor, tumor core, whole tumor + - MRI-T1-Gd: brain: enhancing tumor, tumor core + - MRI-T2: prostate: prostate peripheral zone, prostate transitional zone, + - MRI: abdomen: aorta, esophagus, gallbladder, kidney, left kidney, liver, pancreas, postcava, + right kidney, spleen, stomach + brain: anterior hippocampus, posterior hippocampus + heart: left heart atrium, left heart ventricle, myocardium, right heart ventricle + prostate: prostate + - OCT: retinal: edema + - X-Ray: chest: COVID-19 infection, left lung, lung, lung opacity, right lung, viral pneumonia + - dermoscopy: skin: lesion, melanoma + - endoscope: colon: neoplastic polyp, non-neoplastic polyp, polyp + - fundus: retinal: optic cup, optic disc, + - pathology: bladder: neoplastic cells + breast: epithelial cells, neoplastic cells + cervix: neoplastic cells + colon: glandular structure, neoplastic cells + esophagus: neoplastic cells + kidney: neoplastic cells + liver: epithelial cells, neoplastic cells + ovarian: epithelial cells, 'neoplastic cells + prostate: neoplastic cells + skin: neoplastic cells + stomach: neoplastic cells + testis: epithelial cells + thyroid: epithelial cells, neoplastic cells + uterus: neoplastic cells +ultrasound: breast: benign tumor, malignant tumor, tumor + heart: left heart atrium, left heart ventricle + transperineal: fetal head, public symphysis + +* **Hardware Requirement for Compute Instances** +- Default: Single V100 GPU +- Minimum: Single GPU instance with 8Gb Memory + + +* **Hardware Requirement for Compute Instances** +Please suggest the following hardware requirements for the compute instances, for example: +- Batch size: 4 (~6Gb Memory) +- Image Compression Ratio: 75 (Default) +- Image Size: 512 (Default for X-Y Dimension) diff --git a/assets/models/system/medimageparse/model.yaml b/assets/models/system/medimageparse/model.yaml new file mode 100644 index 0000000000..a86648672c --- /dev/null +++ b/assets/models/system/medimageparse/model.yaml @@ -0,0 +1,8 @@ +path: + container_name: models + container_path: huggingface/MedImageParse/mlflow_model_folder + storage_name: automlcesdkdataresources + type: azureblob +publish: + description: description.md + type: mlflow_model diff --git a/assets/models/system/medimageparse/spec.yaml b/assets/models/system/medimageparse/spec.yaml new file mode 100644 index 0000000000..9a122f786f --- /dev/null +++ b/assets/models/system/medimageparse/spec.yaml @@ -0,0 +1,34 @@ +$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json + +name: MedImageParse +path: ./ + +properties: + inference-min-sku-spec: 6|1|112|64 + inference-recommended-sku: Standard_NC6s_v3, Standard_NC12s_v3, Standard_NC24s_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2 + languages: en + SharedComputeCapacityEnabled: true + +tags: + task: image-segmentation + industry: health-and-life-sciences + Preview: "" + inference_supported_envs: + - hf + license: mit + author: Microsoft + hiddenlayerscanned: "" + SharedComputeCapacityEnabled: "" + inference_compute_allow_list: + [ + Standard_NC6s_v3, + Standard_NC12s_v3, + Standard_NC24s_v3, + Standard_NC24ads_A100_v4, + Standard_NC48ads_A100_v4, + Standard_NC96ads_A100_v4, + Standard_ND96asr_v4, + Standard_ND96amsr_A100_v4, + Standard_ND40rs_v2, + ] +version: 1 \ No newline at end of file diff --git a/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/asset.yaml b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/asset.yaml new file mode 100644 index 0000000000..fcf5c5a05b --- /dev/null +++ b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/asset.yaml @@ -0,0 +1,4 @@ +extra_config: model.yaml +spec: spec.yaml +type: model +categories: ["Foundation Models"] diff --git a/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/description.md b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/description.md new file mode 100644 index 0000000000..95cea135cc --- /dev/null +++ b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/description.md @@ -0,0 +1,55 @@ +# LLaVA-Med v1.5, using [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) as LLM for a better commercial license + +Large Language and Vision Assistant for bioMedicine (i.e., β€œLLaVA-Med”) is a large language and vision model trained using a curriculum learning method for adapting LLaVA to the biomedical domain. It is an open-source release intended for research use only to facilitate reproducibility of the corresponding paper which claims improved performance for open-ended biomedical questions answering tasks, including common visual question answering (VQA) benchmark datasets such as PathVQA and VQA-RAD. + +LLaVA-Med was proposed in [LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](https://arxiv.org/abs/2306.00890) by Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao. + + +**Model date:** +LLaVA-Med-v1.5-Mistral-7B was trained in April 2024. + +**Paper or resources for more information:** +https://aka.ms/llava-med + +**Where to send questions or comments about the model:** +https://github.com/microsoft/LLaVA-Med/issues + + +## License +[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) license. + +## Intended use + +The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper. The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes. + +### Primary Intended Use + +The primary intended use is to support AI researchers reproducing and building on top of this work. LLaVA-Med and its associated models should be helpful for exploring various biomedical vision-language processing (VLP ) and vision question answering (VQA) research questions. + +### Out-of-Scope Use + +Any deployed use case of the model --- commercial or otherwise --- is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases. Please refer to [the associated paper](https://aka.ms/llava-med) for more details. + + +## Data + +This model builds upon [PMC-15M dataset](https://aka.ms/biomedclip-paper), which is a large-scale parallel image-text dataset for biomedical vision-language processing. It contains 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. It covers a diverse range of biomedical image types, such as microscopy, radiography, histology, and more. + + +## Limitations + +This model was developed using English corpora, and thus may be considered English-only. This model is evaluated on a narrow set of biomedical benchmark tasks, described in [LLaVA-Med paper](https://aka.ms/llava-med). As such, it is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, this model is likely to carry many of the limitations of the model from which it is derived, [LLaVA](https://llava-vl.github.io/). + +Further, this model was developed in part using the [PMC-15M](https://aka.ms/biomedclip-paper) dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data. + + +### BibTeX entry and citation info + +```bibtex +@article{li2023llavamed, + title={Llava-med: Training a large language-and-vision assistant for biomedicine in one day}, + author={Li, Chunyuan and Wong, Cliff and Zhang, Sheng and Usuyama, Naoto and Liu, Haotian and Yang, Jianwei and Naumann, Tristan and Poon, Hoifung and Gao, Jianfeng}, + journal={arXiv preprint arXiv:2306.00890}, + year={2023} +} +``` \ No newline at end of file diff --git a/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/model.yaml b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/model.yaml new file mode 100644 index 0000000000..5b81ca69d9 --- /dev/null +++ b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/model.yaml @@ -0,0 +1,8 @@ +path: + container_name: models + container_path: huggingface/llava-med-v1.5-mistral-7b/mlflow_model_folder + storage_name: automlcesdkdataresources + type: azureblob +publish: + description: description.md + type: mlflow_model diff --git a/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/spec.yaml b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/spec.yaml new file mode 100644 index 0000000000..383a58772b --- /dev/null +++ b/assets/models/system/microsoft-llava-med-v1.5-mistral-7b/spec.yaml @@ -0,0 +1,36 @@ +$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json + +name: microsoft-llava-med-v1.5-mistral-7b +path: ./ + +properties: + inference-min-sku-spec: 6|1|112|64 + inference-recommended-sku: Standard_NC6s_v3, Standard_NC12s_v3, Standard_NC24s_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2 + languages: en + SharedComputeCapacityEnabled: true + +tags: + industry: health-and-life-sciences + author: Microsoft + Preview: "" + SharedComputeCapacityEnabled: "" + inference_supported_envs: + - hf + license: apache-2.0 + task: image-text-to-text + hiddenlayerscanned: "" + huggingface_model_id: microsoft/llava-med-v1.5-mistral-7b + inference_compute_allow_list: + [ + Standard_NC6s_v3, + Standard_NC12s_v3, + Standard_NC24s_v3, + Standard_NC24ads_A100_v4, + Standard_NC48ads_A100_v4, + Standard_NC96ads_A100_v4, + Standard_ND96asr_v4, + Standard_ND96amsr_A100_v4, + Standard_ND40rs_v2, + ] + +version: 1 \ No newline at end of file