From fc8b30bfc41646ad091c002f485ee3dc3b13c86c Mon Sep 17 00:00:00 2001 From: Carey Phelps Date: Thu, 8 Feb 2024 07:51:44 -0800 Subject: [PATCH] edit readme, add notebook --- examples/streamlit/README.md | 3 +- .../annotation/Annotation_Streamlit_WB.ipynb | 2860 +++++++++++++++++ 2 files changed, 2862 insertions(+), 1 deletion(-) create mode 100644 examples/streamlit/annotation/Annotation_Streamlit_WB.ipynb diff --git a/examples/streamlit/README.md b/examples/streamlit/README.md index 02480616..6959e697 100644 --- a/examples/streamlit/README.md +++ b/examples/streamlit/README.md @@ -4,4 +4,5 @@ Use Streamlit with W&B for quick, interactive apps. We have two example use cases here in this repo: 1. **Quickstart**: Embed an iframe of W&B in a Streamlit app -2. **Annotation**: Interactively annotate LLM data \ No newline at end of file +2. **Annotation**: Interactively annotate LLM data + diff --git a/examples/streamlit/annotation/Annotation_Streamlit_WB.ipynb b/examples/streamlit/annotation/Annotation_Streamlit_WB.ipynb new file mode 100644 index 00000000..1f0d69d4 --- /dev/null +++ b/examples/streamlit/annotation/Annotation_Streamlit_WB.ipynb @@ -0,0 +1,2860 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "S-Uf9D78nfrX" + }, + "source": [ + "# Annotations for LLMs with Streamlit and W&B\n", + "\n", + "With [Weights & Biases](https://wandb.ai/site), log inputs and outputs from LLM experiments, then evaluate results. Examine individual prompts and responses at the application scale.\n", + "\n", + "W&B Tables stores these critical assets in a single system of record alongside other artifacts, such as input datasets and model checkpoints, with essential metadata and lineage tracked for transparency and reproducibility.\n", + "\n", + "One smart strategy is revising these assets in a table to improve on model performance. [Streamlit's data editor](https://docs.streamlit.io/library/api-reference/data/st.data_editor), showcased in this application, provides an elegant and flexible solution using W&B Tables. Through the application's UI, annotators can flag outlier model responses, select next steps for refinement, and edit results in-place as needed. All of that can be easily exported and stored as a subsequent artifact to a Weights & Biases LLM development or tuning project.\n", + "\n", + "This notebook walks through one simple approach, with the following steps:\n", + "1. Run automated summary of news articles with Hugging Face [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)\n", + "2. Log Tables to W&B to compare two model approaches\n", + "3. Download CSV files of Tables to annotate in Streamlit\n", + "4. Annotate tables with Streamlit data editor\n", + "5. Load annotated Tables to W&B for versioning and evaluation\n", + "\n", + "### 馃弫 Let's get started!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "11msJvwtbSMO" + }, + "source": [ + "First, install dependencies for W&B and Hugging Face." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JkP6kNsyXVzx" + }, + "outputs": [], + "source": [ + "# Dependencies\n", + "! pip install datasets transformers\n", + "! pip install wandb -qq\n", + "! pip install accelerate -U" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 139 + }, + "id": "BGSdI2XDtOVA", + "outputId": "19703d97-9601-4072-ad30-d3617725aefe" + }, + "outputs": [ + { + "data": { + "application/javascript": "\n window._wandbApiKey = new Promise((resolve, reject) => {\n function loadScript(url) {\n return new Promise(function(resolve, reject) {\n let newScript = document.createElement(\"script\");\n newScript.onerror = reject;\n newScript.onload = resolve;\n document.body.appendChild(newScript);\n newScript.src = url;\n });\n }\n loadScript(\"https://cdn.jsdelivr.net/npm/postmate/build/postmate.min.js\").then(() => {\n const iframe = document.createElement('iframe')\n iframe.style.cssText = \"width:0;height:0;border:none\"\n document.body.appendChild(iframe)\n const handshake = new Postmate({\n container: iframe,\n url: 'https://wandb.ai/authorize'\n });\n const timeout = setTimeout(() => reject(\"Couldn't auto authenticate\"), 5000)\n handshake.then(function(child) {\n child.on('authorize', data => {\n clearTimeout(timeout)\n resolve(data)\n });\n });\n })\n });\n ", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[34m\u001b[1mwandb\u001b[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)\n", + "\u001b[34m\u001b[1mwandb\u001b[0m: You can find your API key in your browser here: https://wandb.ai/authorize\n", + "wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 路路路路路路路路路路\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 路路路路路路路路路路\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc\n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "from transformers import pipeline\n", + "from datasets import load_dataset, Dataset\n", + "\n", + "import wandb\n", + "wandb.login()\n", + "\n", + "# import weave\n", + "# # from weave import ops_arrow\n", + "# # from weave.ops_arrow import constructors as arrow_constructors\n", + "# from weave.monitoring import StreamTable\n", + "# import pyarrow as pa" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K8lQxTtlWV3j" + }, + "source": [ + "##1. Run automated summary of news articles with Hugging Face pipelines\n", + "\n", + "This notebook will use a summarization example to showcase W&B and Streamlit, together. Summarization can serve a lot of important uses in ML pipelines, from assisting in data quality checks to preprocessing long-form data into something digestible for a downstream task, e.g., classification.There are many options out there for generating summaries automatically, but for ease of use we are going with Hugging Face [pipelines](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/pipelines#transformers.SummarizationPipeline)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gOfSJN4pTVng" + }, + "outputs": [], + "source": [ + "NUM_ARTICLES = 20" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4lXOLMOzcU0u" + }, + "source": [ + "We will use the tried-and-true CNN/Daily Mail [dataset](https://huggingface.co/datasets/cnn_dailymail) to test out summarization outputs from 2 different pre-trained models from the Hugging Face [model repository](https://huggingface.co/models)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_ld6lWoRTVkp" + }, + "outputs": [], + "source": [ + "cnn_dailymail = load_dataset('cnn_dailymail', '3.0.0')\n", + "\n", + "input_df = cnn_dailymail['train'].to_pandas().sample(frac=1)[:NUM_ARTICLES]\n", + "articles = input_df['article'].values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Pm2U9YtiTVh2" + }, + "outputs": [], + "source": [ + "# Define summarizers for 2 different models for comparison\n", + "bart_summarizer = pipeline(\"summarization\", \"facebook/bart-large-cnn\")\n", + "samsum_summarizer = pipeline(\"summarization\", \"philschmid/bart-large-cnn-samsum\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MjU4QCVwTdAL" + }, + "outputs": [], + "source": [ + "# Create dataframe for each group of 20 summaries\n", + "bart_summaries = []\n", + "bart_samsum_summaries = []\n", + "\n", + "for article in articles:\n", + " bart_summaries.append(bart_summarizer(article[:1024])[0][\"summary_text\"])\n", + " bart_samsum_summaries.append(samsum_summarizer(article[:512])[0][\"summary_text\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5bYJ_MbTTuJP" + }, + "outputs": [], + "source": [ + "bart_df = pd.DataFrame({\n", + " \"articles\": articles,\n", + " \"bart_summaries\": bart_summaries,\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 677 + }, + "id": "HSE46Xm1iMqA", + "outputId": "5fc4e42a-fbe6-4a11-b230-53a7c9abbe2b" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
articlesbart_summaries
0By . Chris Parsons . PUBLISHED: . 02:41 EST, 2...Valerie Trierweiler tweeted support for a poli...
1By . Damien Gayle . PUBLISHED: . 08:10 EST, 12...Nasa and Florida Institute for Human and Machi...
2A father who imported a stun gun disguised as ...John Liddiatt, 40, ordered the device online w...
3February 10, 2015 . Economics, international p...This page includes the show Transcript. Use th...
4By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap...Boy, three, two teenagers and a man in his 30s...
5By . Daily Mail Reporter . PUBLISHED: . 19:11 ...Grover J. Prewitt Jr., 60, of Bristow was arre...
6By . James Rush . PUBLISHED: . 06:02 EST, 30 S...Sheik Mohammed bin Rashid Al Maktoum has order...
7By . Sam Webb and Amanda Williams . PUBLISHED:...Six out of the last seven UK summers have seen...
8(CNN) -- When CNN highlighted some excellent h...Tampa's Columbia Restaurant is 107 years old. ...
9(CNN) -- Japanese golf prodigy Ryo Ishikawa ha...Japanese golf prodigy Ryo Ishikawa will donate...
10Warsaw (CNN) -- Eleven people died and one, so...A small aircraft belonging to a private parach...
11By . Associated Press . PUBLISHED: . 09:08 EST...Allegations that male instructors had sex with...
12By . Adam Shergold . PUBLISHED: . 13:08 EST, 5...Matt Prior hits back at Piers Morgan's claims ...
13Once an annual round of exclusive balls, refin...The Queen Charlotte's Ball was held last night...
14(CNN) -- The memory unit that may tell why an ...NEW: Air France says the memory unit is part o...
15By . James Nye . Under mounting pressure to s...The rapper has signed a contract to design an ...
16Police should face time limits on how long a p...Theresa May is demanding action amid mounting ...
17Two Second World War Lancaster bombers flew to...The Lancaster Thumper joined the Canadian Lanc...
18Andy Murray knows there can be no let up in hi...Andy Murray beat David Ferrer 5-7 6-2 7-5 in t...
19A lifelong dream becomes reality for Chris Cor...Chris Cork is set to embark on the 5,500-mile ...
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " articles \\\n", + "0 By . Chris Parsons . PUBLISHED: . 02:41 EST, 2... \n", + "1 By . Damien Gayle . PUBLISHED: . 08:10 EST, 12... \n", + "2 A father who imported a stun gun disguised as ... \n", + "3 February 10, 2015 . Economics, international p... \n", + "4 By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap... \n", + "5 By . Daily Mail Reporter . PUBLISHED: . 19:11 ... \n", + "6 By . James Rush . PUBLISHED: . 06:02 EST, 30 S... \n", + "7 By . Sam Webb and Amanda Williams . PUBLISHED:... \n", + "8 (CNN) -- When CNN highlighted some excellent h... \n", + "9 (CNN) -- Japanese golf prodigy Ryo Ishikawa ha... \n", + "10 Warsaw (CNN) -- Eleven people died and one, so... \n", + "11 By . Associated Press . PUBLISHED: . 09:08 EST... \n", + "12 By . Adam Shergold . PUBLISHED: . 13:08 EST, 5... \n", + "13 Once an annual round of exclusive balls, refin... \n", + "14 (CNN) -- The memory unit that may tell why an ... \n", + "15 By . James Nye . Under mounting pressure to s... \n", + "16 Police should face time limits on how long a p... \n", + "17 Two Second World War Lancaster bombers flew to... \n", + "18 Andy Murray knows there can be no let up in hi... \n", + "19 A lifelong dream becomes reality for Chris Cor... \n", + "\n", + " bart_summaries \n", + "0 Valerie Trierweiler tweeted support for a poli... \n", + "1 Nasa and Florida Institute for Human and Machi... \n", + "2 John Liddiatt, 40, ordered the device online w... \n", + "3 This page includes the show Transcript. Use th... \n", + "4 Boy, three, two teenagers and a man in his 30s... \n", + "5 Grover J. Prewitt Jr., 60, of Bristow was arre... \n", + "6 Sheik Mohammed bin Rashid Al Maktoum has order... \n", + "7 Six out of the last seven UK summers have seen... \n", + "8 Tampa's Columbia Restaurant is 107 years old. ... \n", + "9 Japanese golf prodigy Ryo Ishikawa will donate... \n", + "10 A small aircraft belonging to a private parach... \n", + "11 Allegations that male instructors had sex with... \n", + "12 Matt Prior hits back at Piers Morgan's claims ... \n", + "13 The Queen Charlotte's Ball was held last night... \n", + "14 NEW: Air France says the memory unit is part o... \n", + "15 The rapper has signed a contract to design an ... \n", + "16 Theresa May is demanding action amid mounting ... \n", + "17 The Lancaster Thumper joined the Canadian Lanc... \n", + "18 Andy Murray beat David Ferrer 5-7 6-2 7-5 in t... \n", + "19 Chris Cork is set to embark on the 5,500-mile ... " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bart_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1wWj3MK0Ts4Q" + }, + "outputs": [], + "source": [ + "samsum_df = pd.DataFrame({\n", + " \"articles\": articles,\n", + " \"bart_samsum_summaries\": bart_samsum_summaries,\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 677 + }, + "id": "jPIrQ_fUiNjX", + "outputId": "77c725f5-10b9-4b25-af8c-230b415bc03c" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
articlesbart_samsum_summaries
0By . Chris Parsons . PUBLISHED: . 02:41 EST, 2...Segolene Royal lost the presidential election ...
1By . Damien Gayle . PUBLISHED: . 08:10 EST, 12...Nasa and the Florida Institute for Human and M...
2A father who imported a stun gun disguised as ...John Liddiatt, 40, imported a stun gun disguis...
3February 10, 2015 . Economics, international p...This page includes CNN Student News stories on...
4By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap...A three-year-old boy and two other people rema...
5By . Daily Mail Reporter . PUBLISHED: . 19:11 ...Grover J. Prewitt Jr., 60, of Bristow, Oklahom...
6By . James Rush . PUBLISHED: . 06:02 EST, 30 S...Banned equine drugs were found on a Dubai gove...
7By . Sam Webb and Amanda Williams . PUBLISHED:...Met Office experts predict a decade of wet sum...
8(CNN) -- When CNN highlighted some excellent h...Last month, CNN highlighted some excellent his...
9(CNN) -- Japanese golf prodigy Ryo Ishikawa ha...Ryo Ishikawa will donate his tournament earnin...
10Warsaw (CNN) -- Eleven people died and one, so...Eleven people died and one survived when a sma...
11By . Associated Press . PUBLISHED: . 09:08 EST...Col. Glenn Palmer delivered his first order to...
12By . Adam Shergold . PUBLISHED: . 13:08 EST, 5...Matt Prior has hit back at Piers Morgan's clai...
13Once an annual round of exclusive balls, refin...The debutante ball has lost its royal patronag...
14(CNN) -- The memory unit that may tell why an ...The memory unit that may tell why an Air Franc...
15By . James Nye . Under mounting pressure to s...James Nye has released a statement on his webs...
16Police should face time limits on how long a p...Theresa May is demanding action against police...
17Two Second World War Lancaster bombers flew to...Two World War II-era Lancaster bombers flew to...
18Andy Murray knows there can be no let up in hi...Andy Murray beat David Ferrer in the final of ...
19A lifelong dream becomes reality for Chris Cor...Chris Cork will start his Dakar journey in Bue...
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " articles \\\n", + "0 By . Chris Parsons . PUBLISHED: . 02:41 EST, 2... \n", + "1 By . Damien Gayle . PUBLISHED: . 08:10 EST, 12... \n", + "2 A father who imported a stun gun disguised as ... \n", + "3 February 10, 2015 . Economics, international p... \n", + "4 By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap... \n", + "5 By . Daily Mail Reporter . PUBLISHED: . 19:11 ... \n", + "6 By . James Rush . PUBLISHED: . 06:02 EST, 30 S... \n", + "7 By . Sam Webb and Amanda Williams . PUBLISHED:... \n", + "8 (CNN) -- When CNN highlighted some excellent h... \n", + "9 (CNN) -- Japanese golf prodigy Ryo Ishikawa ha... \n", + "10 Warsaw (CNN) -- Eleven people died and one, so... \n", + "11 By . Associated Press . PUBLISHED: . 09:08 EST... \n", + "12 By . Adam Shergold . PUBLISHED: . 13:08 EST, 5... \n", + "13 Once an annual round of exclusive balls, refin... \n", + "14 (CNN) -- The memory unit that may tell why an ... \n", + "15 By . James Nye . Under mounting pressure to s... \n", + "16 Police should face time limits on how long a p... \n", + "17 Two Second World War Lancaster bombers flew to... \n", + "18 Andy Murray knows there can be no let up in hi... \n", + "19 A lifelong dream becomes reality for Chris Cor... \n", + "\n", + " bart_samsum_summaries \n", + "0 Segolene Royal lost the presidential election ... \n", + "1 Nasa and the Florida Institute for Human and M... \n", + "2 John Liddiatt, 40, imported a stun gun disguis... \n", + "3 This page includes CNN Student News stories on... \n", + "4 A three-year-old boy and two other people rema... \n", + "5 Grover J. Prewitt Jr., 60, of Bristow, Oklahom... \n", + "6 Banned equine drugs were found on a Dubai gove... \n", + "7 Met Office experts predict a decade of wet sum... \n", + "8 Last month, CNN highlighted some excellent his... \n", + "9 Ryo Ishikawa will donate his tournament earnin... \n", + "10 Eleven people died and one survived when a sma... \n", + "11 Col. Glenn Palmer delivered his first order to... \n", + "12 Matt Prior has hit back at Piers Morgan's clai... \n", + "13 The debutante ball has lost its royal patronag... \n", + "14 The memory unit that may tell why an Air Franc... \n", + "15 James Nye has released a statement on his webs... \n", + "16 Theresa May is demanding action against police... \n", + "17 Two World War II-era Lancaster bombers flew to... \n", + "18 Andy Murray beat David Ferrer in the final of ... \n", + "19 Chris Cork will start his Dakar journey in Bue... " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "samsum_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YfNlCobpTc9T" + }, + "outputs": [], + "source": [ + "# Combine dataframes for logging to W&B\n", + "\n", + "joint_df = pd.DataFrame({\n", + " \"articles\": articles,\n", + " \"bart_summaries\": bart_summaries,\n", + " \"bart_samsum_summaries\": bart_samsum_summaries,\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 677 + }, + "id": "oIO0RyatiPIW", + "outputId": "4cf643d9-38fc-4b9a-9b4e-b04574efcd3c" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
articlesbart_summariesbart_samsum_summaries
0By . Chris Parsons . PUBLISHED: . 02:41 EST, 2...Valerie Trierweiler tweeted support for a poli...Segolene Royal lost the presidential election ...
1By . Damien Gayle . PUBLISHED: . 08:10 EST, 12...Nasa and Florida Institute for Human and Machi...Nasa and the Florida Institute for Human and M...
2A father who imported a stun gun disguised as ...John Liddiatt, 40, ordered the device online w...John Liddiatt, 40, imported a stun gun disguis...
3February 10, 2015 . Economics, international p...This page includes the show Transcript. Use th...This page includes CNN Student News stories on...
4By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap...Boy, three, two teenagers and a man in his 30s...A three-year-old boy and two other people rema...
5By . Daily Mail Reporter . PUBLISHED: . 19:11 ...Grover J. Prewitt Jr., 60, of Bristow was arre...Grover J. Prewitt Jr., 60, of Bristow, Oklahom...
6By . James Rush . PUBLISHED: . 06:02 EST, 30 S...Sheik Mohammed bin Rashid Al Maktoum has order...Banned equine drugs were found on a Dubai gove...
7By . Sam Webb and Amanda Williams . PUBLISHED:...Six out of the last seven UK summers have seen...Met Office experts predict a decade of wet sum...
8(CNN) -- When CNN highlighted some excellent h...Tampa's Columbia Restaurant is 107 years old. ...Last month, CNN highlighted some excellent his...
9(CNN) -- Japanese golf prodigy Ryo Ishikawa ha...Japanese golf prodigy Ryo Ishikawa will donate...Ryo Ishikawa will donate his tournament earnin...
10Warsaw (CNN) -- Eleven people died and one, so...A small aircraft belonging to a private parach...Eleven people died and one survived when a sma...
11By . Associated Press . PUBLISHED: . 09:08 EST...Allegations that male instructors had sex with...Col. Glenn Palmer delivered his first order to...
12By . Adam Shergold . PUBLISHED: . 13:08 EST, 5...Matt Prior hits back at Piers Morgan's claims ...Matt Prior has hit back at Piers Morgan's clai...
13Once an annual round of exclusive balls, refin...The Queen Charlotte's Ball was held last night...The debutante ball has lost its royal patronag...
14(CNN) -- The memory unit that may tell why an ...NEW: Air France says the memory unit is part o...The memory unit that may tell why an Air Franc...
15By . James Nye . Under mounting pressure to s...The rapper has signed a contract to design an ...James Nye has released a statement on his webs...
16Police should face time limits on how long a p...Theresa May is demanding action amid mounting ...Theresa May is demanding action against police...
17Two Second World War Lancaster bombers flew to...The Lancaster Thumper joined the Canadian Lanc...Two World War II-era Lancaster bombers flew to...
18Andy Murray knows there can be no let up in hi...Andy Murray beat David Ferrer 5-7 6-2 7-5 in t...Andy Murray beat David Ferrer in the final of ...
19A lifelong dream becomes reality for Chris Cor...Chris Cork is set to embark on the 5,500-mile ...Chris Cork will start his Dakar journey in Bue...
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " articles \\\n", + "0 By . Chris Parsons . PUBLISHED: . 02:41 EST, 2... \n", + "1 By . Damien Gayle . PUBLISHED: . 08:10 EST, 12... \n", + "2 A father who imported a stun gun disguised as ... \n", + "3 February 10, 2015 . Economics, international p... \n", + "4 By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap... \n", + "5 By . Daily Mail Reporter . PUBLISHED: . 19:11 ... \n", + "6 By . James Rush . PUBLISHED: . 06:02 EST, 30 S... \n", + "7 By . Sam Webb and Amanda Williams . PUBLISHED:... \n", + "8 (CNN) -- When CNN highlighted some excellent h... \n", + "9 (CNN) -- Japanese golf prodigy Ryo Ishikawa ha... \n", + "10 Warsaw (CNN) -- Eleven people died and one, so... \n", + "11 By . Associated Press . PUBLISHED: . 09:08 EST... \n", + "12 By . Adam Shergold . PUBLISHED: . 13:08 EST, 5... \n", + "13 Once an annual round of exclusive balls, refin... \n", + "14 (CNN) -- The memory unit that may tell why an ... \n", + "15 By . James Nye . Under mounting pressure to s... \n", + "16 Police should face time limits on how long a p... \n", + "17 Two Second World War Lancaster bombers flew to... \n", + "18 Andy Murray knows there can be no let up in hi... \n", + "19 A lifelong dream becomes reality for Chris Cor... \n", + "\n", + " bart_summaries \\\n", + "0 Valerie Trierweiler tweeted support for a poli... \n", + "1 Nasa and Florida Institute for Human and Machi... \n", + "2 John Liddiatt, 40, ordered the device online w... \n", + "3 This page includes the show Transcript. Use th... \n", + "4 Boy, three, two teenagers and a man in his 30s... \n", + "5 Grover J. Prewitt Jr., 60, of Bristow was arre... \n", + "6 Sheik Mohammed bin Rashid Al Maktoum has order... \n", + "7 Six out of the last seven UK summers have seen... \n", + "8 Tampa's Columbia Restaurant is 107 years old. ... \n", + "9 Japanese golf prodigy Ryo Ishikawa will donate... \n", + "10 A small aircraft belonging to a private parach... \n", + "11 Allegations that male instructors had sex with... \n", + "12 Matt Prior hits back at Piers Morgan's claims ... \n", + "13 The Queen Charlotte's Ball was held last night... \n", + "14 NEW: Air France says the memory unit is part o... \n", + "15 The rapper has signed a contract to design an ... \n", + "16 Theresa May is demanding action amid mounting ... \n", + "17 The Lancaster Thumper joined the Canadian Lanc... \n", + "18 Andy Murray beat David Ferrer 5-7 6-2 7-5 in t... \n", + "19 Chris Cork is set to embark on the 5,500-mile ... \n", + "\n", + " bart_samsum_summaries \n", + "0 Segolene Royal lost the presidential election ... \n", + "1 Nasa and the Florida Institute for Human and M... \n", + "2 John Liddiatt, 40, imported a stun gun disguis... \n", + "3 This page includes CNN Student News stories on... \n", + "4 A three-year-old boy and two other people rema... \n", + "5 Grover J. Prewitt Jr., 60, of Bristow, Oklahom... \n", + "6 Banned equine drugs were found on a Dubai gove... \n", + "7 Met Office experts predict a decade of wet sum... \n", + "8 Last month, CNN highlighted some excellent his... \n", + "9 Ryo Ishikawa will donate his tournament earnin... \n", + "10 Eleven people died and one survived when a sma... \n", + "11 Col. Glenn Palmer delivered his first order to... \n", + "12 Matt Prior has hit back at Piers Morgan's clai... \n", + "13 The debutante ball has lost its royal patronag... \n", + "14 The memory unit that may tell why an Air Franc... \n", + "15 James Nye has released a statement on his webs... \n", + "16 Theresa May is demanding action against police... \n", + "17 Two World War II-era Lancaster bombers flew to... \n", + "18 Andy Murray beat David Ferrer in the final of ... \n", + "19 Chris Cork will start his Dakar journey in Bue... " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "joint_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UuqHWjuwdMp5" + }, + "source": [ + "Here, we define a simple word function and [lexical diversity](https://en.wikipedia.org/wiki/Lexical_diversity) function, which can be useful data points for examining text inputs and gauging how completely and fluently summarization outputs capture their \"meaning.\"\n", + "
\n", + "
\n", + "There are many methods and dimensions to consider when evaluating summaries, to a quick vibes check to reference-based metrics (if you are lucky enough to have gold-standard reference summaries 馃崁). This walkthrough shows a simple manual approach, where automated summaries are evaluated for further refinement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ftlmADWATc6c" + }, + "outputs": [], + "source": [ + "# Function to calculate word count\n", + "def calculate_word_count(text):\n", + " words = text.split()\n", + " return len(words)\n", + "\n", + "# Function to calculate lexical diversity\n", + "def calculate_lexical_diversity(text):\n", + " words = text.split()\n", + " unique_words = set(words)\n", + " return round((len(unique_words) / len(words)), 3)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "NM69E2HITc3W", + "outputId": "f414d726-3a0c-482f-be98-2de788bc2421" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
articlesbart_summariessource_word_countsummary_word_countsource_lexical_diversitysummary_lexical_diversity
0By . Chris Parsons . PUBLISHED: . 02:41 EST, 2...Valerie Trierweiler tweeted support for a poli...719400.5170.925
1By . Damien Gayle . PUBLISHED: . 08:10 EST, 12...Nasa and Florida Institute for Human and Machi...853560.5230.857
2A father who imported a stun gun disguised as ...John Liddiatt, 40, ordered the device online w...538540.5070.815
3February 10, 2015 . Economics, international p...This page includes the show Transcript. Use th...239480.6400.833
4By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap...Boy, three, two teenagers and a man in his 30s...581480.5080.938
5By . Daily Mail Reporter . PUBLISHED: . 19:11 ...Grover J. Prewitt Jr., 60, of Bristow was arre...789480.4850.917
6By . James Rush . PUBLISHED: . 06:02 EST, 30 S...Sheik Mohammed bin Rashid Al Maktoum has order...727530.4690.925
7By . Sam Webb and Amanda Williams . PUBLISHED:...Six out of the last seven UK summers have seen...1351590.4320.814
8(CNN) -- When CNN highlighted some excellent h...Tampa's Columbia Restaurant is 107 years old. ...1440480.5780.917
9(CNN) -- Japanese golf prodigy Ryo Ishikawa ha...Japanese golf prodigy Ryo Ishikawa will donate...235490.6510.878
10Warsaw (CNN) -- Eleven people died and one, so...A small aircraft belonging to a private parach...193570.7310.895
11By . Associated Press . PUBLISHED: . 09:08 EST...Allegations that male instructors had sex with...1011470.5330.894
12By . Adam Shergold . PUBLISHED: . 13:08 EST, 5...Matt Prior hits back at Piers Morgan's claims ...598680.5480.824
13Once an annual round of exclusive balls, refin...The Queen Charlotte's Ball was held last night...1022480.4300.875
14(CNN) -- The memory unit that may tell why an ...NEW: Air France says the memory unit is part o...468540.6110.833
15By . James Nye . Under mounting pressure to s...The rapper has signed a contract to design an ...1475610.4450.951
16Police should face time limits on how long a p...Theresa May is demanding action amid mounting ...850510.4860.922
17Two Second World War Lancaster bombers flew to...The Lancaster Thumper joined the Canadian Lanc...902560.4910.839
18Andy Murray knows there can be no let up in hi...Andy Murray beat David Ferrer 5-7 6-2 7-5 in t...562460.5520.804
19A lifelong dream becomes reality for Chris Cor...Chris Cork is set to embark on the 5,500-mile ...703580.5450.759
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " articles \\\n", + "0 By . Chris Parsons . PUBLISHED: . 02:41 EST, 2... \n", + "1 By . Damien Gayle . PUBLISHED: . 08:10 EST, 12... \n", + "2 A father who imported a stun gun disguised as ... \n", + "3 February 10, 2015 . Economics, international p... \n", + "4 By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap... \n", + "5 By . Daily Mail Reporter . PUBLISHED: . 19:11 ... \n", + "6 By . James Rush . PUBLISHED: . 06:02 EST, 30 S... \n", + "7 By . Sam Webb and Amanda Williams . PUBLISHED:... \n", + "8 (CNN) -- When CNN highlighted some excellent h... \n", + "9 (CNN) -- Japanese golf prodigy Ryo Ishikawa ha... \n", + "10 Warsaw (CNN) -- Eleven people died and one, so... \n", + "11 By . Associated Press . PUBLISHED: . 09:08 EST... \n", + "12 By . Adam Shergold . PUBLISHED: . 13:08 EST, 5... \n", + "13 Once an annual round of exclusive balls, refin... \n", + "14 (CNN) -- The memory unit that may tell why an ... \n", + "15 By . James Nye . Under mounting pressure to s... \n", + "16 Police should face time limits on how long a p... \n", + "17 Two Second World War Lancaster bombers flew to... \n", + "18 Andy Murray knows there can be no let up in hi... \n", + "19 A lifelong dream becomes reality for Chris Cor... \n", + "\n", + " bart_summaries source_word_count \\\n", + "0 Valerie Trierweiler tweeted support for a poli... 719 \n", + "1 Nasa and Florida Institute for Human and Machi... 853 \n", + "2 John Liddiatt, 40, ordered the device online w... 538 \n", + "3 This page includes the show Transcript. Use th... 239 \n", + "4 Boy, three, two teenagers and a man in his 30s... 581 \n", + "5 Grover J. Prewitt Jr., 60, of Bristow was arre... 789 \n", + "6 Sheik Mohammed bin Rashid Al Maktoum has order... 727 \n", + "7 Six out of the last seven UK summers have seen... 1351 \n", + "8 Tampa's Columbia Restaurant is 107 years old. ... 1440 \n", + "9 Japanese golf prodigy Ryo Ishikawa will donate... 235 \n", + "10 A small aircraft belonging to a private parach... 193 \n", + "11 Allegations that male instructors had sex with... 1011 \n", + "12 Matt Prior hits back at Piers Morgan's claims ... 598 \n", + "13 The Queen Charlotte's Ball was held last night... 1022 \n", + "14 NEW: Air France says the memory unit is part o... 468 \n", + "15 The rapper has signed a contract to design an ... 1475 \n", + "16 Theresa May is demanding action amid mounting ... 850 \n", + "17 The Lancaster Thumper joined the Canadian Lanc... 902 \n", + "18 Andy Murray beat David Ferrer 5-7 6-2 7-5 in t... 562 \n", + "19 Chris Cork is set to embark on the 5,500-mile ... 703 \n", + "\n", + " summary_word_count source_lexical_diversity summary_lexical_diversity \n", + "0 40 0.517 0.925 \n", + "1 56 0.523 0.857 \n", + "2 54 0.507 0.815 \n", + "3 48 0.640 0.833 \n", + "4 48 0.508 0.938 \n", + "5 48 0.485 0.917 \n", + "6 53 0.469 0.925 \n", + "7 59 0.432 0.814 \n", + "8 48 0.578 0.917 \n", + "9 49 0.651 0.878 \n", + "10 57 0.731 0.895 \n", + "11 47 0.533 0.894 \n", + "12 68 0.548 0.824 \n", + "13 48 0.430 0.875 \n", + "14 54 0.611 0.833 \n", + "15 61 0.445 0.951 \n", + "16 51 0.486 0.922 \n", + "17 56 0.491 0.839 \n", + "18 46 0.552 0.804 \n", + "19 58 0.545 0.759 " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Compute word count and append to bart dataframe\n", + "bart_df['source_word_count'] = bart_df['articles'].apply(lambda x: calculate_word_count(x))\n", + "\n", + "# Compute summary word count and append to bart dataframe\n", + "bart_df['summary_word_count'] = bart_df['bart_summaries'].apply(lambda x: calculate_word_count(x))\n", + "\n", + "# Compute lexical diversity and append to dataframe\n", + "bart_df['source_lexical_diversity'] = bart_df['articles'].apply(lambda x: calculate_lexical_diversity(x))\n", + "\n", + "# Compute summary. lexical diversity and append to dataframe\n", + "bart_df['summary_lexical_diversity'] = bart_df['bart_summaries'].apply(lambda x: calculate_lexical_diversity(x))\n", + "\n", + "# Display the DataFrame\n", + "bart_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "09eJMd2CT5h5", + "outputId": "7216b3b9-3113-4393-98c9-a3dafe37eb26" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
articlesbart_samsum_summariessource_word_countsource_lexical_diversitysummary_word_countsummary_lexical_diversity
0By . Chris Parsons . PUBLISHED: . 02:41 EST, 2...Segolene Royal lost the presidential election ...7190.517470.872
1By . Damien Gayle . PUBLISHED: . 08:10 EST, 12...Nasa and the Florida Institute for Human and M...8530.523560.857
2A father who imported a stun gun disguised as ...John Liddiatt, 40, imported a stun gun disguis...5380.507500.880
3February 10, 2015 . Economics, international p...This page includes CNN Student News stories on...2390.640540.889
4By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap...A three-year-old boy and two other people rema...5810.508500.800
5By . Daily Mail Reporter . PUBLISHED: . 19:11 ...Grover J. Prewitt Jr., 60, of Bristow, Oklahom...7890.485540.889
6By . James Rush . PUBLISHED: . 06:02 EST, 30 S...Banned equine drugs were found on a Dubai gove...7270.469430.930
7By . Sam Webb and Amanda Williams . PUBLISHED:...Met Office experts predict a decade of wet sum...13510.432410.854
8(CNN) -- When CNN highlighted some excellent h...Last month, CNN highlighted some excellent his...14400.578410.902
9(CNN) -- Japanese golf prodigy Ryo Ishikawa ha...Ryo Ishikawa will donate his tournament earnin...2350.651480.875
10Warsaw (CNN) -- Eleven people died and one, so...Eleven people died and one survived when a sma...1930.731420.881
11By . Associated Press . PUBLISHED: . 09:08 EST...Col. Glenn Palmer delivered his first order to...10110.533540.889
12By . Adam Shergold . PUBLISHED: . 13:08 EST, 5...Matt Prior has hit back at Piers Morgan's clai...5980.548540.852
13Once an annual round of exclusive balls, refin...The debutante ball has lost its royal patronag...10220.430400.900
14(CNN) -- The memory unit that may tell why an ...The memory unit that may tell why an Air Franc...4680.611500.780
15By . James Nye . Under mounting pressure to s...James Nye has released a statement on his webs...14750.445570.895
16Police should face time limits on how long a p...Theresa May is demanding action against police...8500.486460.935
17Two Second World War Lancaster bombers flew to...Two World War II-era Lancaster bombers flew to...9020.491450.867
18Andy Murray knows there can be no let up in hi...Andy Murray beat David Ferrer in the final of ...5620.552470.745
19A lifelong dream becomes reality for Chris Cor...Chris Cork will start his Dakar journey in Bue...7030.545470.851
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " articles \\\n", + "0 By . Chris Parsons . PUBLISHED: . 02:41 EST, 2... \n", + "1 By . Damien Gayle . PUBLISHED: . 08:10 EST, 12... \n", + "2 A father who imported a stun gun disguised as ... \n", + "3 February 10, 2015 . Economics, international p... \n", + "4 By . James Rush . PUBLISHED: . 10:51 EST, 5 Ap... \n", + "5 By . Daily Mail Reporter . PUBLISHED: . 19:11 ... \n", + "6 By . James Rush . PUBLISHED: . 06:02 EST, 30 S... \n", + "7 By . Sam Webb and Amanda Williams . PUBLISHED:... \n", + "8 (CNN) -- When CNN highlighted some excellent h... \n", + "9 (CNN) -- Japanese golf prodigy Ryo Ishikawa ha... \n", + "10 Warsaw (CNN) -- Eleven people died and one, so... \n", + "11 By . Associated Press . PUBLISHED: . 09:08 EST... \n", + "12 By . Adam Shergold . PUBLISHED: . 13:08 EST, 5... \n", + "13 Once an annual round of exclusive balls, refin... \n", + "14 (CNN) -- The memory unit that may tell why an ... \n", + "15 By . James Nye . Under mounting pressure to s... \n", + "16 Police should face time limits on how long a p... \n", + "17 Two Second World War Lancaster bombers flew to... \n", + "18 Andy Murray knows there can be no let up in hi... \n", + "19 A lifelong dream becomes reality for Chris Cor... \n", + "\n", + " bart_samsum_summaries source_word_count \\\n", + "0 Segolene Royal lost the presidential election ... 719 \n", + "1 Nasa and the Florida Institute for Human and M... 853 \n", + "2 John Liddiatt, 40, imported a stun gun disguis... 538 \n", + "3 This page includes CNN Student News stories on... 239 \n", + "4 A three-year-old boy and two other people rema... 581 \n", + "5 Grover J. Prewitt Jr., 60, of Bristow, Oklahom... 789 \n", + "6 Banned equine drugs were found on a Dubai gove... 727 \n", + "7 Met Office experts predict a decade of wet sum... 1351 \n", + "8 Last month, CNN highlighted some excellent his... 1440 \n", + "9 Ryo Ishikawa will donate his tournament earnin... 235 \n", + "10 Eleven people died and one survived when a sma... 193 \n", + "11 Col. Glenn Palmer delivered his first order to... 1011 \n", + "12 Matt Prior has hit back at Piers Morgan's clai... 598 \n", + "13 The debutante ball has lost its royal patronag... 1022 \n", + "14 The memory unit that may tell why an Air Franc... 468 \n", + "15 James Nye has released a statement on his webs... 1475 \n", + "16 Theresa May is demanding action against police... 850 \n", + "17 Two World War II-era Lancaster bombers flew to... 902 \n", + "18 Andy Murray beat David Ferrer in the final of ... 562 \n", + "19 Chris Cork will start his Dakar journey in Bue... 703 \n", + "\n", + " source_lexical_diversity summary_word_count summary_lexical_diversity \n", + "0 0.517 47 0.872 \n", + "1 0.523 56 0.857 \n", + "2 0.507 50 0.880 \n", + "3 0.640 54 0.889 \n", + "4 0.508 50 0.800 \n", + "5 0.485 54 0.889 \n", + "6 0.469 43 0.930 \n", + "7 0.432 41 0.854 \n", + "8 0.578 41 0.902 \n", + "9 0.651 48 0.875 \n", + "10 0.731 42 0.881 \n", + "11 0.533 54 0.889 \n", + "12 0.548 54 0.852 \n", + "13 0.430 40 0.900 \n", + "14 0.611 50 0.780 \n", + "15 0.445 57 0.895 \n", + "16 0.486 46 0.935 \n", + "17 0.491 45 0.867 \n", + "18 0.552 47 0.745 \n", + "19 0.545 47 0.851 " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Compute word count and append to samsum dataframe\n", + "samsum_df['source_word_count'] = samsum_df['articles'].apply(lambda x: calculate_word_count(x))\n", + "\n", + "# Compute lexical diversity and append to dataframe\n", + "samsum_df['source_lexical_diversity'] = samsum_df['articles'].apply(lambda x: calculate_lexical_diversity(x))\n", + "\n", + "# Compute summary word count and append to bart dataframe\n", + "samsum_df['summary_word_count'] = samsum_df['bart_samsum_summaries'].apply(lambda x: calculate_word_count(x))\n", + "\n", + "# Compute summary. lexical diversity and append to dataframe\n", + "samsum_df['summary_lexical_diversity'] = samsum_df['bart_samsum_summaries'].apply(lambda x: calculate_lexical_diversity(x))\n", + "\n", + "# Display the DataFrame\n", + "samsum_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c4gVzSU9Wkja" + }, + "source": [ + "## 2. Log Tables to W&B to compare two model approaches" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zU3hVGV3erzn" + }, + "source": [ + "[W&B Tables](https://docs.wandb.ai/guides/tables) help you visualize and query tabular data, whether it be numeric, categorical, text, images, or multimodal datasets. Tables help users compare how different models perform on the same test set, identify patterns in data (especially helpful with text analysis), and query inputs and outputs effectively to find outliers or useful patterns.\n", + "
\n", + "
\n", + "Here we log our automatically-generated summaries to W&B as an initial step in the overall LLM development and evaluation process. If you do not have a W&B account yet, follow this simple [quickstart](https://docs.wandb.ai/quickstart) to get set up 馃専" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Cpdn0IVBT5X_" + }, + "outputs": [], + "source": [ + "# log bart table to W&B\n", + "run = wandb.init(project=\"news_summarization\", name=\"load_bart_df\")\n", + "bart_table_v1 = wandb.Table(dataframe=bart_df)\n", + "wandb.log({\"BART summaries v1\": bart_table_v1})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6xBm3xsAVh4g" + }, + "outputs": [], + "source": [ + "wandb.finish()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WP44kTkBT5VJ" + }, + "outputs": [], + "source": [ + "# log samsum table to W&B\n", + "run = wandb.init(project=\"news_summarization\", name=\"load_samsum_df\")\n", + "samsum_table_v1 = wandb.Table(dataframe=samsum_df)\n", + "wandb.log({\"SAMSUM summaries v1\":samsum_table_v1})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2YlHkibaVjic" + }, + "outputs": [], + "source": [ + "wandb.finish()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VHsKF21jVUuk" + }, + "outputs": [], + "source": [ + "# log joint table to W&B\n", + "run = wandb.init(project=\"news_summarization\", name=\"load_joint_df\")\n", + "joint_table_v1 = wandb.Table(dataframe=joint_df)\n", + "wandb.log({\"Combined summaries v1\": joint_table_v1})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MD_JJS-lVUr7" + }, + "outputs": [], + "source": [ + "wandb.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cIoZtV7-Wzlp" + }, + "source": [ + "## 3. Download CSV files of Tables to annotate in Streamlit" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YzbmHcXqf4WR" + }, + "source": [ + "W&B Tables can be exported easily, [programatically](https://docs.wandb.ai/guides/tables/tables-download) or from the UI. To instrument with python, we will convert a table to a W&B artifact (learn more [here](https://docs.wandb.ai/guides/artifacts) and then to a dataframe. From there, it's a simple csv export.\n", + "
\n", + "
\n", + "These csv files can be loaded to a simple Streamlit app for labeling." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BbUR_kYtgPp1" + }, + "outputs": [], + "source": [ + "# Example of how to load a table from step 2 to a csv file\n", + "bart_WB_df = bart_table_v1.get_dataframe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xYnj85vugPm6" + }, + "outputs": [], + "source": [ + "# Convert the table data to .csv\n", + "bart_WB_df.to_csv(\"example.csv\", encoding=\"utf-8\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j2wM0qWZWrni" + }, + "source": [ + "## 4. Annotate tables with Streamlit data editor\n", + "\n", + "This W&B repo contains a simple app that takes a user-loaded .csv file, creates a dataframe, displays that dataframe in a Streamlit app, and enables manual editing and exporting of a revised .csv file." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MrtDzSjEh1Ba" + }, + "source": [ + "Once you have built your app and have it stored with any dependencies needed, you can run the app wherever Streamlit is installed with `run streamlit app.py` and you will get a URL for the app (http://localhost:8501/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j33w1C7UWvsI" + }, + "source": [ + "## 5. Load annotated Tables to W&B for versioning and evaluation\n", + "\n", + "Once you have revised any or all entries in your Streamlit tables and exported the new .csv files, you can load the annotated version to the same W&B project to capture that step, and all its metadata, to keep in a central system of record for your LLM development project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EZhpxH0FW6VI" + }, + "outputs": [], + "source": [ + "# Create DataFrame\n", + "annotated_bart_df = pd.read_csv('annotated_bart.csv', index_col=0)\n", + "annotated_bart_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EBP_VT8WgDD0" + }, + "outputs": [], + "source": [ + "# Create DataFrame\n", + "annotated_samsum_df = pd.read_csv('annotated_samsum.csv', index_col=0)\n", + "annotated_samsum_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "blPAKlLeW6Sh" + }, + "outputs": [], + "source": [ + "# Log as artifact to a project\n", + "run = wandb.init(project=\"news_summarization\")\n", + "bart_table = wandb.Table(dataframe=annotated_bart_df)\n", + "wandb.log({\"Annotated BART summaries\": bart_table})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Na2kNHVrW6Pq" + }, + "outputs": [], + "source": [ + "# Log as artifact to a project\n", + "run = wandb.init(project=\"news_summarization\")\n", + "samsum_table = wandb.Table(dataframe=annotated_samsum_df)\n", + "wandb.log({\"Annotated SAMSUM summaries\": samsum_table})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HecwlqqGjGg5" + }, + "outputs": [], + "source": [ + "wandb.finish()" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "V100", + "machine_shape": "hm", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}