diff --git a/.gitignore b/.gitignore index d22d76c..2d6e554 100644 --- a/.gitignore +++ b/.gitignore @@ -51,3 +51,6 @@ docs/ python/mall/src/ python/assets/style.css + +python/README_files +python/README.html diff --git a/README.md b/README.md index 6f14838..8602420 100644 --- a/README.md +++ b/README.md @@ -2,11 +2,10 @@ -[![R-CMD-check](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml) -[![Codecov test -coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main) -[![Lifecycle: -experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) +[![R check](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml) +[![Python tests](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml) +[![R package coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main) +[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) @@ -18,5 +17,5 @@ pre-determined one-shot prompt, along with the current row’s content. `mall` is now available in both R and Python. To find out how to install and use, or just to learn more about it, please -visit the official website: https://edgararuiz.github.io/mall/ +visit the official website: https://mlverse.github.io/mall/ diff --git a/_freeze/index/execute-results/html.json b/_freeze/index/execute-results/html.json index c214459..59d528e 100644 --- a/_freeze/index/execute-results/html.json +++ b/_freeze/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "8cd0260bd9b9c35d70e1ee3db107ae7f", + "hash": "991ac920bd4b176aa0812245664ed968", "result": { "engine": "knitr", - "markdown": "---\nformat:\n html:\n toc: true\nexecute:\n eval: true\n freeze: true\n---\n\n\n\n\n\n\n\n\n\n\n\n[![R check](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml)\n[![Python tests](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml)\n[![R package coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main)\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n\n\n\nRun multiple LLM predictions against a data frame. The predictions are processed \nrow-wise over a specified column. It works using a pre-determined one-shot prompt,\nalong with the current row's content. `mall` has been implemented for both R\nand Python. The prompt that is use will depend of the type of analysis needed. \n\nCurrently, the included prompts perform the following: \n\n- [Sentiment analysis](#sentiment)\n- [Text summarizing](#summarize)\n- [Classify text](#classify)\n- [Extract one, or several](#extract), specific pieces information from the text\n- [Translate text](#translate)\n- [Custom prompt](#custom-prompt)\n\nThis package is inspired by the SQL AI functions now offered by vendors such as\n[Databricks](https://docs.databricks.com/en/large-language-models/ai-functions.html) \nand Snowflake. `mall` uses [Ollama](https://ollama.com/) to interact with LLMs \ninstalled locally. \n\n\n\nFor **R**, that interaction takes place via the \n[`ollamar`](https://hauselin.github.io/ollama-r/) package. The functions are \ndesigned to easily work with piped commands, such as `dplyr`. \n\n```r\nreviews |>\n llm_sentiment(review)\n```\n\n\n\nFor **Python**, `mall` is a library extension to [Polars](https://pola.rs/). To\ninteract with Ollama, it uses the official\n[Python library](https://github.com/ollama/ollama-python).\n\n```python\nreviews.llm.sentiment(\"review\")\n```\n\n## Motivation\n\nWe want to new find ways to help data scientists use LLMs in their daily work. \nUnlike the familiar interfaces, such as chatting and code completion, this interface\nruns your text data directly against the LLM. \n\nThe LLM's flexibility, allows for it to adapt to the subject of your data, and \nprovide surprisingly accurate predictions. This saves the data scientist the\nneed to write and tune an NLP model. \n\nIn recent times, the capabilities of LLMs that can run locally in your computer \nhave increased dramatically. This means that these sort of analysis can run\nin your machine with good accuracy. Additionally, it makes it possible to take\nadvantage of LLM's at your institution, since the data will not leave the\ncorporate network. \n\n## Get started\n\n- Install `mall` from Github\n\n \n::: {.panel-tabset group=\"language\"}\n## R\n```r\npak::pak(\"mlverse/mall/r\")\n```\n\n## Python\n```python\npip install \"mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python\"\n```\n:::\n\n- [Download Ollama from the official website](https://ollama.com/download)\n\n- Install and start Ollama in your computer\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n- Install Ollama in your machine. The `ollamar` package's website provides this\n[Installation guide](https://hauselin.github.io/ollama-r/#installation)\n\n- Download an LLM model. For example, I have been developing this package using\nLlama 3.2 to test. To get that model you can run: \n ```r\n ollamar::pull(\"llama3.2\")\n ```\n \n## Python\n\n- Install the official Ollama library\n ```python\n pip install ollama\n ```\n\n- Download an LLM model. For example, I have been developing this package using\nLlama 3.2 to test. To get that model you can run: \n ```python\n import ollama\n ollama.pull('llama3.2')\n ```\n:::\n \n#### With Databricks (R only)\n\nIf you pass a table connected to **Databricks** via `odbc`, `mall` will \nautomatically use Databricks' LLM instead of Ollama. *You won't need Ollama \ninstalled if you are using Databricks only.*\n\n`mall` will call the appropriate SQL AI function. For more information see our \n[Databricks article.](https://mlverse.github.io/mall/articles/databricks.html) \n\n## LLM functions\n\nWe will start with loading a very small data set contained in `mall`. It has\n3 product reviews that we will use as the source of our examples.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mall)\ndata(\"reviews\")\n\nreviews\n#> # A tibble: 3 × 1\n#> review \n#> \n#> 1 This has been the best TV I've ever used. Great screen, and sound. \n#> 2 I regret buying this laptop. It is too slow and the keyboard is too noisy \n#> 3 Not sure how to feel about my new washing machine. Great color, but hard to f…\n```\n:::\n\n\n\n\n## Python\n\n\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport mall \ndata = mall.MallData\nreviews = data.reviews\n\nreviews \n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
review
"This has been the best TV I've ever used. Great screen, and sound."
"I regret buying this laptop. It is too slow and the keyboard is too noisy"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"
\n```\n\n:::\n:::\n\n\n\n:::\n\n\n\n\n\n\n\n\n\n### Sentiment\n\nAutomatically returns \"positive\", \"negative\", or \"neutral\" based on the text.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_sentiment(review)\n#> # A tibble: 3 × 2\n#> review .sentiment\n#> \n#> 1 This has been the best TV I've ever used. Great screen, and sound. positive \n#> 2 I regret buying this laptop. It is too slow and the keyboard is to… negative \n#> 3 Not sure how to feel about my new washing machine. Great color, bu… neutral\n```\n:::\n\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_sentiment.qmd) \n\n## Python \n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.sentiment(\"review\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewsentiment
"This has been the best TV I've ever used. Great screen, and sound.""positive"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""negative"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""neutral"
\n```\n\n:::\n:::\n\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.sentiment) \n\n:::\n\n### Summarize\n\nThere may be a need to reduce the number of words in a given text. Typically to \nmake it easier to understand its intent. The function has an argument to \ncontrol the maximum number of words to output \n(`max_words`):\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_summarize(review, max_words = 5)\n#> # A tibble: 3 × 2\n#> review .summary \n#> \n#> 1 This has been the best TV I've ever used. Gr… it's a great tv \n#> 2 I regret buying this laptop. It is too slow … laptop purchase was a mistake \n#> 3 Not sure how to feel about my new washing ma… having mixed feelings about it\n```\n:::\n\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_summarize.qmd) \n\n## Python \n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.summarize(\"review\", 5)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewsummary
"This has been the best TV I've ever used. Great screen, and sound.""great tv with good features"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""laptop purchase was a mistake"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""feeling uncertain about new purchase"
\n```\n\n:::\n:::\n\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.summarize) \n\n:::\n\n### Classify\n\nUse the LLM to categorize the text into one of the options you provide: \n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_classify(review, c(\"appliance\", \"computer\"))\n#> # A tibble: 3 × 2\n#> review .classify\n#> \n#> 1 This has been the best TV I've ever used. Gr… computer \n#> 2 I regret buying this laptop. It is too slow … computer \n#> 3 Not sure how to feel about my new washing ma… appliance\n```\n:::\n\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_classify.qmd) \n\n## Python \n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.classify(\"review\", [\"computer\", \"appliance\"])\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewclassify
"This has been the best TV I've ever used. Great screen, and sound.""appliance"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""computer"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""appliance"
\n```\n\n:::\n:::\n\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.classify) \n\n:::\n\n### Extract \n\nOne of the most interesting use cases Using natural language, we can tell the \nLLM to return a specific part of the text. In the following example, we request\nthat the LLM return the product being referred to. We do this by simply saying \n\"product\". The LLM understands what we *mean* by that word, and looks for that\nin the text.\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_extract(review, \"product\")\n#> # A tibble: 3 × 2\n#> review .extract \n#> \n#> 1 This has been the best TV I've ever used. Gr… tv \n#> 2 I regret buying this laptop. It is too slow … laptop \n#> 3 Not sure how to feel about my new washing ma… washing machine\n```\n:::\n\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_extract.qmd) \n\n## Python \n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.extract(\"review\", \"product\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewextract
"This has been the best TV I've ever used. Great screen, and sound.""tv"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""laptop"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""washing machine"
\n```\n\n:::\n:::\n\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.extract) \n\n:::\n\n\n### Translate\n\nAs the title implies, this function will translate the text into a specified \nlanguage. What is really nice, it is that you don't need to specify the language\nof the source text. Only the target language needs to be defined. The translation\naccuracy will depend on the LLM\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_translate(review, \"spanish\")\n#> # A tibble: 3 × 2\n#> review .translation \n#> \n#> 1 This has been the best TV I've ever used. Gr… Esta ha sido la mejor televisió…\n#> 2 I regret buying this laptop. It is too slow … Me arrepiento de comprar este p…\n#> 3 Not sure how to feel about my new washing ma… No estoy seguro de cómo me sien…\n```\n:::\n\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_translate.qmd) \n\n## Python \n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.translate(\"review\", \"spanish\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewtranslation
"This has been the best TV I've ever used. Great screen, and sound.""Esta ha sido la mejor televisión que he utilizado hasta ahora. Gran pantalla y sonido."
"I regret buying this laptop. It is too slow and the keyboard is too noisy""Me arrepiento de comprar este portátil. Es demasiado lento y la tecla es demasiado ruidosa."
"Not sure how to feel about my new washing machine. Great color, but hard to figure""No estoy seguro de cómo sentirme con mi nueva lavadora. Un color maravilloso, pero muy difícil de en…
\n```\n\n:::\n:::\n\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.translate) \n\n:::\n\n### Custom prompt\n\nIt is possible to pass your own prompt to the LLM, and have `mall` run it \nagainst each text entry:\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_prompt <- paste(\n \"Answer a question.\",\n \"Return only the answer, no explanation\",\n \"Acceptable answers are 'yes', 'no'\",\n \"Answer this about the following text, is this a happy customer?:\"\n)\n\nreviews |>\n llm_custom(review, my_prompt)\n#> # A tibble: 3 × 2\n#> review .pred\n#> \n#> 1 This has been the best TV I've ever used. Great screen, and sound. Yes \n#> 2 I regret buying this laptop. It is too slow and the keyboard is too noi… No \n#> 3 Not sure how to feel about my new washing machine. Great color, but har… No\n```\n:::\n\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_custom.qmd) \n\n## Python \n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmy_prompt = (\n \"Answer a question.\"\n \"Return only the answer, no explanation\"\n \"Acceptable answers are 'yes', 'no'\"\n \"Answer this about the following text, is this a happy customer?:\"\n)\n\nreviews.llm.custom(\"review\", prompt = my_prompt)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewcustom
"This has been the best TV I've ever used. Great screen, and sound.""Yes"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""No"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""No"
\n```\n\n:::\n:::\n\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.custom) \n\n:::\n\n## Model selection and settings\n\nYou can set the model and its options to use when calling the LLM. In this case,\nwe refer to options as model specific things that can be set, such as seed or\ntemperature. \n\n::: {.panel-tabset group=\"language\"}\n## R\n\nInvoking an `llm` function will automatically initialize a model selection\nif you don't have one selected yet. If there is only one option, it will \npre-select it for you. If there are more than one available models, then `mall`\nwill present you as menu selection so you can select which model you wish to \nuse.\n\nCalling `llm_use()` directly will let you specify the model and backend to use.\nYou can also setup additional arguments that will be passed down to the \nfunction that actually runs the prediction. In the case of Ollama, that function\nis [`chat()`](https://hauselin.github.io/ollama-r/reference/chat.html). \n\nThe model to use, and other options can be set for the current R session\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_use(\"ollama\", \"llama3.2\", seed = 100, temperature = 0)\n```\n:::\n\n\n\n\n\n## Python \n\nThe model and options to be used will be defined at the Polars data frame \nobject level. If not passed, the default model will be **llama3.2**.\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.use(\"ollama\", \"llama3.2\", options = dict(seed = 100))\n```\n:::\n\n\n\n\n:::\n\n#### Results caching \n\nBy default `mall` caches the requests and corresponding results from a given\nLLM run. Each response is saved as individual JSON files. By default, the folder\nname is `_mall_cache`. The folder name can be customized, if needed. Also, the\ncaching can be turned off by setting the argument to empty (`\"\"`).\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_use(.cache = \"_my_cache\")\n```\n:::\n\n\n\n\nTo turn off:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_use(.cache = \"\")\n```\n:::\n\n\n\n\n## Python \n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.use(_cache = \"my_cache\")\n```\n:::\n\n\n\n\nTo turn off:\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.use(_cache = \"\")\n```\n:::\n\n\n\n\n:::\n\nFor more information see the [Caching Results](articles/caching.qmd) article. \n\n## Key considerations\n\nThe main consideration is **cost**. Either, time cost, or money cost.\n\nIf using this method with an LLM locally available, the cost will be a long \nrunning time. Unless using a very specialized LLM, a given LLM is a general model. \nIt was fitted using a vast amount of data. So determining a response for each \nrow, takes longer than if using a manually created NLP model. The default model\nused in Ollama is [Llama 3.2](https://ollama.com/library/llama3.2), \nwhich was fitted using 3B parameters. \n\nIf using an external LLM service, the consideration will need to be for the \nbilling costs of using such service. Keep in mind that you will be sending a lot\nof data to be evaluated. \n\nAnother consideration is the novelty of this approach. Early tests are \nproviding encouraging results. But you, as an user, will still need to keep\nin mind that the predictions will not be infallible, so always check the output.\nAt this time, I think the best use for this method, is for a quick analysis.\n\n\n## Vector functions (R only)\n\n`mall` includes functions that expect a vector, instead of a table, to run the\npredictions. This should make it easier to test things, such as custom prompts\nor results of specific text. Each `llm_` function has a corresponding `llm_vec_`\nfunction:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_vec_sentiment(\"I am happy\")\n#> [1] \"positive\"\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_vec_translate(\"Este es el mejor dia!\", \"english\")\n#> [1] \"It's the best day!\"\n```\n:::\n", + "markdown": "---\nformat:\n html:\n toc: true\nexecute:\n eval: true\n freeze: true\n---\n\n\n\n\n\n\n\n\n\n\n[![R check](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml)\n[![Python tests](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml)\n[![R package coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main)\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n\n\n\n\nRun multiple LLM predictions against a data frame. The predictions are processed \nrow-wise over a specified column. It works using a pre-determined one-shot prompt,\nalong with the current row's content. `mall` has been implemented for both R\nand Python. The prompt that is use will depend of the type of analysis needed. \n\nCurrently, the included prompts perform the following: \n\n- [Sentiment analysis](#sentiment)\n- [Text summarizing](#summarize)\n- [Classify text](#classify)\n- [Extract one, or several](#extract), specific pieces information from the text\n- [Translate text](#translate)\n- [Verify that something it true](#verify) about the text (binary)\n- [Custom prompt](#custom-prompt)\n\nThis package is inspired by the SQL AI functions now offered by vendors such as\n[Databricks](https://docs.databricks.com/en/large-language-models/ai-functions.html) \nand Snowflake. `mall` uses [Ollama](https://ollama.com/) to interact with LLMs \ninstalled locally. \n\n\n\nFor **R**, that interaction takes place via the \n[`ollamar`](https://hauselin.github.io/ollama-r/) package. The functions are \ndesigned to easily work with piped commands, such as `dplyr`. \n\n```r\nreviews |>\n llm_sentiment(review)\n```\n\n\n\nFor **Python**, `mall` is a library extension to [Polars](https://pola.rs/). To\ninteract with Ollama, it uses the official\n[Python library](https://github.com/ollama/ollama-python).\n\n```python\nreviews.llm.sentiment(\"review\")\n```\n\n## Motivation\n\nWe want to new find ways to help data scientists use LLMs in their daily work. \nUnlike the familiar interfaces, such as chatting and code completion, this interface\nruns your text data directly against the LLM. \n\nThe LLM's flexibility, allows for it to adapt to the subject of your data, and \nprovide surprisingly accurate predictions. This saves the data scientist the\nneed to write and tune an NLP model. \n\nIn recent times, the capabilities of LLMs that can run locally in your computer \nhave increased dramatically. This means that these sort of analysis can run\nin your machine with good accuracy. Additionally, it makes it possible to take\nadvantage of LLM's at your institution, since the data will not leave the\ncorporate network. \n\n## Get started\n\n- Install `mall` from Github\n\n \n::: {.panel-tabset group=\"language\"}\n## R\n```r\npak::pak(\"mlverse/mall/r\")\n```\n\n## Python\n```python\npip install \"mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python\"\n```\n:::\n\n- [Download Ollama from the official website](https://ollama.com/download)\n\n- Install and start Ollama in your computer\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n- Install Ollama in your machine. The `ollamar` package's website provides this\n[Installation guide](https://hauselin.github.io/ollama-r/#installation)\n\n- Download an LLM model. For example, I have been developing this package using\nLlama 3.2 to test. To get that model you can run: \n ```r\n ollamar::pull(\"llama3.2\")\n ```\n \n## Python\n\n- Install the official Ollama library\n ```python\n pip install ollama\n ```\n\n- Download an LLM model. For example, I have been developing this package using\nLlama 3.2 to test. To get that model you can run: \n ```python\n import ollama\n ollama.pull('llama3.2')\n ```\n:::\n \n#### With Databricks (R only)\n\nIf you pass a table connected to **Databricks** via `odbc`, `mall` will \nautomatically use Databricks' LLM instead of Ollama. *You won't need Ollama \ninstalled if you are using Databricks only.*\n\n`mall` will call the appropriate SQL AI function. For more information see our \n[Databricks article.](https://mlverse.github.io/mall/articles/databricks.html) \n\n## LLM functions\n\nWe will start with loading a very small data set contained in `mall`. It has\n3 product reviews that we will use as the source of our examples.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(mall)\ndata(\"reviews\")\n\nreviews\n#> # A tibble: 3 × 1\n#> review \n#> \n#> 1 This has been the best TV I've ever used. Great screen, and sound. \n#> 2 I regret buying this laptop. It is too slow and the keyboard is too noisy \n#> 3 Not sure how to feel about my new washing machine. Great color, but hard to f…\n```\n:::\n\n\n\n## Python\n\n\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nimport mall \ndata = mall.MallData\nreviews = data.reviews\n\nreviews \n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
review
"This has been the best TV I've ever used. Great screen, and sound."
"I regret buying this laptop. It is too slow and the keyboard is too noisy"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"
\n```\n\n:::\n:::\n\n\n:::\n\n\n\n\n\n\n\n### Sentiment\n\nAutomatically returns \"positive\", \"negative\", or \"neutral\" based on the text.\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_sentiment(review)\n#> # A tibble: 3 × 2\n#> review .sentiment\n#> \n#> 1 This has been the best TV I've ever used. Great screen, and sound. positive \n#> 2 I regret buying this laptop. It is too slow and the keyboard is to… negative \n#> 3 Not sure how to feel about my new washing machine. Great color, bu… neutral\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_sentiment.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.sentiment(\"review\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewsentiment
"This has been the best TV I've ever used. Great screen, and sound.""positive"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""negative"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""neutral"
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.sentiment) \n\n:::\n\n### Summarize\n\nThere may be a need to reduce the number of words in a given text. Typically to \nmake it easier to understand its intent. The function has an argument to \ncontrol the maximum number of words to output \n(`max_words`):\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_summarize(review, max_words = 5)\n#> # A tibble: 3 × 2\n#> review .summary \n#> \n#> 1 This has been the best TV I've ever used. Gr… it's a great tv \n#> 2 I regret buying this laptop. It is too slow … laptop purchase was a mistake \n#> 3 Not sure how to feel about my new washing ma… having mixed feelings about it\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_summarize.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.summarize(\"review\", 5)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewsummary
"This has been the best TV I've ever used. Great screen, and sound.""great tv with good features"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""laptop purchase was a mistake"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""feeling uncertain about new purchase"
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.summarize) \n\n:::\n\n### Classify\n\nUse the LLM to categorize the text into one of the options you provide: \n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_classify(review, c(\"appliance\", \"computer\"))\n#> # A tibble: 3 × 2\n#> review .classify\n#> \n#> 1 This has been the best TV I've ever used. Gr… computer \n#> 2 I regret buying this laptop. It is too slow … computer \n#> 3 Not sure how to feel about my new washing ma… appliance\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_classify.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.classify(\"review\", [\"computer\", \"appliance\"])\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewclassify
"This has been the best TV I've ever used. Great screen, and sound.""appliance"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""computer"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""appliance"
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.classify) \n\n:::\n\n### Extract \n\nOne of the most interesting use cases Using natural language, we can tell the \nLLM to return a specific part of the text. In the following example, we request\nthat the LLM return the product being referred to. We do this by simply saying \n\"product\". The LLM understands what we *mean* by that word, and looks for that\nin the text.\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_extract(review, \"product\")\n#> # A tibble: 3 × 2\n#> review .extract \n#> \n#> 1 This has been the best TV I've ever used. Gr… tv \n#> 2 I regret buying this laptop. It is too slow … laptop \n#> 3 Not sure how to feel about my new washing ma… washing machine\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_extract.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.extract(\"review\", \"product\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewextract
"This has been the best TV I've ever used. Great screen, and sound.""tv"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""laptop"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""washing machine"
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.extract) \n\n:::\n\n### Classify\n\nUse the LLM to categorize the text into one of the options you provide: \n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_classify(review, c(\"appliance\", \"computer\"))\n#> # A tibble: 3 × 2\n#> review .classify\n#> \n#> 1 This has been the best TV I've ever used. Gr… computer \n#> 2 I regret buying this laptop. It is too slow … computer \n#> 3 Not sure how to feel about my new washing ma… appliance\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_classify.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.classify(\"review\", [\"computer\", \"appliance\"])\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewclassify
"This has been the best TV I've ever used. Great screen, and sound.""appliance"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""computer"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""appliance"
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.classify) \n\n:::\n\n### Verify \n\nThis functions allows you to check and see if a statement is true, based\non the provided text. By default, it will return a 1 for \"yes\", and 0 for\n\"no\". This can be customized.\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_verify(review, \"is the customer happy with the purchase\")\n#> # A tibble: 3 × 2\n#> review .verify\n#> \n#> 1 This has been the best TV I've ever used. Great screen, and sound. 1 \n#> 2 I regret buying this laptop. It is too slow and the keyboard is too n… 0 \n#> 3 Not sure how to feel about my new washing machine. Great color, but h… 0\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_verify.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.verify(\"review\", \"is the customer happy with the purchase\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewverify
"This has been the best TV I've ever used. Great screen, and sound."1
"I regret buying this laptop. It is too slow and the keyboard is too noisy"0
"Not sure how to feel about my new washing machine. Great color, but hard to figure"0
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.verify) \n\n:::\n\n\n\n### Translate\n\nAs the title implies, this function will translate the text into a specified \nlanguage. What is really nice, it is that you don't need to specify the language\nof the source text. Only the target language needs to be defined. The translation\naccuracy will depend on the LLM\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreviews |>\n llm_translate(review, \"spanish\")\n#> # A tibble: 3 × 2\n#> review .translation \n#> \n#> 1 This has been the best TV I've ever used. Gr… Esta ha sido la mejor televisió…\n#> 2 I regret buying this laptop. It is too slow … Me arrepiento de comprar este p…\n#> 3 Not sure how to feel about my new washing ma… No estoy seguro de cómo me sien…\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_translate.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.translate(\"review\", \"spanish\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewtranslation
"This has been the best TV I've ever used. Great screen, and sound.""Esta ha sido la mejor televisión que he utilizado hasta ahora. Gran pantalla y sonido."
"I regret buying this laptop. It is too slow and the keyboard is too noisy""Me arrepiento de comprar este portátil. Es demasiado lento y la tecla es demasiado ruidosa."
"Not sure how to feel about my new washing machine. Great color, but hard to figure""No estoy seguro de cómo sentirme con mi nueva lavadora. Un color maravilloso, pero muy difícil de en…
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.translate) \n\n:::\n\n### Custom prompt\n\nIt is possible to pass your own prompt to the LLM, and have `mall` run it \nagainst each text entry:\n\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_prompt <- paste(\n \"Answer a question.\",\n \"Return only the answer, no explanation\",\n \"Acceptable answers are 'yes', 'no'\",\n \"Answer this about the following text, is this a happy customer?:\"\n)\n\nreviews |>\n llm_custom(review, my_prompt)\n#> # A tibble: 3 × 2\n#> review .pred\n#> \n#> 1 This has been the best TV I've ever used. Great screen, and sound. Yes \n#> 2 I regret buying this laptop. It is too slow and the keyboard is too noi… No \n#> 3 Not sure how to feel about my new washing machine. Great color, but har… No\n```\n:::\n\n\n\nFor more information and examples visit this function's \n[R reference page](reference/llm_custom.qmd) \n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nmy_prompt = (\n \"Answer a question.\"\n \"Return only the answer, no explanation\"\n \"Acceptable answers are 'yes', 'no'\"\n \"Answer this about the following text, is this a happy customer?:\"\n)\n\nreviews.llm.custom(\"review\", prompt = my_prompt)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n
reviewcustom
"This has been the best TV I've ever used. Great screen, and sound.""Yes"
"I regret buying this laptop. It is too slow and the keyboard is too noisy""No"
"Not sure how to feel about my new washing machine. Great color, but hard to figure""No"
\n```\n\n:::\n:::\n\n\n\nFor more information and examples visit this function's \n[Python reference page](reference/MallFrame.qmd#mall.MallFrame.custom) \n\n:::\n\n## Model selection and settings\n\nYou can set the model and its options to use when calling the LLM. In this case,\nwe refer to options as model specific things that can be set, such as seed or\ntemperature. \n\n::: {.panel-tabset group=\"language\"}\n## R\n\nInvoking an `llm` function will automatically initialize a model selection\nif you don't have one selected yet. If there is only one option, it will \npre-select it for you. If there are more than one available models, then `mall`\nwill present you as menu selection so you can select which model you wish to \nuse.\n\nCalling `llm_use()` directly will let you specify the model and backend to use.\nYou can also setup additional arguments that will be passed down to the \nfunction that actually runs the prediction. In the case of Ollama, that function\nis [`chat()`](https://hauselin.github.io/ollama-r/reference/chat.html). \n\nThe model to use, and other options can be set for the current R session\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_use(\"ollama\", \"llama3.2\", seed = 100, temperature = 0)\n```\n:::\n\n\n\n\n## Python \n\nThe model and options to be used will be defined at the Polars data frame \nobject level. If not passed, the default model will be **llama3.2**.\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.use(\"ollama\", \"llama3.2\", options = dict(seed = 100))\n```\n:::\n\n\n\n:::\n\n#### Results caching \n\nBy default `mall` caches the requests and corresponding results from a given\nLLM run. Each response is saved as individual JSON files. By default, the folder\nname is `_mall_cache`. The folder name can be customized, if needed. Also, the\ncaching can be turned off by setting the argument to empty (`\"\"`).\n\n::: {.panel-tabset group=\"language\"}\n## R\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_use(.cache = \"_my_cache\")\n```\n:::\n\n\n\nTo turn off:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_use(.cache = \"\")\n```\n:::\n\n\n\n## Python \n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.use(_cache = \"my_cache\")\n```\n:::\n\n\n\nTo turn off:\n\n\n\n::: {.cell}\n\n```{.python .cell-code}\nreviews.llm.use(_cache = \"\")\n```\n:::\n\n\n\n:::\n\nFor more information see the [Caching Results](articles/caching.qmd) article. \n\n## Key considerations\n\nThe main consideration is **cost**. Either, time cost, or money cost.\n\nIf using this method with an LLM locally available, the cost will be a long \nrunning time. Unless using a very specialized LLM, a given LLM is a general model. \nIt was fitted using a vast amount of data. So determining a response for each \nrow, takes longer than if using a manually created NLP model. The default model\nused in Ollama is [Llama 3.2](https://ollama.com/library/llama3.2), \nwhich was fitted using 3B parameters. \n\nIf using an external LLM service, the consideration will need to be for the \nbilling costs of using such service. Keep in mind that you will be sending a lot\nof data to be evaluated. \n\nAnother consideration is the novelty of this approach. Early tests are \nproviding encouraging results. But you, as an user, will still need to keep\nin mind that the predictions will not be infallible, so always check the output.\nAt this time, I think the best use for this method, is for a quick analysis.\n\n\n## Vector functions (R only)\n\n`mall` includes functions that expect a vector, instead of a table, to run the\npredictions. This should make it easier to test things, such as custom prompts\nor results of specific text. Each `llm_` function has a corresponding `llm_vec_`\nfunction:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_vec_sentiment(\"I am happy\")\n#> [1] \"positive\"\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nllm_vec_translate(\"Este es el mejor dia!\", \"english\")\n#> [1] \"It's the best day!\"\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/index.qmd b/index.qmd index 0c32a34..8929777 100644 --- a/index.qmd +++ b/index.qmd @@ -32,6 +32,7 @@ mall::llm_use("ollama", "llama3.2", seed = 100, .cache = "_readme_cache") + Run multiple LLM predictions against a data frame. The predictions are processed row-wise over a specified column. It works using a pre-determined one-shot prompt, along with the current row's content. `mall` has been implemented for both R @@ -44,6 +45,7 @@ Currently, the included prompts perform the following: - [Classify text](#classify) - [Extract one, or several](#extract), specific pieces information from the text - [Translate text](#translate) +- [Verify that something it true](#verify) about the text (binary) - [Custom prompt](#custom-prompt) This package is inspired by the SQL AI functions now offered by vendors such as @@ -298,6 +300,63 @@ For more information and examples visit this function's ::: +### Classify + +Use the LLM to categorize the text into one of the options you provide: + + +::: {.panel-tabset group="language"} +## R + +```{r} +reviews |> + llm_classify(review, c("appliance", "computer")) +``` + +For more information and examples visit this function's +[R reference page](reference/llm_classify.qmd) + +## Python + +```{python} +reviews.llm.classify("review", ["computer", "appliance"]) +``` + +For more information and examples visit this function's +[Python reference page](reference/MallFrame.qmd#mall.MallFrame.classify) + +::: + +### Verify + +This functions allows you to check and see if a statement is true, based +on the provided text. By default, it will return a 1 for "yes", and 0 for +"no". This can be customized. + + +::: {.panel-tabset group="language"} +## R + +```{r} +reviews |> + llm_verify(review, "is the customer happy with the purchase") +``` + +For more information and examples visit this function's +[R reference page](reference/llm_verify.qmd) + +## Python + +```{python} +reviews.llm.verify("review", "is the customer happy with the purchase") +``` + +For more information and examples visit this function's +[Python reference page](reference/MallFrame.qmd#mall.MallFrame.verify) + +::: + + ### Translate @@ -486,4 +545,3 @@ llm_vec_sentiment("I am happy") ```{r} llm_vec_translate("Este es el mejor dia!", "english") ``` - diff --git a/python/README.md b/python/README.md index 18194a4..b51a59b 100644 --- a/python/README.md +++ b/python/README.md @@ -1,100 +1,295 @@ -# mall -## Intro + + + + + +[![Python +tests](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml) +[![Code +coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main) +[![Lifecycle: +experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) + Run multiple LLM predictions against a data frame. The predictions are processed row-wise over a specified column. It works using a pre-determined one-shot prompt, along with the current row’s content. +`mall` has been implemented for both R and Python. The prompt that is +use will depend of the type of analysis needed. + +Currently, the included prompts perform the following: + +- [Sentiment analysis](#sentiment) +- [Text summarizing](#summarize) +- [Classify text](#classify) +- [Extract one, or several](#extract), specific pieces information from + the text +- [Translate text](#translate) +- [Verify that something it true](#verify) about the text (binary) +- [Custom prompt](#custom-prompt) -## Install +This package is inspired by the SQL AI functions now offered by vendors +such as +[Databricks](https://docs.databricks.com/en/large-language-models/ai-functions.html) +and Snowflake. `mall` uses [Ollama](https://ollama.com/) to interact +with LLMs installed locally. -To install from Github, use: +For **Python**, `mall` is a library extension to +[Polars](https://pola.rs/). To interact with Ollama, it uses the +official [Python library](https://github.com/ollama/ollama-python). ``` python -pip install "mall @ git+https://git@github.com/edgararuiz/mall.git@python#subdirectory=python" +reviews.llm.sentiment("review") ``` -## Examples +## Motivation + +We want to new find ways to help data scientists use LLMs in their daily +work. Unlike the familiar interfaces, such as chatting and code +completion, this interface runs your text data directly against the LLM. + +The LLM’s flexibility, allows for it to adapt to the subject of your +data, and provide surprisingly accurate predictions. This saves the data +scientist the need to write and tune an NLP model. + +In recent times, the capabilities of LLMs that can run locally in your +computer have increased dramatically. This means that these sort of +analysis can run in your machine with good accuracy. Additionally, it +makes it possible to take advantage of LLM’s at your institution, since +the data will not leave the corporate network. + +## Get started + +- Install `mall` from Github + +``` python +pip install "mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python" +``` + +- [Download Ollama from the official + website](https://ollama.com/download) + +- Install and start Ollama in your computer + +- Install the official Ollama library + + ``` python + pip install ollama + ``` + +- Download an LLM model. For example, I have been developing this + package using Llama 3.2 to test. To get that model you can run: + + ``` python + import ollama + ollama.pull('llama3.2') + ``` + +## LLM functions + +We will start with loading a very small data set contained in `mall`. It +has 3 product reviews that we will use as the source of our examples. ``` python import mall -import polars as pl - -reviews = pl.DataFrame( - data=[ - "This has been the best TV I've ever used. Great screen, and sound.", - "I regret buying this laptop. It is too slow and the keyboard is too noisy", - "Not sure how to feel about my new washing machine. Great color, but hard to figure" - ], - schema=[("review", pl.String)], -) +data = mall.MallData +reviews = data.reviews + +reviews ``` -## Sentiment +| review | +|----| +| "This has been the best TV I've ever used. Great screen, and sound." | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | + +

+### Sentiment + +Automatically returns “positive”, “negative”, or “neutral” based on the +text. ``` python reviews.llm.sentiment("review") ``` -shape: (3, 2) +| review | sentiment | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | "positive" | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "negative" | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "neutral" | -| review | sentiment | -|----------------------------------|------------| -| str | str | -| "This has been the best TV I've… | "positive" | -| "I regret buying this laptop. I… | "negative" | -| "Not sure how to feel about my … | "neutral" | +### Summarize -## Summarize +There may be a need to reduce the number of words in a given text. +Typically to make it easier to understand its intent. The function has +an argument to control the maximum number of words to output +(`max_words`): ``` python reviews.llm.summarize("review", 5) ``` -shape: (3, 2) +| review | summary | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | "great tv with good features" | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "laptop purchase was a mistake" | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "feeling uncertain about new purchase" | -| review | summary | -|----------------------------------|----------------------------------| -| str | str | -| "This has been the best TV I've… | "it's a great tv" | -| "I regret buying this laptop. I… | "laptop not worth the money" | -| "Not sure how to feel about my … | "feeling uncertain about new pu… | +### Classify -## Translate (as in ‘English to French’) +Use the LLM to categorize the text into one of the options you provide: ``` python -reviews.llm.translate("review", "spanish") +reviews.llm.classify("review", ["computer", "appliance"]) +``` + +| review | classify | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | "appliance" | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "computer" | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "appliance" | + +### Extract + +One of the most interesting use cases Using natural language, we can +tell the LLM to return a specific part of the text. In the following +example, we request that the LLM return the product being referred to. +We do this by simply saying “product”. The LLM understands what we +*mean* by that word, and looks for that in the text. + +``` python +reviews.llm.extract("review", "product") ``` -shape: (3, 2) +| review | extract | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | "tv" | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "laptop" | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "washing machine" | -| review | translation | -|----------------------------------|----------------------------------| -| str | str | -| "This has been the best TV I've… | "Esta ha sido la mejor TV que h… | -| "I regret buying this laptop. I… | "Lo lamento comprar este portát… | -| "Not sure how to feel about my … | "No estoy seguro de cómo sentir… | +### Classify -## Classify +Use the LLM to categorize the text into one of the options you provide: ``` python reviews.llm.classify("review", ["computer", "appliance"]) ``` -shape: (3, 2) +| review | classify | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | "appliance" | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "computer" | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "appliance" | -| review | classify | -|----------------------------------|-------------| -| str | str | -| "This has been the best TV I've… | "appliance" | -| "I regret buying this laptop. I… | "appliance" | -| "Not sure how to feel about my … | "appliance" | +### Verify -## LLM session setup +This functions allows you to check and see if a statement is true, based +on the provided text. By default, it will return a 1 for “yes”, and 0 +for “no”. This can be customized. ``` python -reviews.llm.use(options = dict(seed = 100)) +reviews.llm.verify("review", "is the customer happy with the purchase") ``` - {'backend': 'ollama', 'model': 'llama3.2', 'options': {'seed': 100}} +| review | verify | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | 1 | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | 0 | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | 0 | + +### Translate + +As the title implies, this function will translate the text into a +specified language. What is really nice, it is that you don’t need to +specify the language of the source text. Only the target language needs +to be defined. The translation accuracy will depend on the LLM + +``` python +reviews.llm.translate("review", "spanish") +``` + +| review | translation | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | "Esta ha sido la mejor televisión que he utilizado hasta ahora. Gran pantalla y sonido." | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "Me arrepiento de comprar este portátil. Es demasiado lento y la tecla es demasiado ruidosa." | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "No estoy seguro de cómo sentirme con mi nueva lavadora. Un color maravilloso, pero muy difícil de en… | + +### Custom prompt + +It is possible to pass your own prompt to the LLM, and have `mall` run +it against each text entry: + +``` python +my_prompt = ( + "Answer a question." + "Return only the answer, no explanation" + "Acceptable answers are 'yes', 'no'" + "Answer this about the following text, is this a happy customer?:" +) + +reviews.llm.custom("review", prompt = my_prompt) +``` + +| review | custom | +|----|----| +| "This has been the best TV I've ever used. Great screen, and sound." | "Yes" | +| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "No" | +| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "No" | + +## Model selection and settings + +You can set the model and its options to use when calling the LLM. In +this case, we refer to options as model specific things that can be set, +such as seed or temperature. + +The model and options to be used will be defined at the Polars data +frame object level. If not passed, the default model will be +**llama3.2**. + +``` python +reviews.llm.use("ollama", "llama3.2", options = dict(seed = 100)) +``` + +#### Results caching + +By default `mall` caches the requests and corresponding results from a +given LLM run. Each response is saved as individual JSON files. By +default, the folder name is `_mall_cache`. The folder name can be +customized, if needed. Also, the caching can be turned off by setting +the argument to empty (`""`). + +``` python +reviews.llm.use(_cache = "my_cache") +``` + +To turn off: + +``` python +reviews.llm.use(_cache = "") +``` + +## Key considerations + +The main consideration is **cost**. Either, time cost, or money cost. + +If using this method with an LLM locally available, the cost will be a +long running time. Unless using a very specialized LLM, a given LLM is a +general model. It was fitted using a vast amount of data. So determining +a response for each row, takes longer than if using a manually created +NLP model. The default model used in Ollama is [Llama +3.2](https://ollama.com/library/llama3.2), which was fitted using 3B +parameters. + +If using an external LLM service, the consideration will need to be for +the billing costs of using such service. Keep in mind that you will be +sending a lot of data to be evaluated. + +Another consideration is the novelty of this approach. Early tests are +providing encouraging results. But you, as an user, will still need to +keep in mind that the predictions will not be infallible, so always +check the output. At this time, I think the best use for this method, is +for a quick analysis. diff --git a/python/README.qmd b/python/README.qmd index 862f56a..3bebf77 100644 --- a/python/README.qmd +++ b/python/README.qmd @@ -1,71 +1,259 @@ --- format: gfm +execute: + eval: true --- -# mall + -## Intro + +[![Python tests](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml) +[![Code coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main) +[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) + -Run multiple LLM predictions against a data frame. The predictions are processed row-wise over a specified column. It works using a pre-determined one-shot prompt, along with the current row’s content. -## Install -To install from Github, use: +Run multiple LLM predictions against a data frame. The predictions are processed +row-wise over a specified column. It works using a pre-determined one-shot prompt, +along with the current row's content. `mall` has been implemented for both R +and Python. The prompt that is use will depend of the type of analysis needed. + +Currently, the included prompts perform the following: + +- [Sentiment analysis](#sentiment) +- [Text summarizing](#summarize) +- [Classify text](#classify) +- [Extract one, or several](#extract), specific pieces information from the text +- [Translate text](#translate) +- [Verify that something it true](#verify) about the text (binary) +- [Custom prompt](#custom-prompt) + +This package is inspired by the SQL AI functions now offered by vendors such as +[Databricks](https://docs.databricks.com/en/large-language-models/ai-functions.html) +and Snowflake. `mall` uses [Ollama](https://ollama.com/) to interact with LLMs +installed locally. + +For **Python**, `mall` is a library extension to [Polars](https://pola.rs/). To +interact with Ollama, it uses the official +[Python library](https://github.com/ollama/ollama-python). + +```python +reviews.llm.sentiment("review") +``` + +## Motivation + +We want to new find ways to help data scientists use LLMs in their daily work. +Unlike the familiar interfaces, such as chatting and code completion, this interface +runs your text data directly against the LLM. + +The LLM's flexibility, allows for it to adapt to the subject of your data, and +provide surprisingly accurate predictions. This saves the data scientist the +need to write and tune an NLP model. + +In recent times, the capabilities of LLMs that can run locally in your computer +have increased dramatically. This means that these sort of analysis can run +in your machine with good accuracy. Additionally, it makes it possible to take +advantage of LLM's at your institution, since the data will not leave the +corporate network. + +## Get started + +- Install `mall` from Github + ```python -pip install "mall @ git+https://git@github.com/edgararuiz/mall.git@python#subdirectory=python" +pip install "mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python" ``` -## Examples +- [Download Ollama from the official website](https://ollama.com/download) + +- Install and start Ollama in your computer + +- Install the official Ollama library + ```python + pip install ollama + ``` + +- Download an LLM model. For example, I have been developing this package using +Llama 3.2 to test. To get that model you can run: + ```python + import ollama + ollama.pull('llama3.2') + ``` +## LLM functions + +We will start with loading a very small data set contained in `mall`. It has +3 product reviews that we will use as the source of our examples. + ```{python} #| include: false +#| import polars as pl from polars.dataframe._html import HTMLFormatter + +pl.Config(fmt_str_lengths=100) +pl.Config.set_tbl_hide_dataframe_shape(True) +pl.Config.set_tbl_hide_column_data_types(True) + html_formatter = get_ipython().display_formatter.formatters['text/html'] html_formatter.for_type(pl.DataFrame, lambda df: "\n".join(HTMLFormatter(df).render())) ``` - ```{python} import mall -import polars as pl data = mall.MallData reviews = data.reviews + +reviews ``` + ```{python} #| include: false -reviews.llm.use(options = dict(seed = 100)) +reviews.llm.use(options = dict(seed = 100), _cache = "_readme_cache") ``` +

-## Sentiment +### Sentiment + +Automatically returns "positive", "negative", or "neutral" based on the text. ```{python} reviews.llm.sentiment("review") ``` -## Summarize +### Summarize + +There may be a need to reduce the number of words in a given text. Typically to +make it easier to understand its intent. The function has an argument to +control the maximum number of words to output +(`max_words`): + ```{python} reviews.llm.summarize("review", 5) ``` -## Translate (as in 'English to French') +### Classify + +Use the LLM to categorize the text into one of the options you provide: + ```{python} -reviews.llm.translate("review", "spanish") +reviews.llm.classify("review", ["computer", "appliance"]) +``` + +### Extract + +One of the most interesting use cases Using natural language, we can tell the +LLM to return a specific part of the text. In the following example, we request +that the LLM return the product being referred to. We do this by simply saying +"product". The LLM understands what we *mean* by that word, and looks for that +in the text. + +```{python} +reviews.llm.extract("review", "product") ``` -## Classify +### Classify + +Use the LLM to categorize the text into one of the options you provide: ```{python} reviews.llm.classify("review", ["computer", "appliance"]) ``` -## LLM session setup +### Verify + +This functions allows you to check and see if a statement is true, based +on the provided text. By default, it will return a 1 for "yes", and 0 for +"no". This can be customized. + +```{python} +reviews.llm.verify("review", "is the customer happy with the purchase") +``` + + +### Translate + +As the title implies, this function will translate the text into a specified +language. What is really nice, it is that you don't need to specify the language +of the source text. Only the target language needs to be defined. The translation +accuracy will depend on the LLM + +```{python} +reviews.llm.translate("review", "spanish") +``` + +### Custom prompt + +It is possible to pass your own prompt to the LLM, and have `mall` run it +against each text entry: + +```{python} +my_prompt = ( + "Answer a question." + "Return only the answer, no explanation" + "Acceptable answers are 'yes', 'no'" + "Answer this about the following text, is this a happy customer?:" +) + +reviews.llm.custom("review", prompt = my_prompt) +``` + +## Model selection and settings + +You can set the model and its options to use when calling the LLM. In this case, +we refer to options as model specific things that can be set, such as seed or +temperature. + +The model and options to be used will be defined at the Polars data frame +object level. If not passed, the default model will be **llama3.2**. + +```{python} +#| eval: false +reviews.llm.use("ollama", "llama3.2", options = dict(seed = 100)) +``` + +#### Results caching + +By default `mall` caches the requests and corresponding results from a given +LLM run. Each response is saved as individual JSON files. By default, the folder +name is `_mall_cache`. The folder name can be customized, if needed. Also, the +caching can be turned off by setting the argument to empty (`""`). + +```{python} +#| eval: false +reviews.llm.use(_cache = "my_cache") +``` + +To turn off: ```{python} -reviews.llm.use(options = dict(seed = 100)) +#| eval: false +reviews.llm.use(_cache = "") ``` + +## Key considerations + +The main consideration is **cost**. Either, time cost, or money cost. + +If using this method with an LLM locally available, the cost will be a long +running time. Unless using a very specialized LLM, a given LLM is a general model. +It was fitted using a vast amount of data. So determining a response for each +row, takes longer than if using a manually created NLP model. The default model +used in Ollama is [Llama 3.2](https://ollama.com/library/llama3.2), +which was fitted using 3B parameters. + +If using an external LLM service, the consideration will need to be for the +billing costs of using such service. Keep in mind that you will be sending a lot +of data to be evaluated. + +Another consideration is the novelty of this approach. Early tests are +providing encouraging results. But you, as an user, will still need to keep +in mind that the predictions will not be infallible, so always check the output. +At this time, I think the best use for this method, is for a quick analysis. diff --git a/python/pyproject.toml b/python/pyproject.toml index 277be4c..4a42186 100644 --- a/python/pyproject.toml +++ b/python/pyproject.toml @@ -1,14 +1,46 @@ +[tool.hatch.build.targets.wheel] +packages = ["mall"] + [project] -name = "mall" +name = "mlverse-mall" version = "0.1.0" -description = "Add your description here" +description = "Run multiple 'Large Language Model' predictions against a table. The predictions run row-wise over a specified column." readme = "README.md" -requires-python = ">=3.12" +authors = [ + { name = "Edgar Ruiz", email = "edgar@posit.co" } +] +requires-python = ">=3.9" dependencies = [ "ollama>=0.3.3", "polars>=1.9.0", + "json5>=0.9.25", + "pytest>=8.3.3", + "pytest-cov>=5.0.0", + "pytest-html>=4.1.1", + "pytest-metadata>=3.1.1" ] +classifiers = [ + "Intended Audience :: End Users/Desktop", + "Intended Audience :: Financial and Insurance Industry", + "Intended Audience :: Science/Research", + "Intended Audience :: Healthcare Industry", + "Intended Audience :: Developers", + "License :: OSI Approved :: MIT License", + "Programming Language :: Python", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + "Programming Language :: Python :: Implementation :: PyPy", + "Topic :: Software Development :: Libraries :: Python Modules" +] +keywords = ["llm", "nlp", "polars", "large language models", "natural language processing"] [build-system] requires = ["hatchling"] build-backend = "hatchling.build" + +[project.urls] +homepage = "https://mlverse.github.io/mall/" +issues = "https://github.com/mlverse/mall/issues" diff --git a/r/README.md b/r/README.md new file mode 100644 index 0000000..190efee --- /dev/null +++ b/r/README.md @@ -0,0 +1,22 @@ +# mall + + + +[![R-CMD-check](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/R-CMD-check.yaml) +[![Codecov test +coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main) +[![Lifecycle: +experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) + + + + + +Run multiple LLM predictions against a data frame. The predictions are +processed row-wise over a specified column. It works using a +pre-determined one-shot prompt, along with the current row’s content. +`mall` is now available in both R and Python. + +To find out how to install and use, or just to learn more about it, please +visit the official website: https://mlverse.github.io/mall/ + diff --git a/r/man/figures/logo.png b/r/man/figures/logo.png new file mode 100644 index 0000000..ebeea89 Binary files /dev/null and b/r/man/figures/logo.png differ