Merge pull request #11 from dotCMS/issue-10-update-listener-to-publis…

…h-automatically feat(embeddings) updates listener to automatically insert embeddings …
dotCMS · Jan 5, 2024 · 71c2692 · 71c2692
2 parents e9f21ac + fee3403
commit 71c2692
Show file tree

Hide file tree

Showing 18 changed files with 402 additions and 76 deletions.
diff --git a/README-Velocity Tooling.md → README-Velocity-Tooling.md b/README-Velocity Tooling.md → README-Velocity-Tooling.md
diff --git a/README-Workflow-Actionlets.md b/README-Workflow-Actionlets.md
@@ -0,0 +1,64 @@
+## Workflow Actionlets
+
+The plugin provides 4 workflow actionlets that can be attached to any workflow process in dotCMS.
+These are
+
+### AI Embeddings - DotEmbeddingsActionlet
+
+This actionlet can automatically add or remove content from the `dot_embeddings` semantic search
+index. You can specify one or more content types separated by line breaks or by commas. The
+actionlet tries to intelligently select which field or fields should be read for indexing. In the
+case of content, it will index the first StoryBlock, WYSIWYG or TextArea field. For pages, it will
+try to render the
+page and use the resultant HTML. For fileAssets and dotAssets, it will index the first binary field
+it finds. You can also specify the content type's field that you wish to index if needed. For
+example:
+
+```
+blog.blogContent
+news
+webPageContent
+document.fileAsset
+```
+
+would mean that any `blog`, `news`, `webPageContent` or `document` passed through this workflow
+actionlet would be added to the index. In the case of a `blog`, it would index the
+field `blogcontent` and in the case of `document`, it would index the file found under
+the `fileAsset` field.
+
+
+### AI Content Prompt - OpenAIContentPromptActionlet
+
+This actionlet can be used to automatically modify content based on a prompt and/or the properties in
+the content itself. The prompt that is passed to OpenAI is a velocity template that gets merged with
+the content that is passing through the workflow step.
+The response returned by OpenAI can be stored into a field, or, if OpenAIs response is a json
+object, can be used to update mutliple fields in the content itself. Take the following prompt as an
+example:
+
+> Generate an article for SEO. The article should describe  """\n$contentlet.topic\n""". Return this
+> article as RFC8259 compliant JSON with 3 properties, "title", for the article title, "blog" for
+> the article content, and "urlSlug" which would contain the article title value all
+> lowercase with any special characters removed and dashes instead of spaces between the words. The
+> article content should be in HTML. Try to use the keywords "$contentlet.keywords" as often as
+> possible. Make the article at least 1500 words long with an informative, friendly tone of voice
+> and try to write the article in such a way that it will not be detectable as having been written by
+> AI.
+
+This prompt will replace/include the content values for $topic and $keywords and send this as a
+prompt to OpenAI. Because we ask OpenAI to response with a JSON object, we can use the values that
+are returned in the object to populate the title, slug and blog fields of our content.
+
+This action generally runs asynchronously in the background as generating content can take some time.
+
+### AI Generate Image - OpenAIGenerateImageActionlet
+
+This actionlet can be used to automatically generate an image based on the given the prompt. You can
+use any value from the `$contentlet` object or you can use a special value `$contentletToString`
+which tries to intelligently render a content object as a string value depending on its type.
+
+### AI Auto-tag Content - OpenAIAutoTagActionlet
+
+This actionlet converts the content in the workflow to a string and then submits it to OpenAI to
+tag. You can "limit" the tags to what already exists in dotCMS (the top 1000 tags) which are sent to
+OpenAI as suggestions. The results will be appended to your contents `tag` field.
diff --git a/README.md b/README.md
@@ -1,14 +1,20 @@
 # README
 
 This is plugin that a number of dotCMS specific tools that leverage AI tooling (Open AI specifically) for use in dotCMS.  
-This includes REST apis, Workflow Actions and Viewtools that let dotCMS interact with AI in a variety of ways.  For examples on how to use these endpoints and to create content embedding indexes, see 
-[this document](README-CURL.md). To see how OpenAI can be used in Velocity contexts, see this [this document](README-Velocity%20Tooling.md)
+This includes REST apis, Workflow Actions and Viewtools that let dotCMS interact with AI in a variety of ways.  
+
+* **RestAPIs**  - Examples on how to use these endpoints and to create content embedding indexes, see 
+[README-CURL.md](README-CURL.md). 
+* **Velocity Viewtools** - Examples on how OpenAI can be used in Velocity contexts, see [README-Velocity-Tooling.md](README-Velocity-Tooling.md)
+* **Workflow Actionlets** - Examples on how to use the included Workflow Actionlets, see [README-Workflow-Actionlets.md](README-Workflow-Actionlets.md)
+
 
 Out of the box, it provides:
 ### An App
-  Where credentials and defaults can be configured and/or overridden
+  Where credentials and defaults can be configured and/or overridden.  For a full list of possible configurations, see the config tab on the Portlet/Tool
 ### dotAI Portlet/Tool
   - Search and Chat with Content
+  - Generate and Save new AI Images
   - View/Update/Delete Content Embedding Indexes
   - View AI Plugin configuration values.  These are important because they can override and parameterize the prompts that we send to OpenAI.
 ### REST APIs
@@ -27,10 +33,12 @@ Out of the box, it provides:
     - Perform a `raw` chat request to OpenAIs completion endpoint
 - `/api/v1/ai/image/generate` - Image Resource
     - Create an AI generated image based on a prompt.  The resulting image will be stored as a temp file in dotCMS"
+
 ### Workflow Actions
   - **OpenAI Embeddings** (`DotEmbeddingsActionlet`)  This actionlet uses OpenAI to generate and save (or delete) the embeddings for content.  This is used so that an embedding index can be kept up to date as new content is published and/or unpublished from dotCMS.
   - **OpenAI Content Prompt** (`OpenAIContentPromptActionlet`).  This actionlet can be called to automatically populate/update fields of content as the content is pushed through a workflow.  It works by expecting OpenAI to return its data/answer in a json format which will then be used to update the content.  The example usage is to post content to OpenAI and have OpenAI automatically write appropriate SEO title and SEO short description for the content.
   - **OpenAI Generate Image** (`OpenAIGenerateImageActionlet`).  This actionlet can automatically generate an image based on a content prompt.  This content prompt is a velocity template and can use the values of the content in it.  By default, this actionlet will add this image to the first binary field in the content.
+  - **OpenAI Auto Tag Content** (`OpenAIAutoTagActionlet`).  This actionlet can automatically tag content based on its values.
 
 ### Velocity Viewtools
   - `$ai` can be used to generate content and/or images from a prompt.

diff --git a/src/main/java/com/dotcms/ai/Activator.java b/src/main/java/com/dotcms/ai/Activator.java
@@ -112,7 +112,7 @@ public void stop(BundleContext context) throws Exception {
         OpenAIThreadPool.shutdown();
 
         //Unregister all the bundle services
-        unregisterServices(context);
+        //unregisterServices(context);
 
         unregisterViewToolServices();
 

diff --git a/src/main/java/com/dotcms/ai/api/CompletionsAPI.java b/src/main/java/com/dotcms/ai/api/CompletionsAPI.java
@@ -4,7 +4,6 @@
 import com.dotcms.ai.rest.forms.CompletionsForm;
 import com.dotmarketing.util.json.JSONObject;
 import io.vavr.Lazy;
-
 import java.io.OutputStream;
 
 public interface CompletionsAPI {
@@ -19,8 +18,8 @@ static CompletionsAPI impl() {
 
 
     /**
-     * this method takes the query/prompt, searches dotCMS content for matching
-     * embeddings and then returns an AI summary based on the matching content in dotCMS
+     * this method takes the query/prompt, searches dotCMS content for matching embeddings and then returns an AI
+     * summary based on the matching content in dotCMS
      *
      * @param searcher
      * @return
@@ -29,8 +28,8 @@ static CompletionsAPI impl() {
 
 
     /**
-     * this method takes the query/prompt, searches dotCMS content for matching
-     * embeddings and then streams the AI response based on the matching content in dotCMS
+     * this method takes the query/prompt, searches dotCMS content for matching embeddings and then streams the AI
+     * response based on the matching content in dotCMS
      *
      * @param searcher
      * @return
@@ -55,7 +54,17 @@ static CompletionsAPI impl() {
      */
     JSONObject raw(JSONObject promptJSON);
 
-
+    /**
+     * this method takes a prompt in the form of parameters and returns a json AI response based on the parameters
+     * passed in.
+     *
+     * @param systemPrompt
+     * @param userPrompt
+     * @param model
+     * @param temperature
+     * @param maxTokens
+     * @return
+     */
     JSONObject prompt(String systemPrompt, String userPrompt, String model, float temperature, int maxTokens);
 
 

diff --git a/src/main/java/com/dotcms/ai/api/EmbeddingsAPI.java b/src/main/java/com/dotcms/ai/api/EmbeddingsAPI.java
@@ -6,10 +6,9 @@
 import com.dotmarketing.portlets.contentlet.model.Contentlet;
 import com.dotmarketing.util.json.JSONObject;
 import io.vavr.Tuple2;
-
-import javax.validation.constraints.NotNull;
 import java.util.List;
 import java.util.Map;
+import javax.validation.constraints.NotNull;
 
 public interface EmbeddingsAPI {
 
@@ -24,32 +23,111 @@ static EmbeddingsAPI impl() {
 
     void shutdown();
 
+    /**
+     * given a contentlet, a list of fields and an index, this method will do its best to turn that content into an
+     * index-able string that then gets vectorized and stored in postgres.
+     * <p>
+     * Important - if you send in an empty list of fields to index, the method will try to intelligently(tm) pick how to
+     * index the content.  For example, if you send in a fileAsset or dotAsset, it will try to index the content of the
+     * file. If you send a htmlPageAsset, it will render the page and index the rendered page.  If you send a contentlet
+     * with a Storyblock or wysiwyg field, it will render those and index the resultant content.
+     *
+     * @param contentlet
+     * @param fields
+     * @param index
+     * @return
+     */
     boolean generateEmbeddingsforContent(Contentlet contentlet, List<Field> fields, String index);
 
+    /**
+     * this method takes a contentlet and a velocity template, generates a velocity context that includes the
+     * $contentlet in it and indexes the rendered result.
+     *
+     * @param contentlet
+     * @param velocityTemplate
+     * @param indexName
+     * @return
+     */
     boolean generateEmbeddingsforContent(@NotNull Contentlet contentlet, String velocityTemplate, String indexName);
 
+    /**
+     * Takes a DTO object and based on its properties deletes from the embeddings index.
+     *
+     * @param dto
+     * @return
+     */
     int deleteEmbedding(EmbeddingsDTO dto);
 
 
+    /**
+     * This method takes comma or line separated string of content types and optionally fields and returns
+     *
+     * @param typeAndFieldParam a map of <contentTypeVar, List<FieldsToIndex>>
+     * @return
+     */
+    Map<String, List<Field>> parseTypesAndFields(String typeAndFieldParam);
+
+    /**
+     * This method takes a list of semantic search results, which are just fragements of content and returns a json
+     * object of a list of the actual contentlets and fragements that matched the result. The idea is to provide the
+     * ability to show exactly which contentlets matched the semantic query and specifically, which fragments in that
+     * content matched.
+     *
+     * @param searcher
+     * @param searchResults
+     * @return
+     */
     JSONObject reduceChunksToContent(EmbeddingsDTO searcher, List<EmbeddingsDTO> searchResults);
 
+    /**
+     * Takes a searcher DTO and returns a JSON object that is a list of matching contentlets and the fragments that
+     * matched.
+     *
+     * @param searcher
+     * @return
+     */
     JSONObject searchForContent(EmbeddingsDTO searcher);
 
+    /**
+     * returns a list of matching content+embeddings from the dot_embeddings table based on the searcher dto
+     *
+     * @param searcher
+     * @return
+     */
     List<EmbeddingsDTO> getEmbeddingResults(EmbeddingsDTO searcher);
 
+    /**
+     * returns a count of matching content+embeddings based on the searcher dto
+     *
+     * @param searcher
+     * @return
+     */
     long countEmbeddings(EmbeddingsDTO searcher);
 
+    /**
+     * returns a map of all the available dot_embeddings 'indexes' plus the count of embeddings in them
+     *
+     * @return
+     */
     Map<String, Map<String, Object>> countEmbeddingsByIndex();
 
+    /**
+     * drops the dot_embeddings table
+     */
     void dropEmbeddingsTable();
 
+    /**
+     * inits pg_vector and builds the dot_embeddings table
+     */
     void initEmbeddingsTable();
 
     /**
-     * Returns
+     * Takes a string and returns the embeddings value for the string
      *
      * @param content
      * @return
      */
     Tuple2<Integer, List<Float>> pullOrGenerateEmbeddings(String content);
+
+
 }
diff --git a/src/main/java/com/dotcms/ai/api/EmbeddingsAPIImpl.java b/src/main/java/com/dotcms/ai/api/EmbeddingsAPIImpl.java
@@ -13,6 +13,7 @@
 import com.dotcms.api.web.HttpServletRequestThreadLocal;
 import com.dotcms.api.web.HttpServletResponseThreadLocal;
 import com.dotcms.contenttype.model.field.Field;
+import com.dotcms.contenttype.model.type.ContentType;
 import com.dotcms.rendering.velocity.util.VelocityUtil;
 import com.dotcms.rest.ContentResource;
 import com.dotmarketing.beans.Host;
@@ -29,6 +30,8 @@
 import io.vavr.Tuple2;
 import io.vavr.Tuple3;
 import io.vavr.control.Try;
+import java.util.ArrayList;
+import java.util.HashMap;
 import org.apache.velocity.context.Context;
 
 import javax.validation.constraints.NotNull;
@@ -42,7 +45,7 @@
 import java.util.stream.Collectors;
 
 
-public class EmbeddingsAPIImpl implements EmbeddingsAPI {
+class EmbeddingsAPIImpl implements EmbeddingsAPI {
 
 
     static final Cache<String, Tuple2<Integer, List<Float>>> embeddingCache = Caffeine.newBuilder()
@@ -113,6 +116,39 @@ public boolean generateEmbeddingsforContent(@NotNull Contentlet contentlet, Stri
 
     }
 
+    @Override
+    public  Map<String, List<Field>> parseTypesAndFields(final String typeAndFieldParam) {
+
+        if (UtilMethods.isEmpty(typeAndFieldParam)) {
+            return Map.of();
+        }
+
+        final Map<String, List<Field>> typesAndFields = new HashMap<>();
+        final String[] typeFieldArr = typeAndFieldParam.trim().split("[\\r?\\n,]");
+
+        for (String typeField : typeFieldArr) {
+            String[] typeOptField = typeField.trim().split("\\.");
+            Optional<ContentType> type = Try.of(
+                    () -> APILocator.getContentTypeAPI(APILocator.systemUser()).find(typeOptField[0])).toJavaOptional();
+            if (type.isEmpty()) {
+                continue;
+            }
+            List<Field> fields = typesAndFields.getOrDefault(type.get().variable(), new ArrayList<>());
+
+            Optional<Field> field = Try.of(() -> type.get().fields().stream().filter(f->f.variable().equalsIgnoreCase(typeOptField[1])).findFirst()).getOrElse(Optional.empty());
+            if (field.isPresent()) {
+                fields.add(field.get());
+            }
+
+            typesAndFields.put(type.get().variable(), fields);
+
+        }
+
+        return typesAndFields;
+    }
+
+
+
 
     @Override
     public JSONObject reduceChunksToContent(EmbeddingsDTO searcher, final List<EmbeddingsDTO> searchResults) {

diff --git a/src/main/java/com/dotcms/ai/app/AppKeys.java b/src/main/java/com/dotcms/ai/app/AppKeys.java
@@ -28,6 +28,7 @@ public enum AppKeys {
     EMBEDDINGS_THREADS_QUEUE("com.dotcms.ai.embeddings.threads.queue", "10000"),
     EMBEDDINGS_CACHE_TTL_SECONDS("com.dotcms.ai.embeddings.cache.ttl.seconds", "600"),
     EMBEDDINGS_CACHE_SIZE("com.dotcms.ai.embeddings.cache.size", "1000"),
+    LISTENER_INDEXER("listenerIndexer", "{}"),
     EMBEDDINGS_DB_DELETE_OLD_ON_UPDATE("com.dotcms.ai.embeddings.delete.old.on.update", "true");
     public static final String APP_KEY = "dotAI";
     public static final String APP_YAML_NAME = APP_KEY + ".yml";