Skip to content

Commit

Permalink
Merge pull request #11 from dotCMS/issue-10-update-listener-to-publis…
Browse files Browse the repository at this point in the history
…h-automatically

feat(embeddings) updates listener to automatically insert embeddings …
  • Loading branch information
wezell authored Jan 5, 2024
2 parents e9f21ac + fee3403 commit 71c2692
Show file tree
Hide file tree
Showing 18 changed files with 402 additions and 76 deletions.
File renamed without changes.
64 changes: 64 additions & 0 deletions README-Workflow-Actionlets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
## Workflow Actionlets

The plugin provides 4 workflow actionlets that can be attached to any workflow process in dotCMS.
These are

### AI Embeddings - DotEmbeddingsActionlet

This actionlet can automatically add or remove content from the `dot_embeddings` semantic search
index. You can specify one or more content types separated by line breaks or by commas. The
actionlet tries to intelligently select which field or fields should be read for indexing. In the
case of content, it will index the first StoryBlock, WYSIWYG or TextArea field. For pages, it will
try to render the
page and use the resultant HTML. For fileAssets and dotAssets, it will index the first binary field
it finds. You can also specify the content type's field that you wish to index if needed. For
example:

```
blog.blogContent
news
webPageContent
document.fileAsset
```

would mean that any `blog`, `news`, `webPageContent` or `document` passed through this workflow
actionlet would be added to the index. In the case of a `blog`, it would index the
field `blogcontent` and in the case of `document`, it would index the file found under
the `fileAsset` field.


### AI Content Prompt - OpenAIContentPromptActionlet

This actionlet can be used to automatically modify content based on a prompt and/or the properties in
the content itself. The prompt that is passed to OpenAI is a velocity template that gets merged with
the content that is passing through the workflow step.
The response returned by OpenAI can be stored into a field, or, if OpenAIs response is a json
object, can be used to update mutliple fields in the content itself. Take the following prompt as an
example:

> Generate an article for SEO. The article should describe """\n$contentlet.topic\n""". Return this
> article as RFC8259 compliant JSON with 3 properties, "title", for the article title, "blog" for
> the article content, and "urlSlug" which would contain the article title value all
> lowercase with any special characters removed and dashes instead of spaces between the words. The
> article content should be in HTML. Try to use the keywords "$contentlet.keywords" as often as
> possible. Make the article at least 1500 words long with an informative, friendly tone of voice
> and try to write the article in such a way that it will not be detectable as having been written by
> AI.
This prompt will replace/include the content values for $topic and $keywords and send this as a
prompt to OpenAI. Because we ask OpenAI to response with a JSON object, we can use the values that
are returned in the object to populate the title, slug and blog fields of our content.

This action generally runs asynchronously in the background as generating content can take some time.

### AI Generate Image - OpenAIGenerateImageActionlet

This actionlet can be used to automatically generate an image based on the given the prompt. You can
use any value from the `$contentlet` object or you can use a special value `$contentletToString`
which tries to intelligently render a content object as a string value depending on its type.

### AI Auto-tag Content - OpenAIAutoTagActionlet

This actionlet converts the content in the workflow to a string and then submits it to OpenAI to
tag. You can "limit" the tags to what already exists in dotCMS (the top 1000 tags) which are sent to
OpenAI as suggestions. The results will be appended to your contents `tag` field.
14 changes: 11 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
# README

This is plugin that a number of dotCMS specific tools that leverage AI tooling (Open AI specifically) for use in dotCMS.
This includes REST apis, Workflow Actions and Viewtools that let dotCMS interact with AI in a variety of ways. For examples on how to use these endpoints and to create content embedding indexes, see
[this document](README-CURL.md). To see how OpenAI can be used in Velocity contexts, see this [this document](README-Velocity%20Tooling.md)
This includes REST apis, Workflow Actions and Viewtools that let dotCMS interact with AI in a variety of ways.

* **RestAPIs** - Examples on how to use these endpoints and to create content embedding indexes, see
[README-CURL.md](README-CURL.md).
* **Velocity Viewtools** - Examples on how OpenAI can be used in Velocity contexts, see [README-Velocity-Tooling.md](README-Velocity-Tooling.md)
* **Workflow Actionlets** - Examples on how to use the included Workflow Actionlets, see [README-Workflow-Actionlets.md](README-Workflow-Actionlets.md)


Out of the box, it provides:
### An App
Where credentials and defaults can be configured and/or overridden
Where credentials and defaults can be configured and/or overridden. For a full list of possible configurations, see the config tab on the Portlet/Tool
### dotAI Portlet/Tool
- Search and Chat with Content
- Generate and Save new AI Images
- View/Update/Delete Content Embedding Indexes
- View AI Plugin configuration values. These are important because they can override and parameterize the prompts that we send to OpenAI.
### REST APIs
Expand All @@ -27,10 +33,12 @@ Out of the box, it provides:
- Perform a `raw` chat request to OpenAIs completion endpoint
- `/api/v1/ai/image/generate` - Image Resource
- Create an AI generated image based on a prompt. The resulting image will be stored as a temp file in dotCMS"

### Workflow Actions
- **OpenAI Embeddings** (`DotEmbeddingsActionlet`) This actionlet uses OpenAI to generate and save (or delete) the embeddings for content. This is used so that an embedding index can be kept up to date as new content is published and/or unpublished from dotCMS.
- **OpenAI Content Prompt** (`OpenAIContentPromptActionlet`). This actionlet can be called to automatically populate/update fields of content as the content is pushed through a workflow. It works by expecting OpenAI to return its data/answer in a json format which will then be used to update the content. The example usage is to post content to OpenAI and have OpenAI automatically write appropriate SEO title and SEO short description for the content.
- **OpenAI Generate Image** (`OpenAIGenerateImageActionlet`). This actionlet can automatically generate an image based on a content prompt. This content prompt is a velocity template and can use the values of the content in it. By default, this actionlet will add this image to the first binary field in the content.
- **OpenAI Auto Tag Content** (`OpenAIAutoTagActionlet`). This actionlet can automatically tag content based on its values.

### Velocity Viewtools
- `$ai` can be used to generate content and/or images from a prompt.
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/com/dotcms/ai/Activator.java
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ public void stop(BundleContext context) throws Exception {
OpenAIThreadPool.shutdown();

//Unregister all the bundle services
unregisterServices(context);
//unregisterServices(context);

unregisterViewToolServices();

Expand Down
21 changes: 15 additions & 6 deletions src/main/java/com/dotcms/ai/api/CompletionsAPI.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
import com.dotcms.ai.rest.forms.CompletionsForm;
import com.dotmarketing.util.json.JSONObject;
import io.vavr.Lazy;

import java.io.OutputStream;

public interface CompletionsAPI {
Expand All @@ -19,8 +18,8 @@ static CompletionsAPI impl() {


/**
* this method takes the query/prompt, searches dotCMS content for matching
* embeddings and then returns an AI summary based on the matching content in dotCMS
* this method takes the query/prompt, searches dotCMS content for matching embeddings and then returns an AI
* summary based on the matching content in dotCMS
*
* @param searcher
* @return
Expand All @@ -29,8 +28,8 @@ static CompletionsAPI impl() {


/**
* this method takes the query/prompt, searches dotCMS content for matching
* embeddings and then streams the AI response based on the matching content in dotCMS
* this method takes the query/prompt, searches dotCMS content for matching embeddings and then streams the AI
* response based on the matching content in dotCMS
*
* @param searcher
* @return
Expand All @@ -55,7 +54,17 @@ static CompletionsAPI impl() {
*/
JSONObject raw(JSONObject promptJSON);


/**
* this method takes a prompt in the form of parameters and returns a json AI response based on the parameters
* passed in.
*
* @param systemPrompt
* @param userPrompt
* @param model
* @param temperature
* @param maxTokens
* @return
*/
JSONObject prompt(String systemPrompt, String userPrompt, String model, float temperature, int maxTokens);


Expand Down
84 changes: 81 additions & 3 deletions src/main/java/com/dotcms/ai/api/EmbeddingsAPI.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,9 @@
import com.dotmarketing.portlets.contentlet.model.Contentlet;
import com.dotmarketing.util.json.JSONObject;
import io.vavr.Tuple2;

import javax.validation.constraints.NotNull;
import java.util.List;
import java.util.Map;
import javax.validation.constraints.NotNull;

public interface EmbeddingsAPI {

Expand All @@ -24,32 +23,111 @@ static EmbeddingsAPI impl() {

void shutdown();

/**
* given a contentlet, a list of fields and an index, this method will do its best to turn that content into an
* index-able string that then gets vectorized and stored in postgres.
* <p>
* Important - if you send in an empty list of fields to index, the method will try to intelligently(tm) pick how to
* index the content. For example, if you send in a fileAsset or dotAsset, it will try to index the content of the
* file. If you send a htmlPageAsset, it will render the page and index the rendered page. If you send a contentlet
* with a Storyblock or wysiwyg field, it will render those and index the resultant content.
*
* @param contentlet
* @param fields
* @param index
* @return
*/
boolean generateEmbeddingsforContent(Contentlet contentlet, List<Field> fields, String index);

/**
* this method takes a contentlet and a velocity template, generates a velocity context that includes the
* $contentlet in it and indexes the rendered result.
*
* @param contentlet
* @param velocityTemplate
* @param indexName
* @return
*/
boolean generateEmbeddingsforContent(@NotNull Contentlet contentlet, String velocityTemplate, String indexName);

/**
* Takes a DTO object and based on its properties deletes from the embeddings index.
*
* @param dto
* @return
*/
int deleteEmbedding(EmbeddingsDTO dto);


/**
* This method takes comma or line separated string of content types and optionally fields and returns
*
* @param typeAndFieldParam a map of <contentTypeVar, List<FieldsToIndex>>
* @return
*/
Map<String, List<Field>> parseTypesAndFields(String typeAndFieldParam);

/**
* This method takes a list of semantic search results, which are just fragements of content and returns a json
* object of a list of the actual contentlets and fragements that matched the result. The idea is to provide the
* ability to show exactly which contentlets matched the semantic query and specifically, which fragments in that
* content matched.
*
* @param searcher
* @param searchResults
* @return
*/
JSONObject reduceChunksToContent(EmbeddingsDTO searcher, List<EmbeddingsDTO> searchResults);

/**
* Takes a searcher DTO and returns a JSON object that is a list of matching contentlets and the fragments that
* matched.
*
* @param searcher
* @return
*/
JSONObject searchForContent(EmbeddingsDTO searcher);

/**
* returns a list of matching content+embeddings from the dot_embeddings table based on the searcher dto
*
* @param searcher
* @return
*/
List<EmbeddingsDTO> getEmbeddingResults(EmbeddingsDTO searcher);

/**
* returns a count of matching content+embeddings based on the searcher dto
*
* @param searcher
* @return
*/
long countEmbeddings(EmbeddingsDTO searcher);

/**
* returns a map of all the available dot_embeddings 'indexes' plus the count of embeddings in them
*
* @return
*/
Map<String, Map<String, Object>> countEmbeddingsByIndex();

/**
* drops the dot_embeddings table
*/
void dropEmbeddingsTable();

/**
* inits pg_vector and builds the dot_embeddings table
*/
void initEmbeddingsTable();

/**
* Returns
* Takes a string and returns the embeddings value for the string
*
* @param content
* @return
*/
Tuple2<Integer, List<Float>> pullOrGenerateEmbeddings(String content);


}
38 changes: 37 additions & 1 deletion src/main/java/com/dotcms/ai/api/EmbeddingsAPIImpl.java
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import com.dotcms.api.web.HttpServletRequestThreadLocal;
import com.dotcms.api.web.HttpServletResponseThreadLocal;
import com.dotcms.contenttype.model.field.Field;
import com.dotcms.contenttype.model.type.ContentType;
import com.dotcms.rendering.velocity.util.VelocityUtil;
import com.dotcms.rest.ContentResource;
import com.dotmarketing.beans.Host;
Expand All @@ -29,6 +30,8 @@
import io.vavr.Tuple2;
import io.vavr.Tuple3;
import io.vavr.control.Try;
import java.util.ArrayList;
import java.util.HashMap;
import org.apache.velocity.context.Context;

import javax.validation.constraints.NotNull;
Expand All @@ -42,7 +45,7 @@
import java.util.stream.Collectors;


public class EmbeddingsAPIImpl implements EmbeddingsAPI {
class EmbeddingsAPIImpl implements EmbeddingsAPI {


static final Cache<String, Tuple2<Integer, List<Float>>> embeddingCache = Caffeine.newBuilder()
Expand Down Expand Up @@ -113,6 +116,39 @@ public boolean generateEmbeddingsforContent(@NotNull Contentlet contentlet, Stri

}

@Override
public Map<String, List<Field>> parseTypesAndFields(final String typeAndFieldParam) {

if (UtilMethods.isEmpty(typeAndFieldParam)) {
return Map.of();
}

final Map<String, List<Field>> typesAndFields = new HashMap<>();
final String[] typeFieldArr = typeAndFieldParam.trim().split("[\\r?\\n,]");

for (String typeField : typeFieldArr) {
String[] typeOptField = typeField.trim().split("\\.");
Optional<ContentType> type = Try.of(
() -> APILocator.getContentTypeAPI(APILocator.systemUser()).find(typeOptField[0])).toJavaOptional();
if (type.isEmpty()) {
continue;
}
List<Field> fields = typesAndFields.getOrDefault(type.get().variable(), new ArrayList<>());

Optional<Field> field = Try.of(() -> type.get().fields().stream().filter(f->f.variable().equalsIgnoreCase(typeOptField[1])).findFirst()).getOrElse(Optional.empty());
if (field.isPresent()) {
fields.add(field.get());
}

typesAndFields.put(type.get().variable(), fields);

}

return typesAndFields;
}




@Override
public JSONObject reduceChunksToContent(EmbeddingsDTO searcher, final List<EmbeddingsDTO> searchResults) {
Expand Down
1 change: 1 addition & 0 deletions src/main/java/com/dotcms/ai/app/AppKeys.java
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ public enum AppKeys {
EMBEDDINGS_THREADS_QUEUE("com.dotcms.ai.embeddings.threads.queue", "10000"),
EMBEDDINGS_CACHE_TTL_SECONDS("com.dotcms.ai.embeddings.cache.ttl.seconds", "600"),
EMBEDDINGS_CACHE_SIZE("com.dotcms.ai.embeddings.cache.size", "1000"),
LISTENER_INDEXER("listenerIndexer", "{}"),
EMBEDDINGS_DB_DELETE_OLD_ON_UPDATE("com.dotcms.ai.embeddings.delete.old.on.update", "true");
public static final String APP_KEY = "dotAI";
public static final String APP_YAML_NAME = APP_KEY + ".yml";
Expand Down
Loading

0 comments on commit 71c2692

Please sign in to comment.