User Behavior Insights implementation for Apache Solr #2452

epugh · 2024-05-09T17:45:11Z

Description

I am working with other folks, especially Stavros Macrakis ([email protected]), to come up with a solution for understanding what users are doing in response to search results. We have great visibility and understanding of an incoming query, what we do with it, and then what docs are sent back. We do NOT have a way of tying that search to then what does the user do next, and if the following query is connected to the original one.

Many teams lean on GA or Snowplow or custom code for tracking click through, add to cart, etc as signals, but nothing that is drop dead simple to use and open source.

Solution

User Behavior Insights is a shared schema for tracking search related activities. There is a basic implementation for OpenSearch and this is a version for Apache Solr.

Tasks to be done:

Demonstrate providing a .expr file and using it to write to another Solr collection.
Look at performance implications of the every query generates a streaming expressoin.
Check we only record on the main node, not the replicas when sharding.
How can I load test this?
Write up Ref Guide Docs
Can we add it to techproducts as an example?
Add UBI to Admin UI as flag
Add UBI to SolrJ basic client
Add UBI to SolrJ JSON Query client

Tests

Bats test to demonstrate the end to end use of UBI.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

We have newer approaches such as the FileStoreAPI and the PackageManager.

… testing works with json query

…ry to jam it into log4j

HoustonPutman · 2024-05-10T15:18:36Z

I'd love to review here, but I think I need some more starting information either in a ref guide page or a JIRA, I'm kind of lost right now...

epugh · 2024-05-10T15:31:24Z

I'd love to review here, but I think I need some more starting information either in a ref guide page or a JIRA, I'm kind of lost right now...

Yeah... I'll go ahead and write up some ref guide docs! And finish the demo .bats script ;-)

chatman · 2024-05-20T21:22:20Z

Usually, features like these are discussed in the dev@ list, or in JIRA or a SIP.
Most important question I have in mind is whether this needs to be in the core search engine? If not, can this not be a plugin/package, shipped outside of solr-core?

epugh · 2024-05-20T23:48:45Z

This is definitely draft mode code... I opened it as a PR just to be able to track the work, and once it gets a bit furthur, I plan on opening a proper discussion about it. Module? Solr Sandbox? A Component? A full blown package? So many fun options...

… backend for events.

refer to the standard components using more normal pattern.

We are already in the UBI component!

epugh · 2024-11-27T13:39:46Z

A question for the smarter folks that me. Should the classes UBIQuery and UBIQueryStream be added to the UBIComponent.java? UBIQuery is just a pojo... And UBIQueryStream wires the use of the component up to a streaming expression. I don't see either ever being used elsewhere....

epugh · 2024-11-27T13:46:12Z

Just stubbed my toe on the "Distributed processing is harder than single core processing"! With a two node set up, I discovered that I am logging to a SINGLE userfiles/ubi_queries.jsonl file, and I log once for each shard.. instead of just logging on the collector step..

{"query_id":"c4e40af6-67b7-4824-8b63-5aae70a485f6","timestamp":"2024-11-27T13:42:19.121Z"}
{"query_id":"5dfedf02-fd89-4e40-b3aa-7700c162800b","timestamp":"2024-11-27T13:42:19.121Z"}

Sigh.

… as we interleave data otherwise.

The fact that we are calling .keySet may be a problem... Because that means other components might be in a random order? Maybe we shouldn't even use a map of string/class, it should just be a list of classes?

epugh · 2024-11-28T12:56:46Z

Argh, a bit stuck. I can't figure out how to have the UBIComponent during a distributed query, look up the final doc id's and record them before sending them back to the user. With a single node single shard, it works great, but not in a distributed fashion.

I keep getting:

2024-11-28 12:36:31.368 ERROR (qtp428039780-40-localhost-11) [c:twoshard s:shard1 r:core_node4 x:twoshard_shard1_replica_n2 t:localhost-11] o.a.s.s.HttpSolrCall 500 Exception => java.lang.NullPointerException: Cannot read field "docList" because the return value of "org.apache.solr.handler.component.ResponseBuilder.getResults()" is null
	at org.apache.solr.handler.component.UBIComponent.doStuff(UBIComponent.java:315)
java.lang.NullPointerException: Cannot read field "docList" because the return value of "org.apache.solr.handler.component.ResponseBuilder.getResults()" is null
	at org.apache.solr.handler.component.UBIComponent.doStuff(UBIComponent.java:315) ~[?:?]
	at org.apache.solr.handler.component.UBIComponent.distributedProcess(UBIComponent.java:252) ~[?:?]
	at org.apache.solr.handler.component.SearchHandler.processComponents(SearchHandler.java:552) ~[?:?]
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:429) ~[?:?]
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:238) ~[?:?]
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2875) ~[?:?]

mkhludnev

Hi @epugh !

mkhludnev · 2024-12-01T08:45:50Z

solr/core/src/java/org/apache/solr/handler/component/UBIComponent.java

+      }
+    }
+
+    ResultContext rc = (ResultContext) rb.rsp.getResponse();


To make it distributed, we need to split this method here, and execute parts per certain stage.
It seems, UBIComp needs to submit docids found. Right?
In distributed context, we need to record before-merge per-shards ids or resulting merged&cropped result ids? I bet the later, please confirm.

Thanks @mkhludnev for looking at this, gives me renewed energy to know someone else is looking at it! Yes, it is the merged&cropped result ids. We want to record in the plugin what the candidate result ids that the user MIGHT have seen, which later is used to compare against clicks and impressions to identify which docs are NOT attractive to the user.

mkhludnev · 2024-12-01T08:48:29Z

solr/core/src/java/org/apache/solr/handler/component/UBIComponent.java

+    stream = constructStream(streamFactory, streamExpression);
+
+    streamContext.put("ubi-query", ubiQuery);
+    stream.setStreamContext(streamContext);


I know nothing about streams, but is it possible to submit UbiQuery here, which isn't a subclass of any serializable framework? You know passing a ref might work locally, but not it remote/distributed env.

GOod question. I believe that the getTuple method will be immediately run, which means this code doesn't actually get run in a distributed sense.. I.e, the ubiQuery object that we put into the streamContext is immediately read back out in getTuple() method.. That is the job of the UBIQueryStream class, to convert the UBIQuery found in the context into a Tuple that is used by streaming expressons.

I am going to try actually making UBIQuery and UBIQueryStream inner classes of UBIComponent to see how that looks...

ok. got it. found toMap(). Shouldn't it yield docIds as well? I can't see refs to this field there https://github.com/epugh/solr/blob/99d6b7a7eb7b28a92f4cb36d4a525f8b901ba93c/solr/core/src/java/org/apache/solr/handler/component/UBIQuery.java#L103

That's a fair question... OMG... I can't believe that I forgot to add the damn doc ids... I suck.

Thanks for spotting that! How embarrassing! yeah...

Pardon, I barely understand what's going on there.

mkhludnev · 2024-12-08T20:17:52Z

solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java

-    names.add(TermsComponent.COMPONENT_NAME);
-
-    return names;
+    List<String> l = new ArrayList<String>(SearchComponent.STANDARD_COMPONENTS.keySet());


As a result RTG occurs as a default component that causes a problem in an essential cloud test. 😭

humm... Maybe I just back out this optimization...

check out the changes I made to back this out, but still keep the name STANDARD_COMPONENTS...

mkhludnev · 2024-12-09T13:35:32Z

solr/core/src/java/org/apache/solr/handler/component/UBIComponent.java

+
+    Set<String> fields = Collections.singleton(schema.getUniqueKeyField().getName());
+    for (DocIterator iter = dl.iterator(); iter.hasNext(); ) {
+      sb.append(schema.printableUniqueKey(searcher.getDocFetcher().doc(iter.nextDoc(), fields)))


what about commas in id values? Isn't it safer to use json array as a convention for this field?

humm... maybe? Is that at all ever done? So, most folks are going to take this data, and load it into a pandas dataframe or soemthing like that... Now, since the doc is already JSON.. maybe it's okay to make this a array of JSON as well? Since you already have to parse some JSON to do anything with it......

Neither UBIQuery or UBIQueryStream will ever be used outside of the UBICompoent. Thought about a o.a.s.handler.component.ubi package as well, but this seems more specific...

This doesn't cover converting doc_ids into a JSON array of any type...

Exxcept that we had to change to support TEN standard components by not using Map.of. Also, I couldn't stand the lower case + underscore "standard_components" object name.

I suspect lots of places for fixing. Like pulling out field names into a UBIParams.java file. May want to name space ubi query params under "ubi.".. what about in JSON query?

UBI goes distrib

epugh added 10 commits May 8, 2024 09:35

address @link not working with precommit, now it does

1a2afd0

commented code is confusing to reader...

80f5c80

address some intellj prompted warnings

e5450b8

Remove deprecated BlobRepository

74b395b

We have newer approaches such as the FileStoreAPI and the PackageManager.

checkpoint

79863b4

Merge remote-tracking branch 'upstream/main' into ubi

d9f3702

better name, but I don't have the output pattern working yet

4fbfa9d

write to console out the docids, and add a hard coded query_id

80123eb

now handling passing in query_id instead of internally generated, and…

f6cb356

… testing works with json query

now logging user_query as a map (hash) to our jsonl log file

861f922

github-actions bot added jetty-server client:solrj tests cat:search cat:packagemanager labels May 9, 2024

epugh added 3 commits May 9, 2024 14:24

Log wasn't really working, we want complex nested data, so lets not t…

eaa56e1

…ry to jam it into log4j

actually track the doc_ids in our jsonl file

aa45a83

tidy

9523c18

epugh added 2 commits May 15, 2024 08:57

provide more context to how to use UBI

c7a939e

techproducts gives a better example because of the inStock filter

a6d33a9

epugh added 6 commits May 20, 2024 20:26

Working on trying to get streaming expressions to provide a pluggable…

34a04bf

… backend for events.

precommit failures

43475c9

cut n paste error

f0d2cb6

policeman failures fixed

9168998

argh, let the precommit pass

6364427

tidy!

01ee6c7

epugh added 3 commits November 24, 2024 07:41

We no longer drop the examples into specific sub dir

00f8c49

Map.of () limited to nine elements, so use Map.ofEntries.

7c32dac

refer to the standard components using more normal pattern.

take some weight out of the variable name!

8c77769

We are already in the UBI component!

epugh added 6 commits November 27, 2024 08:56

Track the collection name if an application isn't explicitly provided…

83842e9

… as we interleave data otherwise.

Need to ensure that Query is always the first component.

c128c4a

The fact that we are calling .keySet may be a problem... Because that means other components might be in a random order? Maybe we shouldn't even use a map of string/class, it should just be a list of classes?

typo

87cbfbe

Test is not testing anything.

e63d125

QueryComponent get's forced to first, so list it first.

3393d6e

Added more debugging, but it's not helping.

92d7f67

back out changes

c22f9cb

mkhludnev reviewed Dec 1, 2024

View reviewed changes

mkhludnev added 2 commits December 7, 2024 00:26

UBI goes distrib

49ddee4

Pardon, I barely understand what's going on there.

seems like a right thing

39dadf2

mkhludnev reviewed Dec 8, 2024

View reviewed changes

now it checks that query were recorded. Not sure about doc ids.

995a0cf

mkhludnev reviewed Dec 9, 2024

View reviewed changes

epugh and others added 9 commits December 9, 2024 13:45

Group UBI specific classes into inner classes of UBIComponent.

08b28a5

Neither UBIQuery or UBIQueryStream will ever be used outside of the UBICompoent. Thought about a o.a.s.handler.component.ubi package as well, but this seems more specific...

Start tracking in the backend the document id's.

d01f795

This doesn't cover converting doc_ids into a JSON array of any type...

Restore previous approach for standard components...

674f83e

Exxcept that we had to change to support TEN standard components by not using Map.of. Also, I couldn't stand the lower case + underscore "standard_components" object name.

maybe needed?

56ec0d0

Introduce UBI to the SolrJ client.

dc378fe

I suspect lots of places for fixing. Like pulling out field names into a UBIParams.java file. May want to name space ubi query params under "ubi.".. what about in JSON query?

lint

dc9a0ed

Merge branch 'ubi' into ubi-distr

47d8927

spotless apply

0887cae

extract method

a44abb6

mkhludnev mentioned this pull request Dec 10, 2024

UBI goes distrib epugh/solr#11

Merged

Merge pull request #11 from mkhludnev/ubi-distr

fce0334

UBI goes distrib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User Behavior Insights implementation for Apache Solr #2452

User Behavior Insights implementation for Apache Solr #2452

epugh commented May 9, 2024 •

edited

Loading

HoustonPutman commented May 10, 2024

epugh commented May 10, 2024

chatman commented May 20, 2024

epugh commented May 20, 2024

epugh commented Nov 27, 2024

epugh commented Nov 27, 2024

epugh commented Nov 28, 2024

mkhludnev left a comment

mkhludnev Dec 1, 2024

epugh Dec 9, 2024

mkhludnev Dec 1, 2024

epugh Dec 9, 2024

mkhludnev Dec 9, 2024

epugh Dec 9, 2024

epugh Dec 9, 2024

mkhludnev Dec 8, 2024

epugh Dec 9, 2024

epugh Dec 9, 2024

mkhludnev Dec 9, 2024

epugh Dec 9, 2024

User Behavior Insights implementation for Apache Solr #2452

Are you sure you want to change the base?

User Behavior Insights implementation for Apache Solr #2452

Conversation

epugh commented May 9, 2024 • edited Loading

Description

Solution

Tests

Checklist

HoustonPutman commented May 10, 2024

epugh commented May 10, 2024

chatman commented May 20, 2024

epugh commented May 20, 2024

epugh commented Nov 27, 2024

epugh commented Nov 27, 2024

epugh commented Nov 28, 2024

mkhludnev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epugh commented May 9, 2024 •

edited

Loading