Skip to content

Commit

Permalink
Filter params for vector search (#1889)
Browse files Browse the repository at this point in the history
* adding serializer fields

* adding filters

* update spec

* updating tests and accounting for wrapped boolean arrays

* adding test for qdrant conditions

* looking up resources by readable id

* typo

* removing id field from request serializer and replacing with readable_id

* calrifying docstring

* changing order of recreating collection

* adding query counts to results

* some consolidation

* fixing test

* removing unused filters
  • Loading branch information
shanbady authored Dec 12, 2024
1 parent 25f4c24 commit 3051448
Show file tree
Hide file tree
Showing 10 changed files with 1,097 additions and 63 deletions.
413 changes: 413 additions & 0 deletions frontends/api/src/generated/v0/api.ts

Large diffs are not rendered by default.

14 changes: 8 additions & 6 deletions learning_resources_search/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -952,12 +952,14 @@ def get_similar_resources_qdrant(value_doc: dict, num_resources: int):
list of learning resources
"""
hits = _qdrant_similar_results(value_doc, num_resources)
return LearningResource.objects.for_search_serialization().filter(
id__in=[
resource["id"]
for resource in hits
if resource["id"] != value_doc["id"] and resource["published"]
]
return (
LearningResource.objects.for_search_serialization()
.filter(
readable_id__in=[
resource["readable_id"] for resource in hits if resource["published"]
]
)
.exclude(id=value_doc["id"])
)


Expand Down
325 changes: 325 additions & 0 deletions openapi/specs/v0.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -303,22 +303,347 @@ paths:
description: Vector Search for learning resources
summary: Vector Search
parameters:
- in: query
name: certification
schema:
type: boolean
nullable: true
description: True if the learning resource offers a certificate
- in: query
name: certification_type
schema:
type: array
items:
enum:
- micromasters
- professional
- completion
- none
type: string
description: |-
* `micromasters` - MicroMasters Credential
* `professional` - Professional Certificate
* `completion` - Certificate of Completion
* `none` - No Certificate
description: "The type of certificate \n\n* `micromasters` - MicroMasters\
\ Credential\n* `professional` - Professional Certificate\n* `completion`\
\ - Certificate of Completion\n* `none` - No Certificate"
- in: query
name: course_feature
schema:
type: array
items:
type: string
minLength: 1
description: The course feature. Possible options are at api/v1/course_features/
- in: query
name: delivery
schema:
type: array
items:
enum:
- online
- hybrid
- in_person
- offline
type: string
description: |-
* `online` - Online
* `hybrid` - Hybrid
* `in_person` - In person
* `offline` - Offline
description: "The delivery options in which the learning resource is offered\
\ \n\n* `online` - Online\n* `hybrid` - Hybrid\n* `in_person`\
\ - In person\n* `offline` - Offline"
- in: query
name: department
schema:
type: array
items:
enum:
- '1'
- '2'
- '3'
- '4'
- '5'
- '6'
- '7'
- '8'
- '9'
- '10'
- '11'
- '12'
- '14'
- '15'
- '16'
- '17'
- '18'
- '20'
- 21A
- 21G
- 21H
- 21L
- 21M
- '22'
- '24'
- CC
- CMS-W
- EC
- ES
- ESD
- HST
- IDS
- MAS
- PE
- SP
- STS
- WGS
type: string
description: |-
* `1` - Civil and Environmental Engineering
* `2` - Mechanical Engineering
* `3` - Materials Science and Engineering
* `4` - Architecture
* `5` - Chemistry
* `6` - Electrical Engineering and Computer Science
* `7` - Biology
* `8` - Physics
* `9` - Brain and Cognitive Sciences
* `10` - Chemical Engineering
* `11` - Urban Studies and Planning
* `12` - Earth, Atmospheric, and Planetary Sciences
* `14` - Economics
* `15` - Management
* `16` - Aeronautics and Astronautics
* `17` - Political Science
* `18` - Mathematics
* `20` - Biological Engineering
* `21A` - Anthropology
* `21G` - Global Languages
* `21H` - History
* `21L` - Literature
* `21M` - Music and Theater Arts
* `22` - Nuclear Science and Engineering
* `24` - Linguistics and Philosophy
* `CC` - Concourse
* `CMS-W` - Comparative Media Studies/Writing
* `EC` - Edgerton Center
* `ES` - Experimental Study Group
* `ESD` - Engineering Systems Division
* `HST` - Medical Engineering and Science
* `IDS` - Data, Systems, and Society
* `MAS` - Media Arts and Sciences
* `PE` - Athletics, Physical Education and Recreation
* `SP` - Special Programs
* `STS` - Science, Technology, and Society
* `WGS` - Women's and Gender Studies
description: "The department that offers the learning resource \
\ \n\n* `1` - Civil and Environmental Engineering\n* `2` - Mechanical Engineering\n\
* `3` - Materials Science and Engineering\n* `4` - Architecture\n* `5` -\
\ Chemistry\n* `6` - Electrical Engineering and Computer Science\n* `7`\
\ - Biology\n* `8` - Physics\n* `9` - Brain and Cognitive Sciences\n* `10`\
\ - Chemical Engineering\n* `11` - Urban Studies and Planning\n* `12` -\
\ Earth, Atmospheric, and Planetary Sciences\n* `14` - Economics\n* `15`\
\ - Management\n* `16` - Aeronautics and Astronautics\n* `17` - Political\
\ Science\n* `18` - Mathematics\n* `20` - Biological Engineering\n* `21A`\
\ - Anthropology\n* `21G` - Global Languages\n* `21H` - History\n* `21L`\
\ - Literature\n* `21M` - Music and Theater Arts\n* `22` - Nuclear Science\
\ and Engineering\n* `24` - Linguistics and Philosophy\n* `CC` - Concourse\n\
* `CMS-W` - Comparative Media Studies/Writing\n* `EC` - Edgerton Center\n\
* `ES` - Experimental Study Group\n* `ESD` - Engineering Systems Division\n\
* `HST` - Medical Engineering and Science\n* `IDS` - Data, Systems, and\
\ Society\n* `MAS` - Media Arts and Sciences\n* `PE` - Athletics, Physical\
\ Education and Recreation\n* `SP` - Special Programs\n* `STS` - Science,\
\ Technology, and Society\n* `WGS` - Women's and Gender Studies"
- in: query
name: free
schema:
type: boolean
nullable: true
- in: query
name: level
schema:
type: array
items:
enum:
- undergraduate
- graduate
- high_school
- noncredit
- advanced
- intermediate
- introductory
type: string
description: |-
* `undergraduate` - Undergraduate
* `graduate` - Graduate
* `high_school` - High School
* `noncredit` - Non-Credit
* `advanced` - Advanced
* `intermediate` - Intermediate
* `introductory` - Introductory
- in: query
name: limit
schema:
type: integer
description: Number of results to return per page
- in: query
name: ocw_topic
schema:
type: array
items:
type: string
minLength: 1
description: The ocw topic name.
- in: query
name: offered_by
schema:
type: array
items:
enum:
- mitx
- ocw
- bootcamps
- xpro
- mitpe
- see
type: string
description: |-
* `mitx` - MITx
* `ocw` - MIT OpenCourseWare
* `bootcamps` - Bootcamps
* `xpro` - MIT xPRO
* `mitpe` - MIT Professional Education
* `see` - MIT Sloan Executive Education
description: "The organization that offers the learning resource \
\ \n\n* `mitx` - MITx\n* `ocw` - MIT OpenCourseWare\n* `bootcamps` -\
\ Bootcamps\n* `xpro` - MIT xPRO\n* `mitpe` - MIT Professional Education\n\
* `see` - MIT Sloan Executive Education"
- in: query
name: offset
schema:
type: integer
description: The initial index from which to return the results
- in: query
name: platform
schema:
type: array
items:
enum:
- edx
- ocw
- oll
- mitxonline
- bootcamps
- xpro
- csail
- mitpe
- see
- scc
- ctl
- whu
- susskind
- globalalumni
- simplilearn
- emeritus
- podcast
- youtube
type: string
description: |-
* `edx` - edX
* `ocw` - MIT OpenCourseWare
* `oll` - Open Learning Library
* `mitxonline` - MITx Online
* `bootcamps` - Bootcamps
* `xpro` - MIT xPRO
* `csail` - CSAIL
* `mitpe` - MIT Professional Education
* `see` - MIT Sloan Executive Education
* `scc` - Schwarzman College of Computing
* `ctl` - Center for Transportation & Logistics
* `whu` - WHU
* `susskind` - Susskind
* `globalalumni` - Global Alumni
* `simplilearn` - Simplilearn
* `emeritus` - Emeritus
* `podcast` - Podcast
* `youtube` - YouTube
description: "The platform on which the learning resource is offered \
\ \n\n* `edx` - edX\n* `ocw` - MIT OpenCourseWare\n* `oll` - Open\
\ Learning Library\n* `mitxonline` - MITx Online\n* `bootcamps` - Bootcamps\n\
* `xpro` - MIT xPRO\n* `csail` - CSAIL\n* `mitpe` - MIT Professional Education\n\
* `see` - MIT Sloan Executive Education\n* `scc` - Schwarzman College of\
\ Computing\n* `ctl` - Center for Transportation & Logistics\n* `whu` -\
\ WHU\n* `susskind` - Susskind\n* `globalalumni` - Global Alumni\n* `simplilearn`\
\ - Simplilearn\n* `emeritus` - Emeritus\n* `podcast` - Podcast\n* `youtube`\
\ - YouTube"
- in: query
name: professional
schema:
type: boolean
nullable: true
- in: query
name: q
schema:
type: string
minLength: 1
description: The search text
- in: query
name: readable_id
schema:
type: string
minLength: 1
description: The readable id of the resource
- in: query
name: resource_category
schema:
type: array
items:
enum:
- course
- program
- learning_material
type: string
description: |-
* `course` - Course
* `program` - Program
* `learning_material` - Learning Material
description: "The category of learning resource \n\n* `course`\
\ - Course\n* `program` - Program\n* `learning_material` - Learning Material"
- in: query
name: resource_type
schema:
type: array
items:
enum:
- course
- program
- learning_path
- podcast
- podcast_episode
- video
- video_playlist
type: string
description: |-
* `course` - course
* `program` - program
* `learning_path` - learning path
* `podcast` - podcast
* `podcast_episode` - podcast episode
* `video` - video
* `video_playlist` - video playlist
description: "The type of learning resource \n\n* `course` - course\n\
* `program` - program\n* `learning_path` - learning path\n* `podcast` -\
\ podcast\n* `podcast_episode` - podcast episode\n* `video` - video\n* `video_playlist`\
\ - video playlist"
- in: query
name: topic
schema:
type: array
items:
type: string
minLength: 1
description: The topic name. To see a list of options go to api/v1/topics/
tags:
- learning_resources_vector_search
responses:
Expand Down
2 changes: 2 additions & 0 deletions vector_search/conftest.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import numpy as np
import pytest
from qdrant_client.http.models.models import CountResult

from vector_search.encoders.base import BaseEncoder

Expand Down Expand Up @@ -33,6 +34,7 @@ def _use_test_qdrant_settings(settings, mocker):
[],
None,
]
mock_qdrant.count.return_value = CountResult(count=10)
mocker.patch(
"vector_search.utils.qdrant_client",
return_value=mock_qdrant,
Expand Down
5 changes: 2 additions & 3 deletions vector_search/management/commands/generate_embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,12 +64,11 @@ def handle(self, *args, **options): # noqa: ARG002
for object_type in sorted(LEARNING_RESOURCE_TYPES):
self.stdout.write(f" --{object_type}s")
return

if options["recreate_collections"]:
create_qdrand_collections(force_recreate=True)
task = start_embed_resources.delay(
indexes_to_update, skip_content_files=options["skip_content_files"]
)
if options["recreate_collections"]:
create_qdrand_collections(force_recreate=True)
self.stdout.write(
f"Started celery task {task} to index content for the following"
f" Types to embed: {indexes_to_update}"
Expand Down
Loading

0 comments on commit 3051448

Please sign in to comment.