Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual search #651

Merged
merged 5 commits into from
Nov 3, 2023
Merged

Multilingual search #651

merged 5 commits into from
Nov 3, 2023

Conversation

fgravin
Copy link
Member

@fgravin fgravin commented Oct 18, 2023

Introduces a new option for the metadata_language configuration property: current.
This is mostly useful for multilingual catalogs.

If metadata_language == current, then the Elasticsearch search is queried in the current language of the application (the UI language).

The other rules remains

  • metadata_language is undefined: we use the .* wildcard
  • metadata_language is set to one language: we use the that language for the search eg. .langeng

Here what the search looks like with the UI in english and the property set to current

{
  "query_string": {
    "query": "carte établie",
    "default_operator": "AND",
    "fields": [
      "resourceTitleObject.langeng^5",
      "tag.langeng^4",
      "resourceAbstractObject.langeng^3",
      "lineageObject.langeng^2",
      "any.langeng",
      "uuid"
    ]
  }
} 

@fgravin fgravin added this to the 2.1.0 milestone Oct 18, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 18, 2023

Affected libs: api-repository, feature-catalog, feature-record, feature-router, feature-search, feature-map, feature-dataviz, feature-auth,
Affected apps: metadata-editor, datahub, demo, webcomponents, search, map-viewer, datafeeder,

  • 🚀 Build and deploy storybook and demo on GitHub Pages
  • 📦 Build and push affected docker images

if set to current, the search is requested with the current language of the application (UI language)
if current is set and UI lang is 'eng', then the requested fields are .langeng ones
@benoitregamey
Copy link

Hi, here's my 2 cents

I'm wondering if it would be useful to implement kind of a fallback mechanism to any.* in case the metadata_language is set to current.

I can imagine the usecase where a user with e.g. an english configured browser lands on the catalogue and try to search something. If the catalogue contains very little english records, he will find almost nothing even if the words used for the search are technical (often the same in all languages) or proper nouns.

Can't we imagine something like

{
  "query_string": {
    "query": "carte établie",
    "default_operator": "AND",
    "fields": [
      "resourceTitleObject.langeng^6",
      "tag.langeng^5",
      "resourceAbstractObject.langeng^4",
      "lineageObject.langeng^3",
      "any.langeng^2",
      "any.*",
      "uuid"
    ]
  }
} 

Copy link
Collaborator

@jahow jahow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @benoitregamey here, also does the request fail if for instance the field .langesp is used, but there's absolutely no spanish translations in the catalog?

conf/default.toml Outdated Show resolved Hide resolved
conf/default.toml Outdated Show resolved Hide resolved
@fgravin
Copy link
Member Author

fgravin commented Oct 19, 2023

I agree with @benoitregamey here, also does the request fail if for instance the field .langesp is used, but there's absolutely no spanish translations in the catalog?

Yes, we actually discussed that internally, we should have a fallback on * while boosting on the current language.
I think that it's very hard to have something working well though, there are so many possible combinations of things about the languages... 🙃

I'll adapt the query

@fgravin
Copy link
Member Author

fgravin commented Oct 19, 2023

I've adapted the query if current is used to produce the following payload:

[
    'resourceTitleObject.langeng^5',
    'tag.langeng^4',
    'resourceAbstractObject.langeng^3',
    'lineageObject.langeng^2',
    'any.langeng',
    'uuid',
    'resourceTitleObject.*',
    'tag.*',
    'resourceAbstractObject.*',
    'lineageObject.*',
    'any.*',
  ]

@jahow suggested to fallback all fields, I am wondering if it's usefull though as every field falls into any anyway.

@coveralls
Copy link

coveralls commented Oct 19, 2023

Coverage Status

coverage: 86.121% (-13.9%) from 100.0%
when pulling 020fa07 on multilang-search
into 96f8590 on main.

if the lang is 'current', then multilingual fields priority are increased by 10, with a fallback on the * wildcard
@fgravin
Copy link
Member Author

fgravin commented Oct 31, 2023

Ok, I've made another adaption where the priority are kept in the fallback as well.
The original priorities are increased by 10
eg:

[
  'resourceTitleObject.langeng^15',
  'resourceTitleObject.*^5',
  'tag.langeng^14',
  'tag.*^4',
  'resourceAbstractObject.langeng^13',
  'resourceAbstractObject.*^3',
  'lineageObject.langeng^12',
  'lineageObject.*^2',
  'any.langeng^11',
  'any.*',
  'uuid',
]

@fgravin fgravin requested a review from jahow October 31, 2023 19:29
Copy link
Collaborator

@jahow jahow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've tested it with a multilingual config and it seems to be producing good results. I made a couple of comments about readability, feel free to merge when you think this is ready!

Comment on lines 205 to 208
? this.isCurrentSearchLang()
? `lang${this.lang3}`
: `lang${this.metadataLang}`
: `*`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested ternary operators are hard to read IMO


describe('ElasticsearchService', () => {
let service: ElasticsearchService
let searchFilters

beforeEach(() => {
service = new ElasticsearchService('fre')
service = new ElasticsearchService(langServiceMock, 'fre')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a testBed here would make the test more readable I think, as it would be clear what are the injected dependencies (currently the test manipulates the service properties directly inc. private ones)

@fgravin fgravin merged commit 7661ac3 into main Nov 3, 2023
5 checks passed
@fgravin fgravin deleted the multilang-search branch November 3, 2023 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants