feat(api): add ES score #197

revolunet · 2022-09-06T07:06:30Z

fix #196

github-actions · 2022-09-06T07:19:22Z

🎉 Deployment for commit 4a09f90 :

Ingresses

Docker images

📦 docker pull ghcr.io/socialgouv/recherche-entreprises/api:sha-4a09f90da32149700a0c803674eb12e4fd8d447f
📦 docker pull ghcr.io/socialgouv/recherche-entreprises/front:sha-4a09f90da32149700a0c803674eb12e4fd8d447f

Debug

yohanboniface · 2022-09-14T13:49:34Z

Cool, merci!

Est-ce que tu penses que ce serait possible d'avoir un ordre de grandeur pour le score ? Idéalement l'avoir entre 0 et 1 (pour savoir si un score est bon "en absolu"), ou alors avoir le maxScore à côté ?

revolunet · 2022-09-14T22:00:56Z

C'est pas trivial de modifier le score en fait; il est calculé en fonction de la query et n'est pas normalisé sur [0,1] :/

Le maxScore c'est celui du 1er item de la liste si je me trompe pas

Un "explain" d'exemple pour le calcul du score :

{
  "value": 21.91652,
  "description": "sum of:",
  "details": [
    {
      "value": 17.90895,
      "description": "sum of:",
      "details": [
        {
          "value": 9.977191,
          "description": "weight(namingMain:michelin in 9444530) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 9.977191,
              "description": "score(freq=1.0), computed as boost * idf * tf from:",
              "details": [
                {
                  "value": 2.2,
                  "description": "boost",
                  "details": []
                },
                {
                  "value": 9.977191,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 1515,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 32628324,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.45454544,
                  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "freq, occurrences of term within document",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "k1, term saturation parameter",
                      "details": []
                    },
                    {
                      "value": 0,
                      "description": "b, length normalization parameter",
                      "details": []
                    },
                    {
                      "value": 4,
                      "description": "dl, length of field",
                      "details": []
                    },
                    {
                      "value": 1.7728646,
                      "description": "avgdl, average length of field",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        },
        {
          "value": 7.931761,
          "description": "weight(naming:michelin in 9444530) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 7.931761,
              "description": "score(freq=1.0), computed as boost * idf * tf from:",
              "details": [
                {
                  "value": 2.2,
                  "description": "boost",
                  "details": []
                },
                {
                  "value": 7.9317613,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 11722,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 32639261,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.45454544,
                  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "freq, occurrences of term within document",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "k1, term saturation parameter",
                      "details": []
                    },
                    {
                      "value": 0,
                      "description": "b, length normalization parameter",
                      "details": []
                    },
                    {
                      "value": 4,
                      "description": "dl, length of field",
                      "details": []
                    },
                    {
                      "value": 2.6909916,
                      "description": "avgdl, average length of field",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "value": 3.945307,
      "description": "Saturation function on the _feature field for the etablissements feature, computed as w * S / (S + k) from:",
      "details": [
        {
          "value": 4,
          "description": "w, weight of this function",
          "details": []
        },
        {
          "value": 1.84375,
          "description": "k, pivot feature value that would give a score contribution equal to w/2",
          "details": []
        },
        {
          "value": 133,
          "description": "S, feature value",
          "details": []
        }
      ]
    },
    {
      "value": 0.062262263,
      "description": "Saturation function on the _feature field for the siretRank feature, computed as w * S / (S + k) from:",
      "details": [
        {
          "value": 0.1,
          "description": "w, weight of this function",
          "details": []
        },
        {
          "value": 51814485000000,
          "description": "k, pivot feature value that would give a score contribution equal to w/2",
          "details": []
        },
        {
          "value": 85487029000000,
          "description": "S, feature value",
          "details": []
        }
      ]
    },
    {
      "value": 0,
      "description": "match on required clause, product of:",
      "details": [
        {
          "value": 0,
          "description": "# clause",
          "details": []
        },
        {
          "value": 1,
          "description": "etatAdministratifUniteLegale:A",
          "details": []
        }
      ]
    },
    {
      "value": 0,
      "description": "match on required clause, product of:",
      "details": [
        {
          "value": 0,
          "description": "# clause",
          "details": []
        },
        {
          "value": 1,
          "description": "etatAdministratifEtablissement:A",
          "details": []
        }
      ]
    }
  ]
}

yohanboniface · 2022-10-11T12:25:50Z

Est-ce qu'il y aurait pas moyen d'ajouter à la volée un score de comparaison entre chaque résultat trouvé et la chaîne cherchée ? Genre avec une comparaison levenshtein ou ngrams. Il me semble avoir fait ça dans ma folle jeunesse, mais c'est loin dans ma mémoire. Si tu veux je cherche plus :)

arnaudriegert

Merci pour cette PR !

J'ai lu les commentaires de @yohanboniface et ce serait effectivement chouette d'avoir un score normalisé à terme.

En attendant, le score ES peut déjà être utile 🙂

Julien Bouquillon added 2 commits September 6, 2022 09:06

feat(api): add ES score

0ec9ec0

fix: snaps

4a09f90

github-actions bot deployed to recherche-entreprises-issue-196 September 6, 2022 07:18 View deployment

arnaudriegert approved these changes Nov 15, 2022

View reviewed changes

revolunet closed this Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add ES score #197

feat(api): add ES score #197

revolunet commented Sep 6, 2022

github-actions bot commented Sep 6, 2022

yohanboniface commented Sep 14, 2022

revolunet commented Sep 14, 2022 •

edited

Loading

yohanboniface commented Oct 11, 2022

arnaudriegert left a comment

feat(api): add ES score #197

feat(api): add ES score #197

Conversation

revolunet commented Sep 6, 2022

github-actions bot commented Sep 6, 2022

yohanboniface commented Sep 14, 2022

revolunet commented Sep 14, 2022 • edited Loading

yohanboniface commented Oct 11, 2022

arnaudriegert left a comment

Choose a reason for hiding this comment

revolunet commented Sep 14, 2022 •

edited

Loading