Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): add ES score #197

Closed
wants to merge 2 commits into from
Closed

feat(api): add ES score #197

wants to merge 2 commits into from

Conversation

revolunet
Copy link
Member

fix #196

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

@yohanboniface
Copy link

Cool, merci!

Est-ce que tu penses que ce serait possible d'avoir un ordre de grandeur pour le score ? Idéalement l'avoir entre 0 et 1 (pour savoir si un score est bon "en absolu"), ou alors avoir le maxScore à côté ?

@revolunet
Copy link
Member Author

revolunet commented Sep 14, 2022

C'est pas trivial de modifier le score en fait; il est calculé en fonction de la query et n'est pas normalisé sur [0,1] :/

Le maxScore c'est celui du 1er item de la liste si je me trompe pas

Un "explain" d'exemple pour le calcul du score :

{
  "value": 21.91652,
  "description": "sum of:",
  "details": [
    {
      "value": 17.90895,
      "description": "sum of:",
      "details": [
        {
          "value": 9.977191,
          "description": "weight(namingMain:michelin in 9444530) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 9.977191,
              "description": "score(freq=1.0), computed as boost * idf * tf from:",
              "details": [
                {
                  "value": 2.2,
                  "description": "boost",
                  "details": []
                },
                {
                  "value": 9.977191,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 1515,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 32628324,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.45454544,
                  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "freq, occurrences of term within document",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "k1, term saturation parameter",
                      "details": []
                    },
                    {
                      "value": 0,
                      "description": "b, length normalization parameter",
                      "details": []
                    },
                    {
                      "value": 4,
                      "description": "dl, length of field",
                      "details": []
                    },
                    {
                      "value": 1.7728646,
                      "description": "avgdl, average length of field",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        },
        {
          "value": 7.931761,
          "description": "weight(naming:michelin in 9444530) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 7.931761,
              "description": "score(freq=1.0), computed as boost * idf * tf from:",
              "details": [
                {
                  "value": 2.2,
                  "description": "boost",
                  "details": []
                },
                {
                  "value": 7.9317613,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 11722,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 32639261,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.45454544,
                  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "freq, occurrences of term within document",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "k1, term saturation parameter",
                      "details": []
                    },
                    {
                      "value": 0,
                      "description": "b, length normalization parameter",
                      "details": []
                    },
                    {
                      "value": 4,
                      "description": "dl, length of field",
                      "details": []
                    },
                    {
                      "value": 2.6909916,
                      "description": "avgdl, average length of field",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "value": 3.945307,
      "description": "Saturation function on the _feature field for the etablissements feature, computed as w * S / (S + k) from:",
      "details": [
        {
          "value": 4,
          "description": "w, weight of this function",
          "details": []
        },
        {
          "value": 1.84375,
          "description": "k, pivot feature value that would give a score contribution equal to w/2",
          "details": []
        },
        {
          "value": 133,
          "description": "S, feature value",
          "details": []
        }
      ]
    },
    {
      "value": 0.062262263,
      "description": "Saturation function on the _feature field for the siretRank feature, computed as w * S / (S + k) from:",
      "details": [
        {
          "value": 0.1,
          "description": "w, weight of this function",
          "details": []
        },
        {
          "value": 51814485000000,
          "description": "k, pivot feature value that would give a score contribution equal to w/2",
          "details": []
        },
        {
          "value": 85487029000000,
          "description": "S, feature value",
          "details": []
        }
      ]
    },
    {
      "value": 0,
      "description": "match on required clause, product of:",
      "details": [
        {
          "value": 0,
          "description": "# clause",
          "details": []
        },
        {
          "value": 1,
          "description": "etatAdministratifUniteLegale:A",
          "details": []
        }
      ]
    },
    {
      "value": 0,
      "description": "match on required clause, product of:",
      "details": [
        {
          "value": 0,
          "description": "# clause",
          "details": []
        },
        {
          "value": 1,
          "description": "etatAdministratifEtablissement:A",
          "details": []
        }
      ]
    }
  ]
}

@yohanboniface
Copy link

Est-ce qu'il y aurait pas moyen d'ajouter à la volée un score de comparaison entre chaque résultat trouvé et la chaîne cherchée ? Genre avec une comparaison levenshtein ou ngrams. Il me semble avoir fait ça dans ma folle jeunesse, mais c'est loin dans ma mémoire. Si tu veux je cherche plus :)

Copy link

@arnaudriegert arnaudriegert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merci pour cette PR !

J'ai lu les commentaires de @yohanboniface et ce serait effectivement chouette d'avoir un score normalisé à terme.

En attendant, le score ES peut déjà être utile 🙂

@revolunet revolunet closed this Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Score de confiance
3 participants