Kilimandjaro is a program that provides mapping from any labels to medical terminologies.
Currently two sources are used:
- the french Snomed CT
- the CCAM terminology
In order to provide a useful mapping, we compare text embeddings.
The main way to use Kilimandjaro is to:
- produce and store embeddings
- query through the UI.
Before anything:
- install rye (several methods here, there's a homebrew formula)
- run
rye sync
at the root of this repo directory - copy the
config-skeleton.toml
file to have a dedicatedconfig.toml
in which you provide actual valuesrye run indexer config
will display the configuration sections (more on how to complete below)
To produce the embeddings - here for the CCAM data source:
rye run indexer add ccam
It will fetch the source data, produce and store the embeddings in a local ChromaDB instance.
For this source, the whole process is several minutes long.
Then launch the web UI:
rye run webui
The three main pieces of this application are:
- the vector database, using ChromaDB, to produce and store embeddings
- the indexer program, which fetch source data and push them to ChromaDB
- the web UI, which allows humans to really use the application
graph
DB[(vector DB)]
INDEXER([Indexer])
WEBUI(Web UI)
SOURCES(Sources)
INDEXER -->|fetches| SOURCES
INDEXER -->|indexes| DB
DB -->|produces embeddings| DB
WEBUI -->|queries| DB
The indexer is a command line. This:
rye run indexer --help
will display available commands.
It currently fetches data from a triple store.
To be able to fetch data, you must provide a triple store endpoint in the corresponding configuration section:
[kilimandjaro.sources]
triple-store-url = "<ENDPOINT URL>"
- when parsing the JSON payload outside with
rye run indexer add ccam | yq
some errors appears - for example for this acte:
{'code':'MBFA001', 'label': 'Résection "en bloc" d\'une extrémité et/ou de la diaphyse de l\'humérus'}
- this would be a better encoding:
"label":"Résection \"en bloc\" d\\'une extrémité et/ou de la diaphyse de l\\'humérus"
?