About efficiency of the model #80

AOZMH · 2021-04-11T05:13:58Z

Hi,

Thanks for the great repo, I enjoy a lot exploring it!
However, when I tried to run the code in the "Use BLINK in your codebase" chapter in README, I found the speed of running the model relatively slow (in fast=False mode).
To be more specific, when I execute "main_dense.run", the first stage of processing proceeded relatively slow (~2.5 seconds per item) while the later stage (printing "Evaluation") proceeded ~ 5 items per second. Also, I tried adding indices as below.

config = {
    ...
    "faiss_index": "flat",
    "index_path": models_path+"faiss_flat_index.pkl"
}

However, the performance of the first stage became even worse (~20 seconds per item). I'm wondering if I'm setting something wrong (especially for the faiss index) which resulted in the low speed. If there are any corrections/methods to speed up? Thanks for your help!
(I'll post the performance logs below if needed!)

The text was updated successfully, but these errors were encountered:

ledw · 2021-04-12T16:40:25Z

Hi @AOZMH,
Thanks for reporting. It is known that the (fast=False) mode is slow, however, the faiss index should be faster. Could you provide a code snippet? Thanks.

AOZMH · 2021-04-13T02:23:11Z

Thanks for the follow-up, I'll provide my code snippet and the execution logs (time costs) shortly after!

wutaiqiang · 2021-05-06T09:33:30Z

Using Biencoder+Crossencoder is much slow than Biencoder only.

wutaiqiang · 2021-05-06T09:34:44Z

Since the transformer is O(n^2), we can infer that the T(crossencoder):T(biencoder)=(n+n)^2:2*n^2=2:1

wutaiqiang · 2021-05-06T09:35:43Z

Roughly, we can know that T(cross+bi) : T(bi) = 3:1, ignoring the other process

AOZMH · 2021-05-06T13:15:40Z

Roughly, we can know that T(cross+bi) : T(bi) = 3:1, ignoring the other process

Thanks for the theoretical analysis! That's a bit different from the case I encountered, in which the bi-encoder phase was quite slow and adding an index turned to be even slower.
Actually, when I read through the code, I found that the bi-encoder phase was completely running on CPU (including a transformer encoding of the sentence and a matrix multiplication to calculate the similarity scores of each entity), resulting in the slow execution compared with the cross-encoder which totally ran on GPU.
I also manually changed code to set the bi-encoder phase to run on GPU, which resulted in a GPU out-of-memory due to the LARGE multiplication (hidden_dim * num_entities). As far as I can tell, the execution requires at least 24GB GPU memory for fp32, which exceeds the limit of my 16GB P100.
To solve that, I tried to turn all calculations to fp16, which did not induce OOM, but resulted in a warning telling me that large matmul of fp16 had bugs (see this). Actually, the bi-encoder results of fp16 was also totally wrong, e.g. giving a few incorrect entities high scores.
Finally, I manually pruned the entities to fit in 16GB memory at fp32 and everything was fine, the relative time of bi-encoder and cross-encoder turned to be ~20ms and ~100ms, which is reasonable for me.

To wrap up, I conclude that:

I guess it was the high memory cost that led the contributors to turn bi-encoder phase to CPU (which requires RAM instead of GPU memory), but that significantly harms performance, as we all know, transformers on CPU are weigh more slow than on GPU.
To smoothly run BLINK fully on one GPU, I think we need a >24GB GPU memory, which I guess is not a scarcity for FAIR, but somehow poses difficulty for students like me :)
What I still haven't solved is the slow execution of faiss-index, which was even slower than the pure-cpu bi-encoder execution. Maybe someone can give some additional comments?

Thanks for all the help! I'll be happy to follow any updates.

AOZMH · 2021-05-06T13:19:27Z

Hi @AOZMH,
Thanks for reporting. It is known that the (fast=False) mode is slow, however, the faiss index should be faster. Could you provide a code snippet? Thanks.

My code snippet was the same as the one in README, as shown below.

import blink.main_dense as main_dense
import argparse

models_path = "models/" # the path where you stored the BLINK models

config = {
    "test_entities": None,
    "test_mentions": None,
    "interactive": False,
    "top_k": 10,
    "biencoder_model": models_path+"biencoder_wiki_large.bin",
    "biencoder_config": models_path+"biencoder_wiki_large.json",
    "entity_catalogue": models_path+"entity.jsonl",
    "entity_encoding": models_path+"all_entities_large.t7",
    "crossencoder_model": models_path+"crossencoder_wiki_large.bin",
    "crossencoder_config": models_path+"crossencoder_wiki_large.json",
    "fast": False, # set this to be true if speed is a concern
    "output_path": "logs/" # logging directory
}

args = argparse.Namespace(**config)

models = main_dense.load_models(args, logger=None)

data_to_link = [ {
                    "id": 0,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "".lower(),
                    "mention": "Shakespeare".lower(),
                    "context_right": "'s account of the Roman general Julius Caesar's murder by his friend Brutus is a meditation on duty.".lower(),
                },
                {
                    "id": 1,
                    "label": "unknown",
                    "label_id": -1,
                    "context_left": "Shakespeare's account of the Roman general".lower(),
                    "mention": "Julius Caesar".lower(),
                    "context_right": "'s murder by his friend Brutus is a meditation on duty.".lower(),
                }
                ]

_, _, _, _, _, predictions, scores, = main_dense.run(args, None, *models, test_data=data_to_link)

Also, I changed the config to below to add faiss-index.

config = {
    "test_entities": None,
    "test_mentions": None,
    "interactive": False,
    "top_k": 10,
    "biencoder_model": models_path+"biencoder_wiki_large.bin",
    "biencoder_config": models_path+"biencoder_wiki_large.json",
    "entity_catalogue": models_path+"entity.jsonl",
    "entity_encoding": models_path+"all_entities_large.t7",
    "crossencoder_model": models_path+"crossencoder_wiki_large.bin",
    "crossencoder_config": models_path+"crossencoder_wiki_large.json",
    "fast": False, # set this to be true if speed is a concern
    "output_path": "logs/", # logging directory
    "faiss_index": "flat",
    "index_path": models_path+"faiss_flat_index.pkl"
}

shahjui2000 · 2021-06-02T09:11:19Z

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

AOZMH · 2021-06-02T09:53:12Z

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

You can simply revert this commented code to revert the transition to GPU (# .to(device) => .to(device)) and manually put the corresponding model input tensors to GPU & that should work.

It may need a few days for me to clean up my (experimental) codes so maybe you can give it a try given the aforementioned ideas; as far as I can recall, that requires <20 lines of code changes. Anyway, if you still have any problem please feel free to reply me & I'll try to embark on my code snippet.

shahjui2000 · 2021-06-02T17:15:27Z

I am facing the same memory issue you did. Could you elaborate your third point - On how to lower memory usage?

AOZMH · 2021-06-03T04:25:45Z

I am facing the same memory issue you did. Could you elaborate your third point - On how to lower memory usage?

I tried two ways as follow:

Prun the candidate entity set. This code refers to the vector representations of all entities in Wikipedia (of shape
<num_candidates, hidden_dim>). Such matrix is loaded by the load_models function as utilized by README. So, you can arbitrarily prune that matrix to a smaller set of entities, which would eliminate the OOM. However, as you can see, that's a totally experimental approach just to test the running speed which may lose recall on entity linking (since some entities are excluded from the candidate set).
Execute on FP16. You can turn all corresponding tensors (e.g. candidate_encodings, model_weights, input tensors, etc.) to half-precision floats (16 bit) by a = a.half(). However, as in this discussion, matrix multiplication on very large tensors may result in garbled results, which is exactly the case for me where the entity scores given by the bi-encoder is totally wrong. Maybe you can try out some later versions of CUDA drivers to solve that issue given my relatively outdated CUDA 10.0, fingers crossed for your new findings!

AOZMH · 2021-06-03T04:28:42Z

BTW, I really hope the developers of BLINK can look into this issue to solve the faiss index problem I mentioned before, thanks in advance!

shahjui2000 · 2021-06-03T14:58:39Z

I wonder if this is also a feasible solution - Splitting the candidate_encoding when you pass it to gpu and then concatenating the splittled scores and continuing with the code? That way memory passed onto the gpu at each call is reduced without removing entities

AOZMH · 2021-06-03T15:07:07Z

I wonder if this is also a feasible solution - Splitting the candidate_encoding when you pass it to gpu and then concatenating the splittled scores and continuing with the code? That way memory passed onto the gpu at each call is reduced without removing entities

That should work properly, but the time cost would be considerable: the basic assumption is that the whole candidiate_encoding cannot fit into the gpu memory, now, if you split it into A and B, still, you cannot simultaneously put it into gpu memory. Thus, in that sense, for each execution (instead of each model initialization), we need to first transfer A to gpu, execute on A, then we need to delete A from gpu memory and transfer B to gpu and then execute B & delete B from gpu memory. That is, such splitting approach requires a transfer between main memory and gpu memory for each EXECUTION, which would be costly.

However, that should be a good idea if you have multiple gpus, e.g. putting A and B PERMANENTLY on gpu 0 & 1 and then we do not need such per execution transition.

Jun-jie-Huang · 2021-06-18T13:16:14Z

I think a possible solution is to encode all the queries and all the candidates with GPU and save them. Then build a faiss index with cpu to find the nearest entities. Faiss is much more efficient with satisfying results. But this costs much efforts and means the pipeline will be reconstructed.

BTW, if you use one 32GB V100, the problem will not occur.

shixiao9941 · 2021-06-25T17:48:55Z

Hey, how did you manually change biencoder to gpu? Could you share the snippets?

You can simply revert this commented code to revert the transition to GPU (# .to(device) => .to(device)) and manually put the corresponding model input tensors to GPU & that should work.

It may need a few days for me to clean up my (experimental) codes so maybe you can give it a try given the aforementioned ideas; as far as I can recall, that requires <20 lines of code changes. Anyway, if you still have any problem please feel free to reply me & I'll try to embark on my code snippet.

Hi @AOZMH, could you please share your snippets of using gpu for biencoder? I did that but the speed is still slow, maybe I did wrongly...

shahjui2000 · 2021-06-26T07:40:04Z

Hey, same for me too! Changing it for GPU was actually slower than that of CPU

tomtung · 2021-07-23T00:42:11Z

Btw patches #89 and #90 might help. To enable GPU you can try edit your biencoder_wiki_large.json and crossencoder_wiki_large.json files to set no_cuda to false.

rajharshiitb · 2021-12-25T12:25:11Z

Hi @AOZMH, can you please share how you converted the BLINK to work on FP16. I am getting errors.
Thanks

amelieyu1989 · 2022-10-27T03:55:21Z

I met the same problem. when I add faiss index path, it becomes slower

abhinavkulkarni · 2022-10-27T05:32:57Z

Hi,

You may want to make some changes to the codebase and add support for more sparse indexes. Currently, BLINK codebase only supports flat indices.

I am currently using a sparse index OPQ32_768,IVF4096,PQ32x8 built on candidate encodings and the speed improvement is significant.

For e.g., this is what my faiss_indexer.py looks like.

This is how I load the models.

config = {
    "interactive": False,
    "fast": False,
    "top_k": 8,
    "biencoder_model": models_path + "biencoder_wiki_large.bin",
    "biencoder_config": models_path + "biencoder_wiki_large.json",
    "crossencoder_model": models_path + "crossencoder_wiki_large.bin",
    "crossencoder_config": models_path + "crossencoder_wiki_large.json",
    "entity_catalogue": models_path + "entities_aliases_with_ids.jsonl",
    "entity_encoding": models_path + "all_entities_aliases.t7",
    "faiss_index": "OPQ32_768,IVF4096,PQ32x8",
    "index_path": models_path + "index_opq32_768_ivf4096_pq32x8.faiss",
    "output_path": "logs/",  # logging directory
}

self.args = argparse.Namespace(**config)

logger.info("Loading BLINK model...")
self.models = main_dense.load_models(self.args, logger=logger)

AOZMH closed this as completed May 6, 2021

AOZMH reopened this May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About efficiency of the model #80

About efficiency of the model #80

AOZMH commented Apr 11, 2021 •

edited

Loading

ledw commented Apr 12, 2021

AOZMH commented Apr 13, 2021

wutaiqiang commented May 6, 2021

wutaiqiang commented May 6, 2021

wutaiqiang commented May 6, 2021

AOZMH commented May 6, 2021 •

edited

Loading

AOZMH commented May 6, 2021

shahjui2000 commented Jun 2, 2021

AOZMH commented Jun 2, 2021

shahjui2000 commented Jun 2, 2021 •

edited

Loading

AOZMH commented Jun 3, 2021

AOZMH commented Jun 3, 2021

shahjui2000 commented Jun 3, 2021

AOZMH commented Jun 3, 2021 •

edited

Loading

Jun-jie-Huang commented Jun 18, 2021 •

edited

Loading

shixiao9941 commented Jun 25, 2021 •

edited

Loading

shahjui2000 commented Jun 26, 2021

tomtung commented Jul 23, 2021

rajharshiitb commented Dec 25, 2021

amelieyu1989 commented Oct 27, 2022

abhinavkulkarni commented Oct 27, 2022

About efficiency of the model #80

About efficiency of the model #80

Comments

AOZMH commented Apr 11, 2021 • edited Loading

ledw commented Apr 12, 2021

AOZMH commented Apr 13, 2021

wutaiqiang commented May 6, 2021

wutaiqiang commented May 6, 2021

wutaiqiang commented May 6, 2021

AOZMH commented May 6, 2021 • edited Loading

AOZMH commented May 6, 2021

shahjui2000 commented Jun 2, 2021

AOZMH commented Jun 2, 2021

shahjui2000 commented Jun 2, 2021 • edited Loading

AOZMH commented Jun 3, 2021

AOZMH commented Jun 3, 2021

shahjui2000 commented Jun 3, 2021

AOZMH commented Jun 3, 2021 • edited Loading

Jun-jie-Huang commented Jun 18, 2021 • edited Loading

shixiao9941 commented Jun 25, 2021 • edited Loading

shahjui2000 commented Jun 26, 2021

tomtung commented Jul 23, 2021

rajharshiitb commented Dec 25, 2021

amelieyu1989 commented Oct 27, 2022

abhinavkulkarni commented Oct 27, 2022

AOZMH commented Apr 11, 2021 •

edited

Loading

AOZMH commented May 6, 2021 •

edited

Loading

shahjui2000 commented Jun 2, 2021 •

edited

Loading

AOZMH commented Jun 3, 2021 •

edited

Loading

Jun-jie-Huang commented Jun 18, 2021 •

edited

Loading

shixiao9941 commented Jun 25, 2021 •

edited

Loading