Comparative analysis of protein function text-based embeddings and its potential for prediction tasks
This thesis explores how information for protein functions can be exploited through embeddings so that the produced information can be used to improve protein function annotations. The underlying hypothesis here is that any pair of proteins with high sequence similarity will also share a similar biological function which would be reflected by the corresponding protein embeddings. The comparion and evaluation of this is done using two text-driven embedding approaches: Word2doc2Vec and Hybrid-Word2doc2Vec.