In the recent years we have witnessed a giant breakthrough in the field of Natural language processing (NLP). New more complicated models were trained to replicate and “understand” the human language. One of the major discoveries in this area was the invention of transformers - new network architecture, that takes into consideration each part of input data. Even though the origin of biological sequences is highly dissimilar to those of a human language, we still can use similar strategies working with them. After all, they are both just arrays of letters, aren’t they?
Researchers from facebook utilised the exact same approach and trained a high capacity transformer on evolutionary data. The resulting model is able to predict remote homology and even infer the structure from sequence. Model is also trained to fill the artificial gaps in the sequence, choosing the right amino acid. Still, even the best of us make mistakes. This statement also applies to machine learning. Most algorithms have their limitations and fallacies. When model fails to generate the coherent text in human language we are fast to point out the nonsense. But no one can speak the protein language and seeing the mistakes with a naked eye becomes impossible. Our project aims to find out such limitations in the work of ESM.