Skip to content

Latest commit

 

History

History
82 lines (53 loc) · 2.28 KB

README.md

File metadata and controls

82 lines (53 loc) · 2.28 KB

seq-to-pheno

Code for longevity : Put in longevity project

Code for depmap : Put in depmap project

Fetching embeddings from sequence : get_embeddings.py

Datasets

Get

get the filtered ortholog dataset :

curl -X GET \
     "https://datasets-server.huggingface.co/first-rows?dataset=seq-to-pheno%2Ffiltered_orthologs&config=default&split=train"

get the mapped ortholog dataset :

curl -X GET \
     -H "Authorization: Bearer $HF_TOKEN" \
     "https://datasets-server.huggingface.co/rows?dataset=seq-to-pheno%2Fmapped_orthologs&config=default&split=train&offset=0&length=100"

Use

from datasets import load_dataset

ds = load_dataset("seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data")
from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data/croissant")
records = ds.records("default")
import pandas as pd

df = pd.read_csv("hf://datasets/seq-to-pheno/TCGA-Cancer-Variant-and-Clinical-Data/protein_sequences_metadata.tsv", sep="\t")
from datasets import load_dataset

mapped = load_dataset("seq-to-pheno/mapped_orthologs")
from datasets import load_dataset

mapped = load_dataset("seq-to-pheno/filtered_orthologs")

Re-Create the filtered Ortholog Dataset:

python ./scripts/filtered_dataset.py --folder /downloads --template_path /seq_to_pheno/hug/zoonomia_dataset_repo_template/README.md --token hf_xxx --max_length 1000 --max_orthologs 20 --publish

Re-Create the Fasta Zoonotica Dataset:

To extract sequences for a specific gene and publish:

python extract_and_publish_protein_sequences.py --input_folder data/zoonomia/ --input_file protein_sequence_df.tsv --output_folder data/zoonomia/ --output_file TP53_protein_sequences.fasta --gene TP53 --publish --repo_name filtered-zoonomia-tp53 --hf_token hf_your_token

To extract all sequences and publish:

python extract_and_publish_protein_sequences.py --input_folder data/zoonomia/ --input_file protein_sequence_df.tsv --output_folder data/zoonomia/ --output_file all_protein_sequences.fasta --publish --repo_name filtered-zoonomia-all --hf_token hf_your_token