Closes #716 #722

shamikbose · 2022-07-03T20:54:59Z

Note: This dataset has a few issues

The abstracts have to be downloaded from Pubmed with eUtils, so it's slow since the API is throttled
The way abstracts are generated seems to be inconsistent. In some cases, the titles are considered, but in others, they seem to be ignored. As a result, there are 7 mismatched offsets
Is there a standard way these abstracts are formed?

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py. - Note
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Tagging errors down from 475 to 247

Abstract is build as follows: `{title} {label}: {abstract.label}`

Mismatched offsets in 7 examples, all others pass

shamikbose · 2022-07-05T21:49:11Z

Some concrete examples of strange abstract creation

Example 1:
18239642 '-174G>C' G-174C 394 401 rs1800795 NSM
18239642 '-572G>C' G-572C 375 382 rs1800796 NSM
18239642 '-596A>G' A-596G 356 363 rs1800797 NSM
Abstract
Title:Modifying effects of IL-6 polymorphisms on body size-associated breast cancer risk.OBJECTIVE:The association between obesity and breast cancer risk is complex. We examined whether the association between body size and breast cancer risk is modified by interleukin-6 (IL6) genotype.METHODS AND PROCEDURES:Five polymorphisms in the IL-6 gene (rs1800797/-596A>G, rs1800796/-572G>C, rs1800795/-174G>C, rs2069832/IVS2G>A, and rs2069849 exon 5 C>T) were studied. We investigated IL6 genotypes and haplotypes with indicators of body size among non-Hispanic white (NHW) and Hispanic/American Indian (AI) breast cancer cases and controls living in the Southwestern United States.RESULTS:We observed lower mean levels of BMI among NHW women who carried one or two copies of the GGCAC haplotype (in order: rs1800797, rs1800796, rs1800795, rs2069832, and rs2069849; P trend 0.02). This haplotype, with an estimated frequency of 43% in NHW study controls, was considerably less common in Hispanic/AI controls (19%). We did not detect significant interactions between IL6 genotypes or haplotypes and BMI categorized as low/normal (<25), overweight (25 to <30), or obese (> or =30) and breast cancer risk in either NHW or Hispanic/AI women. However, we detected consistent and significant interactions between waist-to-hip ratio (WHR) and IL6 rs1800795/-174 G>C genotype for breast cancer risk. These associations were restricted to postmenopausal NHW women. Among women without recent hormone exposure, those with a WHR >0.9 and the rs1800795 GG genotype had a greater than threefold increased risk of breast cancer (odds ratios (ORs) 3.22, 95% confidence intervals (CIs) 1.27, 817) when compared with women with a WHR <0.8 and the rs1800795 GG genotype (P interaction 0.01).DISCUSSION:These data suggest that IL-6 genotypes may influence breast cancer risk in conjunction with central adiposity.

Example 2
18092344 'c.30T>A' c.30T>A 36 43 rs2043211 NSM
18092344 'p.C10X' p.C10X 45 51 rs2043211 PSM
Abstract:
Title: No association of the CARD8 (TUCAN) c.30T>A (p.C10X) variant with Crohn's disease: a study in 3 independent European cohorts.
BACKGROUND:A recent study reported that the c.30T>A (p.Cys10Ter; rs2043211) variant, in the CARD8 (TUCAN) gene, is associated with Crohn's disease (CD). The aim of this study was to analyze the frequency of p.C10X in 3 independent European (IBD) cohorts from Germany, Hungary, and the Netherlands.METHODS:We included a European IBD cohort of 921 patients and compared the p.C10X genotype frequency to 832 healthy controls. The 3 study populations analyzed were: (1) Germany [CD, n = 317; ulcerative colitis (UC), n = 180], (2) Hungary (CD, n = 149; UC, n = 119), and (3) the Netherlands (CD, n = 156). Subtyping analysis was performed in respect to NOD2 variants (p.Arg702Trp, p.Gly908Arg, c.3020insC) and to clinical characteristics. Ethnically matched controls were included (German, n = 413; Hungarian, n = 202; Dutch, n = 217).RESULTS:We observed no significant difference in p.C10X genotype frequency in either patients with CD or patients with UC compared with controls in all 3 cohorts. Conversely to the initial association study, we found a trend toward lower frequencies of the suggestive risk wild type in CD from the Netherlands compared with controls (P = 0.14). We found neither evidence for genetic interactions between p.C10X and NOD2 nor the C10X variant to be associated with a CD or UC phenotype.CONCLUSIONS:Analyzing 3 independent European IBD cohorts, we found no evidence that the C10X variant in CARD8 confers susceptibility for CD.

In Example 1, the start and end match up if you include the "Title", but in Example 2, they match up if you exclude the word "Title".

mariosaenger · 2024-10-28T10:51:40Z

@phlobo What do we want to do with this dataset? It just contains the annotations but not the abstracts / texts. The latter could be downloaded via API however there might be a lot of offset errors due to changed content etc

phlobo · 2024-10-28T12:10:44Z

@phlobo What do we want to do with this dataset? It just contains the annotations but not the abstracts / texts. The latter could be downloaded via API however there might be a lot of offset errors due to changed content etc

Would it be an option to include the abstracts (e.g., as a zip file) as part of the repo? I guess there are other datasets (MedMentions comes to my mind), that re-distribute Pubmed abstracts as part of a GitHub repo.

WIP

70c3773

Tagging errors down from 475 to 247

shamikbose requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners July 3, 2022 20:55

shamikbose added 2 commits July 5, 2022 12:56

Changes for building abstract

587cda4

Abstract is build as follows: `{title} {label}: {abstract.label}`

Passes all tests

4c7b813

Mismatched offsets in 7 examples, all others pass

shamikbose changed the title ~~WIP #716~~ Closes #716 Jul 5, 2022

shamikbose mentioned this pull request Jul 11, 2022

thomas2011 implementation issues -- missing passages / entity only implementation #716

Open

mariosaenger self-assigned this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #716 #722

Closes #716 #722

shamikbose commented Jul 3, 2022 •

edited

Loading

shamikbose commented Jul 5, 2022

mariosaenger commented Oct 28, 2024

phlobo commented Oct 28, 2024

Closes #716 #722

Are you sure you want to change the base?

Closes #716 #722

Conversation

shamikbose commented Jul 3, 2022 • edited Loading

shamikbose commented Jul 5, 2022

mariosaenger commented Oct 28, 2024

phlobo commented Oct 28, 2024

shamikbose commented Jul 3, 2022 •

edited

Loading