Simplifying/streamlining #7

dalejn · 2020-06-16T18:41:53Z

To help streamline and simplify the Binder, note that the code uses both Python and R. The most straightfoward way to modify the code is to do so within the Jupyter Notebook launched through Binder (see instructions https://github.com/dalejn/cleanBib#instructions). Then, after modifying, you can save/export the Jupyter Notebook with your changes by saving it as an .ipynb file on your computer. Please test the changes in your own Github branch and/or attach your code to a reply here with a general description of the problem being addressed and the change to the code.

j6k4m8 · 2021-04-11T19:56:37Z

I have adapted the code to a fully Python-based implementation here, which runs quite fast (~1 minute on my laptop with my janky internet connection).

Main differences:

The Good:

I use biblatexparser which dramatically increases speed and eliminates a few parsing errors I encountered on more esoteric cite types in the cleanBib implementation
I cache results so that the same author is not looked up more than once (which saves gender-api credits too)

The Bad:

I don't "heal" broken references by looking them up on crossref (the cleanBib implementation didn't catch that many new datapoints in my bib file so I didn't bother adding it, but maybe this is important!)
~~I don't perform the full analysis yet, I just generate the predictions/data and save to a pandas dataframe / csv~~
I don't check for first/last author of the current paper. (!!!)

The I-Don't-Know-if-It's-Good-Or-Bad:

I think the race model I use is very slightly different than the one in here
~~I don't (yet) count single-author papers~~

Just dropping this here in case it's helpful to you or if you are able to repurpose any of the code. I definitely want to respect your emphasis on reproducibility.

(This is also explicit permission to use that code if you want any of it)

dalejn · 2021-04-11T23:32:00Z

Thanks for working on this and for the write-up. It looks great! I particularly appreciate your emphasis on simplicity and usability, and caching results and trying a different parser are great ideas. I've been planning to rewrite the cleanBib implementation with functions and to clean up some bloat, so thank you also for this material. Healing broken references and automatically dealing with flagged self-citations without burdening the user too much is still very much a work in progress (I think we struck a decent trade-off prior to adding in the race code, and I'll be trying to get us back there). If you end up going in this direction and in a way that doesn't feel at odds with usability, I'd love to follow up!

j6k4m8 · 2021-04-12T16:25:38Z

Amazing! If you're interested, I can spend a little more time on getting the ref-healing code to work here. Or I can stop bothering you with github notifications 🤣

j6k4m8 · 2021-04-13T20:57:35Z

Some improvements:

Some basic ref-healing, using the crossref API
Race / gender breakdown summaries (ratio numbers)
Export to CSV

https://gist.github.com/j6k4m8/3b86b0a78c7966e9257be2677feff781

Would love the opportunity to run this alongside some known results using the current implementation to see how the numbers compare; I'd run it myself but I'm quickly running out of API credits!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplifying/streamlining #7

Simplifying/streamlining #7

dalejn commented Jun 16, 2020

j6k4m8 commented Apr 11, 2021 •

edited

Loading

dalejn commented Apr 11, 2021

j6k4m8 commented Apr 12, 2021

j6k4m8 commented Apr 13, 2021

Simplifying/streamlining #7

Simplifying/streamlining #7

Comments

dalejn commented Jun 16, 2020

j6k4m8 commented Apr 11, 2021 • edited Loading

Main differences:

The Good:

The Bad:

The I-Don't-Know-if-It's-Good-Or-Bad:

dalejn commented Apr 11, 2021

j6k4m8 commented Apr 12, 2021

j6k4m8 commented Apr 13, 2021

j6k4m8 commented Apr 11, 2021 •

edited

Loading