Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying/streamlining #7

Open
dalejn opened this issue Jun 16, 2020 · 4 comments
Open

Simplifying/streamlining #7

dalejn opened this issue Jun 16, 2020 · 4 comments

Comments

@dalejn
Copy link
Owner

dalejn commented Jun 16, 2020

To help streamline and simplify the Binder, note that the code uses both Python and R. The most straightfoward way to modify the code is to do so within the Jupyter Notebook launched through Binder (see instructions https://github.com/dalejn/cleanBib#instructions). Then, after modifying, you can save/export the Jupyter Notebook with your changes by saving it as an .ipynb file on your computer. Please test the changes in your own Github branch and/or attach your code to a reply here with a general description of the problem being addressed and the change to the code.

@j6k4m8
Copy link

j6k4m8 commented Apr 11, 2021

I have adapted the code to a fully Python-based implementation here, which runs quite fast (~1 minute on my laptop with my janky internet connection).

Main differences:

The Good:

  • I use biblatexparser which dramatically increases speed and eliminates a few parsing errors I encountered on more esoteric cite types in the cleanBib implementation
  • I cache results so that the same author is not looked up more than once (which saves gender-api credits too)

The Bad:

  • I don't "heal" broken references by looking them up on crossref (the cleanBib implementation didn't catch that many new datapoints in my bib file so I didn't bother adding it, but maybe this is important!)
  • I don't perform the full analysis yet, I just generate the predictions/data and save to a pandas dataframe / csv
  • I don't check for first/last author of the current paper. (!!!)

The I-Don't-Know-if-It's-Good-Or-Bad:

  • I think the race model I use is very slightly different than the one in here
  • I don't (yet) count single-author papers

Just dropping this here in case it's helpful to you or if you are able to repurpose any of the code. I definitely want to respect your emphasis on reproducibility.

(This is also explicit permission to use that code if you want any of it)

@dalejn
Copy link
Owner Author

dalejn commented Apr 11, 2021

Thanks for working on this and for the write-up. It looks great! I particularly appreciate your emphasis on simplicity and usability, and caching results and trying a different parser are great ideas. I've been planning to rewrite the cleanBib implementation with functions and to clean up some bloat, so thank you also for this material. Healing broken references and automatically dealing with flagged self-citations without burdening the user too much is still very much a work in progress (I think we struck a decent trade-off prior to adding in the race code, and I'll be trying to get us back there). If you end up going in this direction and in a way that doesn't feel at odds with usability, I'd love to follow up!

@j6k4m8
Copy link

j6k4m8 commented Apr 12, 2021

Amazing! If you're interested, I can spend a little more time on getting the ref-healing code to work here. Or I can stop bothering you with github notifications 🤣

@j6k4m8
Copy link

j6k4m8 commented Apr 13, 2021

Some improvements:

  • Some basic ref-healing, using the crossref API
  • Race / gender breakdown summaries (ratio numbers)
  • Export to CSV

https://gist.github.com/j6k4m8/3b86b0a78c7966e9257be2677feff781

Would love the opportunity to run this alongside some known results using the current implementation to see how the numbers compare; I'd run it myself but I'm quickly running out of API credits!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants