Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually extract text citations from syllabi #23

Open
JonathanReeve opened this issue May 26, 2021 · 9 comments
Open

Manually extract text citations from syllabi #23

JonathanReeve opened this issue May 26, 2021 · 9 comments
Assignees

Comments

@JonathanReeve
Copy link
Owner

Let's break up the task of extracting text citations from syllabi.

I can probably use Anystyle to generate Bibtex, and then RDF, once we've manually extracted text citations from syllabi, but we should generate plain text citations from syllabi.

@JonathanReeve
Copy link
Owner Author

I can do the first 100.

Amber: 100-175.

Serena: 175-250+.

@JonathanReeve
Copy link
Owner Author

Let's create plain text files, {id}.texts.txt where {id} is the course ID.

@JonathanReeve JonathanReeve added this to the First Prototype milestone May 28, 2021
@sy2657
Copy link
Collaborator

sy2657 commented May 28, 2021 via email

@JonathanReeve
Copy link
Owner Author

Sure. Let's do this:

  1. Start with the ID listed above, so in your case, deCourse:175.
  2. If it has a syllabus (in HTML, PDF, .docx, etc), retrieve it.
  3. Open the syllabus, and look for a section with assigned readings.
  4. Copy the assigned readings, and paste them into a new plain text document, called {id}.texts.txt, where {id} is the course ID. So if it's deCourse:175, it would be called 175.texts.txt.
  5. Try to make sure that it's a plain text document, where each line is a reading (citation). Here's an example line: O'neil, Cathy. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown, 2016.

You can do this quasi-automatically, if you like, with a little scripting. For instance:

  1. Write a SPARQL query that finds courses with hasSyllabus
  2. Try to get the syllabus HTML or PDF
  3. Verify that it's not a 404 page
  4. If it's a real syllabus, open it, so where you can manually identify the readings section
  5. Copy and paste as above

But it might be faster to do it manually.

@JonathanReeve
Copy link
Owner Author

@Zhuohan-Amber and @sy2657, could you submit pull request(s) with these changes, when you're done? And in the pull request text, just say "fixes #23," which will mark this issue as completed. Thanks in advance!

@JonathanReeve
Copy link
Owner Author

@Zhuohan-Amber and @sy2657 , let me know if you need any help with submitting pull requests on this one. It'd be nice to close the issue when that's done.

@sy2657
Copy link
Collaborator

sy2657 commented Jul 4, 2021 via email

@JonathanReeve
Copy link
Owner Author

I don't think you submitted a pull request, since you would see it in this list of pull requests, if so. Maybe review some tutorials and try again?

@sy2657
Copy link
Collaborator

sy2657 commented Jul 5, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants