Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write scripts to run basic text reuse pipeline #67

Closed
13 tasks done
mnaydan opened this issue Aug 15, 2024 · 4 comments
Closed
13 tasks done

Write scripts to run basic text reuse pipeline #67

mnaydan opened this issue Aug 15, 2024 · 4 comments
Assignees

Comments

@mnaydan
Copy link
Collaborator

mnaydan commented Aug 15, 2024

I/O

  • Generate a minimal Chadwyck-Healey text corpus file (.jsonl) (limited to poems in the test set)
    • Write script
    • Test script output (i.e. jsonl file)
    • Create initial .jsonl file for the Chadwyck-Healey test set.
  • Build minimal ppa page corpus using old OCR for Gale volumes.

Passim-specific

  • Initial script(s) for running passim
    • Write script to convert input text .jsonl files to a passim-suitable format
    • Initial bash script for running running passim
    • Write script for standardizing passim output
      • Minimal output
      • Optionally include original/aligned text excerpts for manual evaluation

Evaluation

  • Get passim results for test set with default parameters
    • "Raw" passim output (i.e. various files within a specific (top-level) output directory)
@mnaydan mnaydan changed the title Write a script to run basic Passim pipeline Write a script to run basic text reuse pipeline Aug 19, 2024
@mnaydan mnaydan changed the title Write a script to run basic text reuse pipeline Write scripts to run basic text reuse pipeline Aug 19, 2024
@mnaydan mnaydan added the 🗜️ awaiting testing Implemented and ready to be tested label Sep 4, 2024
@mnaydan
Copy link
Collaborator Author

mnaydan commented Sep 4, 2024

@laurejt I am testing the jsonl. Could you please reply with a comment here specifying the acceptance criteria? Are new lines important? What else am I looking for?

@mnaydan mnaydan self-assigned this Sep 4, 2024
@laurejt
Copy link
Contributor

laurejt commented Sep 4, 2024

@mnaydan The immediate goals of testing the jsonl is to exam the "text" field and confirm that it corresponds to the "full text" of the poem. I'm not sure how best to compare this beyond finding an external copy from somewhere else and checking that it "looks" right.

@mnaydan
Copy link
Collaborator Author

mnaydan commented Sep 4, 2024

@laurejt thank you, that is helpful! I used Visual Studio Code and json-lines-viewer.preview to read the jsonl, and spot checked a dozen or so poems. The text field does correspond to the full text of the poem as I would expect it, so I would consider this "tested" and "accepted."

@mnaydan mnaydan removed the 🗜️ awaiting testing Implemented and ready to be tested label Sep 4, 2024
@mnaydan mnaydan removed their assignment Sep 4, 2024
@mnaydan mnaydan closed this as completed by moving to Done in Iteration Planning Board Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants