Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Sections are Too Long for LLMs #1558

Open
1 of 60 tasks
npmccallum opened this issue Nov 28, 2023 · 2 comments
Open
1 of 60 tasks

Some Sections are Too Long for LLMs #1558

npmccallum opened this issue Nov 28, 2023 · 2 comments

Comments

@npmccallum
Copy link
Contributor

npmccallum commented Nov 28, 2023

Some sections of the text are too long to be held in the context window of LLMs. Here's a list of the sections (the encodings of Plutarch are a particularly bad offender):

  • tlg0007.tlg067; section: 14
  • tlg0007.tlg067; section: 7
  • tlg0007.tlg067; section: 9
  • tlg0007.tlg068; section: 1
  • tlg0007.tlg068; section: 10
  • tlg0007.tlg068; section: 11
  • tlg0007.tlg068; section: 13
  • tlg0007.tlg068; section: 14
  • tlg0007.tlg068; section: 2
  • tlg0007.tlg068; section: 4
  • tlg0007.tlg068; section: 6
  • tlg0007.tlg068; section: 8
  • tlg0007.tlg070; section: 22
  • tlg0007.tlg071; section: 10
  • tlg0007.tlg071; section: 11
  • tlg0007.tlg079; section: 18
  • tlg0007.tlg079; section: 2
  • tlg0007.tlg079; section: 3
  • tlg0007.tlg083; section: 15
  • tlg0007.tlg083; section: 19
  • tlg0007.tlg087; section: 2:13
  • tlg0007.tlg094; section: 10
  • tlg0007.tlg094; section: 12
  • tlg0007.tlg094; section: 4
  • tlg0007.tlg094; section: 6
  • tlg0007.tlg094; section: 7
  • tlg0007.tlg097; section: 18
  • tlg0007.tlg107; section: 22 #1636
  • tlg0007.tlg109; section: 20
  • tlg0007.tlg109; section: 22
  • tlg0007.tlg112; section: 4:1:3
  • tlg0007.tlg112; section: 6:2:2
  • tlg0007.tlg112; section: 7:6:3
  • tlg0007.tlg112; section: 8:9:3
  • tlg0007.tlg113; section: 13
  • tlg0007.tlg113; section: 16
  • tlg0007.tlg113; section: 17
  • tlg0007.tlg113; section: 18
  • tlg0007.tlg113; section: 19
  • tlg0007.tlg113; section: 21
  • tlg0007.tlg113; section: 9
  • tlg0007.tlg114; section: 3
  • tlg0007.tlg115; section: 1
  • tlg0007.tlg118; section: 13
  • tlg0007.tlg118; section: 14
  • tlg0007.tlg118; section: 15
  • tlg0007.tlg118; section: 32
  • tlg0007.tlg123; section: 42
  • tlg0007.tlg126; section: 20
  • tlg0007.tlg126; section: 21
  • tlg0007.tlg126; section: 25
  • tlg0007.tlg126; section: 26
  • tlg0007.tlg129; section: 10
  • tlg0007.tlg129; section: 13
  • tlg0007.tlg129; section: 3
  • tlg0007.tlg129; section: 36
  • tlg0010.tlg023; section: 6
  • tlg0010.tlg024; section: 9
  • tlg0010.tlg027; section: 2
  • tlg0010.tlg029; section: 4

The worst, by far, is this one:

http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A2008.01.0301%3Asection%3D22

Would it be possible to break up these sections into smaller sizes?

@npmccallum npmccallum changed the title Section 22 of Plutarch's De sera numinis vindicta is too long Some Sections are Too Long for LLMs Nov 29, 2023
@lcerrato
Copy link
Collaborator

@npmccallum
The Isocrates appears incorrect to me (sections were not encoded) but this was very early conversion work and will be revisited as part of the workflow.

For the Plutarch, smaller sections (excepting if there are errors lurking in here) would require another level of CTS structure to be imposed on the texts. That would be an arbitrary imposition on the standard structure. (I also do not believe another layer is possible in the case of works with 3 levels already.)

The plain text versions of the texts might be a better option depending on the type of work being done? I know others have done post-processing text chunking as needed using those versions.

As CTS referencing permits designating any span of text for referencing smaller portions or subsets of the works, we haven't been adding more containers top-down as a general practice excepting some more obscure works here and there.

This is really beyond my role and would be something others should decide.

@helmadik
Copy link
Contributor

Just a quick comment that the offender especially singled out 0007 -107 - 22 does seem to be missing section numbers after that point in Vernardakis, but the Loeb edition does have them. e.g. 23, https://www.loebclassics.com/view/plutarch-delays_divine_vengeance/1959/pb_LCL405.273.xml?result=1&rskey=5RUAuX, 24, https://www.loebclassics.com/view/plutarch-delays_divine_vengeance/1959/pb_LCL405.277.xml?result=1&rskey=5RUAuX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants