Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of text corpuses #125

Open
Glorifier85 opened this issue Sep 12, 2020 · 4 comments
Open

Number of text corpuses #125

Glorifier85 opened this issue Sep 12, 2020 · 4 comments

Comments

@Glorifier85
Copy link

Hi there,

first of, great application! Intuitive and easy to use - eaxactly what I needed. The question I have is: is there a reason why the minimum number of texts to be chosen is ten? I am sure there is but can we change that somehow? What if I just wanted to tokenize and compare two corpuses?

Thanks!
Glorifier

@severinsimmler
Copy link
Collaborator

severinsimmler commented Sep 12, 2020

Hi @Glorifier85,

thank you for the positive feedback, we are very happy that the application is useful for people.

Topic modeling is a technique that works well with a large number of documents. I think it makes theoretical and practical no sense to topic model less than 10 documents (but 10 is actually more or less randomly chosen). Please refer e.g. Tang et al.:

The number of documents plays perhaps the most important role; it is theoretically impossible to guarantee identification of topics from a small number of documents, no matter how long.

The length of the documents also plays an important role. Maybe you should consider segmenting documents of your small corpus – topic modeling works even with tweets (i.e. 280 characters) quite well (see e.g. Ordun et al.).

@Glorifier85
Copy link
Author

Hi @severinsimmler,

thanks for your response, much appreciated!
Understood re the number of corpuses. Frankly, I could see why the length of documents plays a role but I dont quite understand why the sheer number of documents would be so important. But I'll have a close look at the papers you've linked.

Speaking of document length, is there an optimal length in terms of word count? Törnberg (2016), see link below, for example mention that they split documents in chunks of 1000-word texts. Is this something you can confirm?
https://www.sciencedirect.com/science/article/pii/S2211695816300290

Many thanks!

@severinsimmler
Copy link
Collaborator

but I dont quite understand why the sheer number of documents would be so important

Most natural language processing algorithms are designed to extract information from an extensive data set. In general, one could say the more the better. Always. But I think this is also a question of methodology. If I only have two documents, why do I need a quantitative method? I could evaluate the texts with qualitative methods (e.g. close reading) and probably gain more valuable insights.

they split documents in chunks of 1000-word texts. Is this something you can confirm?

Yes, 1000 words per document is a good starting point. I don't know how your texts are structured, but you could also segment by paragraph or chapter.

@Glorifier85
Copy link
Author

Thanks again! To your knowledge, is there a maximum number of words per document that should not be exceeded, like a hard cap? I am planning to model social media comments from news outlets over a certain period of time (3-5 years) so I might end up with >500k words per document.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants