You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:
A class for handling all forms of scrapping. This API for this feature can be like an interface that other scrappers can be built on. We can leverage either bs4 or scrapy . I'm thinking something like:
classBaseScrapper(scrappy.Spider):
def__init__(name, urls, **kwargs):
super(BaseScrapper, self).__init__(name, **kwargs)
defparse_urls(self):
###Do something to the URLs before startingpassdefparse(self):
#Crawling logicpass
Then a scrapper like the Bibeli scrapper can use this class:
classBibeliScrapper(BaseScrapper)
###Logic goes here
Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.
Corpus class and DirectoryCorpus classs (Inspired by gensim)
This would be a class that can be used to load various format of yoruba corpus using a single API interface. It should support:
Streaming files
Reading various file formats. txt, gzip, csv,
Validating a file format. Say if a user loads an Owe file. It should be able to validate that the content of the file conforms to that format.
Preprocessing while reading.
Generating random text
A commit for this is available here
The interface is described below:
I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:
Then a scrapper like the Bibeli scrapper can use this class:
Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.
This would be a class that can be used to load various format of yoruba corpus using a single API interface. It should support:
Owe
file. It should be able to validate that the content of the file conforms to that format.A commit for this is available here
The interface is described below:
iranlowo.
. They should return aCorpus
object.I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.
The text was updated successfully, but these errors were encountered: