This project aims to generate Urdu poetry using N-Grams based on a scrapped poetry dataset. The dataset contains a collection of Urdu poems from various Urdu Poets, serving as the foundation for training and generating new poetic verses. The project will explore the use of the following different N-Gram models:
- Unigram
- Bigram
- Trigram
- Backward Bigram
- Bi-directional Bigram
The dataset has been scrapped from the following website : https://www.rekhta.org/ . All poems of atleast 25 Urdu Poets have been scrapped from this website. Scrappy has been used to achieve the web-scrapping task . Following is the link for Scrappy : https://scrapy.org/ The scrapped data has been saved in the scrapped poems.csv. A spider has been used for this purpose . It is avaliable in the I212705urdupoemsspider.py file.
The csv file contains the following 3 columns :
- Poem Line : This is the verse in a particular poem.
- Nazm Name : This is the name of the poem.
- Author Name : This is the name of the author of the poem.
This is the spider that has been used to scrap the poems . Following are the steps to use the spider:
- Create virtual environment in Visual Studio Code.
- Install scrappy.
- Create a scrappy project inside the virtual environment.
- Either Copy Paste my spider in your spider or add my spider into your project in the spiders directory.
- run the spider.
- data will automatically be stored in a new file called scrapped_poems.csv in the same directory.
- The spider code is modifiable . You can change as per your requirement.
- This code can also serve as a base to scrap other websites but Knowledge of scrappy is a must.
Through this project, I aim to showcase the versatility and creativity of N-Gram models in generating Urdu poetry while preserving the aesthetic and linguistic richness of the language. A pdf report is also available of the project .