GitHub - amih77/tabla-parsa: Document parser and validator

Document Parser And Validator

Description

This project is meant for parsing documents, and validating them according to some rules. It presents a programmatic API for storing, getting, updating and deleting a document.

Usage

Install all dependencies in requirements.txt
Pass the documents directory to api.create_documents()
Perform CRUD operations on the stored documents

Design Minutes

The parsing of the html documents is done by beautifulsoup
Both Document and Discrepancy are ODM classes (Object Document Model) and implement relevant DB methods
I am using pydantic class framework whenever I can. I like it for its simplicity, cleanness and parsing/validating power
I chose to implement DocumentValidator with Strategy design pattern as it provides great extensibility and dynamic flexibility. This pattern involves defining a family of algorithms, encapsulating each one, and making them interchangeable. It lets the algorithm vary independently from the clients that use it, which in the context of calculating discrepancies, allows for flexible and maintainable discrepancy detection logic which can be further developed and enriched.
I separated Documents and Discrepancys among different DB collections for the following reasons:
- Modularity: It allows us to tailor the schema, indexing, and access patterns to the specific needs of each type of data, enhancing clarity and maintainability of the database structure,
- Performance: we can optimize queries, updates, and indexes for each collection independently,
- Scalability: We shouldn't expect the same growth behavior from the two collections therefore we have the flexibility of treating each one differently yet having minimal effect on the entire system.
I decided to go with soft deletion as the mean of db document deletion - I think it's considered to be a good practice (never actually delete from the DB...)
I contemplated a bit about the implementation of creation_date and location. Decided to go with computing them dynamically as properties and storing the raw footer.

Tech Debt

Many improvements come to mind, here are some of them:

MongoDB improvements, such as inserting and fetching multiple documents at once
The async job of calculating document's discrepancies can be impelemnted with a more robust framework such as Celery etc.
update and delete document should have a transaction/lock mechanism as it involves document and discrepancies

Conclusion

I had great fun working on this exercise, especially getting to experience with MongoDB and Strategy design pattern.

Thanks and see you on Wednesday!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data_models		data_models
odm_models		odm_models
utils		utils
README.md		README.md
api.py		api.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Parser And Validator

Description

Usage

Design Minutes

Tech Debt

Conclusion

About

Releases

Packages

Languages

amih77/tabla-parsa

Folders and files

Latest commit

History

Repository files navigation

Document Parser And Validator

Description

Usage

Design Minutes

Tech Debt

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages