Building a Powerful Document Search Engine: Leveraging HDFS, Apache Tika, SFTP, NiFi, Mongo, Elasticsearch, Logstash, Fast Api, React js
Our project is built on a foundation of critical components, each contributing to an efficient and robust data processing system. Here's an overview of the key elements: Our project is a testament to this need, offering a comprehensive solution that combines modern technologies and architectures to create a powerful document search engine. This engine is not just a tool but a sophisticated ecosystem designed to handle complex data processing and retrieval tasks.
This project is based on the original project: Document Search Engine Architecture.
Our system incorporates a 3-datanode HDFS cluster, specifically designed for the storage of original files awaiting processing. This arrangement guarantees both dependable data storage and easy access.
Our solution leverages two distinct Apache Tika servers to streamline content extraction. One of these servers is enhanced with OCR capabilities, enabling us to extract text from images and scanned PDF documents, thus expanding our processing capabilities.
The SFTP Server marks the beginning of our ETL process. It serves as the gateway where a microservice deposits files into a communal directory. These files are then consistently monitored and processed by a specialized NiFi processor.
Central to our system is the NiFi ETL process. It efficiently transports files to the HDFS directory, determines their MIME type, and executes HTTP requests to the relevant Apache Tika server for the extraction of metadata and textual content. This extracted information is then preserved in a MongoDB collection, and updates on the process state are sent to Kafka.
We employ a Logstash pipeline to synchronize our data with Elasticsearch, thereby augmenting our search capabilities. This integration is crucial as MongoDB does not inherently provide extensive search features.
For the purpose of data exploration and visualization, we utilize Kibana. This tool offers an intuitive interface for the analysis and interpretation of indexed data.
To enhance the power and flexibility of our document search engine, we've established a Fast API connection to Elasticsearch. This integration is a key part of our architecture, enabling swift and sophisticated search queries across the indexed data.
FastAPI, known for its high performance and ease of use, provides an optimal framework for building APIs. It facilitates the creation of robust, efficient endpoints that interact seamlessly with Elasticsearch. This means our system can handle complex search operations, offering features like full-text search, real-time indexing, and analytics capabilities.
This connection not only speeds up the data retrieval process but also allows for the implementation of advanced search functionalities. Users can perform detailed queries, benefiting from Elasticsearch's powerful search and analytics engine, directly through the Fast API endpoints. This setup makes it easier to query and retrieve documents based on a wide range of criteria, enhancing the overall user experience in our document search engine system.
To efficiently manage and start the various components of our Document Search Engine Architecture, we use a Makefile with several commands. Here's how to use them:
-
Start Services: To launch all services and display their URLs, run
make up
. This command starts services like Zookeeper, Kafka, Elasticsearch, and more, and provides their access URLs. -
Change Permissions for SFTP: If you need to modify permissions for the SFTP uploads directory, execute
make chmod-sftp
. -
Execute Commands in Docker Container: To run specific commands inside the Logstash Docker container, use
make exec-logstash
. -
Run API: For starting the Python app in the API directory, which includes installing dependencies, use
make run-api
. This will also display the documentation access URL. -
Start Frontend Application: To initiate the frontend application using Docker Compose, execute
make run-front
. This command will inform you about the URL where the frontend is accessible. -
Clean Up: To tear down all services and remove unused Docker images, run
make clean-all
. This command ensures that all containers, volumes, and images are appropriately removed.
Our system includes a Fast API connection to Elasticsearch, enabling robust and efficient search queries. This connection is fundamental in powering our advanced document search capabilities.
Finally, a React-based web application offers a seamless user experience for searching and interacting with the data processed by our system.
Each component plays a crucial role in ensuring our data processing system is comprehensive, efficient, and user-friendly.