Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OddpubMetrics Docker image and Database Table Generation #15

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
1182a9c
Add oddpub_metrics table and model definition
quang-ng Dec 11, 2024
580d6c4
create Oddpub API
quang-ng Dec 11, 2024
ac53e92
Enhance Dockerfile and pyproject.toml for R support and author attrib…
quang-ng Dec 11, 2024
157a217
Update pyproject.toml to include pandas and numpy dependencies
quang-ng Dec 11, 2024
05180ed
Refactor Dockerfile and update pyproject.toml to enhance R package in…
quang-ng Dec 11, 2024
f20241a
Add FastAPI application and Docker setup for ODDPub processing
quang-ng Dec 12, 2024
d7e2115
Enhance logging and update PDF processing in Oddpub application
quang-ng Dec 13, 2024
1bb19ec
Add OddpubWrapper class for PDF processing and API integration
quang-ng Dec 13, 2024
ae524a7
Update Oddpub metrics to allow nullable fields and enhance OddpubWrap…
quang-ng Dec 13, 2024
789cc2d
Refactor R result conversion in OddpubWrapper to use pandas2ri for im…
quang-ng Dec 13, 2024
9943c5f
Add unit tests for OddpubWrapper and refactor test setup in RTranspar…
quang-ng Dec 13, 2024
7ce9b16
Refactor OddpubWrapper class documentation to clarify its purpose as …
quang-ng Dec 13, 2024
f20dbdb
Implement a new test for OddpubWrapper to validate PDF processing wit…
quang-ng Dec 18, 2024
92dd494
remove superfluous oddpub deps
leej3 Dec 18, 2024
e694cbe
Revert name of service in docker compose
quang-ng Dec 19, 2024
eca7643
Update README.md to enhance Docker setup instructions and API usage e…
quang-ng Dec 19, 2024
3ccece3
Refactor Dockerfile: removing unnecessary dependencies for oddpub and…
quang-ng Dec 19, 2024
dca4d6d
Update README.md to correct Docker run command and API access URL for…
quang-ng Dec 19, 2024
4af8cd1
Update Dockerfile to install pdftotext utility for PDF processing
quang-ng Dec 19, 2024
11f9ef0
update mock env files
quang-ng Dec 19, 2024
8f11c26
Added platform to postgres-compose.yaml
joshlawrimore Dec 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .docker/postgres-compose.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
services:
postgres:
postgres-dsst:
quang-ng marked this conversation as resolved.
Show resolved Hide resolved
image: postgres:15
container_name: postgres_db
env_file: "../.env"
Expand All @@ -13,5 +13,13 @@ services:
timeout: 5s # Added missing timeout value
retries: 5

oddpub-dsst:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we add all the services to this file then we should move it to the root directory and rename it to compose.yaml.
We could also launch the container using python-on-whales, or add separate compose files to this directory.

build:
context: ../services/oddpub
dockerfile: Dockerfile
container_name: oddpub_service
ports:
- "8071:8071"

volumes:
postgres_data:
4 changes: 3 additions & 1 deletion .mockenv
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,6 @@ NCBI_API_KEY=

S3_BUCKET_NAME=osm-pdf-uploads
HOSTNAME=localhost
USERNAME=quang
USERNAME=quang

quang-ng marked this conversation as resolved.
Show resolved Hide resolved
ODDPUB_HOST_API=http://localhost:80
64 changes: 64 additions & 0 deletions alembic/versions/600039d1785e_add_field_in_oddpub_is_nullable.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
"""add field in OddPub is nullable

Revision ID: 600039d1785e
Revises: 832c238c1be7
Create Date: 2024-12-13 17:00:30.689184

"""
from typing import Sequence, Union

from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision: str = '600039d1785e'
down_revision: Union[str, None] = '832c238c1be7'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.alter_column('oddpub_metrics', 'article',
existing_type=sa.VARCHAR(),
nullable=True)
op.alter_column('oddpub_metrics', 'is_open_data',
existing_type=sa.BOOLEAN(),
nullable=True)
op.alter_column('oddpub_metrics', 'is_reuse',
existing_type=sa.BOOLEAN(),
nullable=True)
op.alter_column('oddpub_metrics', 'is_open_code',
existing_type=sa.BOOLEAN(),
nullable=True)
op.alter_column('oddpub_metrics', 'is_open_data_das',
existing_type=sa.BOOLEAN(),
nullable=True)
op.alter_column('oddpub_metrics', 'is_open_code_cas',
existing_type=sa.BOOLEAN(),
nullable=True)
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.alter_column('oddpub_metrics', 'is_open_code_cas',
existing_type=sa.BOOLEAN(),
nullable=False)
op.alter_column('oddpub_metrics', 'is_open_data_das',
existing_type=sa.BOOLEAN(),
nullable=False)
op.alter_column('oddpub_metrics', 'is_open_code',
existing_type=sa.BOOLEAN(),
nullable=False)
op.alter_column('oddpub_metrics', 'is_reuse',
existing_type=sa.BOOLEAN(),
nullable=False)
op.alter_column('oddpub_metrics', 'is_open_data',
existing_type=sa.BOOLEAN(),
nullable=False)
op.alter_column('oddpub_metrics', 'article',
existing_type=sa.VARCHAR(),
nullable=False)
# ### end Alembic commands ###
52 changes: 52 additions & 0 deletions alembic/versions/832c238c1be7_add_oddpub_metrics_table.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""add oddpub_metrics table

Revision ID: 832c238c1be7
Revises: 52101c205c9d
Create Date: 2024-12-11 15:18:24.714630

"""
from typing import Sequence, Union

from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision: str = '832c238c1be7'
down_revision: Union[str, None] = '52101c205c9d'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.create_table('oddpub_metrics',
sa.Column('id', sa.Integer(), nullable=False),
sa.Column('article', sa.String(), nullable=False),
sa.Column('is_open_data', sa.Boolean(), nullable=False),
sa.Column('open_data_category', sa.String(), nullable=True),
sa.Column('is_reuse', sa.Boolean(), nullable=False),
sa.Column('is_open_code', sa.Boolean(), nullable=False),
sa.Column('is_open_data_das', sa.Boolean(), nullable=False),
sa.Column('is_open_code_cas', sa.Boolean(), nullable=False),
sa.Column('das', sa.String(), nullable=True),
sa.Column('open_data_statements', sa.String(), nullable=True),
sa.Column('cas', sa.String(), nullable=True),
sa.Column('open_code_statements', sa.String(), nullable=True),
sa.Column('work_id', sa.Integer(), nullable=True),
sa.Column('provenance_id', sa.Integer(), nullable=True),
sa.Column('document_id', sa.Integer(), nullable=True),
sa.ForeignKeyConstraint(['document_id'], ['documents.id'], name=op.f('fk_oddpub_metrics_document_id_documents')),
sa.ForeignKeyConstraint(['provenance_id'], ['provenance.id'], name=op.f('fk_oddpub_metrics_provenance_id_provenance')),
sa.ForeignKeyConstraint(['work_id'], ['works.id'], name=op.f('fk_oddpub_metrics_work_id_works')),
sa.PrimaryKeyConstraint('id', name=op.f('pk_oddpub_metrics'))
)
op.create_index(op.f('ix_oddpub_metrics_article'), 'oddpub_metrics', ['article'], unique=True)
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.drop_index(op.f('ix_oddpub_metrics_article'), table_name='oddpub_metrics')
op.drop_table('oddpub_metrics')
# ### end Alembic commands ###
20 changes: 20 additions & 0 deletions dsst_etl/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -250,3 +250,23 @@ class RTransparentPublication(Base):

work_id = Column(Integer, ForeignKey("works.id"), nullable=True)
provenance_id = Column(Integer, ForeignKey("provenance.id"), nullable=True)


class OddpubMetrics(Base):
__tablename__ = "oddpub_metrics"

id = Column(Integer, primary_key=True)
article = Column(String, unique=True, nullable=True, index=True)
is_open_data = Column(Boolean, nullable=True, default=False)
open_data_category = Column(String)
is_reuse = Column(Boolean, nullable=True, default=False)
is_open_code = Column(Boolean, nullable=True, default=False)
is_open_data_das = Column(Boolean, nullable=True, default=False)
is_open_code_cas = Column(Boolean, nullable=True, default=False)
das = Column(String)
open_data_statements = Column(String)
cas = Column(String)
open_code_statements = Column(String)
work_id = Column(Integer, ForeignKey("works.id"), nullable=True)
provenance_id = Column(Integer, ForeignKey("provenance.id"), nullable=True)
document_id = Column(Integer, ForeignKey("documents.id"), nullable=True)
73 changes: 73 additions & 0 deletions dsst_etl/oddpub_wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import logging
from pathlib import Path

import requests
from sqlalchemy.orm import Session

from dsst_etl.models import OddpubMetrics

from .config import config

logger = logging.getLogger(__name__)


class OddpubWrapper:
"""
Wrapper class for the ODDPub API.
"""

def __init__(
self,
db_session: Session = None,
work_id: int = None,
document_id: int = None,
oddpub_host_api: str = config.ODDPUB_HOST_API,
):
"""
Initialize the OddpubWrapper.

Args:
db (Session, optional): SQLAlchemy database session
work_id (int): ID of the work being processed
document_id (int): ID of the document being processed
"""
try:
self.oddpub_host_api = oddpub_host_api
self.db_session = db_session
self.work_id = work_id
self.document_id = document_id
logger.info("Successfully initialized OddpubWrapper")
except Exception as e:
logger.error(f"Failed to initialize OddpubWrapper: {str(e)}")
raise

def process_pdfs(self, pdf_folder: str) -> OddpubMetrics:
"""
Process PDFs through the complete ODDPub workflow and store results in database.

Args:
pdf_folder (str): Path to folder containing PDF files

Returns:
OddpubMetrics: Results of open data analysis
"""
try:
# Iterate over each PDF file in the folder
for pdf_file in Path(pdf_folder).glob("*.pdf"):
with open(pdf_file, "rb") as f:
# Use requests to call the API
response = requests.post(
f"{self.oddpub_host_api}/oddpub", files={"file": f}
)
response.raise_for_status()

r_result = response.json()
oddpub_metrics = OddpubMetrics(**r_result)
oddpub_metrics.work_id = self.work_id
oddpub_metrics.document_id = self.document_id
self.db_session.add(oddpub_metrics)
self.db_session.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some output to the user here to confirm the tool has run correctly and perhaps show the results concisely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, perhaps we should get user confirmation before uploading/updating the database and have a -y flag for those who are sure they want to upload?


except Exception as e:
logger.error(f"Error in PDF processing workflow: {str(e)}")
self.db_session.rollback()
16 changes: 16 additions & 0 deletions scripts/run_oddpub.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import argparse
from dsst_etl import get_db_engine
from dsst_etl.db import get_db_session
from dsst_etl.oddpub_wrapper import OddpubWrapper

def main():
parser = argparse.ArgumentParser(description="Process PDFs with OddpubWrapper")
parser.add_argument('pdf_folder', type=str, help='Path to the folder containing PDF files')
args = parser.parse_args()

oddpubWrapper = OddpubWrapper(get_db_session(get_db_engine()))
oddpubWrapper.process_pdfs(args.pdf_folder)

if __name__ == "__main__":
main()

57 changes: 57 additions & 0 deletions services/oddpub/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Oddpub API

This is a FastAPI application for processing PDF files using the oddpub functions: `pdf_convert`, `pdf_load`, and `open_data_search`.
quang-ng marked this conversation as resolved.
Show resolved Hide resolved

## Project structure
services/
└── oddpub/
├── dockerfile
├── main.py
├── pyproject.toml
└── README.md

## Requirements

- Python 3.11
- Docker (optional, for containerized deployment)

## Setup

1. **Clone the repository**:
```bash
git clone https://github.com/nimh-dsst/dsst-etl.git
cd services/oddpub
```

2. **Install dependencies**:
If you are using Poetry:
```bash
poetry install
quang-ng marked this conversation as resolved.
Show resolved Hide resolved
```

3. **Run the application**:
```bash
uvicorn main:app --reload
quang-ng marked this conversation as resolved.
Show resolved Hide resolved
```

## Usage

- Access the API at `http://localhost:8000/oddpub` to upload a PDF file and receive JSON output.
quang-ng marked this conversation as resolved.
Show resolved Hide resolved

## Docker

To build and run the application using Docker:

1. **Build the Docker image**:
```bash
docker build -t oddpub-api .
```

2. **Run the Docker container**:
```bash
docker run -p 80:80 oddpub-api
quang-ng marked this conversation as resolved.
Show resolved Hide resolved
```

## License

This project is licensed under the MIT License.
4 changes: 4 additions & 0 deletions services/oddpub/_entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash
source /opt/conda/etc/profile.d/conda.sh
conda activate osm
exec "$@"
quang-ng marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading