Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating custom collection #1

Open
tino097 opened this issue Jun 12, 2024 · 10 comments
Open

Creating custom collection #1

tino097 opened this issue Jun 12, 2024 · 10 comments

Comments

@tino097
Copy link

tino097 commented Jun 12, 2024

Im tryng to implement the ICollection interface with the following code

    # ICollection
    def get_collection_factories(self) -> dict[str, CollectionFactory]:

        return {
            'my-users': lambda n, p, **kwargs: cu.ApiListCollection(
                    n,
                    p,
                    data_factory=cu.ApiListData(action='user_list'))
        }


class UserSerializer(cu.Serializer):

    def serialize(self) -> list[dict[str, str]]:
        result = []
        for user in self.attached.data:
            result.append({
                "id": user.get("id"),
                "name": user.get("name"),
                "email": user.get("email"),
                "fullname": user.get("fullname")
            })
        return result


class UserCollection(cu.ApiCollection):

    def __init__(self, name: str, params: dict, **data_settings):
        super().__init__(name, params, **data_settings)
        self.serializer_factory = UserSerializer
        self.data_settings = data_settings


# def my_factory(name: str, params: dict, **kwargs: Any) -> UserCollection:
#     return UserCollection(name, params, **kwargs)

So this collection is getting registered and im able to select on the explorer component but when selected im getting the following error

    data_factory=cu.ApiListData(action='user_list'))
TypeError: Data.__init__() missing 1 required positional argument: 'obj'

What im missing to set to initialize the ApiListCollection ?

Thanks

@smotornyuk
Copy link
Member

Hi @tino097, sorry for delay

Your error is caused by data_factory. It must be a class itself, not the object. And parameters need to be passed as data_settings. So, the full version of my-users is:

{
  'my-users': lambda n, p, **kwargs: cu.ApiListCollection(
    n,
    p,
    data_factory=cu.ApiListData
    data_settings={"action": 'user_list'})
}

I'll rewrite readme and add examples of collection creation in the beginning, before diving into the internals. If you can give me a couple of use-cases, it would be a good material for documentation

@smotornyuk
Copy link
Member

And regarding UserCollection at the end of your code snippet. Most likely, you want to register a collection, that uses a custom serializer, and assigning data_settings is accidental

For such a situation, where you only want to replace a factory, you can omit the constructor and assign the factory to the corresponding attribute:

class UserCollection(cu.ApiCollection):
    SerializerFactory = UserSerializer

Signature of collection constructor is def __init__(self, name: str, params: dict, **kwargs):. The important part here is kwargs - note, it's not data_settings.

For example, when you build a collection with Collection(n, p, data_settings={}), I imagine, you want to access this data_settings, right? In this case, data_settings is kept inside kwargs:

class MyCollection(cu.ApiCollection):

    def __init__(self, name: str, params: dict, **kwargs):
        super().__init__(name, params, **kwargs)
        
        print("THIS IS DATA SETTINGS ->", kwargs.get("data_settings"))

And, with these comments, we can try building your collection. If you just want a user list that filters user using q parameter and displays only id, name, imail, fullname; you need the following code:

# ICollection
    def get_collection_factories(self) -> dict[str, CollectionFactory]:

        return {
            'my-users': MyUserCollection,
        }


##### your implementation of UserSerializer is left unchanged ####

# ApiList and Api collections just override the data factory. We are going to do it
# ourselves, so there will be no difference if we just use a simple collection as a base class
class MyUserCollection(cu.Collection):

    # Data.with_attributes defines the anonymous class with a specific attribute
    # overriden. If you are not going to use your custom data factory
    # elsewhere, this is the shortest possible syntax
    DataFactory = cu.ApiListData.with_attributes(action="user_list")

    SerializerFactory = UserSerializer

BTW, in your initial implementation, instead of cu.ApiListData(action='user_list')) which created an object and caused an error, you could use cu.ApiListData.with_attributes(action='user_list')) which would create a new class with fixed value of action.

@tino097
Copy link
Author

tino097 commented Jun 24, 2024

Thanks @smotornyuk

@tino097
Copy link
Author

tino097 commented Jul 1, 2024

Hey @smotornyuk

from ckanext.collection import internal, types
ImportError: cannot import name 'internal' from 'ckanext.collection' 

I've pulled latest from master

@smotornyuk
Copy link
Member

smotornyuk commented Jul 2, 2024

Thanks. I forgot to commit internal.py. Now it's added to the repo, so issue must be fixed in latest commit

BTW, I'm rewriting the documentation. At the moment, I finished pages above the red line
image
Everything below the red line still in draft state.

Mainly, I'm trying to explain things gradually with more examples. And there is one change: instead of importing everything like import ckanext.collection.utils as cu, it's recommended to import shared module and access items from it.

from ckanext.collection.shared import collection, data, serialize

#and use it like below
collection.Collection
data.ApiSearchData
serialize.CsvSerializer

@tino097
Copy link
Author

tino097 commented Jul 17, 2024

To confirm, if i want to have a custom data, i would need to create my own action where i would get desired information?

Or if i could use the ModelCollection for that purpose?

@smotornyuk
Copy link
Member

Using ModelCollection is more efficient, but there are certain disadvantages.

If you use ModelCollection with a specific model from CKAN, you'll get all the records from DB. Imagine that you create ModelCollection for the model.Package - you'll get public, private, deleted, and draft datasets at once. If you are showing this collection to admin only - it's ok. If you are filtering results from the collection before showing it to the anonymous user - it is also ok. But it's your responsibility to protect private data and show collection only to people with required access level

If you are using API action instead of the model, all restrictions are handled inside the action. If you use ApiSearchCollection that takes data from package_search, package_search is called with the current user and gives you back only datasets that are accessible by the current user.

So, the answer is:

  • if only trusted users see the collection: use ModelCollection
  • if anyone sees the collection and you already have action that hides private data: use action
  • if anyone sees the collection and you don't have and action: you can either create and action and use it with the API collection or extend ModelCollection and filter data inside it - as you have to implement this filtration logic, it doesn't really matter where it will be done.

@tino097
Copy link
Author

tino097 commented Jul 18, 2024

My use cases are to get reports within CKAN, as example:

  • Get all users and have column for organization / group membership or any showcase or apps
  • Report for datasets by user in orgs or groups

So there would be a filtering and restrictions over some of the data but im trying to figure it out what would be the right path

Thanks again

@smotornyuk
Copy link
Member

Cool, another example for the time, when I continue updating documentation.

Here you can use models directly. It doesn't sound like you'll be able to use API actions that collect data elsewhere, so creating them is not much value. Here's the code that creates a collection of every user. The collection contains the user's ID, name, full name, and all groups + organizations of the user.

Example
from __future__ import annotations

import sqlalchemy as sa
from ckan import model

from ckanext.collection.shared import collection, data, serialize


# aliases that required to select data from the same model twice, for `groups`
# column and for `organizations` column.
stmt_groups = sa.alias(model.Group, "groups")
stmt_orgs = sa.alias(model.Group, "organizations")


# Data factory that executes SQLAlchemy statement to compute data
# records. StatementSaData accepts `statement` attribute(sqlalchemy.sql.Select
# instance) and uses this statement to fetch data from DB. This is a low-level
# data factory that can be used when you need a Collection over arbitrary SQL
# query. I do not recommend using ModelData here, because ModelData optimized
# for work with a single model, while here we have to combine data from User,
# Member and Group models.
#
# I'm using CLS.with_attributes(...) here, but if you read documentation, you
# already know that it's the same as if I defined class:
#
# >>> class UserData(data.StatementSaData):
# >>>     statement = sa.select(...) # and here goes the whole value of select attribute.
# 
UserData = data.StatementSaData.with_attributes(
    statement=sa.select(
        model.User.id,
        model.User.name,
        model.User.fullname,
        sa.func.string_agg(stmt_groups.c.name, ",").label("groups"),
        sa.func.string_agg(stmt_orgs.c.name, ",").label("organizations"),
    )
    .outerjoin(
        model.Member,
        sa.and_(
            model.User.id == model.Member.table_id,
            model.Member.table_name == "user",
        ),
    )
    .outerjoin(
        stmt_groups,
        sa.and_(
            stmt_groups.c.id == model.Member.group_id, stmt_groups.c.type == "group"
        ),
    )
    .outerjoin(
        stmt_orgs,
        sa.and_(
            stmt_orgs.c.id == model.Member.group_id, stmt_orgs.c.type == "organization"
        ),
    )
    .group_by(model.User)
)


# the collection itself. As you can see, the heavy work is done by data factory.
class UserCollection(collection.Collection):
    DataFactory = UserData
    # I don't know what format of report you are going to use, so let's choose CSV
    SerializerFactory = serialize.CsvSerializer


# initialize a collection    
users = UserCollection()

# transform it into CSV
print(users.serializer.serialize())

To add filters to the collection, we need to modify the data factory. It will be converted into a standard class (instead of using .with_attributes). The value of statement is not changed. statement defines the baseline of the source data - it must include as much data as possible. Filters will be applied by defining the statement_with_filters method.

Example
class UserData(data.StatementSaData):
    # statement is not changed
    statement = ...
 
    # this method is responsible for filtration. It's called automatically,
    # accepts `statement` of data factory and must return statement with
    # filters applied
    def statement_with_filters(self, stmt: sa.sql.Select) -> sa.sql.Select:
        # `self.attached` is a reference to collection that holds data
        # factory. `params` attribute contains data from the second argument
        # passed to the collection constructor
        params = self.attached.params

        # let's filter by exact match when using name
        if "name" in params:
            stmt = stmt.where(stmt.selected_columns["name"] == params["name"])

        # fullname will use case-insensitive substring match
        if "fullname" in params:
            fullname = params["fullname"]
            stmt = stmt.where(stmt.selected_columns["fullname"].ilike(f"%{fullname}%"))

        # groups/organizations can are filtered as fullname. But you'll
        # probably use something more sophisticated
        for group_type in ["groups", "organizations"]:
            if group_type not in params:
                continue
            value = params[group_type]
            stmt = stmt.having(stmt.selected_columns[group_type].contains(value))

        return stmt

# this class remains unchanged
class UserCollection(collection.Collection):
    ...


# `params` used by `statement_with_filters` is a dictionary
# passed as a second argument to collection constructor. You can build html-form,
# submit it and extract data from `ckan.plugins.toolkit.request.args`. This value
# is a good candidate for `params`
users = UserCollection("", {"name": "default"})

# transform it into CSV
print(users.serializer.serialize())

And here's the distribution of datasets created by users in different organizations/grops defined in the same manner

Example
from __future__ import annotations

import sqlalchemy as sa
from ckan import model

from ckanext.collection.shared import collection, data, serialize

# aliases that required to select data from the same model twice, for `groups`
# column and for `organizations` column.
package_membership = sa.alias(model.Member)
user_membership = sa.alias(model.Member)


class GroupStatsData(data.StatementSaData):
    # statement is not changed
    statement = (
        sa.select(
            model.Group.name.label("group_name"),
            model.Group.title,
            model.Group.type,
            sa.func.count(model.Package.id).label("number of datasets"),
            model.User.name.label("user_name"),
        )
        .join(user_membership, model.Group.id == user_membership.c.group_id)
        .join(model.User, model.User.id == user_membership.c.table_id)
        .join(package_membership, model.Group.id == package_membership.c.group_id)
        .join(model.Package, model.Package.id == package_membership.c.table_id)
        .where(
            model.User.state == "active",
            model.Package.state == "active",
            model.Group.state == "active",
        )
        .group_by(model.User, model.Group)
    )


class GroupStatsCollection(collection.Collection):
    DataFactory = GroupStatsData
    SerializerFactory = serialize.CsvSerializer


stats = GroupStatsCollection()
print(stats.serializer.serialize())

@smotornyuk
Copy link
Member

smotornyuk commented Jul 18, 2024

Here's implmenetation of the first collection using API action, just for reference. In this case, all the logic goes to action and collection becomes slim. You may find this style more readable, as you are more used for API actions

Example
from __future__ import annotations

from ckanext.collection.shared import collection, data, serialize

# action definition
@tk.side_effect_free
def my_user_listing(context: Context, data_dict: dict[str, Any]) -> dict[str, Any]:
    tk.check_access("my_user_listing", context, data_dict)

    # ApiSearchData use package_search-style for parameter names. rows ->
    # limit, start -> offset.
    rows = tk.asint(data_dict.get("rows", 10))
    start = tk.asint(data_dict.get("start", 0))

    stmt = sa.select(model.User)

    total = model.Session.scalar(sa.select(sa.func.count()).select_from(stmt))

    stmt = stmt.limit(rows).offset(start)

    # ApiSearchData expects package_search-like result, with `results` and
    # `count` keys
    return {
        "results": [
            {
                "id": user.id,
                "name": user.name,
                "fullname": user.fullname,
                "groups": user.get_group_ids("group"),
                "organizations": user.get_group_ids("organization"),
            }
            for user in model.Session.scalars(stmt)
        ],
        "count": total,
    }

UserData = data.ApiSearchData.with_attributes(action="my_user_listing")

class UserCollection(collection.Collection):
    DataFactory = UserData
    SerializerFactory = serialize.CsvSerializer


users = UserCollection()
print(users.serializer.serialize())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants