Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REPL not identifying custom transformers #43

Open
ansaardollie opened this issue Jun 4, 2024 · 7 comments
Open

REPL not identifying custom transformers #43

ansaardollie opened this issue Jun 4, 2024 · 7 comments
Labels
question Further information is requested

Comments

@ansaardollie
Copy link

Hi there,

I am trying to implement my own custom transformer named customtransformer. However when I try to run the ?: command in the REPL it doesn't pick these transformers up.

I have a file called custom_transformers.jl which has the following content

using DataToolkitBase

function storage(provider::DataStorage{:customtransformer}, ::Type{IO}; write::Bool)
    ...
end

function load(provider::DataLoader{:customtransformer}, io::IO, sink::Type)
    ...
end

function save(provider::DataStorage{:customtransformer}, io::IO, tbl)
    ...
end

I execute this file in the current Julia session and then run

using DataToolkit

Then when trying to list the transformers (using the ?: command in the DataRepl) my custom transformer never shows up.

What is the process to let DataToolkit.jl know about these custom transformers.

@tecosaur
Copy link
Owner

tecosaur commented Jun 5, 2024

Currently the transformer list command only knows about transformers it's been explicitly told about, see what's currently done in DataToolkitCommon:

append!(DataToolkitCore.TRANSFORMER_DOCUMENTATION,
[(:storage, :filesystem) => FILESYSTEM_DOC,
(:storage, :git) => GIT_DOC,
(:storage, :null) => NULL_S_DOC,
(:storage, :passthrough) => PASSTHROUGH_S_DOC,
(:storage, :raw) => RAW_DOC,
(:storage, :s3) => S3_DOC,
(:storage, :web) => WEB_DOC,
(:loader, :arrow) => ARROW_DOC,
(:loader, :chain) => CHAIN_DOC,
(:loader, :gzip) => COMPRESSION_DOC,
(:loader, :zlib) => COMPRESSION_DOC,
(:loader, :deflate) => COMPRESSION_DOC,

(NB: DataToolkitBase has been renamed to DataToolkitCore in the development version)

I don't currently see a nicer way of fetching the documentation, but I think I could probably check for undocumented transformers and mention them at the end of ?:, how does that sound?

@tecosaur
Copy link
Owner

tecosaur commented Jun 5, 2024

I'm also planning on improving the docs a bit to make this a bit easier/soften the learning curve 🙂

@tecosaur tecosaur added the question Further information is requested label Jun 5, 2024
@ansaardollie
Copy link
Author

Hi

Completely understand regarding the documentation for the repl. No worries, I've realized I've mis-explained the real issue.

Out of interest, have there been any major changes between v0.9.x to v0.10? I ask because my initial thought in trying to get a handle of how everything works was just to try and get dummy transformers working and see if the toolkit could recognize them. However I've since realized at least for the system to pick up the driver name's in Data.toml; however I keep getting errors along the lines of:

UnsatisfyableTransformer: There are no storages for "cars" that can provide a .
 The defined storages are as follows:
   DataStorage{web}(IO)

I am trying to implement a Parquet driver, however I get issues as above. My basic approach thus far been to create a Julia package and then inside there define all the loader logic (which is the only transformer I've actually needed to use since I can get the parquet files through https).

I've tried following the approach of the example on this page

My package file src/dtk_data.jl

module dtk_data
using DataToolkit, DataToolkitBase, DataToolkitCommon, DataFrames

export load, supportedtypes, create

function __init__()
    @addpkg Parquet2 "98572fba-bba0-415d-956f-fa77e587d26d"
    @addpkg DataFrames "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
end


function load(loader::DataLoader{:parquet}, io::IO, ::Type{DataFrame})
    @import Parquet2
    @import DataFrames
    return Parquet2.Dataset(io) |> DataFrames.DataFrame
end


supportedtypes(::Type{DataLoader{:parquet}}) =
    [QualifiedType(:DataFrames, :DataFrame)]


create(::Type{DataLoader{:parquet}}, source::String) =
    !isnothing(match(r"\.parquet$"i, source))

end # module dtk_data



Then I open julia session in the root directory of this package and run the following code

include("src/dtk_data.jl")

using .dtk_data

using DataToolkit

loadcollection!("Data.toml")

d"cars"

And my Data.toml has the following setup

data_config_version = 0
uuid = "74641622-11fb-438b-b7be-4626639b8eac"
name = "dtk_data"
plugins = ["store", "defaults", "memorise"]


[[cars]]
uuid = "a6cee431-bfa1-4690-b8f3-51de93d970f5"

    [[cars.storage]]
    url = "https://github.com/ansaardollie/dtk_data/raw/main/MT%20cars.parquet"
    type = "Base.IO"
    driver = "web"


    [[cars.loader]]
    driver = "parquet"
    type = "DataFrames.DataFrame"  

Then I get the following error

ERROR: UnsatisfyableTransformer: There are no storages for "cars" that can provide a .
 The defined storages are as follows:
   DataStorage{web}(IO)
Stacktrace:
  [1] _read(dataset::DataToolkitBase.DataSet, as::Type)
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\interaction\externals.jl:253
  [2] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})   
    @ Base .\essentials.jl:887
  [3] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base .\essentials.jl:884
  [4] invokepkglatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\usepkg.jl:101
  [5] invokepkglatest(::Any, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\usepkg.jl:100
  [6] (::DataToolkitBase.AdviceAmalgamation)(::Function, ::Any, ::Vararg{Any}; kwargs...)
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:102
  [7] (::DataToolkitBase.AdviceAmalgamation)(::Function, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:98
  [8] macro expansion
    @ ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:131 [inlined]
  [9] _dataadvisecall(::typeof(DataToolkitBase._read), ::DataToolkitBase.DataSet, ::Type{…}; kwargs::@Kwargs{})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:131
 [10] read(dataset::DataToolkitBase.DataSet)
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\interaction\externals.jl:160
 [11] macro expansion
    @ ~\.julia\packages\DataToolkit\VObGv\src\DataToolkit.jl:48 [inlined]
 [12] top-level scope
    @ REPL[5]:1
Some type information was truncated. Use `show(err)` to see complete types.

Any help would be appreciated. Would love to be able get a parquet driver working so I can hopefully contribute if you'd like.

@tecosaur
Copy link
Owner

tecosaur commented Jun 5, 2024

Out of interest, have there been any major changes between v0.9.x to v0.10?

Yup! I'm making a few major changes (a changelog probably wouldn't hurt 😅), such as:

  • Bumping the minimum Julia to 1.9 and embracing package extensions
  • Replacing @import with @require: d9e9226
  • Removal of the SmallDict type (the original issue is a lot better with Memory in 1.11): 98a6723
  • Improved type inference/logic: 389df28
  • Separate the REPL mode and Store out into new packages
  • Rename DataToolkitBase to DataToolkitCore
  • Split DataToolkit into a more user-facing DataToolkit and package-facing (new) DataToolkitBase
  • Move all these packages into a monorepo (here)
  • Improved load time and precompilation
  • Support for opening files as a FilePathsBase.AbstractPath
  • Support for (basic) S3 downloads
  • More image types (gif, webp)
  • Logging capability moved from DataToolkitCommon to DataToolkitCore. It works a bit differently (IMO, better) and is now configured by Preferences
  • WIP documentation improvements
  • TODO support for directories as well as files

Regarding the problem you've run into, it looks like you've given enough info for it to be a MWE. I'll see if I can give it a look in the next day or two, otherwise I'll probably get to it on the weekend 🙂.

@tecosaur
Copy link
Owner

tecosaur commented Jun 6, 2024

however I keep getting errors along the lines of:

Good news, this error message is improved in 0.10-dev 🙂

UnsatisfyableTransformer: There are no loaders for "cars" that can provide a DataFrames.DataFrame.

More good news, I think you'll find this works if you actually import the functions you want to overload

- export load, supportedtypes, create
+ import DataToolkitBase: load, supportedtypes, create

It would be great to see a Paraquet driver, I should have some docs on adding a loader to DataToolkitCommon in the next week or so.

@ansaardollie
Copy link
Author

Awesome thank you so much for update.

Out of interest how would one add the v0.10-dev of the packages using the monorepo link to my Julia environment ?

@tecosaur
Copy link
Owner

I'll warn that development is slightly volatile at the moment, but if you want to play around with v0.10-dev then once JuliaLang/Pkg.jl#4026 is fixed you can add this to your Project.toml with Julia 1.11/1.12:

[deps]
DataToolkit = "dc83c90b-d41d-4e55-bdb7-0fc919659999"
DataToolkitBase = "e209d0c3-e863-446f-9b45-de6ca9730756"
DataToolkitCommon = "9e6fccbf-6142-406a-aa4c-75c1ae647f53"
DataToolkitCore = "caac3e55-418c-402e-a061-64d454aa8f4f"
DataToolkitREPL = "c58528a0-97a2-40a0-9a44-056fe1196995"
DataToolkitStore = "082ec3c2-3fb3-458f-ad22-5e5e31d4377a"

[sources]
DataToolkit = {rev = "main", subdir = "Main", url = "https://github.com/tecosaur/DataToolkit.jl.git"}
DataToolkitBase = {rev = "main", subdir = "Base", url = "https://github.com/tecosaur/DataToolkit.jl.git"}
DataToolkitCommon = {rev = "main", subdir = "Common", url = "https://github.com/tecosaur/DataToolkit.jl.git"}
DataToolkitCore = {rev = "main", subdir = "Core", url = "https://github.com/tecosaur/DataToolkit.jl.git"}
DataToolkitREPL = {rev = "main", subdir = "REPL", url = "https://github.com/tecosaur/DataToolkit.jl.git"}
DataToolkitStore = {rev = "main", subdir = "Store", url = "https://github.com/tecosaur/DataToolkit.jl.git"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants