-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design, implement, and utilize an object-oriented pipeline architecture based on prior reference #1
Comments
@yambottle, @iamamutt, @dimitri-yatsenko... Just briefly capturing some of the first observations from our review meeting last week:
@apjanke and I had a good further discussion this week. Stay tuned for more feedback shortly! |
Thanks Vijay! I've updated these points that you suggested last time. I'll update the pending PR tomorrow or early next Monday. |
I'd recommend using something like The two important things about the top-level package name are: It should try to be unique across the whole world of Matlab code to avoid collisions, since the Matlab package namespace is a single global one, and who knows what other code a user of your library might be pulling in from elsewhere? There's no central name registry for Matlab M-code package names to coordinate this name selection, so you just have to guess and do your best. Try to pick a name that nobody else would choose on their own. I bet the Neuropixels probe manufacturer might well pick The first couple characters should be as unique and distinguishing as possible, amongst all global Matlab identifiers (the set of package names, function names, and class names), to make tab-completion work well for users. (So your users can just type in a short prefix, hit tab, and find your top-level package name quickly.) This is important because Matlab's The "ne" prefix has a ton of matches: The "np" prefix name space is wide open: |
Thanks for your suggestion, Andrew! I'll update the package name. |
Hi, folks! A couple thoughts on the initial design: Minor thingsIn the Class Diagram: The Config -> Pipeline relationship is labeled as (1, 0..1). Should that be 1-to-* instead, since a Config can produce multiple Pipelines? Or should I expect that the I read the Use Case diagram as saying that a Pipeline can have multiple Stages, and a Stage can have multiple probes, which I assume correspond to Jobs. I think that means the "->" associations in the class diagram form Pipeline -> Stage and Stage -> JobBase should be 1-to-*, not 1-to-0..1. Or does a Stage only have one Job, and one Job can act on multiple Probes? The top-level package name in the Class Diagram should probably be Design PatternsLike Vijay mentioned related to pipes and filters, I think Gang of Four style Software Design Patterns are a useful framework for thinking about this design. Looking at the initial class design at https://github.com/vathes/neuropixels-toolkit, I'd say this employs the Strategy pattern - a Job object (like CatGT, KiloSort, etc) is effectively a Strategy that is plugged in to a sequence in the Stage object - and a Stage effectively follows the Visitor pattern to fire off a sequence of jobs. I don't think this implies any changes to do now; it's just a useful way of thinking about and discussing things. Assembly classIn the use case diagram, it looks like the process of fanning processing out to multiple pipelines is called "assemble". There's no corresponding class in the class design. I think adding an Assembler or Assembly class that manages multiple, possibly concurrent, pipelines would be a good thing to add. The initial implementation of the "assemble" operation might be just a simple Or am I misreading that Use Case diagram? Does it mean that the assemble() operation is what produces a Pipeline from a Config in the first place? The fact that the arrow from "make Config objects" fans out to multiple Pipelines - does this mean that multiple pipelines exist and can run in parallel in one Matlab project? LoggingThis is a relatively complex project, and it involves incorporating the library code and maybe user-supplied code, and runs nontrivial batch processing. I think you might want to have some logging in here. Matlab doesn't provide a logging framework, so you'll need to bring your own. You could roll a custom one from scratch, or you could take an existing FLOSS Matlab logging library like log4m or my own SLF4M. log4m is a simple, kind of minimal logging framework written in native Matlab. SLF4M is a more powerful, but more heavyweight framework that sits on top of the Java SLF4J and log4j logging libraries. This could be an issue: this means that by using SLF4M, you're introducing a dependency on Java. Not a big deal now, but MathWorks has announced that future versions of Matlab will not ship with a bundled Java JVM, so users will have to install and reference their own. (This event is called "Jexit".) Given that, you may not want to make neuropixels-toolkit dependent on Java, so SLF4M is probably the wrong choice. Using third-party libraries in other libraries in Matlab is kind of a challenge: Matlab doesn't have a package manager or dependency management tool for handling this. And you definitely don't want to tell your users "also go install this logging library and get that on your Matlab path before neuropixels-toolkit" will work. So you'll want to redistribute the logging framework library with neuropixels-toolkit itself by incorporating its source code into the neuropixels-toolkit repo. (This is sometimes called "vendoring" a dependency.) Just grabbing the log4m library and including it in a I see there's a |
A note on GitHub workflows: The https://github.com/vathes/neuropixels-toolkit repo is a GitHuub fork of this neuropixels-toolkit repo. You might want to use a non-fork repo as the main place for your in-progress work, so you can make a bunch of commits without having to worry about keeping your commit history tidy, and then use a fork repo as a place for "staging" Pull Requests with a tidier commit history and organization. |
Thanks to @apjanke for all his input! I quite like the idea of having an Assembly class, i.e. adding one more level of hierarchy to the class diagram. It might be a place to implement parallel computing aspects as Andrew notes. For this cycle's goals, it might be primarily scaffolding. One thing to consider is whether adding such a class might replace the need for a static utilities class. Note the Assembly class could have static methods, in addition to per-assembly methods. Regarding the logging suggestions, I guess this came up reacting to the logging function noted in the current utility static class. That's beyond the scope of the current agreement, though we coudl consider it as another option for the stretch goals this cycle. There are various tools to build from for implementation, as Andrew mentions. Besides this, one more general thing to consider is whether logging implementation would be distributed across the various classes. If so, this might figure into the class design. e.g. each class might have a |
Thanks @vijayiyer05 @apjanke for all your suggestions! I've made several changes in the above PR, it responses to most of your requests:
Maybe you want to take a look at the diagrams again at first. There are few requests that I haven't done yet:
|
Cool! Can you talk a bit about why you chose the name "Session" instead of "Pipeline"? I actually thought "Pipeline" was a pretty good name; maybe I'm not familiar enough with the domain terminology to understand why "Session" is better. All the other changes you've made so far look straightforwardly good to me. About the "non-fork repo" thing: If you just email GitHub Support and ask them to "convert my vathes/neuropixels-toolkit repo from a fork to a 'canonical' repo" they can just flip a switch on the back and and your repo becomes a non-fork repo with zero work for you. Then you can create a separate "PR staging" fork repo later when you're ready to send code upstream. About logging: The big thing you need to decide on is what hiercarchy of logger names you're going to use for your messages. Often people just pick package-qualified class names of objects sending the messages as the logger names. But that's a bit of an issue for a reusable library like npxtoolkit: some classes are part of the public API, but some are going to be internal implementation details, and you don't want to leak those names via logger names. So you might want to choose logger names by mapping private/internal classes on to classes that are part of the public API, or just define a logger name hierarchy that's similar to but independent of your package & class name hierarchy. |
Oh, I totally overlooked something: naming conventions for class/method/property names! In the class diagram you have, you're using snake_case for method and property names. What little contemporary OOP Matlab stuff I've seen lately seems to have settled on Java-style camelCase for method and property names. That's also what the existing Neuropixel Utils code is currently using. Maybe switch to camelCase? For example, "stageInfo", "currentJob", "addStage" instead of "stage_info", "current_job", and "add_stage". |
@apjanke Thanks for following up!
|
@apjanke definitely apologize for the hodgepodge of camelCase and snake_case throughout the original neuropixel_utils code. It's a bit of my switching rapidly between Python / Julia and MATLAB showing through. Glad to see so much progress on a (far better designed) neuropixel toolkit! |
Hi @apjanke , About the 'non-fork' repo, apart from the tidy commit history, is there any other concerns that you have? If tidy commit history is the only concern, I think squash and merge may be the best solution for us. |
Good question! I think we can favor what we think is clearest, rather than mirroring the names from the reference architecture. For me "Pipeline-->Stage" seems for sure the clearest for the middle two levels. I actually think "Session" might work better than "Assembly" at the outer level, but I could go either way. On the inner level, I suggest going with "Task" over "Job". The parallelization implementation is TBD later -- could be with
|
Thanks for your update! @vijayiyer05 Sure, I'll finally go with "Session->Pipeline->Stage->Task". I'll check the batch submission approach as well. |
After having a closer look at how Dan's own neuropixels-toolkit repo is set up, I think my concerns here were misplaced, and actually a fork repo is just fine. I had gotten the djoshea/neuropixels-toolkit and djoshea/neuropixel-utils repos confused, and thought that this vathes/neuropixels-toolkit was intended to be a major reorg of the neuropixel-utils repo under the "Toolkit" framework and would eventually produce a separate publishing destination, but actually the djoshea/neuropixels-toolkit repo is itself a separate repo set up specifically to receive the PRs from this one, and it will be the canonical destination for presenting the finished work to the public. So having vathes/neuropixels-toolkit be a fork of djoshea/neuropixels-toolkit is perfectly appropriate, and "squash and merge" is a good approach. Nothing to see here; move along. Sorry! :) |
I like this naming choice, too. Makes good sense to me from a purely programmer's perspective. |
Oh! I just ran in to this article yesterday, which explicitly lays out some things about logging that I agree with, but were just inarticulately sitting around in the back of my head. Have a read through and tell me what you think. https://tuhrig.de/my-logging-best-practices/ From the perspective of Neuropixels Toolkit, the researchers/neuroscience people running it are the "business" users who'll be reading the INFO-level logs, and the developers working on Neuropixels Toolkit/Utils itself, or IT staff supporting researchers' use of the Neuropixels-related code, are the "technology" people who'll be reading DEBUG-level logs. The big exception here is that because npxtoolkit stuff will be dispatching long-running tasks in interactive Matlab sessions, I think it should probably log both before and after doing stuff. |
@apjanke Nice article, thanks for sharing! The INFO/DEBUG level separation is what I always do as well. I'll take your suggestion to log before and after. |
@apjanke @vijayiyer05 This PR has been updated based on our discussion, please take a look. Let me know if you want me to squash and merge it, thanks! |
The Janelia implementation of the Allen Institute's pipeline is the suggested prior reference architecture: https://github.com/jenniferColonell/ecephys_spike_sorting
As discussed at the kick-off meeting, this issues captures the "abstract" component of the implementation whereas #2 captures the first concrete implementation.
For this issue, you might consider the "pipes and filters" ideas from the design patterns community (mostly Java programmers in my experience). You might also consider ideas from the dataflow programming community.
As mentioned, I could also imagine a good solution as being quite lightweight and straightforward:
run()
) for each stagerunAll()
)The text was updated successfully, but these errors were encountered: