-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Object is required before uploading a file whereas Noark 5 requires a file at all times #285
Comments
[ivaylomitrev]
Is it possible to allow documents to be uploaded _prior_ to creating a
document object. This way, the Noark 5 requirements will be met by the
following flow:
1. Upload document
2. Create document object referencing the uploaded document
Nope, there is no such mechanism described in the API spesification. I
personally believe uploading and creating a dokumentobjekt should be
done in one step, as suggested in
<URL: #25 >,
but it will have to wait. The change we got for the current version is in
<URL: ba1e63e >.
…--
Happy hacking
Petter Reinholdtsen
|
Thanks for the confirmation! Uploading and creating a document object will satisfy the Noark 5 requirements (and, thus, ours). I might have to follow up on this one in the near future as it does affect heavily our own implementation of the API. By saying that it will have to wait, do you know if there's a ongoing process for fixing outstanding issues and releasing a new version of the API in the short-term? |
[ivaylomitrev]
By saying that it will have to wait, do you know if there's a ongoing
process for fixing outstanding issues and releasing a new version of
the API in the short-term?
Not quite sure what your definition of short term is, but the editors
are working on and off with the specification, and we have a editorial
meeting planned in a few weeks. No idea if we will agree on wrapping up
a new release any time soon, but I will at least argue that it is a good
idea. :)
As always, patches and suggestions to improve the specification text are
welcome. :)
…--
Happy hacking
Petter Reinholdtsen
|
That might prove difficult with my (non-existent) Norwegian skills, but I will try where possible :) |
[ivaylomitrev]
That might prove difficult with my (non-existent) Norwegian skills,
but I will try where possible :)
I would be happy to set up a translation framework for the Noark 5
Tjenestegrensesnitt specification if someone want to translate the text
to English, like I have done for the Noark 5 standard text. Let me know
if someone want to translate it to English, and I can spent some time
setting it up on <URL: https://hosted.weblate.org/ >. It is quite a lot
of work to translate such texts, so I have not had motivation to start
myself.
Regarding providing patches, if you can not draft texts, perhaps you can
proof read proposals and provide insights to improve them? I'll try
find time to draft a proposal for a unified file upload and
dokumentobjekt creation in the next few days, for a future edition of
the specification. Got some ideas how to do it in a backwards
compatible way.
…--
Happy hacking
Petter Reinholdtsen
|
Merk, dette forslaget er basert på ideer i mangelmelding arkivverket#25. Løser også utfordringer omtalt i mangelmelding arkivverket#285.
I guess it isi time to start discussing my old idea for uploading. The
idea is to upload the file as early as possible in the archiving
process, and then update metadata that the system have the option to
derive from the uploaded file. An open question is how high up in the
hierarcy it should be possible to do the upload, how to differenciate
autodetected metadata values from manually entered/edited/checked
metadata values, and how to return the list of automatically created
archive entities to the client to allow the client to present the
entities for validation and updates.
For example, it could be possible to upload a new document file into a
file (mappe), and create the entries for registrering,
dokumentbeskrivelse and dokumentobjekt automatically based on the
content of the uploaded file. The same could be done by uploading into
an existing registrering or dokumentbeskrivelse. It is just a question
of how much information we want to ensure is created before the file
upload. Perhaps it should be optional in the specification how "high"
in the hieararcy it should be possible to upload a new file? What about
container files like ZIP and TAR.GZ files? Uploading such container
into Mappe or Registrering might create several
dokumentbeskrivelse+dokumentobjekt entries, while doing it in
Dokumentbeskrivelse might create only one dokumentobjekt entry.
Further, it is the question of what to return if several entities are
created in one upload request. The result could either simply be the
dokumentobjekt entry created (if only one is created), which will
contain parent links that can be used to update the other generated
entries. It can also be a dokumentobjekt entry with the created parent
entries in a '_embedded' block according to the JSON Hypertext
Application Language specification. Finally, it can be a list of
different object types formatted like a search result in "results"
attribute. All these options would be consistent with the current
specification.
Finally, it is the question on how to handle automatically detected
values, which in many cases should be manually checked by someone before
the archive entities are considered finalized. For some file formats
titles, authors, dates and other metadata can be extracted from the
file, but for others there is no such metadata available. There need to
be a generic way to handle attributes that not yet have a sensible
value, to ensure they can be tracked down and updated manually if
needed. Perhaps a "magic value" should be used to indicate that the
current value is automatically generated? Perhaps if the string start
with ASCII value 26 (Substitute), it can be seen as a marker that the
archive client need to update the value and remove the ASCII 26
character? <URL: https://en.wikipedia.org/wiki/Substitute_character >
I am drafting a specification update, but do not yet know which
path is the best way forward through this landscape. Would love input
from other users and implementors of the specification.
…--
Happy hacking
Petter Reinholdtsen
|
My immediate reaction to this (emphasis on immediate) is that the API specification should not bother with such details. As long as it provides a simple generic way of uploading single documents that satisfy the requirements of the standard and, hopefully, all vendors, it should be sufficient. In other words, my impression is that the API specification should not overcomplicate the upload specification especially considering that there are already ways of creating resources such as dokumentbeskrivelse, dokumentobjekt, etc. I do not think new ways of creating these resources should be exposed as this would allow vendors to go with very specific (bordering with custom) interpretations of what metadata should be extracted from an uploaded (archive) file. Of course, the metadata for dokumentbeskrivelse/dokumentobjekt can be specified with multipart requests which would limit the amount of guesswork for the vendors, but uploading an arbitrary tarball would pose a lot of open questions to vendors as to the mapping of the data as such (arbitrary) multipart requests would be bothersome to build. It seems to me that this particular point boils down to what the goal is, because I see two separate topics here - bulk upload/creation and single file uploads. I would say resolving the latter is more important at this point as it diverges from the requirements of the standard whereas the former can always be added in a backwards-compatible way to the specification, if a need arises for it. I would even argue that bulk upload should not be the topic of the API specification as long as it provides means of single file uploads. The argument here is that uploading a tarball may mean different things to different business systems (clients) and having a common vendor-specific processing of such tarballs may lead to more issues than it solves. Open topics off the top of my head are:
Reading the questions I posed above, I am thinking that specifying bulk upload would either have to be extremely configurable to satisfy the requirements of various client-side business systems making it difficult to support by vendors or it would have to be very lenient allowing for a lot of interpretation and making it useless for clients. I am not critiquing the idea in any way. I am convinced bulk upload would be a requirement by certain business systems/integrations. It is just my personal opinion that this best be left to the discretion of clients that are responsible for the business logic of the corresponding business system as either the API specification would have to be very limiting (posing issues for one or another existing vendor), or it would have to be very lenient (making the implementation very vendor-specific), or it would have to be overly configurable (making it difficult to implement and support both in terms of vendors and API specification). EDIT: Of course, if there are actual requirements for bulk uploads by business systems/clients, maybe the best approach would be to gather such from them. Until such are available, I believe the bulk upload can go into too many directions and it might, unfortunately, be a guessing game as to what developers might need. |
…nstanser Tillat opplasting direkte fra mappe, registrering, dokumentbeskrivelse og dokumentobjekt. Endre fra dagens opplastingsprosedyre, som har et mellomsteg der arkivet er i en ufullstendig tilstand, mellom oppretting av dokumentobjekt-instans og vellykket opplasting av arkivfil, og i stedet la en laste opp fil directe fra dokumentbeskrivelse, registrering og mappe. Etter opplasting returnerer de nyopprettede barneinstansene i _embedded, jamfør JSON Hypertext Application Language. Dette forslaget er basert på ideer i mangelmelding arkivverket#25, og Løser utfordringer omtalt i mangelmelding arkivverket#285.
…nstanser Tillat opplasting direkte fra mappe, registrering, dokumentbeskrivelse og dokumentobjekt. Endre fra dagens opplastingsprosedyre, som har et mellomsteg der arkivet er i en ufullstendig tilstand, mellom oppretting av dokumentobjekt-instans og vellykket opplasting av arkivfil, og i stedet la en laste opp fil directe fra dokumentbeskrivelse, registrering og mappe. Etter opplasting returnerer de nyopprettede barneinstansene i _embedded, jamfør JSON Hypertext Application Language. Dette forslaget er basert på ideer i mangelmelding arkivverket#25, og Løser utfordringer omtalt i mangelmelding arkivverket#285.
…nstanser Tillat opplasting direkte fra dokumentbeskrivelse og dokumentobjekt. Dette lager et alternativ til dagens opplastingsprosedyre, som har et mellomsteg der arkivet er i en ufullstendig tilstand, mellom oppretting av dokumentobjekt-instans og vellykket opplasting av arkivfil, og kunne laste opp fil direkte fra dokumentbeskrivelse i tillegg til fra dokumentobjekt. Dette forslaget er basert på ideer i mangelmelding arkivverket#25, og Reduserer utfordringer omtalt i mangelmelding arkivverket#285.
…trering Endret til å tillate opplasting direkte fra mappe, registrering, dokumentbeskrivelse og dokumentobjekt. Etter opplasting returnerer de nyopprettede barneinstansene i _embedded, jamfør JSON Hypertext Application Language. Dette forslaget er basert på ideer i mangelmelding arkivverket#25, og Løser utfordringer omtalt i mangelmelding arkivverket#285.
…trering Endret til å tillate opplasting direkte fra mappe, registrering, dokumentbeskrivelse og dokumentobjekt. Etter opplasting returnerer de nyopprettede barneinstansene i _embedded, jamfør JSON Hypertext Application Language. Dette forslaget er basert på ideer i mangelmelding arkivverket#25, og Løser utfordringer omtalt i mangelmelding arkivverket#285.
…trering Endret til å tillate opplasting direkte fra mappe, registrering, dokumentbeskrivelse og dokumentobjekt. Etter opplasting returnerer de nyopprettede barneinstansene i _embedded, jamfør JSON Hypertext Application Language. Dette forslaget er basert på ideer i mangelmelding arkivverket#25, og Løser utfordringer omtalt i mangelmelding arkivverket#285.
…nstanser Tillat opplasting direkte fra dokumentbeskrivelse og dokumentobjekt. Dette gir klienter et alternativ til dagens opplastingsprosedyre, som har et mellomsteg der arkivet er i en ufullstendig tilstand, mellom oppretting av dokumentobjekt-instans og vellykket opplasting av arkivfil, og kunne laste opp fil direkte fra dokumentbeskrivelse i tillegg til fra dokumentobjekt. Når dokumentobjekt opprettes automatisk brukes variantformat Produksjonsformat med mindre API-tjenesten kjenner igjen et arkivformat. Dette forslaget er basert på ideer i mangelmelding #25, og Reduserer utfordringer omtalt i mangelmelding #285.
…nstanser Tillat opplasting direkte fra dokumentbeskrivelse og dokumentobjekt. Dette lager et alternativ til dagens opplastingsprosedyre, som har et mellomsteg der arkivet er i en ufullstendig tilstand, mellom oppretting av dokumentobjekt-instans og vellykket opplasting av arkivfil, og kunne laste opp fil direkte fra dokumentbeskrivelse i tillegg til fra dokumentobjekt. Dette forslaget er basert på ideer i mangelmelding #25, og Reduserer utfordringer omtalt i mangelmelding #285.
…nstanser Tillat opplasting direkte fra dokumentbeskrivelse og dokumentobjekt. Dette gir klienter et alternativ til dagens opplastingsprosedyre, som har et mellomsteg der arkivet er i en ufullstendig tilstand, mellom oppretting av dokumentobjekt-instans og vellykket opplasting av arkivfil, og kunne laste opp fil direkte fra dokumentbeskrivelse i tillegg til fra dokumentobjekt. Når dokumentobjekt opprettes automatisk brukes variantformat Produksjonsformat med mindre API-tjenesten kjenner igjen et arkivformat. Dette forslaget er basert på ideer i mangelmelding #25, og Reduserer utfordringer omtalt i mangelmelding #285.
…trering Endret til å tillate opplasting direkte fra mappe, registrering, dokumentbeskrivelse og dokumentobjekt. Etter opplasting returnerer de nyopprettede barneinstansene i _embedded, jamfør JSON Hypertext Application Language. Dette forslaget er basert på ideer i mangelmelding arkivverket#25, og Løser utfordringer omtalt i mangelmelding arkivverket#285.
It occured to me that a way to ensure consistency and avoid
dokumentobjekt instances without attached files to be visible to
unsuspected consumers of the API, is to delay the attachment of the
dokumentobjekt child entity to the dokumentbeskrivelse instance until
the file is successfully uploaded.
When creating a dokumentobjekt instance, the instance with a _links
dictionary including both a self link and the
https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/fil/ link is
returned, so the program creating it can do the upload with the
information available. But the _links dictionary in the parent
dokumentbeskrivelse instance for the
https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/dokumentobjekt/
key do not need to be presented to API consumers before the file is
uploaded.
This ensure programs uploading files can use the unfinished
dokumentobjekt instance without causing consistency problem for any
other API consumer.
As far as I can tell, this is both allowed by the Noark 5 specifiaction
and the Noark 5 tjenestegrensesnitt specification, and would allow a
implementation to provide consistent view without changing the current
API description.
Note, I still would like to handle uploads directly from registrering,
as proposed in #309. My point is that it is possible to avoid the
problem described in this issue without any changes to the API
specification.
…--
Happy hacking
Petter Reinholdtsen
|
…en fil. Noark 5 krever at det er en fil koblet til hver dokumentobjekt-instans, men det vil være en periode mellom dokumentobjekt blir opprettet og en fil blir lastet opp der et slikt dokumentobjekt mangler slik kobling. For å sikre at ingen API-leser ser slike inkonsistente dokumentobjekt-instanser, gjør det klart at slike ikke skal returneres som barn av sin foreldre-dokumentbeskrivelse før filopplastingen har lykkes. Relatert til mangelmelding arkivverket#285 og arkivverket#25, og tilbyr en bakoverkompatibel løsning uten ekstrafunksjonaliteten beskrevet i arkivverket#309 og arkivverket#298.
That would only be possible for dokumentbeskrivelse that had no dokumentobjekt instance in the first place. If the dokumentobjekt is being created in an existing dokumenbeksrivelse with multiple objects in it, the vendor would still need to return the dokumentobjekt key in the _links dictionary due to the presence of other dokumentobjekter.
Would not that clash with the relasjoner requirements for dokumentobjekt in the specification that says that a dokumentobjekt must have a dokumentbeskrivelse: |
[ivaylomitrev]
That would only be possible for dokumentbeskrivelse that had no
dokumentobjekt instance in the first place. If the dokumentobjekt is
being created in an existing dokumenbeksrivelse with multiple objects
in it, the vendor would still need to return the dokumentobjekt key in
the _links dictionary due to the presence of other dokumentobjekter.
Why would that be a requirement?
As far as I can tell, the only feature implementations need to have for
this to work is a one way link between objects, which can be turned into
a two way link when the file is uploaded. This can be done with a state
variable or by keeping the dokumentobjekt entity in a holding area until
it is ready to be hooked up to the rest of the data hierarcy. In other
words, when dokumentbeskrivelse and dokumentobjekt is created:
[dokumentbeskrivelse] <---- [ dokumentobjekt]
And after the file is uploaded:
[dokumentbeskrivelse] <----> [ dokumentobjekt] ---> [uploaded file]
The dokumentobjekt _links list should point to its parent, but the
parent should not point to its child without any uploaded file.
This can be implemented by the API by only handing out dokumentobjekt
instances in the list returned behind the
https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/dokumentobjekt/
relation when a file attached, and only hand out the "empty"
dokumentobjekt instance to the creator who used the
https://rel.arkivverket.no/noark5/v5/api/arkivstruktur/ny-dokumentobjekt/
relation.
Would not that clash with the relasjoner requirements for
dokumentobjekt in the specification that says that a dokumentobjekt
must have a dokumentbeskrivelse:
![image](https://github.com/arkivverket/noark5-tjenestegrensesnitt-standard/assets/21099109/4c1e8417-b1fd-4c9e-9e0b-49b5bf6ea92a)
Not really, as it do have a dokumentbeskrivelse. It is just not
"commited" to the data structure before it has a file uploaded to it.
…--
Happy hacking
Petter Reinholdtsen
|
Beskrivelse
As per the API specification (Chapter 6 document upload):
As per the Noark 5 standard (section 2.7 Dokumentbeskrivelse og dokumentobjekt), however:
Additionally, arkivstruktur.xsd (as shipped with the Noark 5.5 standard) defines:
The metadatakatalog also identifies the field as "obligatory".
As a result, the API specification requires that archive cores allow "empty" document objects which might, however, lead to data quality issues (as a document may never be linked to said document object). Such empty document objects might also have to be "worked around" in implementations of the API specification as they cannot be returned as results of queries (them not being Noark-compliant in this intermediary state).
Please let me know if I have misinterpreted the specification or the standard.
Ønsket endring
Is it possible to allow documents to be uploaded prior to creating a document object. This way, the Noark 5 requirements will be met by the following flow:
The text was updated successfully, but these errors were encountered: