-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make _metadata optional on writing #809
Comments
1/
I am sorry, I don't have a clear picture of the content of 2/
How would you call this new parameter? if write_fmd:
# here, if-condition to be added?
write_common_metadata(join_path(dn, '_metadata'), fmd, open_with,
no_row_groups=False)
write_common_metadata(join_path(dn, '_common_metadata'), fmd,
open_with) 3/
We could check this thanks to self.fn = join_path(basepath, '_metadata') if basepath else '_metadata' 4/
Ok. Do you identify a specific attribute in 5/ |
Code to reproduce result mentioned in 3/ import os
import pandas as pd
import fastparquet as fp
path = os.path.expanduser('~/Documents/code/data/fastparquet/')
df1 = pd.DataFrame({'val': [0,1]})
path1 = f"{path}file.1.parquet"
df2 = pd.DataFrame({'val': [-1,0]})
path2 = f"{path}file.2.parquet"
fp.write(path1, df1, file_scheme="simple")
fp.write(path2, df2, file_scheme="simple")
pf = fp.ParquetFile(path)
pf.fn
Out[28]: '/home/yoh/Documents/code/data/fastparquet/_metadata' But there is no |
_common_metadata is the same as _metadata, but with an empty list for the row_groups. This means it contains the schema and key-value metadata, but no chunk-specific information. It is much smaller, easier to make, and doesn't require updating when appending to the dataset. |
We made an effective in-memory one by grabbing the row-group definitions in the two files. That's the location the file would be in, if it existed. |
Hi, import os
import pandas as pd
import fastparquet as fp
path = os.path.expanduser('~/Documents/code/data/fastparquet/')
append = False
for i in range(11):
# trouble start at file 11.
df = pd.DataFrame({'val': [i]})
fp.write(path, df, file_scheme="hive", append=append)
append=True
# if there is no '_metadata' to keep track of file order,
os.remove(f"{path}_metadata")
# then a file named 11.parquet is read before a file named 2.parquet.
pf = fp.ParquetFile(path)
pf.to_pandas()
Out[13]:
val
0 0
1 1
2 10
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9 |
Expected is: Out[11]:
val
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10 |
In key-value-metadata,
Should we add a mechanism to check this file on |
Correct, without a _metadata, there is no absolute way to guarantee that appended data will appear at the end of the dataset. We must attempt to generate filenames that come after previous ones in both numerical and lexical sorting. |
There is no such convention. The parquet schemas must match, that is all. So we are free to decide whether to rewrite or not. I would vote for no, unless specifically updating the key-value metadata. |
My proposal would be: Fastparquet already has a "naming convention" that I find ok. I would to check if filename is according this naming convention, then extract the int, and assume that row groups are ordered as per the int. For other naming shemes, then we take lexicographic ordering. But I have little knowledge in possible naming conventions. If one exists that satisfies both, why not. I have been liking the "short" file names produced by fastparquet as opposed to that of pyarrow. |
Ok for me as well (I am aware I am being "over meticulous" here) |
Do you think we could define a convention here about its value? Currently,
elif hasattr(fn, 'read'):
# file-like
self.fn = None Could we set it the following way:
This way, when appending, a |
Yes, I'd be happy with that convention. I think it's more typical to use the first found parquet file rather than the last. Note that some file-like objects expose a |
So far, I was thinking that key value metadata + statistics were being recorded in every I understand now that I have been wrong, at least for key value metadata, which are in Hence this logic I proposed to track filename of last Side question: |
I see a pros in keeping current fastparquet naming scheme: From this root function then stem other 'utils':
It would save us time I think to avoid reworking them. |
Yes, fastparquet builds a virtual ParquetFile from the row groups found in all of the data files, if there is no _metadata. This could be reasonably skipped in some instances, and dask does indeed not use this mode by default any more. Fastparquet's use case is a little different, though, since the main operation is to load the whole dataset into memory, so will will need that information anyway (and in the main thread) to be able to build the empty dataframe that is to be filled in. |
I'm not sure if you asked this question: each parquet data file contains statistics relevant to its own row-groups. All the files contain key-value matadata, and this is supposed to be the same across all files (and the same as _common_metadata). If it has changed as a consequence of append operations, then you are right that the last file should be the most up to date. |
Ok What matters is that if there is no |
Yes, you have persuaded me |
@martindurant , I think here is a sum up of all our discussions so far:
|
A couple of small notes follow. The only thing we don't consider in all this is a ParquetFile made from a list of paths. We could simply refuse to append in that case.
Attempt a numerical sort that would match our scheme and maybe others, fallback to lexical. "Others" here might be arrow and spark.
and schema? Now if seems odd to read key-values from one file and schema from another. The value of
Yes. That means we need to analyse the existing file names and potentially cope with file lists that don't follow a good numbering scheme. I'm not sure if we already do that. |
I am ok with that.
yes, I am ok with that as well, we are in the case there is no
I don't think we do. And at the moment, I have no clear idea what would be the appropriate way of naming the new files. |
About parameter name to be set to write or not It could be this parameter? This parameter could have a
Or it could be |
Note: I modified previous comment / I removed what I think in my previous message was not adequate. The way of setting def write_fmd_auto_set(pf: ParquetFile):
if pf.fn == '_metadata':
return ....
elif ... This function would then be called from In |
Rather than None, True, False, "common"...., you could have an enum
which is more verbose, but at least explicit. It would require function signatures
I can understand if maybe you think that's too many characters. |
Hi, yes we can have an enum, I don't see a trouble with that. I would maybe rename/reword it this way? class MDWriteMode(enum.Enum):
ALL_META = 1
ONLY_COMMON = 2
NO_META = 3 |
The "only" point I see that needs some discussion is what to do when appending a hive dataset when naming does not follow fastparquet's convention (as you also identified). I think that what could be appropriate is:
If we consider last character of filename, we can check if it is out of the ranges [49, 57] or [97, 122]. These ranges come from the ord('1')
Out[11]: 49
ord('9')
Out[12]: 57
ord('a')
Out[13]: 97
ord('z')
Out[14]: 122 Then, we increment this last character, or if needed, the last-but-one character as well to name the new file. Something like: (but management of out-of-bound range is to be implemented in addition) the_string = the_string[:-1] + chr(ord(the_string[-1]) + 1) (comes from this SO answer) This way, we don't break lexical ordering with new appended files. |
das.utils has
so that you can do |
Hi Martin, import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
path=os.path.expanduser('~/Documents/code/data/fastparquet')
df = pd.DataFrame({'a':list(range(4))})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, root_path=path) I get a filename like:
This is consistent with the documentation of
So the |
Following discussion in #807
I would do as follows:
For a user that has built a dataset using append without a _metadata, they can create one as a separate step with the existing merge() function, if they want.
The text was updated successfully, but these errors were encountered: