-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Options to skip small files and not recurse on input paths #90
base: master
Are you sure you want to change the base?
Options to skip small files and not recurse on input paths #90
Conversation
…e for files to index from input paths.
LOG.info("Unable to get status of path " + path); | ||
return false; | ||
} | ||
return status.getLen() >= status.getBlockSize() ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too restrictive. With high compression levels and a large FS block, you might still want to split the FS block to get a spill-less mapper. I would make the threshold configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Does a default value (if user's config has a bunk value like "abc") of block size make sense?
This is odd -- I swear we did this years ago. @rangadi do you remember what the deal is? Is this something we put into EB instead of hadoop-lzo? |
It looks like the build is failing when using -P hadoop-old due to: |
I've added a configuration option for what size of a file should be considered "small." By default it is Long.MIN_VALUE, which should preserve current behavior if it is not specified. As it stands currently, the user configure lzo_skip_indexing_small_files = true and not configure lzo_small_file_size, which would leave the size as default Long.MIN_VALUE. In this case specifying to skip would not actually skip any files. I see two possible remedies, any preferences on which one? I am leaning towards option 1.
|
@gerashegalov @sjlee Thoughts? |
private final String LZO_RECURSIVE_INDEXING_KEY = "lzo_recursive_indexing"; | ||
private final boolean LZO_SKIP_INDEXING_SMALL_FILES_DEFAULT = false; | ||
private final boolean LZO_RECURSIVE_INDEXING_DEFAULT = true; | ||
private final long LZO_SMALL_FILE_SIZE_DEFAULT = Long.MIN_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
33-38: if these are meant as constants (which I think they are), they should be private static final's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if Long.MIN_VALUE is the best default value here, as it would be -2**63. Note that this is printed in the usage as well. If the goal is to disable the small-file-skipping feature if this configuration is not set, isn't 0 fine as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are meant to be constants. I missed the static modifier. 0 seems reasonable, I'll work that into the next commit.
Sorry it took me a super long time to revisit this. I went over the PR, and have some comments (some more major than others). Comments coming... A high level comment: it would be great if you can add some unit tests that cover this. |
|
||
private static final String LZO_EXTENSION = new LzopCodec().getDefaultExtension(); | ||
|
||
private static final String LZO_SKIP_INDEXING_SMALL_FILES_KEY = "lzo_skip_indexing_small_files"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it's a common practice for hadoop and related code bases to use dots (".") as separators for config keys; e.g. "lzo_skip_indexing_small_files" -> "lzo.skip-indexing-small-files". By the same token, how about "lzo.skip-indexing-small-files.size" instead of "lzo_small_file_size", and "lzo.recursive-indexing.enabled" instead of "lzo_recursive_indexing"?
Another nit: let's pair each key definition and its default.
Another nit: have an empty line between the static members and the instance members.
…o_indexer_skip_small_files
…variable grouping, rename constants for grouping, rename constants values for hadoop naming style.
…Name instead of path.getString for extension filtering. Add comments. Reduce number of Path.getFileSystem and getFileStatus.
…nal access. Refactor configuration/job setup. Add unit tests. Remove unused variable in TestLzoRandData.
…Add javadoc to DistributedLzoIndexer.
public static final long LZO_INDEXING_SMALL_FILE_SIZE_DEFAULT = 0; | ||
public static final String LZO_INDEXING_RECURSIVE_KEY = "lzo.indexing.recursive.enabled"; | ||
public static final boolean LZO_INDEXING_RECURSIVE_DEFAULT = true; | ||
private static final String TEMP_FILE_EXTENSION = "/_temporary"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really an extension but rather a file/directory name. Should it be more like TEMP_FILE_NAME or TEMP_DIRECTORY_NAME (depending on whether it is a file or directory)?
…cking Path names.
return -1; | ||
} | ||
/** | ||
* Determine based on previous configuration of this indexer whether a file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Determine -> Determines
Could you please add unit tests around the recursive behavior? There are quite a few tests around whether the file should be indexed, but I don't see tests for the recursion. Also, it would be great if you can test this code against real data to see if there is any surprise that isn't caught by the unit tests (and review). Thanks again! |
|
Added support for a boolean configuration key "skip_indexing_small_files". If this is enabled, files smaller than one block in size will not be indexed. This is useful because indexing files smaller than a block is essentially wasteful. The default is false so the current behavior is preserved.
Added support for a boolean configuration key "recursive_indexing". If this is enabled, paths passed in on the command line will not be recursively searched for files to index. This allows for flexibility on specifying input paths for indexing. The default is true so the current behavior is preserved.