-
Notifications
You must be signed in to change notification settings - Fork 19
/
TODO
60 lines (41 loc) · 1.84 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
* Documentation
** Examples
*** Reduce-side join
* Integration for writing other Hadoop classes
** Input formats
** Output formats
** (Raw)Comparators
** Writables (?)
** Other things?
* Logging
- Probably other logging?
* Use class loader instead of pools
The current implementation builds small statically-compiled pools of
var-trampolining implementations of the various Hadoop classes which Hadoop
expects to locate via name-based reflection. A more general implementation
would use a custom ClassLoader which could dynamically generate and return new
instances of the supported types.
The ClassLoader approach may not however be possible. It’s not clear if Hadoop
allows tasks to configure an alternative ClassLoader prior to task
initialization.
* Support for EMR
Probably a separate project. Suggested by ztellman, potentially a Leiningen
plugin, mechanism for easily spinning up EMR clusters and launching Parkour
jobs.
* Error messages
** Errors on repeated reduction (?)
Attempting to reduce a task input collection multiple times is probably almost
always a mistake. Right now reduction of an exhausted input collection acts as
reduction on an empty collection. Perhaps it should raise an exception instead?
* Allow repeated reduction of inputs
Alternative to erroring on repeated reduction. We could re-initialize the input
record-reader (and context) for the original input split, allowing the same data
to be subsequently re-processed. Not certain this is a useful feature, but is
an interesting idea.
* n-record dseq
A dseq which acts like NLineInputFormat, but wraps an arbitrary existing
InputFormat, distributing only <n> records per mapper (or distributing available
records across <m> mappers).
* jobconf dseq
A dseq which distributes input records via the job configuration. Should
probably use extended configuration serialization.