You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been writing the tests for the current DataSet implementations, but there are some things that need work (specially if the idea is to use them to train NN):
Fix the JavaDoc it's really really hard to read
Add JavaDoc to the Attribute interface
Rename DiscreteAttribute to CategoricalAttribute
Add DiscreteAttribute
There are some missing features:
Editing MetaData (at least attribute names, so it can be removed from the Attribute interface)
Frequencies for continuous and discrete attributes in intervals
Several filtering options
Incomplete/Dirty data should be removed by the DataSet
MySQLDataSet resources are left open (I believe we talk about this a while ago)
Generalize a bit the TextFileDataSet (at least allow setting the splitting regex, check if file has headers, etc.)
Create a something like a MatrixDataSet and a LargeTextFileDataSet
I'm assuming that this classes were created specifically for C4.5, but they need to be generalized a bit.
The text was updated successfully, but these errors were encountered:
I know this classes need tons of work. They kind of grew organically from the C4.5 implementation to something else when i was investigating the Bayes Network implementations.
Some comments on some of your points:
i might agree with the rename from DiscreteAttribute to CategoricalAttribute, however, the addition of DiscreteAttribute seems superfluous as it will be a subset of the ContinuousAttribute (just using the integer part). But then it won't be discrete anymore as integers are continuous.
Frequencies for continuous and discrete attributes in intervals it's implemented on the DataSet using getFrequencies(int lo, int hi, int index)
Incomplete/Dirty removal i think it's responsibility of who traverse the data set, for instance C4.5 has ways to deal with them (that i haven't implemented yet). I would be more inclined to have a flag or a special method that do it for you if needed, but certainly not a default option.
The problem with filtering options is that can lead to incredibly complex code. Keep in mind that any library might expect certain kind of inputs, and more often than not, the inputs have to be pre-processed before it can be feed to any library. Because of this, i think filtering should be part of that pre-processing step.
I would rather focus on this later, as this will be part of a bigger architectural change that might affect several other components (C4.5 & Bayes) and i want to assess the scope of the change first.
I've been writing the tests for the current
DataSet
implementations, but there are some things that need work (specially if the idea is to use them to train NN):JavaDoc
it's really really hard to readJavaDoc
to theAttribute
interfaceDiscreteAttribute
toCategoricalAttribute
DiscreteAttribute
There are some missing features:
MetaData
(at least attribute names, so it can be removed from theAttribute
interface)DataSet
MySQLDataSet
resources are left open (I believe we talk about this a while ago)TextFileDataSet
(at least allow setting the splitting regex, check if file has headers, etc.)MatrixDataSet
and aLargeTextFileDataSet
I'm assuming that this classes were created specifically for
C4.5
, but they need to be generalized a bit.The text was updated successfully, but these errors were encountered: