Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSets are weird... #55

Open
11 tasks
dktcoding opened this issue Jan 7, 2017 · 1 comment
Open
11 tasks

DataSets are weird... #55

dktcoding opened this issue Jan 7, 2017 · 1 comment

Comments

@dktcoding
Copy link
Contributor

I've been writing the tests for the current DataSet implementations, but there are some things that need work (specially if the idea is to use them to train NN):

  • Fix the JavaDoc it's really really hard to read
  • Add JavaDoc to the Attribute interface
  • Rename DiscreteAttribute to CategoricalAttribute
  • Add DiscreteAttribute

There are some missing features:

  • Editing MetaData (at least attribute names, so it can be removed from the Attribute interface)
  • Frequencies for continuous and discrete attributes in intervals
  • Several filtering options
  • Incomplete/Dirty data should be removed by the DataSet
  • MySQLDataSet resources are left open (I believe we talk about this a while ago)
  • Generalize a bit the TextFileDataSet (at least allow setting the splitting regex, check if file has headers, etc.)
  • Create a something like a MatrixDataSet and a LargeTextFileDataSet

I'm assuming that this classes were created specifically for C4.5, but they need to be generalized a bit.

@kronenthaler
Copy link
Owner

I know this classes need tons of work. They kind of grew organically from the C4.5 implementation to something else when i was investigating the Bayes Network implementations.

Some comments on some of your points:

  • i might agree with the rename from DiscreteAttribute to CategoricalAttribute, however, the addition of DiscreteAttribute seems superfluous as it will be a subset of the ContinuousAttribute (just using the integer part). But then it won't be discrete anymore as integers are continuous.
  • Frequencies for continuous and discrete attributes in intervals it's implemented on the DataSet using getFrequencies(int lo, int hi, int index)
  • Incomplete/Dirty removal i think it's responsibility of who traverse the data set, for instance C4.5 has ways to deal with them (that i haven't implemented yet). I would be more inclined to have a flag or a special method that do it for you if needed, but certainly not a default option.
  • The problem with filtering options is that can lead to incredibly complex code. Keep in mind that any library might expect certain kind of inputs, and more often than not, the inputs have to be pre-processed before it can be feed to any library. Because of this, i think filtering should be part of that pre-processing step.

I would rather focus on this later, as this will be part of a bigger architectural change that might affect several other components (C4.5 & Bayes) and i want to assess the scope of the change first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants