Skip to content
gilbcharlene edited this page Jun 5, 2012 · 1 revision

Table of Contents

ML.AppendID

AppendID

AppendID(recordset,idfield,output)

recordset A record set to process.
idfield The name of the field to be appended containing the id for each row.
output The name of the returned record set.
Return: AppendID returns a record set.
The AppendID macro appends a record ID column to any record set.

ML.Associate

Associate

Associate(recordset,count)

recordset A record set to process.
count An integer expression defining the number of times items must occur to be considered equivalent
The Associate module is used to perform frequent pattern matching on the underlying data.

Associate.Apriori1

Associate(recordset,count).Apriori1

The Associate.Apriori1 attribute returns a record set with which single items are most likely to appear using an ‘old school’ brute force and speed approach.

Associate.Apriori2

Associate(recordset,count).Apriori2

The Associate.Apriori2 attribute returns a record set with which pairs of items are most likely to appear together using an ‘old school’ brute force and speed approach.

Associate.Apriori3

Associate(recordset,count).Apriori3

The Associate.Apriori3 attribute returns a record set with which triplets of items are most likely to appear together using an ‘old school’ brute force and speed approach.

Associate.AprioriN

Associate(recordset,count).AprioriN(maxN[,minN])

maxN An integer expression defining the maximum size of sets to return.
minN (Optional) An integer expression defining the minimum size of sets to return. Default: 2.
Return: AprioriN returns a record set.
The Associate.AprioriN subroutine returns a record set with values that occur together using ‘new school’ techniques.

Associate.EclatN

Associate(dataset,count).EclatN(maxN[,minN])

maxN An integer expression defining the maximum size of sets to return.
minN (Optional) An integer expression defining the minimum size of sets to return. Default: 2.
Return: EclatN returns a record set.
The Associate.EclatN subroutine behaves similarly to AprioriN except using the ‘eclat’ technique.

Associate.Rules

Associate(dataset,count).Rules(patterns)

patterns A record set derived from an Apriori1, Apriori2, Apriori3, AprioriN or EclatN subroutine.
Return: Rules returns a record set.
The Associate.Rules subroutine uses patterns generated by the AprioriN or EclatN subroutines to answer the question: “Given a pattern, what comes next?”

ML.Classify

Perceptron

Perceptron(N[,Alpha])

N An integer expression defining the number of passes over the data to make during the learning process.
Alpha (Optional) A REAL value for the learning rate. Default: 0.1.
The Perceptron routine builds a perceptron for multiple dependent (Boolean) variables.

Logistic

Logistic([ridge][,epsilon][,maxIter])

ridge (Optional) A REAL value for the ridge term used to ensure existence of Inv(X'*X) even if some independent variables X are linearly dependent. Default: 0.0001.
epsilon (Optional) A REAL value for the parameter used to test convergence. Default: 0.000000001.
maxIter (Optional) An integer expression defining the maximum number of iterations. Default: 200.

Classifier Interface

The Classifier Interface (NaiveBayes, Perceptron and Logistic) exports the following attributes and subroutines:

LearnC

LearnC(independent,dependent)

independent A record set containing independent values.
dependent A record set containing dependent values.
The LearnC subroutine trains a classifier model with continuous values.

LearnD

LearnD(independent,dependent)

independent A record set containing independent values
dependent A record set containing dependent values
The LearnD subroutine trains a classifier model with discrete values.

ClassifyC

ClassifyC(independent,model)

independent A record set containing independent values.
model A record set containing a model derived from the LearnC subroutine.
The ClassifyC subroutine classifies continuous data.

ClassifyD

ClassifyD(independent,model)

independent A record set containing independent values.
model A record set containing a model derived from the LearnD subroutine.
The ClassifyD subroutine classifies discrete data.

TestC

TestC(independent,dependent)

independent A record set containing independent values.
dependent A record set containing dependent values.
The TestC subroutine tests the performance of a classifier trained with continuous data.

TestD

TestD(independent,dependent)

independent A record set containing independent values.
dependent A record set containing independent values.
The TestD subroutine tests the performance of a classifier trained with discrete data.

Compare

Compare(dependent,computed)

dependent A record set containing dependent values.
computed A record set containing classification tags derived from a ClassifyC or ClassifyD subroutine.
The Compare routine computes the efficiency of a classification process. It exports the following attributes and subroutines:

Compare.Raw

Compare(dependent,computed).Raw

The Compare.Raw attribute returns a detailed breakdown of every record in the test corpus including what the classification should have been and what it was.

Compare.CrossAssignments

Compare(dependent,computed).CrossAssignments

The Compare.CrossAssignments attribute returns for each record if a class is misclassified, what is it most likely to be misclassified as.

Compare.PrecisionByClass

Compare(dependent,computed).PrecisionByClass

The Compare.PrecisionByClass attribute returns the precision broken down by the class that it should have been classified to.

Compare.HeadLine

Compare(dependent,computed).HeadLine

The Compare.Headline attribute returns the main precision number that shows how often the classifier was correct.

ML.Cluster

KMeans

KMeans(documentset,centroidset[,niterations,nconverge,algorithm])

documentset A record set of documents to process.
centroidset A record set of centroids to process.
niterations (Optional) An integer expression defining the maximum number of iterations before stopping. Default: 1
nconverge (Optional) A REAL value for the minimum distance for non-convergence. Default: 0.0.
algorithm (Optional) The distance algorithm to use.

Possible Values:
  • DF.Euclidean
  • DF.EuclideanSquared
  • DF.Manhattan
  • DF.Cosine
  • DF.Tanimoto
Default: DF.Euclidean.
The KMeans routine exports the following attributes and subroutines:

KMeans.AllResults

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).AllResults

The AllResults attribute returns a record set with the result of all iterations.

KMeans.Convergence

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).Convergence

The Convergence attribute returns the number of iterations that were performed.

KMeans.Result()

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).Result()

The Result() subroutine returns the final locations of the centroids.

KMeans.Result(n)

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).Result(n)

n An integer expression defining the iteration to consider.
Return: Result(n) returns a record set.
The Result(n) subroutine returns the locations of the centroids after the nth iteration.

KMeans.Delta(minN,maxN)

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).Delta(minN,maxN)

minN An integer expression defining the minimum number of iterations.
maxN An integer expression defining the maximum number of iterations.
Return: Delta returns a record set.
The Delta subroutine returns the distance traveled by every centroid across each axis from iterations minN to maxN

KMeans.Delta(0)

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).Delta(0)

The Delta(0) subroutine returns the total distance traveled by every centroid across each axis.

KMeans.DistanceDelta(minN,maxN)

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).DistanceDelta(minN,maxN)

minN An integer expression defining the minimum number of iterations.
maxN An integer expression defining the maximum number of iterations.
Return: DistanceDelta returns a record set.
The DistanceDelta subroutine returns the straight-line distance travelled by each centroid from iterations minN to maxN.

KMeans.DistanceDelta(0)

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).DistanceDelta(0)

The DistanceDelta(0) subroutine returns the straight-line distance traveled by each centroid.

KMeans.DistanceDelta()

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).DistanceDelta()

The DistanceDelta() subroutine returns the distance traveled by each centroid during the last iteration.

KMeans.Allegiances()

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).Allegiances()

The Allegiances() subroutine returns the table of allegiances (centroid an entity is closest to) after convergence.

KMeans.Allegiance(entityId,iterationN)

KMeans(documentset,centroidset[,niterations,nconverge,algorithm]).Allegiance(entityId,iterationN)

entityId An integer expression defining the entity to find the allegiance for.
iterationN An integer expression defining the iteration to find the allegiance for.
Return: Allegiance returns a record set
The Allegiance subroutine returns the centroid to which entityId is closest after iteration iterationN.

AggloN

AggloN(numericfield,n[,algorithm,method])

numericfield A NumericField set of records to process.
n An integer expression defining the number of iterations.
algorithm (Optional) The distance algorithm to use.

Possible Values:
  • DF.Euclidean
  • DF.EuclideanSquared
  • DF.Manhattan
  • DF.Cosine
  • DF.Tanimoto
Default: DF.Euclidean.
method (Optional) How to compute distance between clusters.

Possible values:
  • min-dist
  • max-dist
  • ave-dist
Default: min-dist.
Return: AggloN returns a record set .
The AggloN routine exports the following attributes and subroutines:

AggloN.Dendrogram

AggloN(numericfield,n[,algorithm,method]).Dendrogram

The Dendrogram attribute displays the output as a string representation of the tree diagram.

AggloN.Distances

AggloN(numericfield,n[,algorithm,method]).Distances

The Distances attribute returns a record set of the remaining distances that would be used to further cluster the entities.

AggloN.Clusters

AggloN(numericfield,n[,algorithm,method]).Clusters

The Clusters attribute returns a record with each entity and the id of the cluster that the entity was assigned to.

Distances

Distances(numericfield1,numericfield2[,algorithm])

numericfield1 A set of NumericField records to process.
numericfield2 A set of NumericField records to process.
algorithm (Optional) The distance algorithm to use.

Possible Values:
  • DF.Euclidean
  • DF.EuclideanSquared
  • DF.Manhattan
  • DF.Cosine
  • DF.Tanimoto
Default: DF.Euclidean.
Return: Distances returns a record set.
The Distances routine is the ‘distance computation engine’ that computes the distance matrix.

Closest

Closest(distances)

distances A dataset containing distances.
Return: Closest returns a record set.
The Closest routine takes a set of distances and returns the closest centroid for each row.

ML.Correlate

Correlate

Correlate(numericfield)

numericfield A set of NumericField records to process.
The Correlate module exports the following attributes and subroutines:

Correlate.Simple

Correlate(numericfield).Simple

The Simple attribute returns a record set containing the Pearson and Spearman correlation co-efficient for every pair of fields.

Correlate.Kendall

Correlate(numericfield).Kendall

The Kendall attribute returns the Kendall Tau statistic for every pair of fields.

ML.Discretize

ByRounding

ByRounding(numericfield[,scale,delta])

numericfield A set of NumericField records to process.
scale (Optional) A REAL value for the factor to multiply to bring data into a desired range. Default: 1.0.
delta (Optional) A REAL value to add to rebase a range, cause truncation or rounding up. Default: 0.0.
Return: ByRounding returns a record set.
The ByRounding routine returns a record set of values passed in to create discrete element using scale and delta.

ByBucketing

ByBucketing(numericfield[,numgroups])

numericfield A set of NumericField records to process.
numgroups (Optional) An integer expression defining the number of groups to discretize numericfield into. Default: 10.
Return: ByBucketing returns a record set.
The ByBucketing routine returns a record set with values allocated unevenly into one of numgroups buckets based upon an equal division of the range of the variable.

ByTiling

ByTiling(numericfield[,numgroups])

numericfield A set of NumericField records to process.
numgroups (Optional) An integer expression defining the number of groups to discretize numericfield into. Default: 10.
Return: ByTiling returns a record set.
The ByTiling routine returns a record set with values allocated evenly into one of numgroups groups such that all of the elements of group 2 have a higher value than group 1.

Do

Do(numericfield,instructionset)

numericfield A set of NumericField records to process.
instructionset A set of r_Method records containing metadata instructions.
Return: Do returns a record set.
The Do routine returns a record set with all the fields in a file discretized applying a different method to each if specified.

ML.Distribution

GenData

GenData(nrecords,distribution[,nfield])

nrecords An integer expression defining the number of records to generate.
distribution A record set containing a distribution to take a random variable from.
nfield (Optional) An integer expression defining the column to fill. Default:1.
Return: GenData returns a record set.
The GenData routine generates a record set using random values from a distribution.

Uniform

Uniform(low,high[,ranges])

low A REAL value for the minimum value in the distribution.
high A REAL value for the maximum value in the distribution.
ranges (Optional) An integer expression defining the number of divisions to split the distribution into. Default: 10,000.
The Uniform routine specifies that any (continuous) value between low and high is equally likely to occur.

Normal

Normal(mean,stdeviation[,ranges])

mean A REAL value for the mean.
stdeviation A REAL value for the standard deviation.
ranges (Optional) An integer expression defining the number of divisions to split the distribution into. Default: 10,000.
The Normal routine implements a normal distribution (bell curve) which shows mean ‘mean’ and stdeviation as specified, approximated by a number of ranges straight lines.

StudentT

StudentT(v[,ranges])

v An integer expression defining the degrees of freedom.
ranges (Optional) An integer expression defining the number of divisions to split the distribution into. Default: 10,000.
The StudentT routine specifies the degrees of freedom for a Student-T distribution.

Exponential

Exponential(lambda[,ranges])

lambda A REAL value for the rate parameter.
ranges (Optional) An integer expression defining the number of divisions to split the distribution into. Default: 10,000.
The Exponential routine implements the exponential (sometimes called negative exponential) distribution.

Binomial

Binomial(p[,ranges])

p A REAL value for the success probability.
ranges (Optional) An integer expression defining the number of divisions to split the distribution into. Default 100.
The Binomial routine gives the distribution showing the chances of getting ‘k’ successful events in ranges-1 trails where the chances of success in one trail is ‘p’.

NegBinomial

NegBinomial(p,r[,ranges])

p A REAL value for the success probability.
r An integer expression defining the number of failures.
ranges (Optional) An integer expression defining the number of divisions to split the distribution into. Default: 1,000.
The NegBinomial routine gives the distribution showing the chances of getting ‘k’ successful events before ‘r’ number of failures occurs. The geometric distribution can be obtained by setting r equal to 1.

Poisson

Poisson(lambda[,ranges])

lambda A REAL value for the expected value.
ranges (Optional) An integer expression defining the number of divisions to split the distribution into. Default: 100.
The Poisson routine returns a discrete distribution characterized by lambda.

Distribution Interface

The Distribution Interface (Uniform,Normal,StudentT,Exponential,Binomial,NegBinomial,Poisson) exports the following attributes and subroutines:

Density

Density(RH)

RH A REAL value.
Return: Density returns a single REAL value.
The Density subroutine returns the probability density at point RH.

Cumulative

Cumulative(RH)

RH A REAL value.
Return: Cumulative returns a single REAL value.
The Cumulative subroutine returns the cumulative probability function from negative infinity to RH.

DensityV

DensityV()

The DensityV subroutine returns a record set providing the probability density function at each range point.

CumulativeV

CumulativeV()

The CumulativeV subroutine returns a vector providing the cumulative probability density function at each range point.

Ntile

Ntile(percent)

percent A REAL percentage value.
Return: Ntile returns a single REAL value.
The Ntile subroutine returns the value from the underlying domain that corresponds to the given percentile.

InvDensity

InvDensity(delta)

delta A REAL value.
Return: InvDensity returns a single REAL value.
The InvDensity subroutine is an inverse of the density subroutine. Given a probability desnity function value delta, the InvDensity subroutine returns value X such that Density(X) = delta.

ML.FieldAggregates

FieldAggregates

FieldAggregates(numericfield)

numericfield the name of the inputField
The FieldAggregates routine exports the following attributes and subroutines:

FieldAggregates.Simple

FieldAggregates(numericfield).Simple

The Simple attribute returns the

  • minimum
  • maximum
  • sum
  • mean
  • variance
  • standard deviation
of each column.

FieldAggregates.SimpleRanked

FieldAggregates(numericfield).SimpleRanked

The SimpleRanked attribute assigns every record a rank, arbitrarily picking which duplicate value receives the lower rank.

FieldAggregates.Ranked

FieldAggregates(numericfield).Ranked

The Ranked attribute assigns every record a rank.

FieldAggregates.Medians

FieldAggregates(numericfield).Medians

The Medians attribute calculates the median for each column.

FieldAggregates.Modes

FieldAggregates(numericfield).Modes

The Modes attribute calculates the mode for each column.

FieldAggregates.NTiles(n)

FieldAggregates(numericfield).NTiles(n)

n An integer expression defining how many groups to split population into.
Return: NTiles(n) returns a record set.
The NTiles subroutine splits records into Tiles based on the value of n. NTiles(4) is quartiles, NTiles(100) is percentiles.

FieldAggregates.NTileRanges(n)

FieldAggregates(numericfield).NTileRanges(n)

n An integer expression defining how many tiles to split population into.
Return: NTileRanges(n) returns a record set.
The NTilesRanges subroutine returns information about the highest and lowest value in every tile.

FieldAggregates.Buckets(n)

FieldAggregates(numericfield).Buckets(n)

n An integer expression defining how many buckets to split population into.
Return: Buckets(n) returns a record set.
The Buckets subroutine splits records into groups based on the value of n. Unlike NTiles, the population of each bucket may not be even. However the range of each group is even.

FieldAggregates.BucketRanges(n)

FieldAggregates(numericfield).BucketRanges(n)

n An integer expression defining how many buckets to split population into.
Return: BucketRanges(n) returns a record set.
The BucketRanges subroutine returns information about the range and size of each bucket.

ML.FromField

FromField

FromField(numericfield,layout,output[,map])

numericfield A set of NumericField records to process.
layout The name of the resulting layout of the returned set.
output The name of the resulting record set.
map (Optional) The mapping table that was created by the ToField routine. Default: ‘ ‘ (left blank)
Return: FromField returns a record set.
The FromField macro reconstitutes an original matrix from a set of NumericField records.

ML.Regression

OLS

OLS(X,Y)

X A record set containing independent variables.
Y A record set containing dependent variables.
The OLS routine contains attributes to calculate regression using the Ordinary Least Squares linear regression models. It exports the following attributes and subroutines:

OLS.Beta

OLS(X,Y).Beta([control])

control (Optional) The Matrix decomposition method. Possible Values:
  • MDM.LU
  • MDM.Cholesky
Default: MDM.Cholesky.
The Beta subroutine calculates beta parameters using the LU or Cholesky matrix decomposition methods.

OLS.Extrapolate

OLS(X,Y).Extrapolate(independent,beta)

independent A record set containing independent variables.
beta A record set containing results derived from Beta.
The Extrapolate subroutine takes independent variables X, and the linear regression model beta and calculates dependent variables Y.

Poly

Poly(X,Y,maxN)

X A NumericField record set containing independent variables.
Y A NumericField record set containing dependent variables.
maxN An integer expression defining the maximum number of polynomial components used. Default: 6.
The Poly routine contains attributes to calculate regression sing the polynomial regression model. It exports the following attributes and subroutines:

Poly.Beta

Poly(X,Y,maxN).Beta The Beta attribute returns the unknown parameter value b used to predict values.

Poly.Rsquared

Poly(X,Y,maxN).Rsquared The Rsquared attribute returns the coefficient of determination, a measure of goodness of fit.

Poly.SubBeta

Poly(X,Y,maxN).SubBeta(K,N)

K An integer expression defining the minimum number of polynomial components used.
N An integer expression defining the maximum number of polynomial components used.
Return: SubBeta returns a record set.
The SubBeta subroutine uses K out of N polynomial components and finds the best model.

ML.ToField

ToField

ToField(recordset,output[,idfield,datafields])

recordset A set of records to process.
output The name of the resulting NumericField record set.
idfield (Optional) A field that contains the Record ID for each row. Default: If omitted, it is assumed to be the first field.
datafields (Optional) A STRING containing a comma-delimited list of the fields to be treated as axes. Default: If omitted, all numeric fields that are not the Record ID will be treated as axes. NOTE: idfield defaults to the first field in the table, so if that field is specified as an axis field, then the user should be sure to specify a value in the idfield parameter.
Return: ToField returns a record set.
The ToField routine takes an input dataset with a record ID column and expands it into a NumericField dataset to be used by the ML-Library.

ML.Docs.CoLocation

Words

Words(rawrecordset)

rawrecordset A set of Raw records to process.
Return: Words returns a record set.
The Words routine takes a Raw record set and calls the Tokenize.Clean and Tokenize.Split routines to map words for further textual analysis.

Lexicon

Lexicon(words)

words A set of WordElement records derived from a Words routine to process.
Return: Lexicon returns a record set.
The Lexicon routine calls the Tokenize.Lexicon routine.

AllNGrams

AllNGrams(words[,lexicon][,n])

words A set of WordElement records derived from a Words routine to process.
lexicon (Optional) The output from the Lexicon routine.
n (Optional) An integer expression defining the maximum ngram size. Default: 3.
Return: AllNGrams returns a record set.
The AllNGrams routine harvests every n-gram, from unigrams up to n defined by the user.

Support

Support(setofstrings,allngrams)

setofstrings A set of strings.
allngrams The output from the AllNGrams routine.
Return: Support returns a single real value.
The Support routine returns the ratio of the number of documents that contain all the items in the set to the total number of documents.

Confidence

Confidence(setofstrings1,setofstrings2,allngrams)

setofstrings1 A set of strings.
setofstrings2 A set of strings.
allngrams The output from the AllNGrams routine.
Return: Confidence returns a single real value
The Confidence routine returns the ratio of the support of two sets of ngrams to the support of the first set of ngrams.

Lift

Lift(setofstrings1,setofstrings2,allngrams)

setofstrings1 A set of strings.
setofstrings2 A set of strings.
allngrams The output from the AllNGrams routine.
Return: Lift returns a single real value.
The Lift routine returns the ratio of the support of two sets of ngrams to the product of the supports for the two sets of ngrams.

Conviction

Conviction(setofstrings1,setofstrings2,allngrams)

setofstrings1 A set of strings.
setofstrings2 A set of strings.
allngrams The output from the AllNGrams routine.
Return: Conviction returns a single real value.
The Conviction routine is the ratio of one minus the support for the second set to one minus the confidence of the two sets.

NGrams

Ngrams(allngrams)

allngrams The output from the AllNGrams routine.
Return: NGrams returns a record set.
The Ngrams routine strips the document ids and groups the dataset so that there is one row per unique n-gram with aggregate information such as:
  • number of documents in which the items appears
  • the term frequency
  • the inverse document frequency (IDF).

SubGrams

SubGrams(ngrams)

ngrams The output from the Ngrams routine.
Return: SubGrams returns a record set.
The SubGrams routine produces a dataset of every n-gram where n>1 along with a comparison of the document frequency of the n-gram to the product of the frequencies of all of its constituent unigrams.

SplitCompare

SplitCompare(ngrams)

ngrams The output from the Ngrams routine.
Return: SplitCompare returns a record set.
The SplitCompare routine splits every n-gram with n>1 into two rows with two parts which are the
  • initial unigram and the remainder
  • the final unigram and the remainder
The document frequencies of all three items (the full n-gram, and the two constituent parts) are then presented side-by-side so their relative values can be evaluated.

ShowPhrase

ShowPhrase(lexicon,string)

lexicon A record set containing a lexicon derived from the Lexicon routine.
string A STRING of INTEGERs representing words in lexicon.
Return: ShowPhrase returns a record set.
The ShowPhrase routine replaces the integers with the words they represent, reconstituting phrases of significance.

ML.Docs.Tokenize

Enumerate

Enumerate(rawrecordset)

rawrecordset A set of Raw records to process.
Return: Enumerate returns a record set.
The Enumerate routine assigns a sequential integer ID to the dataset.

Clean

Clean(rawrecordset)

rawrecordset A set of Raw records to process.
Return: Clean returns a record set.
The Clean routine standardizes the text by performing actions such as removing punctuation, converting all letters into uppercase and normalizing common contractions.

Split

Split(rawrecordset)

recordset A record set containing output derived from the Clean routine.
Return: Split returns a record set.
The Split routine breaks each word into a separate entity.

Lexicon

Lexicon(recordset)

recordset A record set containing output derived from the Split routine.
Return: Lexicon returns a record set.
The Lexicon routine aggregates the data from recordset, grouping by word.

ToO

ToO(recordset,lexicon)

recordset A record set containing output derived from the Split routine.
lexicon A record set containing output derived from the Lexicon routine.
Return: ToO returns a record set.
The ToO routine creates a dataset that replaces the word with the word id assigned to it in the lexicon that was passed as a parameter.

FromO

FromO(recordset,lexicon)

owordelement A record set containing output derived from the ToO routine.
lexicon A record set containing output derived from the Lexicon routine.
Return: FromO returns a record set.
The FromO routine re-constitutes a table that was produced by the ToO routine back into the WordElement format.

ML.Docs.Trans

Trans

Trans(owordelement)

owordelement A record set containing output derived from the ToO routine.
The Trans routine exports the following attributes and subroutines:

Trans.Wordbag

Trans(owordelement).Wordbag

This WordBag attribute turns every document in the dataset into a wordbag, by removing then and counting multiple occurrences of a word within a document.

Trans.WordsCounted

Trans(owordelement).WordsCounted

The WordsCounted attribute returns a dataset with the number of times that word occurs and tf-idf (Term Frequency – Inverse Document Frequency)

Trans.TfIdf

Trans(owordelement).TfIdf([lowthreshold][,lowdoccount])

lowthreshold (Optional) A REAL value for the tf-idf value a word must be above to be kept. Default: 0.05.
lowdoccount (Optional) An integer expression defining the least number of documents a word must appear to qualify as a keyword candidate. Default: 200.
The TfIdf subroutine takes a word stream, determines the Term Frequency – Inverse Document Frequency (TF-IDF) for every id/word combination, and returns those that are above a set threshold lowthreshold.
Clone this wiki locally