public class SGDText extends RandomizableClassifier implements UpdateableClassifier, WeightedInstancesHandler
-F Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression) (default = 0)
-outputProbs Output probabilities for SVMs (fits a logsitic model to the output of the SVM)
-L The learning rate (default = 0.01).
-R <double> The lambda regularization constant (default = 0.0001)
-E <integer> The number of epochs to perform (batch learning only, default = 500)
-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-normalize Normalize document length (use in conjunction with -norm and -lnorm
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stoplist Ignore words that are in the stoplist.
-stopwords <file> A file containing stopwords to override the default ones. Using this option automatically sets the flag ('-stoplist') to use the stoplist if the file exists. Format: one stopword per line, lines starting with '#' are interpreted as comments and ignored.
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
Modifier and Type | Field and Description |
---|---|
static int |
HINGE
the hinge loss function.
|
static int |
LOGLOSS
the log loss function.
|
static Tag[] |
TAGS_SELECTION
Loss functions to choose from
|
Constructor and Description |
---|
SGDText() |
Modifier and Type | Method and Description |
---|---|
void |
buildClassifier(Instances data)
Method for building the classifier.
|
double[] |
distributionForInstance(Instance inst)
Predicts the class memberships for a given instance.
|
String |
epochsTipText()
Returns the tip text for this property
|
Capabilities |
getCapabilities()
Returns default capabilities of the classifier.
|
int |
getEpochs()
Get current number of epochs
|
double |
getLambda()
Get the current value of lambda
|
double |
getLearningRate()
Get the learning rate.
|
double |
getLNorm()
Get the L Norm used.
|
SelectedTag |
getLossFunction()
Get the current loss function.
|
boolean |
getLowercaseTokens()
Get whether to convert all tokens to lowercase
|
double |
getMinWordFrequency()
Get the minimum word frequency.
|
double |
getNorm()
Get the instance's Norm.
|
boolean |
getNormalizeDocLength()
Get whether to normalize the length of each document
|
String[] |
getOptions()
Gets the current settings of the classifier.
|
boolean |
getOutputProbsForSVM()
Get whether to fit a logistic regression (itself trained
using SGD) to the outputs of the SVM (if an SVM is being
learned).
|
int |
getPeriodicPruning()
Get how often to prune the dictionary
|
String |
getRevision()
Returns the revision string.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
File |
getStopwords()
returns the file used for obtaining the stopwords, if the file represents
a directory then the default ones are used.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
boolean |
getUseStopList()
Get whether to ignore all words that are on the stoplist.
|
boolean |
getUseWordFrequencies()
Get whether to use word frequencies rather than binary
bag of words representation.
|
String |
globalInfo()
Returns a string describing classifier
|
String |
lambdaTipText()
Returns the tip text for this property
|
String |
learningRateTipText()
Returns the tip text for this property
|
Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
String |
LNormTipText()
Returns the tip text for this property
|
String |
lossFunctionTipText()
Returns the tip text for this property
|
String |
lowercaseTokensTipText()
Returns the tip text for this property
|
static void |
main(String[] args)
Main method for testing this class.
|
String |
minWordFrequencyTipText()
Returns the tip text for this property
|
String |
normalizeDocLengthTipText()
Returns the tip text for this property
|
String |
normTipText()
Returns the tip text for this property
|
String |
outputProbsForSVMTipText()
Returns the tip text for this property
|
String |
periodicPruningTipText()
Returns the tip text for this property
|
void |
reset()
Reset the classifier.
|
void |
setEpochs(int e)
Set the number of epochs to use
|
void |
setLambda(double lambda)
Set the value of lambda to use
|
void |
setLearningRate(double lr)
Set the learning rate.
|
void |
setLNorm(double newLNorm)
Set the L-norm to used
|
void |
setLossFunction(SelectedTag function)
Set the loss function to use.
|
void |
setLowercaseTokens(boolean l)
Set whether to convert all tokens to lowercase
|
void |
setMinWordFrequency(double minFreq)
Set the minimum word frequency.
|
void |
setNorm(double newNorm)
Set the norm of the instances
|
void |
setNormalizeDocLength(boolean norm)
Set whether to normalize the length of each document
|
void |
setOptions(String[] options)
Parses a given list of options.
|
void |
setOutputProbsForSVM(boolean o)
Set whether to fit a logistic regression (itself trained
using SGD) to the outputs of the SVM (if an SVM is being
learned).
|
void |
setPeriodicPruning(int p)
Set how often to prune the dictionary
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwords(File value)
sets the file containing the stopwords, null or a directory unset the
stopwords.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setUseStopList(boolean u)
Set whether to ignore all words that are on the stoplist.
|
void |
setUseWordFrequencies(boolean u)
Set whether to use word frequencies rather than binary
bag of words representation.
|
String |
stemmerTipText()
Returns the tip text for this property.
|
String |
stopwordsTipText()
Returns the tip text for this property.
|
String |
tokenizerTipText()
Returns the tip text for this property.
|
String |
toString() |
void |
updateClassifier(Instance instance)
Updates the classifier with the given instance.
|
String |
useStopListTipText()
Returns the tip text for this property
|
String |
useWordFrequenciesTipText()
Returns the tip text for this property
|
getSeed, seedTipText, setSeed
classifyInstance, debugTipText, forName, getDebug, makeCopies, makeCopy, runClassifier, setDebug
public static final int HINGE
public static final int LOGLOSS
public static final Tag[] TAGS_SELECTION
public Capabilities getCapabilities()
getCapabilities
in interface Classifier
getCapabilities
in interface CapabilitiesHandler
getCapabilities
in class AbstractClassifier
Capabilities
public void setStemmer(Stemmer value)
value
- the configured stemming algorithm, or nullNullStemmer
public Stemmer getStemmer()
public String stemmerTipText()
public void setTokenizer(Tokenizer value)
value
- the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public String tokenizerTipText()
public String useWordFrequenciesTipText()
public void setUseWordFrequencies(boolean u)
u
- true if word frequencies are to be used.public boolean getUseWordFrequencies()
u
- true if word frequencies are to be used.public String lowercaseTokensTipText()
public void setLowercaseTokens(boolean l)
l
- true if all tokens are to be converted to
lowercasepublic boolean getLowercaseTokens()
public String useStopListTipText()
public void setUseStopList(boolean u)
u
- true to ignore all words on the stoplist.public boolean getUseStopList()
public void setStopwords(File value)
value
- the file containing the stopwordspublic File getStopwords()
public String stopwordsTipText()
public String periodicPruningTipText()
public void setPeriodicPruning(int p)
p
- how often to prunepublic int getPeriodicPruning()
public String minWordFrequencyTipText()
public void setMinWordFrequency(double minFreq)
minFreq
- the minimum word frequency to usepublic double getMinWordFrequency()
return
- the minimum word frequency to usepublic String normalizeDocLengthTipText()
public void setNormalizeDocLength(boolean norm)
norm
- true if document lengths is to be normalizedpublic boolean getNormalizeDocLength()
public String normTipText()
public double getNorm()
public void setNorm(double newNorm)
newNorm
- the norm to wich the instances must be setpublic String LNormTipText()
public double getLNorm()
public void setLNorm(double newLNorm)
newLNorm
- the L-normpublic String lambdaTipText()
public void setLambda(double lambda)
lambda
- the value of lambda to usepublic double getLambda()
public void setLearningRate(double lr)
lr
- the learning rate to use.public double getLearningRate()
public String learningRateTipText()
public String epochsTipText()
public void setEpochs(int e)
e
- the number of epochs to usepublic int getEpochs()
public void setLossFunction(SelectedTag function)
function
- the loss function to use.public SelectedTag getLossFunction()
public String lossFunctionTipText()
public void setOutputProbsForSVM(boolean o)
o
- true if a logistic regression is to be fit to the
output of the SVM to produce probability estimates.public boolean getOutputProbsForSVM()
public String outputProbsForSVMTipText()
public Enumeration<Option> listOptions()
listOptions
in interface OptionHandler
listOptions
in class RandomizableClassifier
public void setOptions(String[] options) throws Exception
-F Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression) (default = 0)
-outputProbs Output probabilities for SVMs (fits a logsitic model to the output of the SVM)
-L The learning rate (default = 0.01).
-R <double> The lambda regularization constant (default = 0.0001)
-E <integer> The number of epochs to perform (batch learning only, default = 500)
-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-normalize Normalize document length (use in conjunction with -norm and -lnorm
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stoplist Ignore words that are in the stoplist.
-stopwords <file> A file containing stopwords to override the default ones. Using this option automatically sets the flag ('-stoplist') to use the stoplist if the file exists. Format: one stopword per line, lines starting with '#' are interpreted as comments and ignored.
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
setOptions
in interface OptionHandler
setOptions
in class RandomizableClassifier
options
- the list of options as an array of stringsException
- if an option is not supportedpublic String[] getOptions()
getOptions
in interface OptionHandler
getOptions
in class RandomizableClassifier
public String globalInfo()
public void reset()
public void buildClassifier(Instances data) throws Exception
buildClassifier
in interface Classifier
data
- the set of training instances.Exception
- if the classifier can't be built successfully.public void updateClassifier(Instance instance) throws Exception
updateClassifier
in interface UpdateableClassifier
instance
- the new training instance to include in the modelException
- if the instance could not be incorporated in
the model.public double[] distributionForInstance(Instance inst) throws Exception
AbstractClassifier
distributionForInstance
in interface Classifier
distributionForInstance
in class AbstractClassifier
inst
- the instance to be classifiedException
- if distribution could not be
computed successfullypublic String getRevision()
getRevision
in interface RevisionHandler
getRevision
in class AbstractClassifier
public static void main(String[] args)
Copyright © 2012 University of Waikato, Hamilton, NZ. All Rights Reserved.