NaiveBayesMultinomialText (weka-dev 3.7.6-SNAPSHOT API)

java.lang.Object
- weka.classifiers.AbstractClassifier
- - weka.classifiers.bayes.NaiveBayesMultinomialText

All Implemented Interfaces:: Serializable, Cloneable, Classifier, UpdateableClassifier, CapabilitiesHandler, OptionHandler, RevisionHandler, WeightedInstancesHandler

public class NaiveBayesMultinomialText
extends AbstractClassifier
implements UpdateableClassifier, WeightedInstancesHandler

Multinomial naive bayes for text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification

Valid options are:

 -W
  Use word frequencies instead of binary bag of words.

 -P <# instances>
  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)

 -M <double>
  Minimum word frequency. Words with less than this frequence are ignored.
  If periodic pruning is turned on then this is also used to determine which
  words to remove from the dictionary (default = 3).

 -normalize
  Normalize document length (use in conjunction with -norm and -lnorm)

 -norm <num>
  Specify the norm that each instance must have (default 1.0)

 -lnorm <num>
  Specify L-norm to use (default 2.0)

 -lowercase
  Convert all tokens to lowercase before adding to the dictionary.

 -stoplist
  Ignore words that are in the stoplist.

 -stopwords <file>
  A file containing stopwords to override the default ones.
  Using this option automatically sets the flag ('-stoplist') to use the
  stoplist if the file exists.
  Format: one stopword per line, lines starting with '#'
  are interpreted as comments and ignored.

 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)

 -stemmer <spec>
  The stemmering algorihtm (classname plus parameters) to use.

Author:: Mark Hall (mhall{[at]}pentaho{[dot]}com), Andrew Golightly (acg4@cs.waikato.ac.nz), Bernhard Pfahringer (bernhard@cs.waikato.ac.nz)
See Also:: Serialized Form

Constructor Summary

Constructors
Constructor and Description

NaiveBayesMultinomialText()

Constructors
Constructor and Description
`NaiveBayesMultinomialText()`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`buildClassifier(Instances data)` Generates the classifier.
`double[]`	`distributionForInstance(Instance instance)` Calculates the class membership probabilities for the given test instance.
`Capabilities`	`getCapabilities()` Returns default capabilities of the classifier.
`double`	`getLNorm()` Get the L Norm used.
`boolean`	`getLowercaseTokens()` Get whether to convert all tokens to lowercase
`double`	`getMinWordFrequency()` Get the minimum word frequency.
`double`	`getNorm()` Get the instance's Norm.
`boolean`	`getNormalizeDocLength()` Get whether to normalize the length of each document
`String[]`	`getOptions()` Gets the current settings of the classifier.
`int`	`getPeriodicPruning()` Get how often to prune the dictionary
`String`	`getRevision()` Returns the revision string.
`Stemmer`	`getStemmer()` Returns the current stemming algorithm, null if none is used.
`File`	`getStopwords()` returns the file used for obtaining the stopwords, if the file represents a directory then the default ones are used.
`Tokenizer`	`getTokenizer()` Returns the current tokenizer algorithm.
`boolean`	`getUseStopList()` Get whether to ignore all words that are on the stoplist.
`boolean`	`getUseWordFrequencies()` Get whether to use word frequencies rather than binary bag of words representation.
`String`	`globalInfo()` Returns a string describing classifier
`Enumeration<Option>`	`listOptions()` Returns an enumeration describing the available options.
`String`	`LNormTipText()` Returns the tip text for this property
`String`	`lowercaseTokensTipText()` Returns the tip text for this property
`static void`	`main(String[] args)` Main method for testing this class.
`String`	`minWordFrequencyTipText()` Returns the tip text for this property
`String`	`normalizeDocLengthTipText()` Returns the tip text for this property
`String`	`normTipText()` Returns the tip text for this property
`String`	`periodicPruningTipText()` Returns the tip text for this property
`void`	`reset()` Reset the classifier.
`void`	`setLNorm(double newLNorm)` Set the L-norm to used
`void`	`setLowercaseTokens(boolean l)` Set whether to convert all tokens to lowercase
`void`	`setMinWordFrequency(double minFreq)` Set the minimum word frequency.
`void`	`setNorm(double newNorm)` Set the norm of the instances
`void`	`setNormalizeDocLength(boolean norm)` Set whether to normalize the length of each document
`void`	`setOptions(String[] options)` Parses a given list of options.
`void`	`setPeriodicPruning(int p)` Set how often to prune the dictionary
`void`	`setStemmer(Stemmer value)` the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
`void`	`setStopwords(File value)` sets the file containing the stopwords, null or a directory unset the stopwords.
`void`	`setTokenizer(Tokenizer value)` the tokenizer algorithm to use.
`void`	`setUseStopList(boolean u)` Set whether to ignore all words that are on the stoplist.
`void`	`setUseWordFrequencies(boolean u)` Set whether to use word frequencies rather than binary bag of words representation.
`String`	`stemmerTipText()` Returns the tip text for this property.
`String`	`stopwordsTipText()` Returns the tip text for this property.
`String`	`tokenizerTipText()` Returns the tip text for this property.
`String`	`toString()` Returns a textual description of this classifier.
`void`	`updateClassifier(Instance instance)` Updates the classifier with the given instance.
`String`	`useStopListTipText()` Returns the tip text for this property
`String`	`useWordFrequenciesTipText()` Returns the tip text for this property

Methods inherited from class weka.classifiers.AbstractClassifier
classifyInstance, debugTipText, forName, getDebug, makeCopies, makeCopy, runClassifier, setDebug

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - NaiveBayesMultinomialText
```
public NaiveBayesMultinomialText()
```
- Method Detail
  - globalInfo
```
public String globalInfo()
```
    Returns a string describing classifier
    
    Returns:
    a description suitable for displaying in the explorer/experimenter gui
  - getCapabilities
```
public Capabilities getCapabilities()
```
    Returns default capabilities of the classifier.
    
    Specified by:
    
    getCapabilities in interface Classifier
    
    Specified by:
    
    getCapabilities in interface CapabilitiesHandler
    
    Overrides:
    
    getCapabilities in class AbstractClassifier
    
    Returns:
    the capabilities of this classifier
    See Also:
    Capabilities
  - buildClassifier
```
public void buildClassifier(Instances data)
                     throws Exception
```
    Generates the classifier.
    
    Specified by:
    
    buildClassifier in interface Classifier
    
    Parameters:
    instances - set of instances serving as training data
    
    Throws:
    
    Exception - if the classifier has not been generated successfully
  - updateClassifier
```
public void updateClassifier(Instance instance)
                      throws Exception
```
    Updates the classifier with the given instance.
    
    Specified by:
    
    updateClassifier in interface UpdateableClassifier
    
    Parameters:
    instance - the new training instance to include in the model
    
    Throws:
    
    Exception - if the instance could not be incorporated in the model.
  - distributionForInstance
```
public double[] distributionForInstance(Instance instance)
                                 throws Exception
```
    Calculates the class membership probabilities for the given test instance.
    
    Specified by:
    
    distributionForInstance in interface Classifier
    
    Overrides:
    
    distributionForInstance in class AbstractClassifier
    
    Parameters:
    instance - the instance to be classified
    
    Returns:
    predicted class probability distribution
    
    Throws:
    
    Exception - if there is a problem generating the prediction
  - reset
```
public void reset()
```
    Reset the classifier.
  - setStemmer
```
public void setStemmer(Stemmer value)
```
    the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
    
    Parameters:
    value - the configured stemming algorithm, or null
    See Also:
    NullStemmer
  - getStemmer
```
public Stemmer getStemmer()
```
    Returns the current stemming algorithm, null if none is used.
    
    Returns:
    the current stemming algorithm, null if none set
  - stemmerTipText
```
public String stemmerTipText()
```
    Returns the tip text for this property.
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setTokenizer
```
public void setTokenizer(Tokenizer value)
```
    the tokenizer algorithm to use.
    
    Parameters:
    value - the configured tokenizing algorithm
  - getTokenizer
```
public Tokenizer getTokenizer()
```
    Returns the current tokenizer algorithm.
    
    Returns:
    the current tokenizer algorithm
  - tokenizerTipText
```
public String tokenizerTipText()
```
    Returns the tip text for this property.
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - useWordFrequenciesTipText
```
public String useWordFrequenciesTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setUseWordFrequencies
```
public void setUseWordFrequencies(boolean u)
```
    Set whether to use word frequencies rather than binary bag of words representation.
    
    Parameters:
    u - true if word frequencies are to be used.
  - getUseWordFrequencies
```
public boolean getUseWordFrequencies()
```
    Get whether to use word frequencies rather than binary bag of words representation.
    
    Parameters:
    u - true if word frequencies are to be used.
  - lowercaseTokensTipText
```
public String lowercaseTokensTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setLowercaseTokens
```
public void setLowercaseTokens(boolean l)
```
    Set whether to convert all tokens to lowercase
    
    Parameters:
    l - true if all tokens are to be converted to lowercase
  - getLowercaseTokens
```
public boolean getLowercaseTokens()
```
    Get whether to convert all tokens to lowercase
    
    Returns:
    true true if all tokens are to be converted to lowercase
  - periodicPruningTipText
```
public String periodicPruningTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setPeriodicPruning
```
public void setPeriodicPruning(int p)
```
    Set how often to prune the dictionary
    
    Parameters:
    p - how often to prune
  - getPeriodicPruning
```
public int getPeriodicPruning()
```
    Get how often to prune the dictionary
    
    Returns:
    how often to prune the dictionary
  - minWordFrequencyTipText
```
public String minWordFrequencyTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setMinWordFrequency
```
public void setMinWordFrequency(double minFreq)
```
    Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
    
    Parameters:
    minFreq - the minimum word frequency to use
  - getMinWordFrequency
```
public double getMinWordFrequency()
```
    Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
    
    Parameters:
    return - the minimum word frequency to use
  - normalizeDocLengthTipText
```
public String normalizeDocLengthTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setNormalizeDocLength
```
public void setNormalizeDocLength(boolean norm)
```
    Set whether to normalize the length of each document
    
    Parameters:
    norm - true if document lengths is to be normalized
  - getNormalizeDocLength
```
public boolean getNormalizeDocLength()
```
    Get whether to normalize the length of each document
    
    Returns:
    true if document lengths is to be normalized
  - normTipText
```
public String normTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getNorm
```
public double getNorm()
```
    Get the instance's Norm.
    
    Returns:
    the Norm
  - setNorm
```
public void setNorm(double newNorm)
```
    Set the norm of the instances
    
    Parameters:
    newNorm - the norm to wich the instances must be set
  - LNormTipText
```
public String LNormTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - getLNorm
```
public double getLNorm()
```
    Get the L Norm used.
    
    Returns:
    the L-norm used
  - setLNorm
```
public void setLNorm(double newLNorm)
```
    Set the L-norm to used
    
    Parameters:
    newLNorm - the L-norm
  - useStopListTipText
```
public String useStopListTipText()
```
    Returns the tip text for this property
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - setUseStopList
```
public void setUseStopList(boolean u)
```
    Set whether to ignore all words that are on the stoplist.
    
    Parameters:
    u - true to ignore all words on the stoplist.
  - getUseStopList
```
public boolean getUseStopList()
```
    Get whether to ignore all words that are on the stoplist.
    
    Returns:
    true to ignore all words on the stoplist.
  - setStopwords
```
public void setStopwords(File value)
```
    sets the file containing the stopwords, null or a directory unset the stopwords. If the file exists, it automatically turns on the flag to use the stoplist.
    
    Parameters:
    value - the file containing the stopwords
  - getStopwords
```
public File getStopwords()
```
    returns the file used for obtaining the stopwords, if the file represents a directory then the default ones are used.
    
    Returns:
    the file containing the stopwords
  - stopwordsTipText
```
public String stopwordsTipText()
```
    Returns the tip text for this property.
    
    Returns:
    tip text for this property suitable for displaying in the explorer/experimenter gui
  - listOptions
```
public Enumeration<Option> listOptions()
```
    Returns an enumeration describing the available options.
    
    Specified by:
    
    listOptions in interface OptionHandler
    
    Overrides:
    
    listOptions in class AbstractClassifier
    
    Returns:
    an enumeration of all the available options.
  - setOptions
```
public void setOptions(String[] options)
                throws Exception
```
    Parses a given list of options.
    Valid options are:
```
 -W
  Use word frequencies instead of binary bag of words.
```
```
 -P <# instances>
  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
```
```
 -M <double>
  Minimum word frequency. Words with less than this frequence are ignored.
  If periodic pruning is turned on then this is also used to determine which
  words to remove from the dictionary (default = 3).
```
```
 -normalize
  Normalize document length (use in conjunction with -norm and -lnorm)
```
```
 -norm <num>
  Specify the norm that each instance must have (default 1.0)
```
```
 -lnorm <num>
  Specify L-norm to use (default 2.0)
```
```
 -lowercase
  Convert all tokens to lowercase before adding to the dictionary.
```
```
 -stoplist
  Ignore words that are in the stoplist.
```
```
 -stopwords <file>
  A file containing stopwords to override the default ones.
  Using this option automatically sets the flag ('-stoplist') to use the
  stoplist if the file exists.
  Format: one stopword per line, lines starting with '#'
  are interpreted as comments and ignored.
```
```
 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
```
```
 -stemmer <spec>
  The stemmering algorihtm (classname plus parameters) to use.
```
    Specified by:
    
    setOptions in interface OptionHandler
    
    Overrides:
    
    setOptions in class AbstractClassifier
    
    Parameters:
    options - the list of options as an array of strings
    
    Throws:
    
    Exception - if an option is not supported
  - getOptions
```
public String[] getOptions()
```
    Gets the current settings of the classifier.
    
    Specified by:
    
    getOptions in interface OptionHandler
    
    Overrides:
    
    getOptions in class AbstractClassifier
    
    Returns:
    an array of strings suitable for passing to setOptions
  - toString
```
public String toString()
```
    Returns a textual description of this classifier.
    
    Overrides:
    
    toString in class Object
    
    Returns:
    a textual description of this classifier.
  - getRevision
```
public String getRevision()
```
    Returns the revision string.
    
    Specified by:
    
    getRevision in interface RevisionHandler
    
    Overrides:
    
    getRevision in class AbstractClassifier
    
    Returns:
    the revision
  - main
```
public static void main(String[] args)
```
    Main method for testing this class.
    
    Parameters:
    args - the options

Class NaiveBayesMultinomialText

Constructor Summary

Method Summary

Methods inherited from class weka.classifiers.AbstractClassifier

Methods inherited from class java.lang.Object

Constructor Detail

NaiveBayesMultinomialText

Method Detail

globalInfo

getCapabilities

buildClassifier

updateClassifier

distributionForInstance

reset

setStemmer

getStemmer

stemmerTipText

setTokenizer

getTokenizer

tokenizerTipText

useWordFrequenciesTipText

setUseWordFrequencies

getUseWordFrequencies

lowercaseTokensTipText

setLowercaseTokens

getLowercaseTokens

periodicPruningTipText

setPeriodicPruning

getPeriodicPruning

minWordFrequencyTipText

setMinWordFrequency

getMinWordFrequency

normalizeDocLengthTipText

setNormalizeDocLength

getNormalizeDocLength

normTipText

getNorm

setNorm

LNormTipText

getLNorm

setLNorm

useStopListTipText

setUseStopList

getUseStopList

setStopwords

getStopwords

stopwordsTipText

listOptions

setOptions

getOptions

toString

getRevision

main