Text Analytics Algorithms
Text data is fed to algorithms in a vectorized form to generate an NLP model. The NLP models could be broadly classified into supervised and unsupervised learning models. In QuickML, we have algorithms that use labelled data to build supervised learning models.
The algorithms include:
- Naive Bayes
- Support vector machine (SVM)
Naive Bayes
A classification algorithm that works on Bayesian theorem with a naive assumption that there is a conditional independence between every pair of features considered. Bayes theorem calculates probability P(c|x) where c is the class of the possible target labels and x is the given instance which has to be classified, representing some certain features.
P(c|x) = P(x|c) * P(c) / P(x)
Hyper parameters:
Parameters:
-
priors: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.
-
var_smoothing: float, default=1e-9 Portion of the largest variance of all features that is added to variances for calculation stability.
Attributes:
-
class_count_: ndarray of shape (n_classes,) number of training samples observed in each class.
-
class_prior_: ndarray of shape (n_classes,) probability of each class.
-
classes_: ndarray of shape (n_classes,) class labels known to the classifier.
-
epsilon_: float absolute additive value to variances.
-
n_features_in_: int Number of features seen during fit.
-
feature_names_in_: ndarray of shape (n_features_in_,) Names of features seen during fit. Defined only when X has feature names that are all strings.
-
var_: ndarray of shape (n_classes, n_features) Variance of each feature per class.
-
theta_: ndarray of shape (n_classes, n_features) Mean of each feature per class.
Support vector machine (SVM)
SVM in another popular classification machine learning algorithm which classifies data by determining the best hyperplane (decision boundary)
Hyperparameters:
-
C : float, default=1.0
Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
-
kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=‘rbf’
Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
-
degree: int, default=3
Degree of the polynomial kernel function (‘poly’). Must be non-negative. Ignored by all other kernels.
-
gamma: {‘scale’, ‘auto’} or float, default=‘scale’
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
- if gamma= ‘scale’ (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
- if ‘auto’, uses 1 / n_features
- if float, must be non-negative.
-
coef0: float, default=0.0
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
-
shrinking: bool, default=True
Whether to use the shrinking heuristic. See the User Guide.
-
probability: bool, default=False
Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide.
-
Tolerance: float, default=1e-3
Tolerance for stopping criterion.
-
cache_size: float, default=200
Specify the size of the kernel cache (in MB).
-
class_weight: dict or ‘balanced’, default=None
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
-
verbose: bool, default=False
Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded multi threaded context.
-
max_iter: int, default=-1
Hard limit on iterations within solver, or -1 for no limit.
-
decision_function_shape: {‘ovo’, ‘ovr’}, default=‘ovr’
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, note that internally, one-vs-one (‘ovo’) is always used as a multi-class strategy to train models; an ovr matrix is only constructed from the ovo matrix. The parameter is ignored for binary classification.
-
break_ties: bool, default=False
If true, decision_function_shape=‘ovr’, and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict.
-
random_state: int, RandomState instance or None, default=None
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls.
Last Updated 2024-12-27 14:14:58 +0530 +0530
Yes
No
Send your feedback to us