[Information Retrieval] Text Classifier

Introduction

A standard supervised supervised learning problem.

Soft classification
Hard classification

Process

A given set of classes C
Given text data x, determine its class y in C
All the classes are predefined

Variants of Problem Formulation

Binary Categorization: only two categories
K-category Categorization: more than two

K-category Categorization can be done by multiple binary categorizations.

Basic Methodology

Ideas: manual classification using rules
Popular techniques: generative(kNN, Naive Bayes), discriminative (SVM, regression)
- generative probability p(x,y)
- discriminative: model p(y|x)
Automatic Methods
- Retrieval-based: treat the category as information needed (query), then the documents will be the training set.
  - Prototype-based (Rocchio)
  - K-nearest neighbor (kNN)
- Generative classifier
  - Naive Bayes classifier
- Discriminative

Retrieval-based Categorization

Treat category as query. Treat training set in each category as relevant documents. Use feedback approaches to learn a good query. Learn a good query, and compute the vector distance between the learned query and the test data (incoming documents).

Prototype-based Classifier

For every class, create a prototype (query). The queries are constructed in feedback method.
Check the distance between the class and new incoming data.

Pros
Easy and simple to use. Efficient.

Cons

Distance to different prototypes may not be measured on the same scale if we treat a prototype as a query.
Little guidance on improving the classifier

kNN classifier

-> Treat each incoming unlabeled documents as a query. All the training data are the documents. We can rank the training data according to the test data.

-> Find k examples that are most similar to the new documents.

-> Assign the category that is most common in these neighbor documents (neighbors vote for the category) (use the majority of labels)

Improve
Let the near neighbors has more weight.

Pros

No training needed
Can be applied to any distance measure and document representation
Empirically effective

Cons

Find nearest neighbors has high time complexity
Imprecise when the number of examples is few, which is often true in high-dimensional spaces.

Generative Classifier: Naive Bayes

P(y|x) = P(x|y)* P(y)/P(x)

In text classification, x is the list of features of a document, F1,.., Fk. Each feature is a word. y is the event: document d belongs to category C.

For text classification: P(C|D) = P(D|C)*P(C)/P(D). C: category, D: document. We only want to pick the most likely category C, thus we can drop P(D)

To calculate P(D|C):

Document Model

Multi-Bernoulli Model
Each coin represents a word. The coin decide whether the word appears given the class.

Multinomial Model

Each face of the dice represents a word in the collection. For every token in the document, toss the dice and decide which word to be added in the document.

Discriminative Classifier

All trying to predict the class label directly based on doc representation, typically a feature vector.

Try to find a function that can predict the history, then apply it to the future data. This function maps each vector into a class. This function can be found by minimizing errors on the training data.

Decision Tree

Neural Networks

Learn from mistakes.

Perceptron algorithm: simplest Neural Network.
Multilayer Perceptron: simplest Deep Neural Network.

Cons
Hard to interpret the trained classifier.

Support Vector Machines

Class : f(document-vector)
Has constraint is to maximize the distance between the boundary and the support vectors.

See notes of SI670 Applied Machine Learning