Natural Language Processing Overview

June 27, 2018

자연어처리(Natural Language Processing)의 큰 그림을 그려보자.

Background

NLP의 기본 절차와 Lexical Analysis

Hypothesis

분산 가설(distributional hypothesis)

You shall know a word by the company it keeps

Firth, 1957 (Studies in Linguistic Analysis)

Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis)

Harris, 1968 (Mathematical Structures of Language)

Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis)

Miller and Charles, 1991 (Language and Cognitive Processes)

Various extensions…

Similar contexts will have similar meanings
Names that occur in similar contexts will refer to the same underlying person

Ref

Contexts

기준이 되는 string 단위

windows(n size), 문장, 문단, 문서 etc

Lexical Features

N-gram

Dictionary based Tokenization

Unsupervised Segmentation

☞ Unsupervised Segmentation

Co-occurrences

Models

Bag of words

Word Weighting

Vector Space Model

idea of statistical semantics

Context representations

first order vector

Term-Document Matrix

second order vector

Term-Co-occurence Matrix

Dimensionality Reduction (차원축소)

SVD (Singular Value Decomposition) and LSA (Latent semantic analysis)

MDS (Multi-Dimensional Scaling)

PCA (principal component analysis), unsupervised learning

ICA (Independent Components Analysis)

LDA (Linear Discriminant Analysis, Fisher’s LDA)

LDA (Latent Dirichelt Allocation)

Similarity

Measuring Similarity

Integer Values
- Matching Coefficient
- Jaccard Coefficient
- Dice Coefficient
Real Values
- Cosine
문서 유사도 측정

Distance

Generative model

언어모델(Language Model)

Semantics

Word Embedding

☞ Word Embedding

Sequence-to-Sequence

Sequence-to-Sequence 모델로 뉴스 제목 추출하기

Applications

Collocations

☞ Collocations

Topic Modeling

Comparing Corpuses

Comparing Corpuses by Word Use

Measures

Measures of Association

Log-likelihood Ratio (ll)
True Mutual Information (tmi)
Pearson’s Chi-squared Test (x2)
Pointwise Mutual Information (pmi)
Phi coefficient (phi)
T-test (tscore)
Fisher’s Exact Test (leftFisher, rightFisher)
Dice Coefficient (dice)
Odds Ratio (odds)

Probability

T-score

Z-score

Chi-Square Statistic (χ2)

관찰값과 기대값 사이의 거리(Distance)

\[\chi^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\]

log-likelihood ratio G2

Log-likelihood for comparing texts

Information Theory

Information Theory / SanghyukChun’s Blog Entropy, KL divergence, Mutual information