Natural Language Processing Overview
자연어처리(Natural Language Processing)의 큰 그림을 그려보자.
Background
Hypothesis
분산 가설(distributional hypothesis)
You shall know a word by the company it keeps
Firth, 1957 (Studies in Linguistic Analysis)
Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis)
Harris, 1968 (Mathematical Structures of Language)
Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis)
Miller and Charles, 1991 (Language and Cognitive Processes)
Various extensions…
- Similar contexts will have similar meanings
- Names that occur in similar contexts will refer to the same underlying person
Ref
- (Ted Pedersen)Language Independent Methods of Clustering Similar Contexts
- aclwiki/Distributional_Hypothesis
- M Sahlgren. The distributional hypothesis. Italian Journal of Linguistics. 2008;20:33-53.
- ratsgo’s blog for textmining/idea of statistical semantics
Contexts
기준이 되는 string 단위
windows(n size), 문장, 문단, 문서 etc
Lexical Features
N-gram
Dictionary based Tokenization
Unsupervised Segmentation
Co-occurrences
Models
Bag of words
Word Weighting
Vector Space Model
Context representations
first order vector
Term-Document Matrix
second order vector
Term-Co-occurence Matrix
Dimensionality Reduction (차원축소)
SVD (Singular Value Decomposition) and LSA (Latent semantic analysis)
MDS (Multi-Dimensional Scaling)
PCA (principal component analysis), unsupervised learning
ICA (Independent Components Analysis)
LDA (Linear Discriminant Analysis, Fisher’s LDA)
LDA (Latent Dirichelt Allocation)
Similarity
Measuring Similarity
- Integer Values
- Matching Coefficient
- Jaccard Coefficient
- Dice Coefficient
- Real Values
- Cosine
- 문서 유사도 측정
Distance
Generative model
Semantics
Word Embedding
Sequence-to-Sequence
Applications
Collocations
Topic Modeling
Comparing Corpuses
Measures
Measures of Association
- Log-likelihood Ratio (ll)
- True Mutual Information (tmi)
- Pearson’s Chi-squared Test (x2)
- Pointwise Mutual Information (pmi)
- Phi coefficient (phi)
- T-test (tscore)
- Fisher’s Exact Test (leftFisher, rightFisher)
- Dice Coefficient (dice)
- Odds Ratio (odds)
Probability
T-score
Z-score
Chi-Square Statistic (χ2)
관찰값과 기대값 사이의 거리(Distance)
\[\chi^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\]log-likelihood ratio G2
Information Theory
- Information Theory / SanghyukChun’s Blog Entropy, KL divergence, Mutual information