NLP cheatsheet

1 minute read

Term Meaning
Weights and Vectors  
TF-IDF Weight higher the more a word appears in doc and not in corpus
Term Frequency Inverse Document Frequency
length(TF-IDF, doc) num of distinct words in doc, for each word number in vector.
Word Vectors Calculate word vector:
for each word w1 => for each 5 window words, make vectors increasingly
closer, v[w1] closer v[w2]
king - queen ~ man - woman // wow it will find that for you!
You can even download ready made word vectors
Google Word Vectors You can download ready made google trained vector words
Text Structure  
Part-Of-Speech Tagging word roles: is it verb, noun, …? it’s not always obvious
Head of sentence head(sentence) most important word, it’s not nessesaraly the first
word, it’s the root of the sentence the most important word
she hit the wall => hit .
You build a graph for a sentence and it becomes the root.
Named entities People, Companies, Locations, …, quick way to know what text is about.
Sentiment Analysis  
Sentiment Dictionary love +2.9, hated: -3.2, “I loved you but now I hate you” => 2.9 - 3.2
Sentiment Entities Is it about the movie or about the cinema place?
Sentiment Features Camera/Resolution , Camera/Convinience
Text Classification Decisions, Decisions: What’s the Topic, is he happy, native english speaker?
Mostly supervised training: We have labels, then map new text to labels
Supervised Learning We have 3 sets, Train Set, Dev Set, Test Set.
Train Set  
Dev(=Validation) Set Tuning Parameters (and also to prevent overfitting), tune model
Test Set Check your model
Text Features Convert documents to be classified into features,
bags of words word vectors, can use TF-IDF
LDA Latent Dirichlecht Allocation: LDA(Documents) => Topics
Technology Topic: Scala, Programming, Machine Learning
Sport Topic: Football, Basketball, Skateboards (3 most important words)
Pick number # of topics ahead of time like 5 topics
Doc = Distribution(topics) probability for each topic
Topic = Distribution(words) technology topic higher probably over cpu word
Unsupervised, what topics patterns are there. Good for getting the sense what the doc is about.
Machine Reading  
Entity Extraction EntityRecognition(text) => (EntityName -> EntityType)
(“paul newman is a great actor”) => [(PaulNewman -> Person)]
Entity Linking EntityLinking(Entity) => FixedMeaning
EntityLinking(“PaulNewman”) => “http://wikipedia../paul_newman_the_actor”
(and not the other paul newman based on text)
dbpedia DB for wikipedia, machines can read it its a db. Query DBPedia with SparQL
FRED (lib) / Pikes FRED(natural-language) => formal-structure

Video version of this post:

Categories: , ,


Leave a Comment