NLP cheatsheet
Term | Meaning |
---|---|
Weights and Vectors | |
TF-IDF | Weight higher the more a word appears in doc and not in corpus Term Frequency Inverse Document Frequency |
length(TF-IDF, doc) | num of distinct words in doc, for each word number in vector. |
Word Vectors | Calculate word vector: for each word w1 => for each 5 window words, make vectors increasingly closer, v[w1] closer v[w2] king - queen ~ man - woman // wow it will find that for you! You can even download ready made word vectors |
Google Word Vectors | You can download ready made google trained vector words |
Text Structure | |
Part-Of-Speech Tagging | word roles: is it verb, noun, …? it’s not always obvious |
Head of sentence | head(sentence) most important word, it’s not nessesaraly the first word, it’s the root of the sentence the most important word she hit the wall => hit . You build a graph for a sentence and it becomes the root. |
Named entities | People, Companies, Locations, …, quick way to know what text is about. |
Sentiment Analysis | |
Sentiment Dictionary | love +2.9, hated: -3.2, “I loved you but now I hate you” => 2.9 - 3.2 |
Sentiment Entities | Is it about the movie or about the cinema place? |
Sentiment Features | Camera/Resolution , Camera/Convinience |
Text Classification | Decisions, Decisions: What’s the Topic, is he happy, native english speaker? Mostly supervised training: We have labels, then map new text to labels |
Supervised Learning | We have 3 sets, Train Set, Dev Set, Test Set. |
Train Set | |
Dev(=Validation) Set | Tuning Parameters (and also to prevent overfitting), tune model |
Test Set | Check your model |
Text Features | Convert documents to be classified into features, bags of words word vectors, can use TF-IDF |
LDA | Latent Dirichlecht Allocation: LDA(Documents) => Topics Technology Topic: Scala, Programming, Machine Learning Sport Topic: Football, Basketball, Skateboards (3 most important words) Pick number # of topics ahead of time like 5 topics Doc = Distribution(topics) probability for each topic Topic = Distribution(words) technology topic higher probably over cpu word Unsupervised, what topics patterns are there. Good for getting the sense what the doc is about. |
Machine Reading | |
Entity Extraction | EntityRecognition(text) => (EntityName -> EntityType) (“paul newman is a great actor”) => [(PaulNewman -> Person)] |
Entity Linking | EntityLinking(Entity) => FixedMeaning EntityLinking(“PaulNewman”) => “http://wikipedia../paul_newman_the_actor” (and not the other paul newman based on text) |
dbpedia | DB for wikipedia, machines can read it its a db. Query DBPedia with SparQL |
FRED (lib) / Pikes | FRED(natural-language) => formal-structure |
Resources | https://www.youtube.com/watch?v=FcOH_2UxwRg https://tinyurl.com/word-vectors |
Video version of this post: