csv - Creating Rules for Word Permutations Based on a word's POS Tag -
i have question , since community has been great helping me along thought give shot.
right have python 3 code imports csv file first column total of words in next format:
the words in column
once csv file uploaded , read python, words tagged using nltk pos tagger. there, permutations made of words , results exported new csv file. right now, total code goes this
import csv open(r'c:\users\jkk\desktop\python.csv', 'r') f: reader = csv.reader(f) j = [] row in reader: j.extend(row) import nltk d = nltk.pos_tag(j) c = list(itertools.permutations(d, 3)) open('test.csv', 'w') a_file: result in c: result = ' '.join(result) a_file.write(result + '\n')
my question is, how 1 create rules word permutations based on word tag? more specifically, reason tagging words because don't want nonsense permutations (i.e. in / in / etc). 1 time words tagged respective part of speech, how code rules based on tag (for example): never have 2 "dt" labeled words follow each other (i.e. "the" , "a"). or have nn tagged word followed vbg tagged word (i.e. "looks" comes after "words")? , finally, 1 time rules implemented, rid of tags original words remain? realize general question guidance much appreciated on how approach question still new , learning every step of way! resources, code, or advice help! give thanks 1 time again taking time read long post!
the set of rules define legal strings in language called grammar (or formal grammar). there many formalisms allow express these rules. 1 reasonnably simple experiment context free grammar (cfg). nltk comes tools generate strings these. here nltk book's chapter on syntax. go much more depth.
the next code python 3 nltk 3.0a4. api changed between nltk 2 , 3, not run on older version.
from nltk import contextfreegrammar nltk.parse.generate import generate ntlk.util import trigrams # build simple grammar cfg = """ s -> np vp vp -> vbz np np -> dt | nn | dt nn | dt jj nn | jj nn """ # these csv words = "this simple sentence".split() tagged = set(pos_tag(words)) # add together words grammar word, tag in tagged: cfg += "{tag} -> '{word}'\n".format(word=word, tag=tag) grammar = parse_cfg(cfg) valid_trigrams = set() language = generate(grammar) valid_sentence in language: valid_trigrams.update(list(trigrams(valid_sentence))) print(valid_trigrams) # {('simple', 'sentence', 'is'), ('this', 'is', 'this'), ('a', 'sentence', 'is'), ('sentence', 'is', 'a'), ('a', 'is', 'a'), ('this', 'is', 'simple'), ('sentence', 'is', 'this'), ('this', 'is', 'sentence'), ('is', 'a', 'sentence'), ('is', 'a', 'simple'), ('a', 'simple', 'sentence'), ('a', 'is', 'this'), ('this', 'simple', 'sentence'), ('this', 'is', 'a'), ('is', 'simple', 'sentence'), ('a', 'is', 'simple'), ('this', 'sentence', 'is'), ('is', 'this', 'sentence'), ('sentence', 'is', 'sentence'), ('sentence', 'is', 'simple'), ('is', 'this', 'simple'), ('a', 'is', 'sentence')}
there limitation approach though, since context free grammar cannot cover of english. there no known way of validating syntax english language anyways though, can have approximate solution.
another thing should aware of pos tagger assumes order of words relevant. roughly, gives each word set of possible tags, refines based on preceding , or next words, if you're csv contains sentences, ok, otherwise, might want specify unigram pos tagger nltk.tag.unigramtagger
, regardless, mutual tag. issue words "run" can verb or noun ("a morning run" vs "i run").
csv python-3.x nlp permutation nltk
Great Article
ReplyDeleteAngular 5 Development Company
Angular 5 Training in CHennai