python - Removing a small number of lines from a large file -
i have big text file, of lines composed of ascii characters, little fraction of lines have non-ascii characters. fastest way create new text file containing ascii lines? right checking each character in each line see if it's ascii, , writing each line new file if characters ascii, method rather slow. also, using python, open using other languages in future.
edit: updated code
#!/usr/bin/python import string def isascii(s): c in s: if ord(c) > 127 or ord(c) < 0: homecoming false homecoming true f = open('data.tsv') g = open('data-ascii-only.tsv', 'w') linenumber = 1 line in f: if isascii(line): g.write(line) linenumber += 1 f.close() g.close()
you can utilize grep: "-v" keeps opposite, -p uses perl regex syntax, , [\x80-\xff] character range non-ascii.
grep -vp "[\x80-\xff]" data.tsv > data-ascii-only.tsv see question how grep non-ascii characters in unix more search ascii characters grep.
python unix
No comments:
Post a Comment