Find replace using PowerShell get-content -
i attempting mask ssn numbers random ssns in big text file. file 400m or .4 gigs.
there 17,000 instances of ssns want find , replace.
here illustration of powershell script using.
(get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content c:\trainingfile\trainingfile.txt
my problem that have 17,000 lines of code have in .ps1 file. ps1 file looks similar
(get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content c:\trainingfile\trainingfile.txt (get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "122-45-6789", "666-66-6668"} | set-content c:\trainingfile\trainingfile.txt (get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "223-45-6789", "666-66-6667"} | set-content c:\trainingfile\trainingfile.txt (get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "123-44-6789", "666-66-6669"} | set-content c:\trainingfile\trainingfile.txt
for 17,000 powershell commands in .ps1 file. 1 command per line.
i did test on 1 command , took 15 secoonds execute. doing math, 170000 x 15 seconds comes out 3 days run .ps1 script of 17,000 commands.
is there faster way this?
the reason poor performance lot of work beingness done. let's process pseudoalgorithm so,
select ssn (x) , masked ssn (x') list read rows file each file row string x if found, replace x' save rows file loop until ssns processed
so what's problem? each ssn replacement, process rows. not need masking don't. that's lot of work. if got, 100 rows , 10 replacements, going utilize 1000 steps when 100 needed. in addition, reading , saving file creates disk io. whlist that's not issue single operation, multiply io cost loop count , you'll find quite big time wasted disk waits.
for great performance, tune algorithm so,
read rows file loop through rows current row, alter x -> x' save result
why should faster? 1) read , save file once. disk io slow. 2) process each row once, work not beingness done. how perform x -> x' transform, got define more masking rule is.
edit
here's more practical resolution:
since know f(x) -> x' results, should have pre-calculated list saved disk so,
ssn, mask "123-45-6789", "666-66-6666" ... "223-45-6789", "666-66-6667"
import file hash table , work forwards stealing juicy bits ansgar's answer so,
$ssnmask = @{} $ssn = import-csv "c:\temp\ssnmasks.csv" -delimiter "," # add together x -> x' hashtable $ssn | % { if(-not $ssnmask.containskey($_.ssn)) { # it's error add together existing key, check first $ssnmask.add($_.ssn, $_.mask) } } $datatomask = get-content "c:\temp\training.txt" $datatomask | % { if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) { # replace ssn look-a-like value hashtable # nb: removes ssns don't have match in hashtable $_ -replace $matches[1], $ssnmask[$matches[1]] } } | set-content "c:\temp\training2.txt"
powershell
No comments:
Post a Comment