Monday, 15 June 2015

Find replace using PowerShell get-content -



Find replace using PowerShell get-content -

i attempting mask ssn numbers random ssns in big text file. file 400m or .4 gigs.

there 17,000 instances of ssns want find , replace.

here illustration of powershell script using.

(get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content c:\trainingfile\trainingfile.txt

my problem that have 17,000 lines of code have in .ps1 file. ps1 file looks similar

(get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "123-45-6789", "666-66-6666"} | set-content c:\trainingfile\trainingfile.txt (get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "122-45-6789", "666-66-6668"} | set-content c:\trainingfile\trainingfile.txt (get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "223-45-6789", "666-66-6667"} | set-content c:\trainingfile\trainingfile.txt (get-content c:\trainingfile\trainingfile.txt) | foreach-object {$_ -replace "123-44-6789", "666-66-6669"} | set-content c:\trainingfile\trainingfile.txt

for 17,000 powershell commands in .ps1 file. 1 command per line.

i did test on 1 command , took 15 secoonds execute. doing math, 170000 x 15 seconds comes out 3 days run .ps1 script of 17,000 commands.

is there faster way this?

the reason poor performance lot of work beingness done. let's process pseudoalgorithm so,

select ssn (x) , masked ssn (x') list read rows file each file row string x if found, replace x' save rows file loop until ssns processed

so what's problem? each ssn replacement, process rows. not need masking don't. that's lot of work. if got, 100 rows , 10 replacements, going utilize 1000 steps when 100 needed. in addition, reading , saving file creates disk io. whlist that's not issue single operation, multiply io cost loop count , you'll find quite big time wasted disk waits.

for great performance, tune algorithm so,

read rows file loop through rows current row, alter x -> x' save result

why should faster? 1) read , save file once. disk io slow. 2) process each row once, work not beingness done. how perform x -> x' transform, got define more masking rule is.

edit

here's more practical resolution:

since know f(x) -> x' results, should have pre-calculated list saved disk so,

ssn, mask "123-45-6789", "666-66-6666" ... "223-45-6789", "666-66-6667"

import file hash table , work forwards stealing juicy bits ansgar's answer so,

$ssnmask = @{} $ssn = import-csv "c:\temp\ssnmasks.csv" -delimiter "," # add together x -> x' hashtable $ssn | % { if(-not $ssnmask.containskey($_.ssn)) { # it's error add together existing key, check first $ssnmask.add($_.ssn, $_.mask) } } $datatomask = get-content "c:\temp\training.txt" $datatomask | % { if ( $_ -match '(\d{3}-\d{2}-\d{4})' ) { # replace ssn look-a-like value hashtable # nb: removes ssns don't have match in hashtable $_ -replace $matches[1], $ssnmask[$matches[1]] } } | set-content "c:\temp\training2.txt"

powershell

No comments:

Post a Comment