Tuesday, 15 September 2015

How to Remove Hidden Characters in PHP -



How to Remove Hidden Characters in PHP -

i have next piece of code, reads text files director. have used list of stopwords , after removing stopwords files when words of these files along positions there come blank characters in place of stopword exist in document.

for example, file reads like,

department of computer science // document

after removing stop word 'of' document when loop through document next output comes out:

department(0) (1) computer(2) science(3) //output

but blank space should not there.

here code:

<?php $directory = "archive/"; $dir = opendir($directory); while (($file = readdir($dir)) !== false) { $filename = $directory . $file; $type = filetype($filename); if ($type == 'file') { $contents = file_get_contents($filename); $texts = preg_replace('/\s+/', ' ', $contents); $texts = preg_replace('/[^a-za-z0-9\-\n ]/', '', $texts); $text = explode(" ", $texts); $text = array_map('strtolower', $text); $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or", " "); $text = (array_diff($text,$stopwords)); echo "<br><br>"; $total_count = count($text); $b = -1; foreach ($text $a=>$v) { $b++; echo $text[$b]. "(" .$b. ")" ." "; } } } closedir($dir); ?>

genuinely not 100% sure final output of string position, assuming placing there reference only. test code using regex preg_replace seems work well.

header('content-type: text/plain; charset=utf-8'); // set test content array. $contents_array = array(); $contents_array[] = "department of computer science // document"; $contents_array[] = "department of economics // document"; // set stopwords. $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or"); // set regex based on stopwords. $regex = '/(' . implode('\b|', $stopwords) . '\b)/i'; foreach ($contents_array $contents) { // remove stopwords. $contents = preg_replace($regex, '', $contents); // clear out whitespace; 2 spaces or more in row. $contents = preg_replace('/\s{2,}/', ' ', $contents); // echo contents. echo $contents . "\n"; }

the output cleaned & formatted this:

department computer science // document

department economics // document

so integrate code, should this. note how moved $stopwords & $regex outside of while loop since makes no sense reset values on each while loop iteration. set 1 time outside of loop & allow stuff in loop focused on need there in loop:

<?php $directory = "archive/"; $dir = opendir($directory); // set stopwords. $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or"); // set regex based on stopwords. $regex = '/(' . implode('\b|', $stopwords) . '\b)/i'; while (($file = readdir($dir)) !== false) { $filename = $directory . $file; $type = filetype($filename); if ($type == 'file') { // contents of filename. $contents = file_get_contents($filename); // remove stopwords. $contents = preg_replace($regex, '', $contents); // clear out whitespace; 2 spaces or more in row. $contents = preg_replace('/\s{2,}/', ' ', $contents); // echo contents. echo $contents; } } closedir($dir); ?>

php

No comments:

Post a Comment