Editing (v7+)

Searching within files

Permalink January 31, 2012 at 8:23 AM

Is it possible to expand the search so that it's possible to search within PDF files?

thx for your help!

Asseco replied on May 3, 2012 at 4:50 am Permalink Reply

Has anyone an idea. Unfortunately, the Google search is no solution because the site should be protected.

JohntheFish replied on May 3, 2012 at 6:14 am Permalink Reply

I don't think it is off the shelf. Some coding will be needed.

Jordanlev wrote a howto on something like your requirement:
http://www.concrete5.org/documentation/how-tos/developers/how-to-in...

You may also find this one useful, as it does it for products (so maybe use a similar technique for files)
http://www.concrete5.org/documentation/how-tos/developers/modify-si...

These addons may also do some of what you are looking for (they search files, but not file content)
http://www.concrete5.org/marketplace/addons/image-file-search/...
http://www.concrete5.org/marketplace/addons/document_library/...

Asseco replied on May 23, 2012 at 8:04 am Permalink Reply 1 Attachment

thanks!

Meanwhile, I've programmed my own solution.
After uploading the file, the contents of the PDF is read and written to the database. And the search looks for matches in these fields and shows them separately in the list (with a direct download link to the file). In addition, the search can be limited by filesets.

If you want to know the details, please let me know.

melat0nin replied on May 1, 2013 at 12:26 pm Permalink Reply

I've achieved this independently using the same method, utilising server-side processing of the files (though not in PHP, so not packagable). Did you use a PHP library to scrape text from the files? I couldn't get any of the open source ones to work reliably with all the various PDF versions.

Asseco replied on May 2, 2013 at 4:39 am Permalink Reply

I use these two functions. Unfortunately it does not work with every PDF. I don't know why.

private function pdf2string($sourcefile) {          
      $fp = fopen($sourcefile, 'rb'); 
      $content = fread($fp, filesize($sourcefile)); 
      fclose($fp);
      $searchstart = 'stream'; 
      $searchend = 'endstream'; 
      $pdfText = ''; 
      $pos = 0; 
      $pos2 = 0; 
      $startpos = 0; 
      while ($pos !== false && $pos2 !== false) { 
         $pos = strpos($content, $searchstart, $startpos); 
         $pos2 = strpos($content, $searchend, $startpos + 1); 
         if ($pos !== false && $pos2 !== false){ 
            if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {

Viewing 15 lines of 114 lines. View entire code block.

private function pdf2string($sourcefile) {          
      $fp = fopen($sourcefile, 'rb'); 
      $content = fread($fp, filesize($sourcefile)); 
      fclose($fp);
 
      $searchstart = 'stream'; 
      $searchend = 'endstream'; 
      $pdfText = ''; 
      $pos = 0; 
      $pos2 = 0; 
      $startpos = 0; 
 
      while ($pos !== false && $pos2 !== false) { 
         $pos = strpos($content, $searchstart, $startpos); 
         $pos2 = strpos($content, $searchend, $startpos + 1); 
         if ($pos !== false && $pos2 !== false){ 
            if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) { 
               $pos += 2; 
            } else if ($content[$pos] == 0x0a) { 
               $pos++; 
            } 
            if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) { 
               $pos2 -= 2; 
            } else if ($content[$pos2 - 1] == 0x0a) { 
               $pos2--; 
            } 
            $textsection = substr( 
               $content, 
               $pos + strlen($searchstart) + 2, 
               $pos2 - $pos - strlen($searchstart) - 1 
            ); 
            $data = @gzuncompress($textsection);
            //$data = $textsection; 
            $pdfText .= self::pdfExtractText($data); 
            $startpos = $pos2 + strlen($searchend) - 1; 
         } 
      } 
      return preg_replace('/(\s)+/', ' ', $pdfText); 
   } 
 
   private function pdfExtractText($psData){ 
      if (!is_string($psData)) { return '';  } 
      $text = ''; 
      // Handle brackets in the text stream that could be mistaken for 
      // the end of a text field. I'm sure you can do this as part of the 
      // regular expression, but my skills aren't good enough yet. 
      $psData = str_replace('\)', '##ENDBRACKET##', $psData); 
      $psData = str_replace('\]', '##ENDSBRACKET##', $psData); 
 
      preg_match_all( 
         '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si', 
         $psData, 
         $matches 
      ); 
      for ($i = 0; $i < sizeof($matches[0]); $i++) { 
         if ($matches[3][$i] != '') { 
            // Run another match over the contents. 
            preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches); 
            foreach ($subMatches[1] as $subMatch) { 
               $text .= $subMatch; 
            } 
         } else if ($matches[4][$i] != '') { 
            $text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i]; 
         } 
      } 
      // Translate special characters and put back brackets. 
      $trans = array( 
         '\221'            => chr(145), 
         '\222'            => chr(146), 
         '\223'            => chr(147), 
         '\224'            => chr(148), 
         '\226'            => '-', 
         '\267'            => '•',
         '\('            => '(', 
         '\['            => '[', 
         '##ENDBRACKET##'   => ')', 
         '##ENDSBRACKET##'   => ']', 
         chr(133)         => '-', 
         chr(141)         => chr(147), 
         chr(142)         => chr(148), 
         chr(143)         => chr(145), 
         chr(144)         => chr(146),
         "'"               => '"',
         ".."            => ' ',
         chr(13)            => ' ',
         chr(10)            => ' ',
         chr(9)            => ' ',
         '\340'            => 'a',         // à      |http://www.the-art-of-web.com/html/character-codes/...
         chr(224)         => 'a',
         '\341'            => 'a',         // á
         chr(225)         => 'a',
         '\225'            => chr(149),   // *
         '\304'            => chr(196),   // Ä
         '\344'            => chr(228),   // ä
         '\374'            => chr(252),   // ü
         '\334'            => chr(220),   // Ü
         '\366'            => chr(246),   // ö
         '\326'            => chr(214),   // Ö
         '\337'            => chr(223),   // ß
         '\44'            => chr(36),      // $
         '\45'            => chr(37),      // %
         '\251'            => chr(169),   // ©
         '\256'            => chr(174),   // ®
         '\74'            => chr(60),      // <
         '\76'            => chr(62),      // >
         chr(149)         => ' -',
         chr(150)         => ' -',
         chr(133)         => 'a',
         chr(205)         => 'a',
         '\037'            => ' -',
         '\n'            => ' ',
         '\r'            => ' '
      );
      $text = strtr($text, $trans); 
      $text = preg_replace('/([0-9])([A-Z][a-z])/', '$1 $2', $text);
      //return $text;
      return mysql_real_escape_string(utf8_encode($text)); 
   }

melat0nin replied on May 2, 2013 at 5:40 am Permalink Reply

Ah yes, I tried the same script and had limited success. The reason is that it doesn't support all versions of the PDF file format, so it's rendered fairly useless.

I managed to get round it using two server-side approaches combined with the php shell_exec function. For PDFs I'm using the pdftotext utility from the xpdf package, and for Word files I'm using a headless install of OpenOffice combined with the unoconv command line util. They can both output to stdout, so it's easy to get the parsed text back into php. This is on a Linux (CentOS) server so I'm not how how cross-platform this approach is, but it works well for me.

I'll probably write up the steps into a howto to share with the community.

webpresso replied on Jun 26, 2013 at 4:08 pm Permalink Reply

Would be really interesting how you achieved this!

Forums

Editing (v7+)

Searching within files

Code

Post Reply

Delete Post

Mark Post as Spam

Destroy Spammer

Sign In?