Searching within files

Permalink
Is it possible to expand the search so that it's possible to search within PDF files?

thx for your help!

Asseco
 
Asseco replied on at Permalink Reply
Asseco
Has anyone an idea. Unfortunately, the Google search is no solution because the site should be protected.
JohntheFish replied on at Permalink Reply
JohntheFish
I don't think it is off the shelf. Some coding will be needed.

Jordanlev wrote a howto on something like your requirement:
http://www.concrete5.org/documentation/how-tos/developers/how-to-in...

You may also find this one useful, as it does it for products (so maybe use a similar technique for files)
http://www.concrete5.org/documentation/how-tos/developers/modify-si...

These addons may also do some of what you are looking for (they search files, but not file content)
http://www.concrete5.org/marketplace/addons/image-file-search/...
http://www.concrete5.org/marketplace/addons/document_library/...
Asseco replied on at Permalink Reply 1 Attachment
Asseco
thanks!

Meanwhile, I've programmed my own solution.
After uploading the file, the contents of the PDF is read and written to the database. And the search looks for matches in these fields and shows them separately in the list (with a direct download link to the file). In addition, the search can be limited by filesets.

If you want to know the details, please let me know.
melat0nin replied on at Permalink Reply
melat0nin
I've achieved this independently using the same method, utilising server-side processing of the files (though not in PHP, so not packagable). Did you use a PHP library to scrape text from the files? I couldn't get any of the open source ones to work reliably with all the various PDF versions.
Asseco replied on at Permalink Reply
Asseco
I use these two functions. Unfortunately it does not work with every PDF. I don't know why.

private function pdf2string($sourcefile) {          
      $fp = fopen($sourcefile, 'rb'); 
      $content = fread($fp, filesize($sourcefile)); 
      fclose($fp);
      $searchstart = 'stream'; 
      $searchend = 'endstream'; 
      $pdfText = ''; 
      $pos = 0; 
      $pos2 = 0; 
      $startpos = 0; 
      while ($pos !== false && $pos2 !== false) { 
         $pos = strpos($content, $searchstart, $startpos); 
         $pos2 = strpos($content, $searchend, $startpos + 1); 
         if ($pos !== false && $pos2 !== false){ 
            if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
melat0nin replied on at Permalink Reply
melat0nin
Ah yes, I tried the same script and had limited success. The reason is that it doesn't support all versions of the PDF file format, so it's rendered fairly useless.

I managed to get round it using two server-side approaches combined with the php shell_exec function. For PDFs I'm using the pdftotext utility from the xpdf package, and for Word files I'm using a headless install of OpenOffice combined with the unoconv command line util. They can both output to stdout, so it's easy to get the parsed text back into php. This is on a Linux (CentOS) server so I'm not how how cross-platform this approach is, but it works well for me.

I'll probably write up the steps into a howto to share with the community.
webpresso replied on at Permalink Reply
webpresso
Would be really interesting how you achieved this!