Files, their attributes, and finding them in the site search engine

Permalink
I've seen bits and pieces about this in the forums but decided I need to some things up in a post.

The issue is this -- I've got a page that shows a list of music files the client adds through the file manager. The page spits out each file's title, url, etc. according to File Attributes the client specifies in the file manager.

But, he wants a user searching for, say, a particular artist, to show up in the search results. The only place that artist text would be stored would be within the Artist custom attribute that's been created and assigned to a specific audio file.

First, what does checking "Content included in "Keyword Search"" actually do? Does is mean the file manager search or the site search. If the latter that'd be great, but what would that search result point to? The actual file?

If the above doesn't get me anywhere, I wonder if it's possible to automatically create a page for each file that's added in the file manager -- that page content would be indexed, but in its page type template it would have a redirect that send the user to the master music listing page, with the search-for track highlighted.

Any guidance in this general area is much appreciated!

Best,
Dan

dvodvo
 
jordanlev replied on at Permalink Best Answer Reply 1 Attachment
jordanlev
As far as I can tell, 'Content included in "Keyword Search"' and 'Field available in "Advanced Search"' are only applicable to the search in the File Manager (keyword search is the textbox you type into, advanced search is when you click the little "plus sign" and choose individual fields to filter on).

Also as far as I can tell, many of the things on this documentation page are wrong (well, out of date):http://www.concrete5.org/documentation/general-topics/search/...

One thing to consider is that the search block only links to pages, not files -- so even if the system's underlying file attribute search worked as you'd hope it would (that is, automatically be included in searches from the search block just by the mere existence of attributes), the search block wouldn't be able to link to it because it only links to pages, not file downloads. Ideally one day the search block would be modified to address this, but that day is not today :)

So, all that being said, here's how you can make file searching work:

1) You must alter the code of the block that is displaying the files on the page. You must add some code like this to its controller.php file:
public function getSearchableContent() {
   $files = $this->getPermittedFileObjects(); //<--this is a function you will have to write, but it is probably the same thing you're doing when retrieving files for the view. It needs to return an array of file objects.
   $content = '';
   foreach ($files as $file) {
      $content .= $file->getTitle() . ' - ' . $file->getDescription();
   }
   return $content;
}

If your files were being listed from the built-in "Content" block, then unfortunately you're out of luck and this solution won't work (because the "Content" block already has a getSearchableContent() function, and it only returns raw content -- which, if someone inserted a file from the file manager, is actually in there as a code along the lines of "CCM:CID_79", not the name or description of your file).


2) Make sure the page Area that this block lives in are included in the list of areas that get indexed for searches. Go to Dashboard -> Sitemap -> Page Search, and click the teeny tiny "Setup Index" link (see attached screenshot -- it's impossible to find if you don't already know where to look). If you don't know what the things on this page mean, then just make sure none of the boxes are checked and that the "Blacklist" option is selected in the dropdown menu (then click the "Save" button in the lower-right if you had to make any changes). BTW, If you don't see the Area name that you want in the list... well then I'm not sure what to tell you (I've never not seen an area I wanted -- hopefully it's always like that).

3) Run the search index job. No, not that one -- it won't do anything! You want the super-secret search index job that actually indexes blocks other than "Content". To run this, go to the usual place (Dashboard -> System & Maintenance). But instead of just clicking the "Run Checked" button, you want to copy the URL at the bottom, paste it into your browser's address bar, then add "&force=1" to the end of it. For example, if the URL was this:
http://example.com/tools/required/jobs?auth=c5e9439560b97c3eff17893eabf72706

you would paste this into your address bar"
http://example.com/tools/required/jobs?auth=c5e9439560b97c3eff17893eabf72706&force=1

Then hit enter to go to that url. It might take a little bit depending on how big your site is and how much content there is, but eventually it will stop loading and leave you on a blank white page. Just click the back button to get back to the dashboard.


4) Success! Well, kind of... there are a few limitations on what the search block actually searches for. The biggest one is that it completely fails on spaces (if someone wanted to search for "daily specials" for example, nothing would come up). Fortunately this can be fixed fairly easily:
http://www.concrete5.org/index.php?cID=112065...
(Note that this fix will be included in the core system for whatever version is after 5.4.1.1, so if you're reading this in the future, you might not need to do this step).

I have also gotten tripped up because I kept trying to search on file names (which was in the file's "title" attribute, so I know it was getting added to the index), but they never came up. I eventually realized it's because there were dashes in the name (like "client-file-10.pdf"), which the search block also fails on. It basically ignores everything that isn't alphanumeric (or a space or tab or newline character if you add the fix above). So if this is going to be a problem, you'll need to tweak the regex in the above fix to include whatever characters are important to you. But I don't know if there are bad consequences to this (probably nothing more than returning too many results for certain search terms).

Ideally, the search block would strip out the characters that it's ignoring from both the content it's searching AND the query terms that were entered by the user. Maybe that can be fixed some day as well, but that day also is not today :)

Okay, I think that does it. I hope this works for you.

-Jordan
dvodvo replied on at Permalink Reply
dvodvo
(You know, I never went to school for programming, I've just spent 14 years learning by example and posting to forums when I was stuck. And in all that time I've never gotten a more thoughtful and extensive response. Thank you Jordan!)

This highlights a lot of what I was stuck on and luckily points out much that I'd never even realize was going to be a problem.

I've followed much of what Jordan's suggested however needed to go a bit further for silly reasons. The public page my client wants for listing the music files can't have all of the files on it, the list would be too long. So, what I've done instead of block-ifying the file listing is to make a barebones Single Page that lists the files. The idea is that the search engine will index that master list, and on my search results page I add some jQuery to append the search argument on to links to the Single Page -- and finally, on that Single Page have more jQuery to find the specific file that matches the search query (if there is a search query) and redirect back to the public page with the appropriate file ID passed as an argument.

It sounds almost ridiculous, but without doing some heavy hacking I can't think of anything better.

However, this all rests on one question -- does content embedded in Single Pages get indexed by the search engine??
RadiantWeb replied on at Permalink Reply
RadiantWeb
not sure,

I do know that Tony's Image/File search block is amazing and handles all this.

It is single handedly one of the most under-rated blocks on this site in my opinion.

Chad
jtfjtf replied on at Permalink Reply
Apologies for resurrecting an old post - but this "almost* does what I need it to (quite likely due to a lack of thorough understanding on my part)!

When implemented as described (to the best of my knowledge), the search result returns invalid entries, due to the concatenation of file attributes I suspect. I have used my doclibrary (custom made) block on several pages, each time with a different file set passed in. So my question is - how do you pull back a distinct list, where the search term matches a given document(s) attributes?

Maybe my implementation of the getPermittedFileObjects function is a little awry?

FYI - I am running c5 v5.6.

Apologies if I'm missing something fundamental.

Many thanks in advance; JTF
dvodvo replied on at Permalink Reply
dvodvo
Alright, job done. It's not pretty, I warn you, but time is money.

I've ended up just creating a new search block template in which I insert some SQL queries before the code that spits out the results array. The SQL commands pull out approved files from the FileVersions, FileAttributeValues and atDefault tables that have my desired attributes and match the search query.

Probably the quickest and easiest (and slowest and least efficient/elegant) way of making Concrete5 search for files, assuming you're familiar with the C5 DB structure.

<div id="filesearchResults">
         <?php
         $db = Loader::db();
$sql = "SELECT FileAttributeValues.fID,FileAttributeValues.fvID,atDefault.value FROM atDefault,FileAttributeValues WHERE atDefault.value LIKE ? AND atDefault.avID = FileAttributeValues.avID AND (FileAttributeValues.akID = 28 OR FileAttributeValues.akID = 30)";
$vars = array($query)
$items = $db->GetAll($sql, $vars);
         foreach ($items as $item) {
         $sql = "SELECT * FROM FileVersions WHERE fvIsApproved = 1 AND fvID = ".$item['fvID']." AND fID = ".$item['fID'];
            if ($det = $db->GetRow($sql)) {
               echo whatever....
            }
         }
         ?>
         <i>Files</i>
      </div><div style='clear:both;'></div>


(where akID=28,30 are my file attributes I'm searching on)
jordanlev replied on at Permalink Reply
jordanlev
Hey, if it works it works.

*BUT* you have a gaping security hole in your code -- you're wide open to a SQL injection attack because you are not properly escaping your input. htmlentities and htmlspecialchars are only for things being outputted to the browser -- databases need different things escaped (for example, a <script> tag isn't going to hurt it at all, but quotation marks and semicolons could mess it up real good).

ADODB (the database library that C5 uses) has a built-in feature called "parameterized queries" (or sometimes "prepared statements"), which makes it really easy to protect yourself from such attacks. For example, in your code above, you would do this:
$db = Loader::db();
$sql = "SELECT FileAttributeValues.fID,FileAttributeValues.fvID,atDefault.value FROM atDefault,FileAttributeValues WHERE atDefault.value LIKE ? AND atDefault.avID = FileAttributeValues.avID AND (FileAttributeValues.akID = 28 OR FileAttributeValues.akID = 30)";
$vars = array($query)
$items = $db->GetAll($sql, $vars);


So you're just putting question marks into your SQL string (without quotes -- it handles that for you), then you have an array with one element for each question mark (they're inserted into the SQL in the same order that the question marks are there). Your example only has one value that needs to be "parameterized" (because the other values are not coming from an "untrusted" outside source), but if you did have more than one parameter, it would look something like this:
$db = Loader::db();
$sql = "SELECT * FROM table WHERE field1 = ? AND field2 = ?";
$vars = array($val1, $val2);
$items = $db->GetAll($sql, $vars);


This is no joke by the way, just yesterday there was a massive SQL injection attack against tons of different sites (funnily/sadly enough, even mysql.com was vulnerable!)
http://news.google.com/news/more?pz=1&cf=all&ned=us&ncl...
dvodvo replied on at Permalink Reply
dvodvo
Ah awesome, thanks again. I've updated my code above with this fix in case any sloppy coders like myself decide to use it.