Handle data for large number of pages.

Permalink 1 user found helpful
Hi all,

I've looked through the forums and couldn't find any help - but maybe I missed it.

I need to iterate over attribute data for 2000+ pages, taking what I need and outputting to a JSON data cache file.

I've got it all working great in terms of the caching - the problem comes from too much memory usage in the script as it creates the cache.

I've tried setting the memory limit up (100M still fatal errors)

I've tried several ways of unsetting / destroying the page object - as I know that nested objects can continue in memory and I think this is where my over head is.

At this point my options look like:
1: find the memory leak and figure out how to plug it (after about 8 hrs I'm running out of patience with that)
2: figure out the select statements for a custom db query, construct all the data myself and go from there.

Any advice would be helpful. This is my last stumbling block to take the site to full scale and its driving me up the wall. Thanks! Love C5!

Here is the related section of code:

$pl = new PageList();
$pl->filterByPath($pathToPages);
$total = $pl->getTotal();
$itemsToGet = 20;
for ($i=0;$i<$total;$i+=$itemsToGet) {
   $pages = $pl->get($itemsToGet,$i);
   foreach ($pages as $page) {
                // the method below returns a json encoded assoc array
      $json = $this->createPageDataSubset($page,true);
      $cachedData[] = $json;
      $json = null;
      unset($json);
      $page->__destruct();
      $page = null;
      unset($page);

snowbound
 
Mainio replied on at Permalink Best Answer Reply
Mainio
I'd suggest going with the custom database querying, generally in PHP object creation is not the most memory/performance efficient way to go. Handling arrays of data without creating any inheriting objects will probably save you a lot of trouble.

By the way I was struggling with the same issue about few weeks ago with big lists of objects and turned out that the array-method with custom querying was way more efficient. I was also using active record objects in that specific case before and then I just went to using straight $db->query() instead.
snowbound replied on at Permalink Reply
snowbound
I sort of figured as much. MySQL is not my strength, but I guess it means I have something to learn.

Have you (or anyone else reading) come across any good examples for a select query that gets the page content and joins in attributes? or is it a get the cID first and then go about gathering the attributes sort of affair?

Thanks for the advice.
Mainio replied on at Permalink Reply
Mainio
Yes, cID is the key with which you do table joining.

The attributes case might be bit hard if you're not familiar with c5 schema before but I'd suggest making those attributes searchable which creates them own columns in table CollectionSearchIndexAttributes which might make things easier.

It is also doable with some heavy joining if those attributes are not search indexed but might be a bit harder.
snowbound replied on at Permalink Reply
snowbound
Thanks Mainio.

I'll see if I can find any examples in the forums, but likely time to dive in and see what kinda ruckus I can cause. Thanks again.

If anyone reading this has any query snippets to pass along they are appreciated!

Cheers,

Gerr
snowbound replied on at Permalink Reply
snowbound
Hi all,

In case anyone finds this and is needing something along the same lines I thought I'd post the query I used which is working great.

Thanks to Mainio for the advice I had a bad case of developer tunnel vision - this solution took a fraction of the time I wasted trying to find memory leaks in PHP and PageList - ahh well - it happens.

The items in [ ] should be replaced to fit your own attributes / needs.

$query = "SELECT Pages.cID, csia.ak_[ak_handle] as [alias] FROM Pages
LEFT JOIN  CollectionSearchIndexAttributes as csia
ON Pages.cID = csia.cID   WHERE Pages.cParentID='$parentID'";


Performance wise:

1. with 500ish pages running through them all with pagelist to perform my process took approx 24seconds.

2. to do the same with a query similar to above selecting 13 fields took less than 1 second.

Hands down. When dealing with large numbers of pages - direct querying is best. Make your attributes search indexed and they are easy to grab.

I'd still love to see a proper join that selects from the Pages table and joins in attribute keys - is this doable?

Thanks again!

Gerr
Mainio replied on at Permalink Reply
Mainio
[quote]
I'd still love to see a proper join that selects from the Pages table and joins in attribute keys - is this doable?
[/quote]

Could you tell what are you looking for more specifically? I don't understand what you mean by "...and joins in attribute keys" because doesn't your query already do that?
snowbound replied on at Permalink Reply
snowbound
Hi Mainio -

Yes that was unclear. And this is not something I need - just curious to see how it would be done.

The CollectionSearchIndexAttributes table only gives access to those attributes that you've rendered searchable.

I was curious what a query would look like that would give access to any attribute for a Page just because I could see that being valuable as well in the future.

Thanks again,

Gerr
Mainio replied on at Permalink Reply
Mainio
Ok, yes, everything is doable but as said, your method is the best out there for simple use.

To do this you need to know the attribute key you're looking for and also its attribute type table in the database but it's doable something like this:
SELECT P.cID, atB.value FROM Pages P
   INNER JOIN CollectionAttributeValues CAV ON CAV.cID = P.cID 
   INNER JOIN AttributeKeys AK ON CAV.akID = AK.akID
   INNER JOIN AttributeKeyCategories AKC ON AK.akCategoryID = AKC.akCategoryID
   INNER JOIN atBoolean atB ON atB.avID = CAV.avID
   WHERE AKC.akCategoryHandle = 'collection' AND AK.akHandle = 'exclude_nav'


That gets the value of the 'exclude_nav' attributes for collections/pages.


Antti
Mainio replied on at Permalink Reply
Mainio
And just for the record I'd like to add that you could probably also do automatic detection of the attribute types with some heavy if clauses in the query but I don't believe they belong to SQL queries, instead you should do such checks straight in PHP.

There you could also do automatic search of the type for the attribute key that you're looking for and decide the attribute type value table according to your search.
snowbound replied on at Permalink Reply
snowbound
Awesome!

As I think I mentioned above - I'm no MySQL guru - so what you have here would have taken me a while to put together. Thanks very much as I think it will be useful in my future C5 career.

Thanks for the snippet. If you're ever in N.S. I owe you a beer. Cheers,

Gerr
Mainio replied on at Permalink Reply
Mainio
I'll put that behind my ear if I ever visit North America. :)

Yeah, I used to work for a software firm that put quite a lot of the logic behind SQL, I'm kinda surprised now seeing e.g. some Rails devs who don't know how to handle a raw database... But maybe I'm just "old school".

So that's where I learnt SQL almost as good as my native language. :D Just kidding, it's a great skill e.g. in this kind of optimization issues but I prefer of course a lot more well written API that provides me the "backend stuff", so thanks to Andy and folks at C5, we rarely need to do such things!
snowbound replied on at Permalink Reply
snowbound
Well I'm one of those who's been learning on the fly for 8-10 years so if I've done it - I know it, but otherwise I learn it and I try to never pass up an opportunity to do so.

I appreciate the knowledge drop - it gives me a foot in the door to learning a new complex level to MySQL. Its one of those things that I know if I knew it better it would solve many problems for me.

Only been at C5 for a year or so now - but have a few sites under my belt. And having worked with other CMS's I can say C5 is my favorite hands down. You are absolutely correct in that credit should go where credit is due.

G
Mainio replied on at Permalink Reply
Mainio
By the way I just adjusted the query after looking what I had written. :)

Noticed the subquery was useless (they eat up performance usually), so you might want to take a second look.
snowbound replied on at Permalink Reply
snowbound
I see what you did.

This is excellent for targetting a single attribute. I can see how complex it would be to target different attributes of different types. Needing to know what type of attribute corresponds to the individual handle is crucial to getting at the proper object.

In my case I needed a subset of attributes and had made them all searchable.

My next approach was going to be to compile a custom attributes list of actual AttributeValue objects for the page after getting the cID's - using API calls and then I could manage unsetting the Attribute objects as I went - trying to stay tidy.

Your suggested method is obviously far superior should a person have all the needed attributes searchable (or has no problem making them so)

The memory leak research I had done informed me that an Object with nested Objects does not get cleaned up very well upon unset(). Essentially the nested objects are not also unset, but remain clogging up memory. As you know a Page object contains a great deal of other objects within objects. I played around with overriding Page and adding a __destruct() method that would recursively unset the internal objects of the Page, but met with little success.

g
snowbound replied on at Permalink Reply
snowbound
At any rate - I am mostly trying to brain dump these issues here because I couldn't find threads in the forums related to the issues I was experiencing. There seem to be some unique issues once a site gets over a certain threshold of pages - perhaps a section in the forum related to these issues would make them easier to track down - or maybe I'm just not looking the proper way.

At any rate - I'm procrastinating. - g
Mainio replied on at Permalink Reply
Mainio
Maybe file a bug report on the memory leak issue if you managed to look up it that far?

The sad thing is that if that bug doesn't actually break anything crucial or produce e.g. a fatal error, many times that type of bugs that relate to these edge-cases are left "hanging" in the bug reporting section, although you would file them again and again from version to version... :(