How I lowered my pagelist load time down from 4 minutes to 40ms

Permalink 2 users found helpful
I'm currently working on a site which required me to write a migration script to scrape an old drupal4 site (which I did in node.js, and could be the topic of another thread), and a concrete5 ENVIRONMENT_ONLY script to load all the pages I screen-scraped from the old site into c. My site was up and running well before the migration, but after moving over 2,300 new pages (a number which wouldn't make your typical performance expert bat an eye), one page with a big pagelist (the class, not the block) on it took so long to render, it would time out. Removing all time limits, it could take upwards of 4 minutes to run.

This page would build a big grid, with preview images, of the full set of pages. I wanted the users to be able to sort, filter, and do basic searches inside this grid, and have it all update dynamically right in the browser. I coded all this up in pure JavaScript (e.g. not relying on jQuery), and could do all this filtering so fast there was no visible delay.

While I was initially worried that listing so many pages all together would be slow on the client-side, that part ended up working very well. I had a new problem now: the server was taking a very, very long time to return my pagelist. I turned on mysql query logging (http://dev.mysql.com/doc/refman/5.1/en/query-log.html... ) and then the problem became abundantly clear: the log of just the queries (not including any actual results) of loading the page 1 time, was over 6.4MB. Unsurprisingly, running that many queries took a long, long time and put huge overhead against my server.

Normally I'd turn to caching first to solve a problem like this, but to cache you need to at least be able to render your results when your cache is empty, and that just couldn't happen. Instead, I looked into the query log and saw a massive amount of optimization that could be done. The PageList was essentially selecting a list of cIDs, turning each of those cIDs into a Page object, looking up each Page, then looking up each AttributeValue that I needed. When I was listing ~50 pages, this would run in far less time than all the basic image loading/page downloading, so to the end-user the perceived delay from this was negligible. At 2300+ pages, it took longer than anyone would wait, and strained the server + DB so much that it would certainly crash the whole site.

My solution? Roll everything up, as much as possible, into one big query that returns all the attributes.

First thing, I switched my PageList for a DatabaseItemList. I looked into the PageList source, copied out the query it was building, and set that in my DatabaseItemList::addToQuery(). I then added Database::setDebug(true); Database::setDebug(false); around the code which fetched each attribute in order to grab its SQL. I rewrote these as LEFT JOINs in my SQL as best as I could.

Getting the attribute key IDs turned out to be extremely tricky, so I eventually gave up on trying to do that in pure SQL, and just fetched them from the CollectionAttributeKey objects. The akIDs only require one lookup at the start of the query, so it shouldn't matter if I do a few extra lookups to get them.

After adding the attribute names to the front my my select, I updated the code I used to have running $page->getAttribute() to instead grab the attribute my query returned.

I put it all together, and the 4 minute page dropped to a clunky, but manageable 5 seconds. I threw a few Cache::set()s and get()s around the query results (timing out every 5 minutes), and around the rendered results of each page in the list. The page load time uncached climbed to ~10 seconds, but cache it and voila - whole thing loads in 40ms. Almost nothing.

My SQL skills aren't as sharp as they were 5 years ago, so please feel free to recommend any updates I can do to this. My big next-steps are:
- How can I generalize this some more? It would be awesome to take out all the LEFT JOINs I'm using to load each attribute, and include them as a standard function in the Attributes themselves (similar to the load() and getValue() functions' select statements in the AttributeTypeControllers, but one to add to the list of JOINs)
- Am I missing some obvious, or at least more c5-standard, way to do something similar?
- Please review my code and make suggestions for any updates, so I can expose this work as a howto
$results = Cache::get('item_search','all');
if(!$results) {
  /* A stored procedure's the only sql way to do this, so instead we'll just grab the attribute types right here */
  $akPrice = CollectionAttributeKey::getByHandle('price');
  $akCategory = CollectionAttributeKey::getByHandle('category');
  $akStatus = CollectionAttributeKey::getByHandle('status');
  $akEra = CollectionAttributeKey::getByHandle('era');
  $akImages = CollectionAttributeKey::getByHandle('images');
  $dbQuery = <<<DBQ
SELECT c.cID,
cv.cvName AS title,
PagePaths.cPath AS path,
MAX(atMultipleFiles.value) AS images,
MAX(atCurrency.amount) AS price,
MAX(atCurrency.code) AS currency,

 
JohntheFish replied on at Permalink Reply
JohntheFish
That is some monster SQL. It will take me a while to get my head round it.

Remo also has some SQL for faster page list queries. I am not sure how it compares. I think its may be his github or his blog, or you may be able to google for it.
jkoudys replied on at Permalink Reply
Most of the start of that is straight out of Concrete5_Model_PageList::setBaseQuery() at the top, assuming it's not supposed to display page aliases. The main part I added were all the LEFT JOINs after the INNER JOIN on the Collections.

Reading up on this, I'm thinking the best approach would be to extend the PageList so it as some kind of 'prefetchAttribute(); you can set, which add those JOINs to the main query it runs so it returns those attributes. The Collection could be extended so that the getAttribute() will check if that attribute is already set.

Haven't thought through all the particulars of implementation, but the use of it could be something like this:
$pl = new PageList();
$pl->prefetchAttribute('phone_number');
$pl->prefetchAttribute(CollectionAttributeKey::getByHandle('exclude_from_nav'));
$pages = $pl->get(); // Since 'prefetchAttribute' was set, the get() query includes a lookup of the value of those attributes. It saves those attributes in each of the $pages[]).
foreach($pages as $page) {
  echo "Phone number: {$page->getAttribute('phone_number')}  ";
/* The phone_number attribute is loaded directly out of the $page object, because the prefetch saved the phone number for it */
  echo "Something else: {$page->getAttribute('foobar')}";
/* foobar goes to the DB, since it wasn't prefetched */
}


The big challenge I see is that the SQL for getting most attribute values is pretty ad-hoc -- just a bunch of select statements. Perhaps each attribute could define a method for getting its value SQL that could be included in a standard-formatted JOIN (e.g. in my SQL on the previous post, it would rely on that CollectionAttributeValues cav being defined and joined already.) It could be interested to try and use the adodb activerecords somehow to help define that, though I haven't investigated that at all.
A3020 replied on at Permalink Reply
A3020
So your code is 6000 times faster? :) Nice!
JohntheFish replied on at Permalink Reply
JohntheFish
Another similar solution has just been posted

http://www.concrete5.org/community/forums/customizing_c5/new-query-...