Cache rebuild issues on AWS deployment - latest C5 build

Permalink
We are experiencing an issue with cache clearing and rebuilding in a live environment, however we have not seen it happen before, and are unable to replicate in local/staging environments.

Local is on our teams macs, staging is on a Rackspace cloud server but the live environment is and AWS cluster... that said the hosts have scaled the AWS environment back in an attempt to debug the issue:

"We isolated one web node from our active load balancer, spun up another ELB instance just so we are replicating live 100%, and pointed the isolated node at a copy of the production DB on the same RDS instance as the production DB."

Here is what they are experiencing when they run a full cache clear from the C5 back-end:

----

1st time: Same result as last Thursday, white page ELB serving up 503’s resolved itself within 40-50 minutes, cache seemed to then start populating.

2nd time: Cache repopulated immediately after initial slow page load with extended time to first byte due to DB lookups. However once the page had reloaded from cache all of the url’s had been populated with the internal VPC IP’s from our EC2 instance, Canonical URL was defined in config at the time as well (we checked).

3rd time: Cache repopulated immediately after initial slow page load with extended time to first byte. However once the page had been reloaded from cache there were numerous missing images across the site, most notably the homepage banner, featured course image and some of the social media images. Strangely enough the missing images seemed to start populating after 25-30 minutes, time stamps in the /files/cache/ directory reflected this too.

We restored files back to “original” copy from production web node before each cache clear just to make you aware.

----

As a control they did the following:

"...installed an out of the box copy of C5 on the same node after our tests and made some textual changes to the home page and attempted full page caching and manual page clearing and everything seemed to work fine."

So we are a little stumped - it appears to be affecting the client site, but only on the AWS environment, however a 'vanilla' install behaves okay in that environment.

Anyone have any suggestions? Happy to provide more information if needed.