CPU and Load Average Conflict on EC2 server

amazon ec2cpuload-averagetopuptime

I am having touble understanding what server resource is causing lag in my Java game server. In the last patch of my game server, I updated my EC2 lamp server from apache2.2, php5.3, mysql5.5 to apache2.4, php7.0, mysql5.6. I also updated my game itself, to include many more instances of monsters that are looped though every game loop – among other things.

Here is output from right when my game server starts up:
enter image description here

Here is output from a few minutes later:
enter image description here

And here is output from the next morning:
enter image description here

As you can see in the images the cpu usage of my Java process levels off around 80% in the last screenshot, yet load avg goes to 1.20. I have even seen it go as high as 2.7 this morning. The cpu credits affect how much actual cpu juice my server has so it makes sense that the percentage goes up as my credits balance diminishes, but why at 80% does my server lag?

On my Amazon EC2 metrics I see cpu at 10% (which confuses me even more):
enter image description here

Right when I start up my server my mmorpg does not lag at all. Then as soon as my cpu credits are depleted it starts to lag. This makes me feel like it is cpu based, but when I see 10% and 80% I don't see why. Any help would be greatly appreciated. I am on a T2.micro instance, so it has 1 vCPU. If I go up to the next instance it nearly doubles in price, and stays at same vCPU of 1, but with more credits.

Long story short, I want to understand fully what I going on as the 80% number is throwing me. I don't just want to throw money at the problem.

Best Answer

You notice the large values of st? Those are "stolen" CPU cycles -- cycles you can't use, because you have completely almost -- or fully -- depleted your CPU credit balance.

The usage is 10% is averaged over some time window, probably 5 minutes. If you watch the output from top, you should see that 100% minus stolen minus idle is approximately 10% over time.

You essentially have no available CPU headroom at this point. A timing-critical workload would be expected to exhibit inconsistent responsiveness under these conditions.

Your workload is too large for a t2.micro. If this were not the case, you'd always have a surplus of CPU credits... essentially, by definition. Unless you can do something to reduce the workload or improve the efficiency of your code, the current symptoms indicate the need for a larger instance class.

Related Question