Forums » General
Sorry for the recent instabilities
The server has been restarted and things should be solid again for awhile. We have a coredump from this afternoon to look at and debug what the hell happened. I'll be checking in on the server periodically, and I imagine Michael should be waking up soon too.
Things should be solid enough for Nation War and whatever other events this evening. If anything goes awry, we'll be around.
I'll write a more involved newspost about this later, but we're hard at work on making all of this more stable. Sorry for the hiccups in the meantime.
Things should be solid enough for Nation War and whatever other events this evening. If anything goes awry, we'll be around.
I'll write a more involved newspost about this later, but we're hard at work on making all of this more stable. Sorry for the hiccups in the meantime.
At the time of this writing, server is lagging out. Something similar happened about half an hour ago, too.
Yes. It's very bad.
speaking of borked... CANT LOG ON!
The issue exacerbates itself. Once the first lagspike (caused by a timeout between erlang and the server) has happened, everything is in a wonky state and the lagspikes happen more frequently. During an especially big one about 30 min ago, I restarted everything with a change that *might* slow the problem down (can't make any promises as I still don't know what's actually causing it). So things should hopefully be running fine for the next several hours, and I'll be watching it all closely for the next 12 hours, by which time the other guys will be awake again.
yes, ive been having problems loging on saince the update.
I believe I have fixed it; we'll see after the restart. I'm definitely the one who broke it.
I will be available for a public flogging tomorrow evening.
I will be available for a public flogging tomorrow evening.
Time and place?
Just kidding.
Just kidding.
/me pats Andy on the head and gives him a cookie.
"Keep yer chin up there 'lil camper!"
"Keep yer chin up there 'lil camper!"
[ ] a1k0n
[ ] a1k0n
[x] a1k0n
[ ] a1k0n
[x] a1k0n
och , its complicated... I'm sure you will get to the bottom of it ...
Ecka
Ecka
Teddygrahms for everyone !
So, hey, do we know what the problem was yet?
Well, here's the technical explanation.
Lua 5.0.2 has a bug where if a function closure object is the only reference to a couroutine state and the garbage collector runs, that coroutine could erroneously get collected, leading to all sorts of crashes when you resume it.
As a workaround, I added a global table which stores references to all active coroutines -- we only create and resume coroutines in a single place, which is a wrapper that also adds message passing mailboxes and all that good stuff. Thing is, I assumed that after the coroutine initially ran, it was alive and I only cleaned up dead ones after subsequent resume() calls. But we have a bunch of cases where the coroutine doesn't ever yield and can be immediately collected after its first run.
Those were being held onto indefinitely. Most of them were requests from Kourier via the Erlang interface, so there were a lot of dead Erlang terms in memory. Once it reached about a gig of garbage, it took a really, really long time to run the GC, causing all sorts of cascading timeouts and lots of swapping and general badness.
It's simply amazing (and embarassing) we didn't catch this on the test server.
Lua 5.0.2 has a bug where if a function closure object is the only reference to a couroutine state and the garbage collector runs, that coroutine could erroneously get collected, leading to all sorts of crashes when you resume it.
As a workaround, I added a global table which stores references to all active coroutines -- we only create and resume coroutines in a single place, which is a wrapper that also adds message passing mailboxes and all that good stuff. Thing is, I assumed that after the coroutine initially ran, it was alive and I only cleaned up dead ones after subsequent resume() calls. But we have a bunch of cases where the coroutine doesn't ever yield and can be immediately collected after its first run.
Those were being held onto indefinitely. Most of them were requests from Kourier via the Erlang interface, so there were a lot of dead Erlang terms in memory. Once it reached about a gig of garbage, it took a really, really long time to run the GC, causing all sorts of cascading timeouts and lots of swapping and general badness.
It's simply amazing (and embarassing) we didn't catch this on the test server.
perhaps it didnt have time to collect as much garbage on the test server? possibley due to things being restarted all the time while your working on stuff
Nah, it didn't happen on the test server because a1k0n was collecting all of the garbage with his famous garbage truck.