Forums » General

Sorry for the recent instabilities

Jun 17, 2007 incarnate link
The server has been restarted and things should be solid again for awhile. We have a coredump from this afternoon to look at and debug what the hell happened. I'll be checking in on the server periodically, and I imagine Michael should be waking up soon too.

Things should be solid enough for Nation War and whatever other events this evening. If anything goes awry, we'll be around.

I'll write a more involved newspost about this later, but we're hard at work on making all of this more stable. Sorry for the hiccups in the meantime.
Jun 17, 2007 jexkerome link
At the time of this writing, server is lagging out. Something similar happened about half an hour ago, too.
Jun 17, 2007 FatStrat85 link
Yes. It's very bad.
Jun 17, 2007 Impavid link
speaking of borked... CANT LOG ON!
Jun 17, 2007 momerath42 link
The issue exacerbates itself. Once the first lagspike (caused by a timeout between erlang and the server) has happened, everything is in a wonky state and the lagspikes happen more frequently. During an especially big one about 30 min ago, I restarted everything with a change that *might* slow the problem down (can't make any promises as I still don't know what's actually causing it). So things should hopefully be running fine for the next several hours, and I'll be watching it all closely for the next 12 hours, by which time the other guys will be awake again.
Jun 17, 2007 mr bean link
yes, ive been having problems loging on saince the update.
Jun 17, 2007 a1k0n link
I believe I have fixed it; we'll see after the restart. I'm definitely the one who broke it.

I will be available for a public flogging tomorrow evening.
Jun 17, 2007 SilentWave link
Time and place?
Just kidding.
Jun 18, 2007 LeberMac link
/me pats Andy on the head and gives him a cookie.

"Keep yer chin up there 'lil camper!"
Jun 18, 2007 Aleksey link
[ ] a1k0n
[ ] a1k0n
[x] a1k0n
Jun 18, 2007 davejohn link
och , its complicated... I'm sure you will get to the bottom of it ...

Ecka
Jun 18, 2007 fzuazo link
Teddygrahms for everyone !

Jun 18, 2007 roguelazer link
So, hey, do we know what the problem was yet?
Jun 18, 2007 a1k0n link
Well, here's the technical explanation.

Lua 5.0.2 has a bug where if a function closure object is the only reference to a couroutine state and the garbage collector runs, that coroutine could erroneously get collected, leading to all sorts of crashes when you resume it.

As a workaround, I added a global table which stores references to all active coroutines -- we only create and resume coroutines in a single place, which is a wrapper that also adds message passing mailboxes and all that good stuff. Thing is, I assumed that after the coroutine initially ran, it was alive and I only cleaned up dead ones after subsequent resume() calls. But we have a bunch of cases where the coroutine doesn't ever yield and can be immediately collected after its first run.

Those were being held onto indefinitely. Most of them were requests from Kourier via the Erlang interface, so there were a lot of dead Erlang terms in memory. Once it reached about a gig of garbage, it took a really, really long time to run the GC, causing all sorts of cascading timeouts and lots of swapping and general badness.

It's simply amazing (and embarassing) we didn't catch this on the test server.
Jun 18, 2007 look... no hands link
perhaps it didnt have time to collect as much garbage on the test server? possibley due to things being restarted all the time while your working on stuff
Jun 18, 2007 roguelazer link
Nah, it didn't happen on the test server because a1k0n was collecting all of the garbage with his famous garbage truck.