Forums » General

Fiber cut and website outage.

Aug 22, 2024 incarnate link
At 8:57am Central on August 21st, a large "boring" device drilled through a fiber vault upstream of one of our core datacenters, wrapping the drill with a large amount of fiber and pulling/damaging it for over 1/4 of a mile (400+ meters).

This was a big event, impacting at least a major 24-pair and a 96-pair long-haul single-mode fiber cable bundles, and collapsing a fiber ring due to a lack of route diversity.

To give some idea, a single "fiber pair" (send/receive) of single-mode can be lit DWDM at 19.2 terabits/s, so this could have been upwards of 2400 terabits, although I imagine in practice there may have been a broad mixture of different circuit types.

(The main point being, this is far larger than "us").

This simultaneously impacted AT&T, Lumen and Spectrum (not Spectrum home-service, but their 10gig+ carrier service).

This outage did not actually impact the Vendetta Online game server, but it did impact our webserver and website. (Ignore the Players Online graph: People are playing, but our analytics are still down).

As a result of impact on the website, some users on Google Login (Android) had problems connecting to the game. This has been fixed, by bringing the website back online. The failure was an oversight due to a design failure, which will be addressed sometime in the coming weeks, as it should not have happened (the game is intended to be fault-tolerant to these kinds of outages).

Additionally, users making new accounts would have had problems logging into the game as well (This has also been fixed by bringing the website back up). This was a known limitation, and one that will be improved eventually. But, our main goal is to make the game fault-tolerant and continuously operational for existing players, even under extreme outages.

The fiber-cut outage is currently on-going, some 18 hours later: last I heard they were having to cut through concrete to locate broken pieces of fiber, and some of it was running under a big building they had no access to at night, so the fiber repair may be some time.

As a result of the uncertainty of how long this fiber repair will take, I physically went to our datacenter and we managed to migrate some things to different address space and a different provider (off of Lumen), allowing us to bring the website back online.

From here we have to slowly migrate a lot of other internal services off of impacted network paths and onto something functional. It's going to be a bit of a process, but hopefully the rest of it will not be very user-facing.

(For instance, the Players Online graph, the test server and the "PCC" / mission editor are all currently offline).

FYI.
Aug 22, 2024 Whistler link
It is a testament to the excellent design of VO that such a potentially catastrophic event did not completely bring the game down!

I would be remiss if I did not invoke the infamous Garbage Truck Incident of 2003:

"Oct 22, 2003 a1k0n
So a garbage truck was tooling around in the industrial park where our internet service provider is located with the forks used to haul dumpsters all the way up. Voila, it drives right through the fiber connection to the building, which is for some strange reason strung up on a telephone pole above the road rather than buried. This was roughly at 7AM."
Aug 22, 2024 incarnate link
So, Lumen is back up, as-of a little white ago, downtime of 31 hours and 36 minutes.

To put it simply: fiber providers do not have sufficient physical path diversity, and frequently lie about this to customers. In this case, Lumen was riding Spectrum glass, and AT&T was in the same trench with them. So, a single event took out all three major backbone networks and connections in the area.

We do what we can to have as much designed-in service resilience as possible. We're in something like 30 datacenters, globally. But, there are some design-tradeoffs that you make, like relative security versus distribution: It can be more secure to run on your own physical hardware in a facility you know, versus a cloud that may be compromised at a level you cannot see. Our services are designed to operate in "rings" of relative-trust, where we trust our own "metal" the most, and this tries to balance these tradeoffs.

But, basically, we need more physical hardware locations to back up our existing cloud redundancies.
Aug 26, 2024 Lord~spidey link
There's room on my sketchy chinese xeon rig, gonna have to move some porn around though.

Jokes aside you should host a fallback near halifax t'is a prime spot smack between the eu and us-west.
Aug 26, 2024 incarnate link
Jokes aside you should host a fallback near halifax t'is a prime spot smack between the eu and us-west.

Kind of, but there's only limited connectivity. There's EXA North & South and EXA Express.

There's a lot more path diversity coming out of the US, and it's generally cheaper as well.

Anyway, we have no lack of potential locations, the issue is the time and effort required to re-architect, migrate, test, and so on.
Aug 27, 2024 Lord~spidey link
EXA express gets is like ten whole milliseconds faster if we're talking about ireland/united kingdom though.

And Canadian dollaridoos are cheaper than american dollaridoos!

stoopid canadian dollaridoos urgh!
Aug 31, 2024 7heMech link
Definitely interesting.
I wonder in terms of what is trust measured when you say your own hardware has more trust, would that be like database access and how is the db itself managed? Hope you don't mind me asking, I'm just curious.
Aug 31, 2024 incarnate link
Basically, if you host your data and software on someone else's hardware, you never have absolute certainty of who has access to it.

The "cloud" is made up of infrastructure hardware, called "hypervisors" which then run "virtual machines" to which people have access. But, the client who buys the virtual machine, be it on Amazon or Google or DigitalOcean or whatever, has no idea what kind of IT infrastructure is used to manage the actual hypervisor hardware itself.

Case-in-point being the Solarwinds exploit which hit the US intelligence community pretty hard. The Solarwinds software is or was a package used for IT management and analytics, meaning that it ran little client processes on all kinds of machines, and let IT infrastructure managers more "easily" maintain large groups of hardware.

However, because that software was also insecure, it went from being "management software for IT people" to "management software for unfriendly hackers".

Similarly, there have been several CPU issues exposed over the last decade, like Meltdown, which have some potential for permitting someone in another VM on the same hypervisor to access your memory. This potential is small and technically challenging, and largely mitigated by various patches, but that's only for problems that are known. Major state actors tend to exploit problems that are not publicly known. In any event, these examples serve to illustrate how shared-hardware infrastructure ("the cloud"), while very convenient in many ways, can come with inherent security tradeoffs.

If one owns and operates one's own hardware, and has it set up carefully, it can mitigate a layer of that risk, to some extent. Someone could still theoretically break into a facility physically and access your hardware, but that's a much different sort of attack. A lot of types of attacks can be done conveniently when it is virtual, which become much more challenging when it is physical.

As to your core question, yes, this can mean any kind of data or service you operate from the hardware. Clearly, if you have something like a database that is still remotely accessible, you have to be aware of whether there's any real "point" in hosting in a more hardened manner, because it is inherently remotely accessible (the most likely attack is the front door, in that case).

But, depending on the individual situation, it can be feasible to separate access-control and storage into different layers, where you limit the scope of access, or you only have the secure site "ingesting" data (nothing can read from it), or other methods to try and reduce the potential "attack surfaces" that are available to an adversary.

We operate in logical "concentric rings", which have limited access and data flow. Inner rings can access outer rings, but the opposite is avoided.

I hope that helps.