I always wanted to start my own web host and the earliest I remember having that desire was in 1998 when I was still in middle school. I had written my own web server software and already had a solid grasp of the protocols of the time. Being young and still in school this got put on the back burner for nearly a decade.
Fast forward to 2007 when I was looking for reliable hosting for one of my own projects and having significant trouble finding a provider I felt that I could trust. This was even before EIG, or Endurance International Group (now called Newfold Digital), had swallowed up all of the common providers. Many of the larger providers were still independent and providing decent but not amazing services.
My decision process came down to Site5 and HostGator at the time, both of which are now owned by Newfold Digital. I will be straightforward in that I honestly don’t remember what leaned me towards HostGator over Site5 but that is where I ended up. Having wanted to start my own provider for nearly a decade at this point I went for a reseller plan with HostGator.
The service overall was ok but not anything I’d consider spectacular. The long and short of it was that the server I was placed on was so hopelessly overloaded that important system wide processes like Apache or Named would crash. Over the first couple of weeks I saw several outages of 1 to 4 hours as services on the server would crash and it would take time for support to fix the issues. There was no way I could reasonably and truthfully recommend my own services to anyone else based upon a HostGator reseller account.
Moving to a Managed Dedicated Server with HostGator.
After only a couple of weeks I moved from a reseller plan at HostGator to a fully managed dedicated server. Even though I had only acquired a couple of customers and the revenue generated was nowhere near what a fully managed dedicated server was going to cost I made that leap. I felt confident that so long as the network and hardware were reliable that I could build a solid service on top of them. For a short time, about 4 months, I was right.
At the end of July in 2008 the data center where my server from HostGator was located, The Planet, experienced an explosion, fire, and several days of downtime. At this point I had managed to acquire a few hundred clients and this was quite the nightmare experience. Ultimately I had no control over whether the server was going to be ok and when, if at all, it was going to come back online. All I could do was tell my clients that I would keep them as informed as I possibly could.
If my memory serves me right, as this was more than 14 years ago at the time of writing this post, MDDHosting was offline for about 72 hours although it may have been as long as 96. I don’t unfortunately have uptime statistics from this time so all I have to go from is my memory. Fortunately many of my clients at the time were very understanding that this was outside of my control and once everything did eventually come back online I had only lost a few. I felt terrible that there was nothing I could do during the outage and this was a lesson I would take to heart.
Moving to Unmanaged Dedicated Servers at SoftLayer, now IBM.
I moved from the fully managed dedicated server at HostGator to an unmanaged server at SoftLayer within the month following the explosion at The Planet datacenter. Over the several months I had the dedicated server from HostGator I used every opportunity I could to learn more about server management and security. Anytime I asked for help with anything at all I made sure to as how the support technician solved the issue as well as why the issue happened at all. My goal has always been to avoid issues for my clients whenever possible even if I knew how to fix it after it had happened.
One of the largest motivating factors of moving to SoftLayer was honestly pricing. I learned during the major outage with HostGator, which admittedly was not their fault, that I needed to have backups and that I needed to have them at no cost to my customers. I was confident at this point in my abilities to manage a server and the move from an expensive fully managed server to a cheaper self-managed server gave me the revenue I needed to make this happen. I brought online a secondary server with SoftLayer in another location to function as a backup server.
MDDHosting continued to grow over the next couple of years from one server to several and from only a couple hundred dollars per month in expenses to several thousand. All this time I was providing 100% of the support for the company as well. Thankfully my goal of providing solid and reliable services helped significantly here as the number of support issues were low and mostly unrelated to the server or network itself.
The move to fully owned colocated hardware.
By having our own hardware and only paying for colocation we stood to save nearly $10,000 per month although we did not at the time have the capital on hand to buy all of the hardware we needed up front.
In late 2010 I worked out a deal with still-to-this-day data center facility, Handy Networks, to obtain enough hardware to move to fully owned and colocated hardware. To this day I still thank Jay Sudowski, Co-Founder and CEO of Handy Networks, for going out on a limb for me and my company. Jay and Handy Networks obtained more than $100,000 worth of hardware for us on our promise alone that we would repay the debt.
Our monthly cost for about the first year with Handy Networks was about the same as what we were paying SoftLayer prior except that we were working towards owning our own hardware. As we paid off pieces of hardware our monthly expenses dropped and this opened up our margins for hiring staff and making other improvements.
Hiring the first staff member beyond myself.
Although the services offered by us were solid and reliable and most issues raised by customers were not server or network related the number of inquiries increased over time as the company grew. The first staff member hired beyond myself was one of our clients that asked the most questions about everything particularly issues with which I was not acquainted. His name is Scott Swezey, and he asked questions about Ruby support more than anything. I remember offering to pay him to write documentation for the company related to anything he had questions about for which I did not have direct and quick answers to prior to hiring him on directly.
For a long time this Scott and myself were able to handle and manage the business and our clients with little to no issues. As the goal had always been to provide the most solid and reliable services that we could we kept our servers at about 50% to 80% of what they were fully capable of handling.
What made us so special?
In contrast our competitors were often loading servers well beyond 100% meaning that every user and every request has to fight for server resources. On our platform there were always resources available for processes and users as they needed them. While we couldn’t say that the resources were “dedicated” as there were certainly times when a server would be out of available resources for a short period we did our absolute best to make sure this happened as infrequently as possible.
Even today we still keep our servers under-loaded below what they are capable of to ensure performance and reliability. With the advent of account isolation technologies such as the CloudLinux Operating System we were able to give each individual user their own “bubble” of resources which helped to ensure no individual user was able to monopolize server resources.
Over the next decade we hired on a few more staff members as we continued to grow to make sure we could always provide fast and helpful support for our clients. While our support load is and always has been fairly low we always wanted to make sure that we could both respond to and resolve issues quickly and efficiently. While many of our competitors were taking hours or even days to respond we were and still are often resolving issues in minutes – usually less than 15 minutes.
I have never personally understood why anyone would tolerate an unreliable service or slow and/or incompetent support.
Changing storage technology.
The technology powering our servers had changed at least a couple of times over the years from in-server RAID based storage to a physical storage area network, or SAN, and then to software defined distributed storage by StorPool.
With RAID storage the only backup we had were the backups we took off-server and off-site. I don’t remember ever having to restore a full server when we were running RAID but we did often have to restore individual files, folders, databases, and even whole accounts for our users on a semi-regular basis at their request.
When we moved to a Nimble Storage CS500 Storage Area Network we had what are called snapshots. Snapshots were automated and on by default and meant that if we experienced major hardware failure or data loss that we could very quickly and easily, often within minutes, recover from such an incident. I don’t believe that we ever had to do this although we knew that it was a possibility and considered this our first level of disaster recovery while our off-site backups would be our second level.
Eventually our needs outgrew the storage area network and at the time an all-flash array, or a SAN built on solid state drives, was still prohibitively expensive. I do remember dreaming about the day when we would be powered by solid state drives 100% and how great I believed that to be. We looked at a few software defined storage solutions and ended up choosing StorPool. We saw massive improvements in speed and reductions in latency by moving to StorPool’s software defined storage platform.
Our first major outage since The Planet explosion in 2008.
In 2018 we experienced our first major outage since the explosion in The Planet in 2008. Although we had done our best over the preceding 10 years to ensure we had a solid and reliable disaster recovery plan it turns out that you often don’t know what you don’t know. Our experience with storage area networks was that snapshots are on by default and that you have to go out of your way to disable them. Due to either a miscommunication or a misunderstanding between us and StorPool we did not have snapshots enabled although we believed we did.
At the time StorPool was still relatively new and TRIM support for Linux was not 100% functional. This wasn’t a huge deal as it just meant that we would need to periodically perform manual trim operations. In September of 2018 a systems administrator performing this regular and mundane task, file system trims, made a major error resulting in the loss of all customer data across all servers nearly instantaneously.
Instead of issuing a “fstrim” command the administrator issued a “blkdiscard” command. The former is the file system trim command that unsets erased blocks in the file system whereas the latter essentially erases all data from the device on the block level. At the time “blkdiscard” did not have a warning when operated asking, “Are you sure you want to do this?” Although the administrator canceled the command nearly immediately our storage platform was so fast that major damage had already been done to every machine in the network. This wasn’t an intentional or malicious act but simple human error.
I remember where I was and what I was doing when I received the phone call letting me know what had just happened. I was heading east away from our office and turning into the first round-about to head to my house when my phone rang. I got the call before I got server down notifications as it would take a few minutes for the servers to begin failing.
The administrator was panicked and “freaking out” on the phone and I did my best to calm them down. I had learned over the years that panicking doesn’t solve problems and only makes things more difficult. It’s not only harder to think but also harder to type when you are panicking which helps nobody. Once I managed to get them calmed down a bit I said, “Just bring everything back up from the snapshots.” While we had not tested this, which is a major oversight, theoretically this should have brought everything back online within only a few minutes.
It turned out that at the time snapshots were not automated by StorPool and that we had to write the automation to create, maintain, and discard snapshots. Essentially this meant that we did not have any snapshots and as a result no quick way to recover from this incident. To be clear I do not blame StorPool at all for this – this was an oversight on our part.
Without snapshots the recovery would be longer than a few minutes.
We had never chosen to rely entirely on snapshots for disaster recovery. Not only could our clients not restore anything on their own from snapshots but we always felt more comfortable having another copy of the data in a physically separate datacenter facility. We built the backup hardware and network with the goal of being able to perform a network-wide restoration from start to finish in under 12 hours.
It turns out that while initially our backup system was capable of doing such a restoration in the estimated timeframe that over time our backups had gotten slower and slower. This was due to some poor configurations in the ZFS File System we used for backups as well as an unexpected hardware bottleneck. The updated estimated timeframe for full restoration was approximately 28 days.
Had we simply accepted that it was going to take this long to restore all data network wide this incident could very well have been the end of MDDHosting as a company. Myself as well as the systems administrator that originally made the mistake worked diligently to optimize the restoration process as much as possible and we were eventually able to get the entire process done in about 96 hours, or 4 days.
The restoration process.
Originally the plan was simply to conduct restorations directly from our backup servers to our production servers as we would have done any other non-major restoration. As this was going to take 28 days or longer we had to adapt. The facility where our backups are located offered to move the drives into a new piece of hardware that did not have the same hardware bottleneck but as this was our last remaining copy of customer data and we could not be 100% sure that the new server would recognize the storage array we were not willing to take that chance.
What we ended up doing was attaching 4 individual solid state drives to the backup server and copying over backup data one server at a time. After copying a server over to one of the drives we would then move that drive into another piece of hardware and conduct restorations from there. While this was still significantly slower than our original 12 hour estimation it was going substantially faster than direct restorations. We actually began referring to the original backup servers as “slow backup” and the faster restoration hardware as “fast backup.”
Communicating with our clients during the outage.
We kept our company forums updated with details of the restoration process and any new updates or information that we had to provide. This was simply the best way for us to get information to our clients at the time. We did also perform some mass emails and in those emails directed everyone to our forums for further updates. As we were always able to operate with a very small team due to our services being solid and reliable we did not have the staffing to keep up with individually answering every direct question to our normal standards during the incident.
Every staff member handled approximately 3,000 support tickets per day for 4 days and then an average of 500 support tickets per day for a couple of weeks there after. Most of what was said in direct support tickets was information we were already dispersing on our forums in one-to-many way. That said we all still did our best to answer every question we received individually. The biggest question was, “What is the ETA of my account being restored?” and unfortunately due to the nature of the restorations we were unable to give anyone a specific ETA. We were eventually able to start providing ETAs for whole server restorations and while that was helpful it wasn’t what our clients really wanted to know which was extremely frustrating for us and our clients.
What we learned.
As I said earlier in this post you often don’t know what you don’t know and this incident opened our eyes to quite a bit of that. We learned that having a disaster recovery plan is only part of the process. Testing the plan is just as important if not more important than having the plan at all. Had we tested our disaster recovery plan before having an actual disaster we could have identified that we didn’t have the snapshots we thought we did and avoided drawn-out major outage to begin with. If we had tested our disaster recovery plan we could have found that our backups were going to take 28 days instead of 12 hours and solved that issue before it became a problem.
Many of our customers wanted to know what happened, why it happened, and how we would prevent it. Much of this information was provided on the forums and in a direct email to every client.
Here is a short list of some of the changes we’ve made although this is not an all-encompassing list:
- We monitor our snapshots to ensure both that they are being created as well as that our latest snapshot is no older than 2 hours. In the event that our youngest snapshot is older than 2 hours this is treated as a disaster event and all staff are notified.
- We monitor the performance of our backup servers to ensure not only that they are able to take timely backups but that should we ever need to perform major restorations they will perform to our expectations or better.
- We test our disaster recovery plan on a regular basis to identify any issues we may not be aware of so that we can proactively resolve them.
It can happen to any provider, always keep your own backups.
A piece of advice I share whenever I can is: “Always take your own backups no matter who your provider is or what they promise, even us.” The biggest issue for us was how quickly we could transfer the data from the remote backup servers to our production servers. Even though the client data was gone we were able to bring the production servers back online, although missing client data, within only a few minutes. Any client that had their own backups we were able to restore and get back online within minutes of that backup being made available to us.
I know that even with our goals since inception to provide the best and most reliable services we possibly could that we made some pretty big mistakes when it came to this incident. I can only imagine other providers that do not care as much as we do would make similar or worse mistakes so it’s always a good idea to have your own backup.
We are thankful for our customers.
While September of 2018 was a trying time for us and our customers it did not kill us and did make us stronger. We are very thankful for our customers that stuck with us through this outage and that still trust us. We have always been as straightforward, honest, and transparent as we can possibly be and I believe that is a big reason as to why we were able to survive the outage in 2018 with minimal losses. At the end of the day if not for our customers we wouldn’t exist and we are thankful every day for each and every individual and business that chooses to trust us for their web hosting needs.
To the future!
Every day we are striving to improve and to grow. Our basic principals have remained the same – to provide solid and reliable hosting with excellent and competent support that responds within minutes. Our uptime network-wide since the outage in 2018 has averaged 99.95% which is excellent. Over the last year our uptime has averaged 99.997% which averages out to less than 45 seconds per physical server over the whole year.
Uptime is not the only important factor though – we strive to respond to support tickets in under 5 minutes and to resolve issues whenever possible in less than 15 minutes.
We continue to work every day to improve our services and are now as of this writing offering services on the latest AMD Epyc high-frequency processors with high-speed RAM and NVMe storage. The new services we have labeled “Plaid” which is a Spaceballs reference. We are truly in exciting times with hardware performance higher than I could have imagined a decade ago.
Here’s to another 10 years!