Thread by @fstokesman on Thread Reader App

Francis Stokes Profile picture
Seeing the HBO email mishap got me thinking about about the fuck up I once made that I still think about to this day.

A 🧵 that I hope will illustrate some of the factors that can go into these kinds of events.

thread#showTweet data-screenname=fstokesman data-tweet=1405826700336893958 dir=auto> I'm keeping everything, including the companies, anonymised. I know some people on here know me personally, and may have even worked with me on these project, and I'd ask you not to name any names either.

thread#showTweet data-screenname=fstokesman data-tweet=1405826702392057860 dir=auto> I was still pretty junior, working as a contractor for one of the worlds biggest technology companies. I was part of a small prototyping and piloting division that worked outside of (or at least around) the corporate bureaucracy. The whole sell of this team is that they could get

thread#showTweet data-screenname=fstokesman data-tweet=1405826703914684417 dir=auto> an end-to-end idea up and running extremely fast and cheap. We were able to do this in part by leveraging our existing codebase, which had been built up over quite a few years. We could take our general architecture and throw it at different problems, and 9/10 times we'd be able

to hack something together that would meet the goal (and we'd have a little bit more code to leverage in the future).

Keep in mind, this is still mostly internal to the company - so our team and others would basically be competing to win the budget money of other teams

thread#showTweet data-screenname=fstokesman data-tweet=1405826706934534144 dir=auto> throughout the organisation This atmosphere was both awesome and awful.

thread#showTweet data-screenname=fstokesman data-tweet=1405826708377411584 dir=auto> Awesome because we were really streamlined and really did get stuff done in almost no time.

thread#showTweet data-screenname=fstokesman data-tweet=1405826710340354049 dir=auto> Awful because we did this by cutting every possible corner, including all forms of testing, and by always being on call to hotfix pilots in production, day or night.

thread#showTweet data-screenname=fstokesman data-tweet=1405826711871176704 dir=auto> The team consisted of a pair of iOS developers, 2-3 web front end devs, 1 backender, a product owner, and visionary team leader who had quite a significant position in the organisation. It also included an additional frontender and backender, both of whom worked remote, but for

various reasons were treated very much as a separate entity, and not part of the core.

I had joined the team a frontender with pretty limited experience (I actually studied game dev at uni), but part of my growth plan was to move into backend development. And in this team,

thread#showTweet data-screenname=fstokesman data-tweet=1405826714819833858 dir=auto> backend basically meant "everything that wasn't frontend or mobile".

thread#showTweet data-screenname=fstokesman data-tweet=1405826716342308868 dir=auto> I'd been part of of the team for a while, and had found my feet in the frontend. We were working on a new pilot - which was technologically more complex in all dimensions than anything we'd done before. I was onboarding a new dev, at the same time as setting up a whole new

architecture for our web app.

Our backender was working hard to make sure all of the various databases, server code, and scripts were able to fit this new pilot, and it was a big job - made especially hard because of the lack of tests. At the same time, he was very slowly

thread#showTweet data-screenname=fstokesman data-tweet=1405826719697846275 dir=auto> teaching me bits and pieces - sharing insights about the codebase and practices. I had been told that a proper onboarding would take place at some point, and that he would gradually move off this project and I would replace him.

thread#showTweet data-screenname=fstokesman data-tweet=1405826721501302785 dir=auto> Great! Except that one day, my boss from the contracting company dropped the bomb that it wasn't going to happen in a few months, it was going to happen *today*.

thread#showTweet data-screenname=fstokesman data-tweet=1405826722960969729 dir=auto> And it wasn't going to be gradual, it was going to be "he's not working there at all anymore, but will be available for questions you might have".

thread#showTweet data-screenname=fstokesman data-tweet=1405826728031928325 dir=auto> Gulp. We were in the middle of a project, and I had my hands completely full with the FE. All of a sudden, I was also in charge of the backend. As it turned out, the old dev had been burning out due to stress, and had all but threatened to quit if not removed asap.

I can completely understand that, but I felt a bit like I'd been bait and switched.

And very quickly, I began to see why he was so stressed. The server infrastructure was not what you would call modern.

thread#showTweet data-screenname=fstokesman data-tweet=1405826731706101763 dir=auto> Making changes often required going through mistrustful, faceless IT admins via email in other parts of the company. We had servers that were for testing, staging and production, but also servers that were a mix of all of these.

thread#showTweet data-screenname=fstokesman data-tweet=1405826733828456448 dir=auto> That meant it was very difficult to understand the impact of operations, and made every mutable DB query into an anxiety nightmare.

thread#showTweet data-screenname=fstokesman data-tweet=1405826735690637312 dir=auto> The codebase was huge, and a mess of legacy and hacks, stitched together with undocumented interfaces. We had databases, caching layers, and files on the disk that were incredibly interdependent on each other, and which formed some kind of monolith state machine.

It wasn't clear if the code you were working on was important or vestigial.

Every new pilot would add some project specific scripts, which would help us out in the development phase - but would also often do critical state operations during the running of a pilot.

thread#showTweet data-screenname=fstokesman data-tweet=1405826739369041921 dir=auto> They all had innocuous names, and were copied/pasted/modified from older scripts. And we could be running 3 or 4 pilots concurrently sometimes, so there was a huge amount of context switching involved.

thread#showTweet data-screenname=fstokesman data-tweet=1405826741386555397 dir=auto> t this point I want to point out that this wasn't any developers fault - especially not my predecessor. The pace and pressure were just so high in this environment, that this was the only way to achieve the results that were demanded. He had begun, against the wishes of the PO

thread#showTweet data-screenname=fstokesman data-tweet=1405826743441707009 dir=auto> and team leader, to write tests that would document the system - and had done some significant refactoring to improve important areas of the code. It was, however, a bit of a drop in the ocean so to speak.

thread#showTweet data-screenname=fstokesman data-tweet=1405827462513184771 dir=auto> Aside from those factors, I soon found out that a culture had been fostered around the backend role that this person would be responsible for anything that the other developers felt was outside their purview. This meant every other developer would come with a bespoke request

thread#showTweet data-screenname=fstokesman data-tweet=1405827465218605058 dir=auto> every day - which could quickly add up to eat an entire day. Many of these requests were also expressed as "absolutely top priority" or "completely blocking". Every day. They could be anything from a bespoke API that made no sense in a wider context, and

thread#showTweet data-screenname=fstokesman data-tweet=1405827467571564544 dir=auto> would just add to the legacy, to server operations that, as I previously mentioned, could be absolutely terrifying (SSH and yoloing like it's 1999).

thread#showTweet data-screenname=fstokesman data-tweet=1405827469731565570 dir=auto> So we were in more than halfway into the development of the latest pilot, with the old backender gone (and as it turned out, not very available for questions), and I still had all of my FE responsibilities to boot. My sleep was affected, and I would lay awake at night

thread#showTweet data-screenname=fstokesman data-tweet=1405827471858094080 dir=auto> dreading the next day. I felt like I was treading water. But somehow, I *was* making progress. The pilot looked like it was coming together. And I had a 3 week holiday fast approaching, which was helping to keep me sane.

thread#showTweet data-screenname=fstokesman data-tweet=1405827474018156546 dir=auto> It was the day before were going to production (also the day before my holiday) - still hacking to the very last minute - and we were in the morning standup. During these standups I would be frantically hacking away, trying desperately to keep up with the seemingly never-ending

thread#showTweet data-screenname=fstokesman data-tweet=1405827476283088899 dir=auto> demands I was faced with - when one of the mobile guys told me he needed me to run the pilot reset script so they could do some last minute testing. And of course, it was a critically blocking request, which I should drop everything for.

thread#showTweet data-screenname=fstokesman data-tweet=1405827478728417280 dir=auto> So during this standup, I opened a terminal, SSHd into the server, and ran the script, and told him then and there it was done.

thread#showTweet data-screenname=fstokesman data-tweet=1405827480582344705 dir=auto> In that moment, I actually felt like I had shit under control. Like I was truly stepping into my predecessors shoes! But that feeling vanished very quickly. Just after the standup slack messages starting dropping in the team chat.

"I'm not getting any response from the <blah> server. None of my devices are connecting" "I can't see it on <XYZ> frontend either."

"Francis you need to look at this asap."

thread#showTweet data-screenname=fstokesman data-tweet=1405827484688470018 dir=auto> They were referring to stuff outside of the project were crunching to finish, and I was just annoyed to have to context switch yet again. But then a DM came in.

thread#showTweet data-screenname=fstokesman data-tweet=1405827486840143876 dir=auto> "There's no data in the <blah> database"

thread#showTweet data-screenname=fstokesman data-tweet=1405827488698322945 dir=auto> The <blah> database was a kind of testing/production hybrid database, full of very domain specific data we'd be collecting for years. It was considered an enormous asset to the team, and there were big plans for it related to ML and AI applications.

thread#showTweet data-screenname=fstokesman data-tweet=1405827490673725445 dir=auto> Now, I only peripherally understood this at the time. I had thought it was just a test server (since we were using it that way). But I also knew it was me who had fucked up. The pilot reset script - which was written by the old backender - was apparently

thread#showTweet data-screenname=fstokesman data-tweet=1405827492603215875 dir=auto> meant to be run ONLY on the actual pilot server. I opened it up, and after digging through the various abstract calls it made into other parts of the codebase, realised that it did in fact, amoungst other things, perform a drop of all data in the DB.

thread#showTweet data-screenname=fstokesman data-tweet=1405827495182712833 dir=auto> No warning, no confirmation, no output in the terminal. Just *poof* gone.

thread#showTweet data-screenname=fstokesman data-tweet=1405827496998809601 dir=auto> The heart palpitations started immediately, and I all but sprinted into a meeting room by myself. I couldn't breath. I sent out some slack messages saying that I was looking into it and would update as soon as I knew something.

The PO and other devs were virtually breathing down my neck the entire time.

Now, it shouldn't surprise you at this point to find out that we didn't have any backups - but it certainly surprised me!

thread#showTweet data-screenname=fstokesman data-tweet=1405827500853321731 dir=auto> Later I found out, according to the PO, backups were expensive, both to store and to maintain - and that he was sure that data could be recovered "somehow" by the faceless IT department. I'm not so sure about that.

thread#showTweet data-screenname=fstokesman data-tweet=1405827502761824261 dir=auto> So I thought this was it. My company would lose this client because of my mistake, and I'd likely be fired as a result. Filled with panic, I did the only thing I could think to do, which was to call my predecessor. He was on vacation at the time, which I felt even worse about,

thread#showTweet data-screenname=fstokesman data-tweet=1405827504670183429 dir=auto> but he did pick up. I explained everything, completely panicked, and he listened calmly, nodding along without any condescension at all.

thread#showTweet data-screenname=fstokesman data-tweet=1405827506612195329 dir=auto> In some kind of deus ex machina moment, he told me he thought that he actually had a backup he'd made a few weeks before he left the team, when he needed to clone the DB somewhere else. And within a few minutes he'd found it, sent it to me,

thread#showTweet data-screenname=fstokesman data-tweet=1405827508537335808 dir=auto> and I was able to get it back up and running. The whole thing was down for perhaps 2 hours total.

thread#showTweet data-screenname=fstokesman data-tweet=1405827510353473539 dir=auto> I should even make clear here that this backup would have been considered unauthorised, and probably in breach of NDA. But there was no ill intent here, no cloak and dagger, just a pure saving grace.

thread#showTweet data-screenname=fstokesman data-tweet=1405827512278654977 dir=auto> I thanked him the best I could in the state I was in, and went back to slack to update everyone. I framed the event in the vaguest terms I possibly could, stating that we'd lost a few weeks of data, but nothing too serious. It would have been better to be completely transparent,

thread#showTweet data-screenname=fstokesman data-tweet=1405827841523171330 dir=auto> but at the time I just couldn't. I was still scared about losing my position, as well as my standing in the team if I did somehow manage to stay employed. There was a bit of grumbling about the data loss, but in the end it didn't really matter.

thread#showTweet data-screenname=fstokesman data-tweet=1405827843821649923 dir=auto> The next day we launched, and I stepped on the plane, and while I felt relief, it took several days for me to return to a state of non-anxiety. And on the return flight everything came back up again as I imagined what conversations might be waiting for me.

thread#showTweet data-screenname=fstokesman data-tweet=1405827845734256643 dir=auto> Thankfully, there weren't any. They were all just happy the gears of the machine were still turning.

thread#showTweet data-screenname=fstokesman data-tweet=1405827847462309893 dir=auto> I learned a lot from that incident. The biggest lesson of course is that you need to be thoughtful and take your time when it comes to operations - especially when the infrastructure you have doesn't have any guard rails.

thread#showTweet data-screenname=fstokesman data-tweet=1405827849441980426 dir=auto> You make mistakes when you rush, the rushing is the natural reaction when pressure is applied. You need to unlearn that, and I certainly did! I also learned the importance documentation, proper procedures for on/off-boarding, and planning ahead for change.

thread#showTweet data-screenname=fstokesman data-tweet=1405827851316862979 dir=auto> And in hindsight, and with more experience with other teams, I learned that the way we had things set up there was a ticking time bomb. There was so much firing from the hip there - so much absolute and mandated yolo behaviour - that this was bound to happen.

thread#showTweet data-screenname=fstokesman data-tweet=1405827853107843073 dir=auto> From the top down, these bad practices were built into our working process as "acceptable" risk. They were encouraged, even demanded, and requests to add more safety or careful procedures were shut down because of costs.

thread#showTweet data-screenname=fstokesman data-tweet=1405827855196565504 dir=auto> I refuse to work like that anymore, and I make same sure to highlight and escalate any perceived risks so that it's clear where responsibility will lay if my advice is ignored. I'm always trying to push for better documentation and onboarding procedures because I never want

thread#showTweet data-screenname=fstokesman data-tweet=1405827857520246786 dir=auto> anyone else to experience the sheer dread that I went through. I feel lucky that in a lot of the teams I've worked in since then, communication and openness are valued and encouraged, and that people can face difficulties together instead of feeling isolated and cornered.

thread#showTweet data-screenname=fstokesman data-tweet=1405827859378278401 dir=auto> I hope that was interesting - it's honestly taken me a long time to get to the point where I don't feel ashamed to share this. Every single person makes mistakes. I'm still making mistakes!

But you have to turn those mistakes into knowledge and intuition, learning from every one of them, and trying hard not to make them again (even that can happen!).


• • •

Missing some Tweet in this thread? You can try to force a refresh