What's the worst failure you've ever witnessed as a data analytics engineer?



Having worked at an IT company for many years, I have come across various failures. There is a thread on Reddit called ``Worst Data Engineering Mistake you've seen?'' and many engineers talked about their experiences.

Worst Data Engineering Mistake youve seen? : dataengineering

https://reddit.com/r/dataengineering/comments/16vhp70/worst_data_engineering_mistake_youve_seen/



Inevitable-Quality15, who created the thread, also posted his own experience as below.

'I started working at a company that had just implemented Databricks and didn't understand how it worked. So I turned off auto-closing because it was ok to run on weekends, and I didn't understand how it worked. I set it up to run everything on a private cluster using General Purpose Compute (3x the price). Finance stopped using Databricks after 2 months lol.'

Databricks pricing is determined by operating hours based on DBU, which is a unit price based on the processing power of the processor. In the case of Inevitable-Quality15, the general-purpose compute with the highest unit price was clustered with multiple nodes, and the bill was high because the function that automatically shuts down after being idle for a while was turned off. .

In addition, bitsynthesis, who has worked at many companies, has posted numerous sightings.

An engineer came home on Friday after running a huge batch job that downloaded a large number of files from a third-party host.
When I went to work on Monday, I received a bill of $100,000 (about 14.7 million yen) and an email from AWS warning that my account would be closed due to DDoS to a third party. Please note that the file download was far from complete.

- A young engineer who was promoted from the help desk but deleted the main production database a few months later.
It appears that this was an accident that was bound to happen because the production database was completely insecure for anyone on the internal network. It is said that it took 12 hours to recover.



・An incident where the same S3 location was specified for input and output of a serverless streaming pipeline
When configuring a streaming pipeline that uses AWS's Amazon Simple Storage Service (Amazon S3) to automatically process documents once they are saved in S3, the same location was incorrectly used for both input and output. I made a mistake in specifying it. No one noticed the loop in which the output document was processed again and output to the same location for about a year, so the document saved in that location was duplicated hundreds of millions of times. This was discovered after AWS complained that objects with hundreds of millions of versions were causing problems with backend systems.

・Engineer who enabled debug logging on a large production ETL pipeline
A log aggregation service charged me more than $100,000 (approximately 14.7 million yen) in one week.

・An incident where the JSON implementation of the data import system was not compliant with usage.
When a 'legacy' data ingestion system for all users first supported JSON, the team implemented their own custom JSON encoder that was not compliant with the JSON specification and could no longer be parsed by standard JSON libraries. Despite calling this data ingestion system 'legacy,' it is actually unique, and the team in charge was asked to fix this issue, but 'fixing legacy changes is not justified.' For this reason, we asked all other teams to rebuild their JSON parsing systems.

Others in the original thread said, ``I started a Redshift cluster just for an experiment, but I forgot about it and lost $120,000,'' and ``I lost $120,000 (approximately 17.6 million yen) with a visualization tool.'' You can read various failure stories such as 'I was querying without partitioning and it cost more than 5 times more,' so if you are interested, please check it out.

in Note,   Software, Posted by log1d_ts