We are the Platform Database Services Disaster Recovery as a Service SRE team (DRaaS), charged to administer the end-to-end testing of Bloomberg's datacenters for disaster recovery scenarios of numerous services which support applications that constitute Bloomberg’s line of products! On any given day we're inventing, engineering, developing, building, coding, trouble-shooting and maintaining a wide range of: tools, monitors, frameworks, interfaces, protocols, solutions and best-practices around Disaster Recovery. These components stitch together a robust suite of automated and self-healing systems that manage the services that the Platform Database Services SRE team provides to the rest of the firm.

What's in it for you:

You will be part of a team that works to help meet company and regulatory defined Disaster Testing standards. Manage and develop solutions that support various disaster recovery tools, creating these applications to integrate the services they provide into the Bloomberg operational environment as well as Bloomberg products. This in-house tooling suite is required to test our clusters and managed services that reside in our datacenters and nodesites in an automated, scale-able and self driven fashion, complete with accompanying metrics and transparency tools that would be required for internal and external clients. Tooling is expected to be written with end-to-end unit testing and continuous integration to provide the highest level of stability.

We have product ownership and "the classic SRE responsibilities" such as: system tuning, performance analysis, defining and following availability targets such as SLA’s, SLO’s and SLI’s as well as having immediate access to the experts that are designing and coding the Bloomberg specific components, APIs and methods used by and supporting the disaster recovery infrastructure. You’ll receive insight and entry to the lowest levels of how Bloomberg applications interact with each other and the runtime environments for the purposes of both in-depth troubleshooting and enhancing stability, reliability, performance and feature-set.

You'll need to have:

4+ years of experience in Python and/or TypeScript
A degree in Computer Science, Engineering or similar field of study or equivalent work experience
5+ years experience with Unix, Unix tools and shell scripting
Experience designing stable, long-lasting APIs
Deep understanding of TCP/IP networking and the OSI model
Experience designing and automating repeatable processes in a client/server modeled environment
Ability to build and maintain highly sophisticated, available, performant, and scalable, critically important systems
Experience building monitors and alarms for system performance, status and stability
Experience with CI/CD systems and writing robust unit and system tests

We'd love to see:

Basic knowledge in Rapid framework
Experience analyzing existing systems and identifying shortcomings with proven methods for improvement
Experience with Chaos Engineering
Experience with Splunk/Humio and Grafana or other metric based reporting tools
Experience with GitHub and JIRA
Passion for product ownership

Apply now

See more open positions at Bloomberg

Privacy policy Cookie policy

The #1 Source for In-Person NYC Tech Jobs

Senior Software Engineer/SRE - Automated Disaster Recovery

The Team:

You'll need to have:

We'd love to see:

The #1 Source for
In-Person NYC Tech Jobs