Senior Site Reliability Engineer (SRE) (3467)

at SMX in Augusta, Maine, United States

Job Description

Senior Site Reliability Engineer (SRE) (3467)at SMX (

United States

SMX is seeking a driven and talented Senior Site Reliability Engineer (SRE) to join our thriving Cloud Services business unit. Senior SREs provide high-level implementation support services and subject-matter expertise to SMX clients on IT consulting engagements. Using knowledge and experience in technical architecture and systems integration, our Senior Site Reliability Engineers are responsible for assisting with the Technology team deliverables including building of dashboards for monitoring metrics on top tier apps, the continuous build/deployment of automation scripts, and maintaining system configurations across multiple environments hosted on the AWS cloud tech stack. In addition, our Senior Site Reliability Engineers work closely with the delivery teams and SMX clients to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, incident retrospectives, chaos testing, and end-to-end ownership, and to prioritize the timely completion and delivery of these tasks. This individual will bring a passion for technology, a strong technical skill set, and an ability to deploy, employ, operate, and sustain Production-ready solutions, software, and tools for our customers. Our Site Reliability Engineers have working knowledge of continuous integration models, work directly with leads and program managers and exhibit an overall willingness to contribute to the SMX team. This individual will bring experience in infrastructure and operations automation and will provide hands-on experience implementing cloud-native, and automation-centric solutions to drive operation efficiencies with a strong focus on quality, communication, customer success, and results. This is a remote role supporting a Herndon, VA based team.

Essential Duties and Responsibilities:

+ Implement application/infrastructure observability solutions and perform maintenance to ensure desired application availability

+ Real-time service management inclusive of building monitoring for the golden signal SLIs, establishing, negotiating SLOs with the business, building alerting, creating playbooks and runbooks for services in conjunction with development teams, product owners and support

+ Apply automation and software to any tasks or parts of the system that would benefit from it or are performed manually.

+ Handle Cloud Operations (Events, Incidents, and Requests) based on a defined, ticket-driven service catalog.

+ Provide guidance and leadership to SRE team, performing team technical reviews

+ Work with customer and SRE team to identify, develop, deploy, and maintain solutions

+ Be a primary “face to the customer” during the Manage phase of the customer lifecycle – communicating clearly and concisely to identify, triage, remediate, and resolve infrastructure and solution issues when customer needs are greatest.

+ Take direction from, and provide clear and timely updates to, Project Lead or Project Manager

+ Proactively identify potential operations and reliability issues and work to resolve, while also identifying system / performance issues and developing resolutions using automation

+ Identify opportunities for automation and implement them to drive operational efficiency and cost reduction

+ Implement and maintain backup and disaster recovery solution for customers’ cloud computing resources

+ Optimize existing – and identify new opportunities for – monitoring, logging, and management metrics to improve operational effectiveness and customer knowledge

+ Participate in troubleshooting of infrastructure and/or application related issues

+ Produce well-written technical project documentation and operational runbooks

+ Participate in change management processes

+ Maintain core working hours but remain flexible to support after-hours maintenance and escalations (as necessary)

+ Participate as a team player capable of high performance and flexibility in a dynamic working environment

+ Improve CI/CD tools integration/operations and full automation of CI/testing

+ Identify and support Continuous Improvement opportunities to increase system reliability

+ Deploy and configure cloud services according to best practice (e.g.: Virtual Machines, Virtual Network, AWS AD, CDN, serverless functions, DNS, Monitor, Key Vault, Blob storage)

+ Achieve and maintain AWS certifications

Required Skills:

+ 7+ years of experience in DevOps or SRE

+ Proven ability to dissect a technical architecture into engineering plans and discrete tasks

+ Excellent customer facing skills and the calm professional demeanor necessary to bolster customer confidence when stress is highest

+ Scripting Experience, Kusto Query Language, Arm Templates, PowerShell

+ Strong skillset with AWS Automation, DevOps Pipeline and related AWS tooling

+ Collaborate with internal dev team to help end-to-end testing

+ Solid command of standard CI/CD tools (Terraform, Ansible, Git, Jenkins, etc.)

+ Solid experience with container-based deployments using Docker, working with Docker images, Docker hub and Docker registries. Installation and configuring Kubernetes and clustering them.

+ Scripting Experience, Kusto Query Language, Arm Templates, PowerShell

+ Proficiency and proven hands-on experience with AWS IaaS and PaaS Services, AWS Active Directory, and SQL Server Infrastructure.

+ Experience with AWS Monitoring, Migrate, Log Analytics, AWS SSM, Load Balancer techniques

+ Ability to write scripts in JavaScript, Bash, Python, or similar

+ Experience in monitoring, metrics collection, and reporting using open-source tools

+ Depth of knowledge in security best practices, tools, and compliance frameworks (NIST, FedRamp, HIPAA, etc.)

+ Strong written and verbal communication skills

Desired Skills / Certs:

+ BS/BA in Computer Science, Computer Engineering or related field or equivalent technical experience

+ Current operations experience within a Cloud Managed Services Provider (MSP) delivery environment.

+ One of more of the following certifications are required:

+ o AWS Certified Developer – Associate (DVA-C01) o AWS Certified SysOps Administrator – Associate (SOA-C02)o AWS Certified Solutions Architect – Associate (SAA-C03)o AWS Certified DevOps Engineer – Professional (DOP-C01)o AWS Certified Solutions Architect – Professional (SAP-C01)o DevOps Institute: Site Reliability Engineering Foundation (SREF)



At SMX®, we are a team of technical and domain experts dedicated to enabling your mission. From priority national security initiatives for the DoD to highly assured and compliant solutions for healthcare, we understand that digital transformation is key to your future success.

We share your vision for the future and strive to accelerate your impact on the world. We bring both cutting edge technology and an expansive view of what’s possible to every engagement. Our delivery model and unique approaches harness our deep technical and domain knowledge, providing forward-looking insights and practical solutions to power secure mission acceleration.

SMX is committed to hiring and retaining a diverse workforce. All qualified candidates will receive consideration for employment without regard to disability status, protected veteran status, race, color, age, religion, national origin, citizenship, marital status, sex, sexual orientation, gender identity or expression, pregnancy or genetic information. SMX is an Equal Opportunity/Affirmative Action employer including disability and veterans.

Selected applicant will be subject to a background investigation.

Copy Link

Job Posting: JC258290986

Posted On: Apr 13, 2024

Updated On: May 04, 2024

Please Wait ...