ABOUT THE ROLE
Our international team takes care of the architecture, design and implementation of central and business critical platforms for API handling, system integration and automation of rule-based business decisions. Advising our specialist departments and IT product teams is an essential part in that context. We advise and support the implementation of new use cases and the optimization of existing solutions. In addition, we work together with service providers for operation and support to guarantee the highest grade of reliability
We see your role as IT operations engineer as a flexible all-rounder and dynamic team player who enjoys and is keen in operate and optimization of our cloud-based platforms. We offer a variety of tasks in expanding and improving our solution, from design to implementation and management of operations related challenges. As part of a powerful team, you can contribute to parts of Hapag-Lloyd's critical core systems to make a difference.
KEY RESPONSIBILITIES & TASKS
Monitoring & Observability:
- Continuous observation of our systems regarding availability, performance, system usage and costs Definition, design and implementation of observability / monitoring regarding Service Levels (SLIs / SLOs / SLAs).
- Integration in central observability solutions e.g.: Datadog, Elastic.
- Reporting of availability, performance, system usage and costs on a regular basis
Maintenance:
- Planning, coordination and implementation of system updates in collaboration with our vendors and suppliers.
- Take care of keeping our system secure by fixing vulnerabilities in collaboration with our CISO department
- Take care of housekeeping tasks.
Automation:
- Drive automation regarding paradigms like CaC/IaC (Configuration as Code / Infrastructure as Code) to ensure the lowest possible degree of error prone manual work.
- Optimize our CI/CD pipeline
Incident & Problem Management:
- Take over responsibility of coordination & solving incidents to keep “Mean-Time-To-Repair” and user impact as low as possible.
- Drive and support problem management to ensure system reliability and prevent reoccurring incidents
Service Management:
- Take over responsibility of service request handling
Continuous Improvement:
- Driving continuous improvement of our platform regarding to scalability, reliability & cost-efficiency
BEHAVIOURS & APPROACH
- Strong analytical and problem-solving skills.
- Team-oriented with excellent communication and collaboration skills.
- Ability to build pro-active, co-operative working relationships with customers, peers and key stakeholders based on respect and teamwork
- Ability to act under pressure and to manage efficiently crisis situations
- Able to evaluate information, identify key issues and formulate conclusions based on sound, practical judgment, experience, and common sense
WORK EXPERIENCE
- Extensive experience in operations of business critical and cloud-based platforms (monitoring, maintenance, improvement, troubleshooting, …) on an enterprise scale
- Extensive experience with AWS cloud and container runtimes like ROSA (Red Hat Open Shift on AWS)
- Good Knowledge in end-to-end monitoring of applications and systems with enterprise observability tools (e.g. Datadog, Elastic, Prometheus, Grafana)
- Experience with automation tools such as Terraform or Ansible is an advantage
- Experience in software development and the tools used, such as version management, CI/CD, planning and collaboration tools (e.g. Git, Jenkins, Jira, Confluence, ...)
- Excellent communication, problem-solving, and stakeholder management skills.
EDUCATION & QUALIFICATIONS
- Bachelor’s or Master’s degree in computer science, Engineering, or related discipline.
- English language – expert proficiency (additional languages are beneficial).