Engineering Manager - Technical Platform Systems (Observability)
Poznań, PL, 61-569
Job Description:
The Technical Platform Observability team is the backbone of Allegro’s technical excellence. We guarantee Allegro's platform stability by performing 250,000 system health checks every minute. This prevents problems before they impact customers and ensures the platform runs reliably, providing clear performance insights and proactive alerts for the entire Tech organization - supporting 1000+ 24/7 on-duty officers.
We are looking for an Engineering Manager to lead this mission-critical domain. You will spearhead the evolution of our Observability ecosystem, including a high-stakes transition of our incident management and on-call alerting systems.
This is the right job for you if you:
- Have experience in leading and growing engineering teams, with a focus on coaching, mentoring, and building a culture of ownership.
- Possess a strong technical background in Observability, Monitoring, or Site Reliability Engineering (SRE).
- Have a proven track record of managing complex migrations or large-scale infrastructure projects (e.g., transitioning between mission-critical enterprise tools).
- Understand the "Last Mile of Observability" - ensuring that automated signals translate into effective human action.
- Are proficient in modern infrastructure practices, including Infrastructure as Code, GitOps, open-source and high-availability distributed systems.
- Know how to balance technical debt with the delivery of new, scalable platform features.
- Communicate effectively in English at a minimum B2 level.
- Demonstrate the ability to bridge the gap between deep technical implementation and business impact by translating technical complexities into clear business value.
In your daily work you will handle the following tasks:
- Leadership & Strategy: Leading the team responsible for Allegro’s central observability and monitoring ecosystem, overseeing our mission-critical alerting, routing infrastructure, and self-service monitoring platforms.
- Mission-Critical Innovation: e.g. overseeing the strategic transition of our on-call management and incident response system, ensuring zero downtime in alerting coverage for over 2,000 services.
- Platform Evolution: Driving the evolution of our monitoring-as-a-service capabilities, moving towards a fully declarative, Git-based workflow to democratize monitoring ownership across the organization.
- Scalability & Performance: Managing a massive-scale data ecosystem, including VictoriaMetrics (ingesting 100M+ samples/sec) and Zabbix to ensure physical infrastructure safety and long-term performance baselines.
- Stakeholder Management: Collaborating with Area Managers and the wider tech community to ensure Grafana remains the primary "Operational Front Door" for incident response and system behavior exploration.
- System Discovery: Maintaining automated collection targets for approximately 141,000 active instances, ensuring new services are monitored the moment they are deployed.
- Technical Excellence: Managing technical debt and providing expertise in high-complexity architectures to ensure platform stability during the worst-case failure scenarios.
What's in it for you:
- Flexible working hours in the hybrid model (4/1) - working hours start between 7:00 a.m. and 10:00 a.m.
- Well-located offices (with e.g. fully equipped kitchens, bicycle parking, terraces full of greenery) and excellent work tools (e.g., raised desks, ergonomic chairs, interactive conference rooms).
- A 16" or 14" MacBook Pro or corresponding Dell with Windows (if you don't like Macs) and all the necessary accessories.
- A wide selection of fringe benefits in a cafeteria plan - you choose what you like (e.g., medical, sports or lunch packages, insurance, purchase vouchers).
- English classes that we pay for related to the specific nature of your job.
- A training budget, inter-team tourism, hackathons, and an internal learning platform.
- An additional day off for volunteering, which you can use alone, with a team, or with a larger group.
- Social events for Allegro people - Spin Kilometers, Family Day, Fat Thursday, Advent of Code, and many other occasions we enjoy.
#goodtobehere means that:
- You will join a team you can count on - we work with top-class specialists who have knowledge- and experience-sharing in their DNA.
- You will love our level of autonomy in team organization, the space for continuous development, and the opportunity to try new things.
- You get to choose which technology solves the problem and you are responsible for what you create.
- You will value our Developer Experience and the full platform of tools and technologies that make creating software easier.
- We rely on an internal ecosystem based on self-service and widely used tools such as Kubernetes, Allegro Open-source Hermes, Docker, Consul, GitHub, and GitHub Actions.
- You will be equipped with modern AI tools to automate repetitive tasks, allowing you to focus on developing new services.
- You will meet the Allegro Scale: 2000+ microservices, 300K+ rps on our data bus (Hermes), and tens of petabytes of data.
- You will become part of Allegro Tech - we speak at conferences, run a blog for 10+ years, record podcasts, and lead guilds.
Send us your CV and... see you at Allegro!