Sr Inference Engineer

Tanla Platforms Limited

Hyderabad, Telangana, India
Expired On 30 Apr 20263 months ago

About the Job

About The Role:

As a Model Inference Engineer, you will bridge the gap between model training and production deployment. You will take high-performance checkpoints from our Training Engineers and transform them into optimized, production-ready artifacts. Your mission is to architect, build, and rigorously test inference servers that deliver our Voice AI capabilities across both real-time streaming and high-throughput batch scenarios.


You will also play a key role in hardware-software co-optimization, selecting the right computer profiles and implementing scaling strategies to balance high-fidelity audio quality with cost-efficient, reliable production delivery.

What you’ll be Responsible for:

  • Transform trained checkpoints into high-performance artifacts using TensorRT, ONNX, or TVM. Implement quantization strategies (FP16, INT8, FP8) to balance precision and performance.
  • Architect and maintain inference servers using Triton Inference Server or vLLM. Implement efficient request handling through dynamic batching and streaming protocols (gRPC, WebSockets).
  • Profile and optimize model performance at the kernel level. Select and tune compute profiles across various NVIDIA GPU architectures (T4, L4, A100, H100) to maximize cost-efficiency.
  • Design and execute rigorous performance tests to measure latency (TTFC), throughput, and memory usage. Ensure optimized models maintain the required acoustic fidelity and accuracy.
  • Partner with Training Engineers to define export-friendly architectures and provide feedback on model performance in production-like environments.


What we are looking for, in you:


Must have:

  • Deep practical experience with model serving frameworks such as vLLM, Triton Inference Server, and Ollama.
  • Strong experience with model acceleration and runtime frameworks including TensorRT, TensorRT-LLM, and ONNX Runtime.
  • Ability to optimize inference performance through batching, quantization, GPU utilization, and latency tuning for large-scale model serving.
  • Ability to profile and identify bottlenecks across the entire stack—from Python/C++ code to GPU kernels and memory bandwidth.


Good to have:

  • Experience writing or optimizing kernels using CUDA (C++) or Triton (Python) to accelerate non-standard operators.
  • Familiarity with Apache TVM, Kubernetes, Docker, and managing GPU clusters for large-scale inference deployment.

Required:

  • 5–7 years of industry experience in machine learning model optimization, inference systems, or ML infrastructure engineering.
  • BE/BTech/ME/MTech/PhD in Computer Science, Artificial Intelligence, Machine Learning, or a related field preferred.
  • Strong proficiency in C++ and Python for building high-performance machine learning and inference systems.
  • Solid understanding of NVIDIA GPU architectures (e.g., Ampere, Hopper) and CUDA programming concepts for accelerated computing.
  • Experience working with Linux-based environments, including system-level debugging and performance tuning.
  • Familiarity with networking protocols and APIs such as gRPC, WebSockets, and HTTP/2 for real-time inference services.
  • Proficiency with version control systems such as Git, and experience with collaborative software development workflows.


Why join us?

  • Impactful Work: Play a pivotal role in safeguarding Tanla's assets, data, and reputation in the industry.
  • Tremendous Growth Opportunities: Be part of a rapidly growing company in the telecom and CPaaS space, with opportunities for professional development.
  • Innovative Environment: Work alongside a world-class team in a challenging and fun environment, where innovation is celebrated.


Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.


www.tanla.com

Location :Hyderabad, Telangana, India

Create alert for similar jobs

Similar Jobs