Engineering
AI/ML Engineer
Full-time
|
Hybrid - Chennai/Madurai, TN
| Exp.
3-4 Years
Posted on.
Skills Required
Python, PyTorch, TensorFlow, Linear Regression, NLP, NumPy, SciPy, Pandas, Matplotlib, Seaborn, Keras, Docker, Git, GitHub, Vectors, Agentic Frameworks, VLMs, RAG, LLM, TensorFlow/Keras
Role Summary
We seek an AI/ML Engineer to design, train, and optimize the multimodal AI systems powering Vizhi's intelligent coaching. You will focus on:
Vision-Language Models (VLMs): Adapting models (LLaVA, BLIP-2, Qwen-VL) to understand human movement and generate personalized coaching
Multimodal Data Pipelines: Fusing vision (pose/video), language (coaching cues), and sensor data (heart rate, IMU)
Model Training & Optimization: Implementing training loops, experiments, and deploying models to edge devices (<100ms latency)
Agentic Systems & RAG: Building retrieval-augmented generation and multi-step reasoning agents for grounded coaching
You'll collaborate with CV, Mobile ML, and Engineering teams to power real-time intelligent assistance.
Key Responsibilities
1) VLM Architecture & Multimodal Fusion
Evaluate and select state-of-the-art VLMs for fitness domain (LLaVA, Flamingo, BLIP-2, Qwen-VL)
Understand architecture tradeoffs: model size vs. accuracy vs. latency, vision encoder capacity, language decoder design
Design multimodal pipelines fusing pose keypoints, coaching text, and sensor data with temporal alignment
Build cross-modal embeddings using contrastive learning (align poses to coaching cues)
2) Model Training & Fine-Tuning
Implement training loops and model definitions in PyTorch
Use LoRA/QLoRA for efficient fine-tuning (1-5% parameters) on consumer GPUs
Run experiments: hyperparameter sweeps, architecture variations, augmentation strategies
Implement training pipelines with contrastive loss, supervised fine-tuning, and instruction tuning
Monitor metrics (loss curves, accuracy, validation) and debug training issues (NaNs, divergence, data mismatches)
Handle data imbalance through augmentation and oversampling
3) Data Pipeline & Dataset Management
Design and maintain datasets for vision, language, and multimodal tasks
Implement data pipelines: train/val/test splits, versioning (DVC), quality checks
Build preprocessing scripts: normalization, scaling, sequence preparation, temporal smoothing
Curate exercise descriptions, trainer coaching scripts, and form feedback annotations
Extract multimodal data from trainer recordings (video, audio, pose) and create aligned training pairs
Create golden standard templates for exercises (5-10 perfect rep examples)
4) Generative Coaching & Form Assessment
Train generative models: pose sequence → natural language coaching cue (5-30 tokens, <100ms)
Implement safety filters preventing dangerous suggestions
Support coaching personas (motivational, technical, balanced) via prompt engineering
Build form quality scoring (0-100) with explainable components: positioning, stability, ROM, timing, symmetry
Detect form degradation under fatigue
5) Model Optimization & Deployment
Quantization: INT8/FP16 post-training and quantization-aware training (<2% accuracy loss)
Knowledge distillation: train smaller student models (100M-300M params) mimicking teacher VLMs
Architecture optimization: efficient encoders (MobileViT), pruning, dynamic quantization
Export to ONNX/TFLite/CoreML for edge deployment
Profile on target devices (Snapdragon XR2+ smart glasses), achieve <100ms latency
Batch inference, embedding caching, inference optimization
6) Agentic Systems & RAG
Agentic Coaching: Design multi-step reasoning agents analyzing form, retrieving knowledge, generating grounded feedback
Multimodal RAG: Build retrieval-augmented generation using vector search over exercise standards and trainer libraries
Tool-Augmented VLM: Enable VLMs to invoke tools (pose analysis, biomechanics calculators, rep counters)
Agent Training Pipeline: Design agents monitoring data quality, detecting labeling issues, suggesting improvements
Model Coordination: Coordinate VLM, CV models, sensors through structured tool calls and shared context
7) Evaluation & Benchmarking
Define and track metrics: form classification (>90%), form quality prediction (<5 MAE), coaching accuracy (>85%)
Language quality: BLEU, ROUGE, METEOR, human evaluation (fluency, relevance, correctness)
Latency targets: vision encoder <20ms, language decoder <60ms, end-to-end <100ms
Test robustness across body types, environments, exercises, edge cases
Build validation datasets: 20+ exercises, diverse demographics, occlusions, extreme angles
Implement automated regression tests and CI integration
8) Reproducibility & Experiment Tracking
Use experiment tracking tools (Weights & Biases, MLflow, TensorBoard)
Ensure reproducibility: fix seeds, document dependencies, version datasets
Maintain documentation: training scripts, preprocessing steps, experiment results
Summarize findings in reports and dashboards for team
9) Cross-Functional Collaboration
Partner with CV Engineers on pose representation formats and quality validation
Work with Mobile ML on resource-aware deployment and profiling
Collaborate with Product/Trainers on coaching quality feedback
Participate in technical reviews, code reviews, and retrospectives
Required Skills & Experience
Educational Background
Bachelor's degree in Computer Science, ML, Mathematics, Statistics, Physics, or related field
Master's degree is a plus
Strong foundation in linear algebra, probability, statistics, calculus
Core ML & Deep Learning (3-4 years)
Programming & Frameworks:
Expert-level Python: NumPy, SciPy, pandas, Matplotlib/Seaborn
PyTorch (1.5+ years): building/training networks, custom loss functions, data loading
Hugging Face Transformers (1+ year): fine-tuning, model architectures, vision models
Experience with at least one DL framework (PyTorch primary, TensorFlow/Keras acceptable)
ML Fundamentals:
Train/val/test splits, overfitting vs. underfitting
Loss functions (MSE, cross-entropy), optimizers (SGD, Adam)
Regularization (dropout, weight decay, early stopping)
Gradient descent and training dynamics
Experience training small-to-medium models end-to-end
Data Engineering:
Working with real datasets: cleaning, transforming, augmenting
Data loaders and preprocessing pipelines
Logging metrics, saving checkpoints
Dataset versioning and quality checks
VLM & Multimodal AI (2+ years preferred)
Vision-Language Models:
Experience fine-tuning VLMs (CLIP, LLaVA, Flamingo, BLIP-2, Qwen-VL)
Understanding VLM architecture: vision encoders, language decoders, cross-attention
Multimodal fusion techniques: early/late fusion, contrastive learning
NLP & Language Models:
Transformer architecture: attention, self-attention, encoder-decoder
Autoregressive generation, beam search
Fine-tuning: instruction tuning, LoRA, parameter-efficient methods
Prompt engineering and tokenization (BPE, WordPiece)
Evaluation: BLEU, ROUGE, METEOR, human evaluation
Computer Vision (Intermediate):
Understanding of pose estimation: 2D/3D keypoints, skeleton representations, temporal modeling
Basic image processing: rotation, scaling, normalization
Comfortable reading CV code and papers
Model Optimization & Deployment
Quantization: INT8, FP16, post-training and QAT
Knowledge distillation: student-teacher models
Pruning and model compression
ONNX/TFLite conversion
Edge inference and low-latency optimization
Agentic AI & RAG Systems
RAG Systems: Vector databases, embedding selection, chunking strategies, retrieval evaluation
Agent Frameworks: ReAct-style agents, multi-agent patterns, tool/function calling, agent memory
LLM Orchestration: Prompt engineering, chain-of-thought, multi-step workflows, function calling
Tool Protocols: Model-tool interfaces, context sharing, JSON schemas, cross-model coordination
Software Engineering
Git workflows and code review
Clean, modular, well-documented code
Reproducibility: seed management, documentation
Debugging, logging, testing
CI/CD basics
Mathematical Foundation
Linear Algebra: Vectors, matrices, decompositions, eigenvalues
Calculus: Derivatives, chain rule, gradients, optimization
Probability & Statistics: Distributions, expectation/variance, hypothesis testing
Numerical Methods: Least-squares fitting, handling ill-conditioned problems
Preferred Qualifications
Advanced Multimodal & Generative AI
Experience with cutting-edge VLMs: GPT-4V, Gemini Vision, Claude Vision
Generative AI: seq2seq, RAG, RLHF
Multimodal distillation across modalities
Production ML: monitoring, drift detection, retraining, A/B testing
Specialized Domain Knowledge
Biomechanics or sports science
Human pose estimation experience
Computer vision tasks (detection, segmentation, tracking)
Advanced Optimization
Torch JIT, CUDA kernels
Mixed-precision training
Distributed training
GPU optimization and parallel processing
Research & Open Source
Published papers (NeurIPS, ICML, CVPR, ICCV, arXiv)
Kaggle competitions
GitHub repos with VLM/multimodal projects
Hugging Face contributions
Production Experience
Deployed ML models to real users
Model monitoring and retraining pipelines
Experiment tracking tools (W&B, MLflow)
Format conversion (ONNX, TFLite, CoreML)
What You'll Gain
Technical ownership of multimodal AI pipeline (data to production)
Deep expertise in VLMs, cross-modal learning, and agentic systems
Real-world impact on thousands of users' fitness experiences
Research-to-product bridge: cutting-edge AI shipped to real users
Contribution to Nutpaa's multimodal AI patents
Collaboration with CV, mobile, and product specialists
Path to Senior ML Engineer or Research Scientist roles
Hybrid working from Month 6+
MVP Success Criteria (8 Months)
VLM selected, fine-tuned on 1,000+ pose-coaching pairs
Multimodal data pipeline operational (vision + language + sensor alignment)
Coaching generation >85% trainer approval, form assessment >90% accuracy
Quantized model <2% accuracy loss, <100ms inference on device
RAG and agentic coaching system functional
Comprehensive evaluation across 20+ exercises, diverse demographics
Training pipelines reproducible with experiment tracking
Documentation and handoff ready
Work Arrangement
Duration: 8 months (March–October 2026)
Commitment: Full-time, 45–50 hours/week
Location: Hybrid – Chennai or Madurai
Post-MVP: Flexible hybrid working from Month 6+
Application Process
Email careers@nutpaa.ai with:
1. Resume highlighting:
ML/AI experience (3–4+ years)
VLM or multimodal ML projects (2+ years preferred)
PyTorch and Hugging Face expertise
Production deployment experience
RAG/agent framework experience
2. Portfolio:
GitHub: VLM fine-tuning, multimodal pipelines, training experiments, RAG systems
Technical writing: blog posts, papers, project documentation
Published work: arXiv, Kaggle, Hugging Face contributions
3. Statement (~200–250 words):
Interest in VLMs and multimodal AI for fitness
One ML/VLM project you led (challenge, approach, results)
Why real-world AI deployment matters to you
Excitement about early-stage deep-tech
Email Subject: AI/ML Engineer – Multimodal Systems – [Your Name]
Equal Opportunity
Nutpaa is an equal opportunity employer. We value ML fundamentals, VLM/multimodal expertise, and shipping mentality over strict credentialism.
Non-traditional backgrounds welcome: Candidates without formal degrees but with demonstrable expertise (portfolio, projects, shipped products) are encouraged to apply
Questions?
Email: careers@nutpaa.ai
