Engineering

AI/ML Engineer – VLM Focused

Full-time

Madurai (Hybrid)

| Exp.

3-4 Years

Posted on.

Feb 27, 2026

Apply Now

AI/ML Engineer – VLM Focused (Vision-Language Models)

Skills Required

Python, PyTorch, TensorFlow, Linear Regression, VLM, NLP, NumPy, SciPy, Pandas, Matplotlib, Seaborn, Keras, Git, GitHub, Vector Databases, RAG, LLM, Pinecone, ChromaDB, CLIP, LLaVA, Flamingo, BLIP-2, Qwen-VL, CUDA, Quantization, Torch JIT, LangChain

Role Summary

We are seeking an AI/ML Engineer specializing in Vision-Language Models (VLMs) to design, train, and optimize multimodal AI systems that understand human movement (pose, form, biomechanics) and generate intelligent, personalized coaching through generative language models.

This role focuses on:

VLM Architecture & Fine-Tuning: Adapting state-of-the-art Vision-Language Models (LLaVA, Flamingo, BLIP-2, Qwen-VL) to the fitness domain using LoRA/QLoRA
Multimodal Data Engineering: Building pipelines that fuse vision (pose/video), language (coaching cues), and sensor data (IMU, heart rate)
Generative Coaching Models: Training models to generate real-time, natural language feedback based on human movement
Model Optimization & Deployment: Compressing multimodal models for on-device inference (<100ms latency)

You will collaborate closely with Computer Vision Engineers (pose keypoints), Mobile ML Engineer (deployment), and Head of Engineering (roadmap).

This is a core technical role that directly impacts product quality through intelligent multimodal AI.

Key Responsibilities

1. Vision-Language Model (VLM) Architecture & Selection

Evaluate and select state-of-the-art VLMs (LLaVA, Flamingo, BLIP-2, CogVLM, Qwen-VL) for fitness domain
Understand architecture tradeoffs: model size vs. accuracy vs. latency, vision encoder capacity, language decoder design

2. Multimodal Data Pipeline Architecture

Vision modality: Receive pose keypoints (2D/3D) from CV Engineer, design temporal representations (keypoint sequences, skeleton graphs, pose embeddings)
Language modality: Curate exercise descriptions, trainer coaching scripts, form feedback annotations
Sensor modality: Integrate heart rate, HRV, IMU acceleration data from wearables
Data alignment: Timestamp video frames, pose keypoints, coaching annotations, sensor readings; create aligned training pairs
Data versioning: Manage datasets using DVC with version control

3. VLM Fine-Tuning & Training

Use LoRA (Low-Rank Adaptation) for efficient fine-tuning: freeze base model, train small adapters (1-5% parameters)
Use QLoRA for 4-bit quantization + LoRA on consumer GPUs
Implement training pipeline using Hugging Face Transformers with: Contrastive loss, Supervised fine-tuning, Instruction tuning and RLHF
Handle data imbalance: augmentation for underrepresented exercises, oversampling strategies
Experiment tracking: Log to Weights & Biases, manage reproducibility

4. Model Optimization for Mobile Deployment

Quantization: Post-training (INT8) and quantization-aware training (QAT), validate <2% accuracy loss
Knowledge distillation: Train smaller student model (100M-300M params) to mimic teacher VLM (1B+ params)
Architecture optimization: Explore efficient encoders (MobileViT), lighter decoders, pruning, dynamic quantization
Inference optimization: Batch inference, embedding caching, ONNX conversion, TFLite deployment
Extract multimodal data from trainer recordings: video (3 angles), audio (coaching narration), pose (CV extraction)

5. Evaluation & Benchmarking

Accuracy metrics: Form classification (>90%), form quality prediction (<5 points MAE), coaching cue accuracy (>85%)
Language quality: BLEU, ROUGE, METEOR scores, human evaluation (fluency, relevance, correctness)
Latency: End-to-end <100ms on mobile (vision encoder <20ms, language decoder <60ms)
Robustness: Test across body types, gym environments, exercise diversity, edge cases
Validation datasets: 20+ exercises, diverse demographics, occlusion/extreme angles
Benchmarking cadence: weekly core metrics, monthly full suite, quarterly additions

6. Agentic Solutions and Tools

Agentic Coaching System – Design multi-step reasoning agents that analyze user form, retrieve relevant coaching knowledge, and generate grounded, personalized feedback in real time.
Multimodal RAG – Build retrieval-augmented generation systems that use vector search over exercise standards, trainer libraries, and pose examples to ground coaching outputs.
Model & Tool Coordination (MCP-style) – Coordinate VLM, CV models, sensor interpreters, and knowledge bases through structured tool/function calls and shared context.
Tool-Augmented VLM – Enable the VLM to dynamically invoke tools (pose analysis, biomechanics calculators, rep counters, validators) during inference.

Required Skills & Experience

Educational Background

Bachelor's degree in Computer Science, ML, Statistics, Mathematics, Physics, or related field
Master's degree (MS) is a plus but not required.
Strong foundation: linear algebra, probability & statistics, calculus, information theory

VLM & Multimodal ML Experience (2+ years)

2+ years hands-on experience with Vision-Language Models or multimodal AI:

Fine-tuned VLMs (CLIP, LLaVA, Flamingo, BLIP-2, Qwen-VL)
Built systems combining vision and language modalities
Shipped multimodal models in production or research

Expert-level Python:

NumPy, SciPy, pandas for numerical computing

Expert PyTorch (1.5+ years):

Building and training neural networks from scratch
Custom loss functions and training loops
Data loading (torch.utils.data)

Hugging Face Transformers (1+ year):

Fine-tuning with trainer API
Model architectures (attention, encoder-decoder)
Vision models (image processors, ViT)

Computer Vision Knowledge (Intermediate)

Understanding of pose estimation: 2D/3D keypoints, skeleton representations, pose embeddings, temporal modeling
Comfortable reading CV code and papers
Basic image processing knowledge (rotation, scaling, normalization)

Natural Language Processing (Intermediate to Advanced)

Language model architecture: Transformers, attention, self-attention
Generation: autoregressive models, beam search
Fine-tuning: instruction tuning, LoRA, parameter-efficient methods
Evaluation: BLEU, ROUGE, METEOR, human evaluation
Prompt engineering and text preprocessing (tokenization, BPE, WordPiece)

Model Optimization & Deployment

Quantization: INT8, FP16, post-training and quantization-aware training
Knowledge distillation: student-teacher models
Pruning and model compression
ONNX and TFLite conversion
Edge inference and low-latency optimization

Data Engineering

Multimodal data alignment and versioning (DVC)
Annotation management and quality control
Data augmentation and imbalance handling
Experiment tracking (Weights & Biases, MLflow, TensorBoard)

Software Engineering Practices

Git workflows and code review
Clean, modular, well-documented code
Reproducibility: seed management, documentation
Debugging and logging
Testing and CI/CD basics

Preferred Skills & Experience

Experience with cutting-edge VLMs: GPT-4V, Gemini Vision, Claude Vision API
Multimodal fusion techniques: cross-attention, late/early fusion, contrastive learning (CLIP-style), distillation across modalities
Generative AI: seq2seq models, RAG, RLHF
Production ML: model monitoring, data drift, retraining, A/B testing
Biomechanics or sports science knowledge, human pose estimation experience
Advanced optimization: Torch JIT, CUDA kernels, mixed-precision training, distributed training
Open-source contributions: GitHub repos with VLM/multimodal projects, Hugging Face contributions, published papers (NeurIPS, ICML, CVPR, ICCV)
Agent Development – ReAct-style agents, multi-agent patterns, tool/function calling, agent memory (session + long-term), experience with LangChain / LangGraph / similar frameworks.
RAG Systems – Vector databases (e.g., Pinecone, ChromaDB), text and multimodal embeddings, retrieval strategies (semantic, hybrid, reranking), context window management, RAG evaluation for faithfulness.
Model Context & Tool Protocols (MCP-style) – Designing model–tool interfaces, context sharing patterns, tool registration and discovery, robust JSON/function schemas, cross-model coordination.
LLM Orchestration – Prompt engineering, few-shot and chain-of-thought prompting, multi-step workflow design, function calling patterns, integrating multiple tools and models into a coherent flow

What You'll Gain

Technical ownership of the entire VLM pipeline (data to production)
Deep multimodal expertise in vision-language models and cross-modal learning
Real-world impact: Your models affect thousands of users' fitness experiences
Research-to-product bridge: Work with cutting-edge AI while shipping to real users
Patent involvement: Contribute to Nutpaa's multimodal AI patents
Collaboration with specialists: Work with CV engineers, mobile engineers, trainers
Career growth: Path to Senior ML Engineer, Research Scientist, or ML Architecture roles
Early-stage deep-tech: Work on hard problems in early-stage startup
Hybrid working: Post-MVP (Month 6+), flexibility for remote collaboration

Organizational & Cultural Expectations

Maintain scientific rigor in model development: proper train/val/test splits, reproducible experiments
Balance research depth with shipping pragmatism: good model now > perfect model later
Communicate clearly with non-ML specialists about capabilities and limitations
Collaborate genuinely across teams: ask for help, offer help, share learnings
Uphold Nutpaa's values: Engineering Excellence, Long-Termism, Open Evolution, Peer-Driven Collaboration
Be comfortable with ambiguity and iteration (VLMs are frontier tech)
Mentor junior engineers on multimodal AI concepts

Application Process

Please email careers@nutpaa.ai with:

1. Resume highlighting:

VLM or multimodal ML experience (2+ years)
Specific models worked with (LLaVA, CLIP, Flamingo, etc.)
PyTorch and Hugging Face expertise
Quantization/deployment experience
Production shipping experience

2. Portfolio:

GitHub repos: VLM fine-tuning, multimodal data pipelines, optimization work, pose estimation projects
Technical writing: blog posts, papers on VLMs/multimodal learning, project documentation, experiment reports
Published work: arXiv papers, Kaggle competitions, open-source contributions (Hugging Face, PyTorch)

3. Statement of Interest (~250 words):

Why interested in VLMs and multimodal AI?
One specific VLM fine-tuning or multimodal project you led: What was the challenge? What did you build? Results? (accuracy, latency, learnings)
Why real-world AI deployment matters to you?
What excites you about early-stage deep-tech?

Email Subject: AI/ML Engineer – VLM Specialist – [Your Name]

Equal Opportunity Statement

Nutpaa is an equal opportunity provider. We do not discriminate based on race, religion, color, national origin, gender, gender identity or expression, sexual orientation, age, marital status, veteran status, or disability status.

Apply now to join us