Engineering

AI/ML Engineer – VLM Focused

Full-time

|

Madurai (Hybrid)

| Exp.

3-4 Years

Posted on.

AI/ML Engineer – VLM Focused (Vision-Language Models)

AI/ML Engineer – VLM Focused (Vision-Language Models)

Skills Required

Python, PyTorch, TensorFlow, Linear Regression, VLM, NLP, NumPy, SciPy, Pandas, Matplotlib, Seaborn, Keras, Git, GitHub, Vector Databases, RAG, LLM, Pinecone, ChromaDB, CLIP, LLaVA, Flamingo, BLIP-2, Qwen-VL, CUDA, Quantization, Torch JIT, LangChain

Role Summary

We are seeking an AI/ML Engineer specializing in Vision-Language Models (VLMs) to design, train, and optimize multimodal AI systems that understand human movement (pose, form, biomechanics) and generate intelligent, personalized coaching through generative language models.

This role focuses on:

  • VLM Architecture & Fine-Tuning: Adapting state-of-the-art Vision-Language Models (LLaVA, Flamingo, BLIP-2, Qwen-VL) to the fitness domain using LoRA/QLoRA

  • Multimodal Data Engineering: Building pipelines that fuse vision (pose/video), language (coaching cues), and sensor data (IMU, heart rate)

  • Generative Coaching Models: Training models to generate real-time, natural language feedback based on human movement

  • Model Optimization & Deployment: Compressing multimodal models for on-device inference (<100ms latency)

You will collaborate closely with Computer Vision Engineers (pose keypoints), Mobile ML Engineer (deployment), and Head of Engineering (roadmap).

This is a core technical role that directly impacts product quality through intelligent multimodal AI.


Key Responsibilities

1. Vision-Language Model (VLM) Architecture & Selection

  • Evaluate and select state-of-the-art VLMs (LLaVA, Flamingo, BLIP-2, CogVLM, Qwen-VL) for fitness domain

  • Understand architecture tradeoffs: model size vs. accuracy vs. latency, vision encoder capacity, language decoder design

2. Multimodal Data Pipeline Architecture

  • Vision modality: Receive pose keypoints (2D/3D) from CV Engineer, design temporal representations (keypoint sequences, skeleton graphs, pose embeddings)

  • Language modality: Curate exercise descriptions, trainer coaching scripts, form feedback annotations

  • Sensor modality: Integrate heart rate, HRV, IMU acceleration data from wearables

  • Data alignment: Timestamp video frames, pose keypoints, coaching annotations, sensor readings; create aligned training pairs

  • Data versioning: Manage datasets using DVC with version control

3. VLM Fine-Tuning & Training

  • Use LoRA (Low-Rank Adaptation) for efficient fine-tuning: freeze base model, train small adapters (1-5% parameters)

  • Use QLoRA for 4-bit quantization + LoRA on consumer GPUs

  • Implement training pipeline using Hugging Face Transformers with: Contrastive loss, Supervised fine-tuning, Instruction tuning and RLHF

  • Handle data imbalance: augmentation for underrepresented exercises, oversampling strategies

  • Experiment tracking: Log to Weights & Biases, manage reproducibility

4. Model Optimization for Mobile Deployment

  • Quantization: Post-training (INT8) and quantization-aware training (QAT), validate <2% accuracy loss

  • Knowledge distillation: Train smaller student model (100M-300M params) to mimic teacher VLM (1B+ params)

  • Architecture optimization: Explore efficient encoders (MobileViT), lighter decoders, pruning, dynamic quantization

  • Inference optimization: Batch inference, embedding caching, ONNX conversion, TFLite deployment

  • Extract multimodal data from trainer recordings: video (3 angles), audio (coaching narration), pose (CV extraction)

5. Evaluation & Benchmarking

  • Accuracy metrics: Form classification (>90%), form quality prediction (<5 points MAE), coaching cue accuracy (>85%)

  • Language quality: BLEU, ROUGE, METEOR scores, human evaluation (fluency, relevance, correctness)

  • Latency: End-to-end <100ms on mobile (vision encoder <20ms, language decoder <60ms)

  • Robustness: Test across body types, gym environments, exercise diversity, edge cases

  • Validation datasets: 20+ exercises, diverse demographics, occlusion/extreme angles

  • Benchmarking cadence: weekly core metrics, monthly full suite, quarterly additions

6. Agentic Solutions and Tools

  • Agentic Coaching System – Design multi-step reasoning agents that analyze user form, retrieve relevant coaching knowledge, and generate grounded, personalized feedback in real time.

  • Multimodal RAG – Build retrieval-augmented generation systems that use vector search over exercise standards, trainer libraries, and pose examples to ground coaching outputs.

  • Model & Tool Coordination (MCP-style) – Coordinate VLM, CV models, sensor interpreters, and knowledge bases through structured tool/function calls and shared context.

  • Tool-Augmented VLM – Enable the VLM to dynamically invoke tools (pose analysis, biomechanics calculators, rep counters, validators) during inference.


Required Skills & Experience


Educational Background

  • Bachelor's degree in Computer Science, ML, Statistics, Mathematics, Physics, or related field

  • Master's degree (MS) is a plus but not required.

  • Strong foundation: linear algebra, probability & statistics, calculus, information theory


VLM & Multimodal ML Experience (2+ years)

2+ years hands-on experience with Vision-Language Models or multimodal AI:

  • Fine-tuned VLMs (CLIP, LLaVA, Flamingo, BLIP-2, Qwen-VL)

  • Built systems combining vision and language modalities

  • Shipped multimodal models in production or research

Expert-level Python:

  • NumPy, SciPy, pandas for numerical computing

Expert PyTorch (1.5+ years):

  • Building and training neural networks from scratch

  • Custom loss functions and training loops

  • Data loading (torch.utils.data)

Hugging Face Transformers (1+ year):

  • Fine-tuning with trainer API

  • Model architectures (attention, encoder-decoder)

  • Vision models (image processors, ViT)


Computer Vision Knowledge (Intermediate)

  • Understanding of pose estimation: 2D/3D keypoints, skeleton representations, pose embeddings, temporal modeling

  • Comfortable reading CV code and papers

  • Basic image processing knowledge (rotation, scaling, normalization)


Natural Language Processing (Intermediate to Advanced)

  • Language model architecture: Transformers, attention, self-attention

  • Generation: autoregressive models, beam search

  • Fine-tuning: instruction tuning, LoRA, parameter-efficient methods

  • Evaluation: BLEU, ROUGE, METEOR, human evaluation

  • Prompt engineering and text preprocessing (tokenization, BPE, WordPiece)


Model Optimization & Deployment

  • Quantization: INT8, FP16, post-training and quantization-aware training

  • Knowledge distillation: student-teacher models

  • Pruning and model compression

  • ONNX and TFLite conversion

  • Edge inference and low-latency optimization


Data Engineering

  • Multimodal data alignment and versioning (DVC)

  • Annotation management and quality control

  • Data augmentation and imbalance handling

  • Experiment tracking (Weights & Biases, MLflow, TensorBoard)


Software Engineering Practices

  • Git workflows and code review

  • Clean, modular, well-documented code

  • Reproducibility: seed management, documentation

  • Debugging and logging

  • Testing and CI/CD basics


Preferred Skills & Experience

  • Experience with cutting-edge VLMs: GPT-4V, Gemini Vision, Claude Vision API

  • Multimodal fusion techniques: cross-attention, late/early fusion, contrastive learning (CLIP-style), distillation across modalities

  • Generative AI: seq2seq models, RAG, RLHF

  • Production ML: model monitoring, data drift, retraining, A/B testing

  • Biomechanics or sports science knowledge, human pose estimation experience

  • Advanced optimization: Torch JIT, CUDA kernels, mixed-precision training, distributed training

  • Open-source contributions: GitHub repos with VLM/multimodal projects, Hugging Face contributions, published papers (NeurIPS, ICML, CVPR, ICCV)

  • Agent Development – ReAct-style agents, multi-agent patterns, tool/function calling, agent memory (session + long-term), experience with LangChain / LangGraph / similar frameworks.

  • RAG Systems – Vector databases (e.g., Pinecone, ChromaDB), text and multimodal embeddings, retrieval strategies (semantic, hybrid, reranking), context window management, RAG evaluation for faithfulness.

  • Model Context & Tool Protocols (MCP-style) – Designing model–tool interfaces, context sharing patterns, tool registration and discovery, robust JSON/function schemas, cross-model coordination.

  • LLM Orchestration – Prompt engineering, few-shot and chain-of-thought prompting, multi-step workflow design, function calling patterns, integrating multiple tools and models into a coherent flow


What You'll Gain

  • Technical ownership of the entire VLM pipeline (data to production)

  • Deep multimodal expertise in vision-language models and cross-modal learning

  • Real-world impact: Your models affect thousands of users' fitness experiences

  • Research-to-product bridge: Work with cutting-edge AI while shipping to real users

  • Patent involvement: Contribute to Nutpaa's multimodal AI patents

  • Collaboration with specialists: Work with CV engineers, mobile engineers, trainers

  • Career growth: Path to Senior ML Engineer, Research Scientist, or ML Architecture roles

  • Early-stage deep-tech: Work on hard problems in early-stage startup

  • Hybrid working: Post-MVP (Month 6+), flexibility for remote collaboration


Organizational & Cultural Expectations

  • Maintain scientific rigor in model development: proper train/val/test splits, reproducible experiments

  • Balance research depth with shipping pragmatism: good model now > perfect model later

  • Communicate clearly with non-ML specialists about capabilities and limitations

  • Collaborate genuinely across teams: ask for help, offer help, share learnings

  • Uphold Nutpaa's values: Engineering Excellence, Long-Termism, Open Evolution, Peer-Driven Collaboration

  • Be comfortable with ambiguity and iteration (VLMs are frontier tech)

  • Mentor junior engineers on multimodal AI concepts


Application Process

Please email careers@nutpaa.ai with:

1. Resume highlighting:

  • VLM or multimodal ML experience (2+ years)

  • Specific models worked with (LLaVA, CLIP, Flamingo, etc.)

  • PyTorch and Hugging Face expertise

  • Quantization/deployment experience

  • Production shipping experience

2. Portfolio:

  • GitHub repos: VLM fine-tuning, multimodal data pipelines, optimization work, pose estimation projects

  • Technical writing: blog posts, papers on VLMs/multimodal learning, project documentation, experiment reports

  • Published work: arXiv papers, Kaggle competitions, open-source contributions (Hugging Face, PyTorch)

3. Statement of Interest (~250 words):

  • Why interested in VLMs and multimodal AI?

  • One specific VLM fine-tuning or multimodal project you led: What was the challenge? What did you build? Results? (accuracy, latency, learnings)

  • Why real-world AI deployment matters to you?

  • What excites you about early-stage deep-tech?

Email Subject: AI/ML Engineer – VLM Specialist – [Your Name]


Equal Opportunity Statement

Nutpaa is an equal opportunity provider. We do not discriminate based on race, religion, color, national origin, gender, gender identity or expression, sexual orientation, age, marital status, veteran status, or disability status.

Apply now to join us

Apply now to join us

First Name *
Middle Name
Last Name *
Email Address *
Phone no.
Current Location
LinkedIn
GitHub
Portfolio
Brief about you *
Resume *
Click to choose a file or drag here
Size limit: 1 MB
Loading captcha…