Hi, I’m Farhan Mohammed, a Senior AI/ML & NLP Engineer with a passion for building practical AI systems that scale. I bring hands-on experience in Python-based application development and end-to-end ML/AI model deployment, with deep expertise in TensorFlow, PyTorch, scikit-learn, and Generative AI, including GPT, LLaMA, Mistral, and Claude. I’ve shipped production-grade AI solutions across text, code, and knowledge retrieval, designing scalable APIs, ML workflows, and cloud-based deployments on AWS, GCP. I routinely apply fine-tuning (SFT, LoRA, PEFT), embeddings, prompt engineering, and efficient inference to deliver faster results and measurable improvements in accuracy and throughput.

Farhan Mohammed

Hi, I’m Farhan Mohammed, a Senior AI/ML & NLP Engineer with a passion for building practical AI systems that scale. I bring hands-on experience in Python-based application development and end-to-end ML/AI model deployment, with deep expertise in TensorFlow, PyTorch, scikit-learn, and Generative AI, including GPT, LLaMA, Mistral, and Claude. I’ve shipped production-grade AI solutions across text, code, and knowledge retrieval, designing scalable APIs, ML workflows, and cloud-based deployments on AWS, GCP. I routinely apply fine-tuning (SFT, LoRA, PEFT), embeddings, prompt engineering, and efficient inference to deliver faster results and measurable improvements in accuracy and throughput.

Available to hire

Hi, I’m Farhan Mohammed, a Senior AI/ML & NLP Engineer with a passion for building practical AI systems that scale. I bring hands-on experience in Python-based application development and end-to-end ML/AI model deployment, with deep expertise in TensorFlow, PyTorch, scikit-learn, and Generative AI, including GPT, LLaMA, Mistral, and Claude.

I’ve shipped production-grade AI solutions across text, code, and knowledge retrieval, designing scalable APIs, ML workflows, and cloud-based deployments on AWS, GCP. I routinely apply fine-tuning (SFT, LoRA, PEFT), embeddings, prompt engineering, and efficient inference to deliver faster results and measurable improvements in accuracy and throughput.

See more

Language

English
Fluent

Work Experience

Senior NLP Engineer at Augmedix
February 7, 2022 - October 3, 2024
Designed and developed a Python-based medical scribe AI agent using Generative AI frameworks (LangChain, Hugging Face Transformers, PyTorch, TensorFlow, scikit-learn) and Large Language Models (GPT, LLaMA, Med-PaLM, Gemini), achieving a 92% F1 score, <10% hallucination rate, and generating notes in under 40 seconds. Migrated legacy NLP models to advanced Generative AI/LLM solutions, delivering a 15% performance improvement in clinical text processing. Applied Generative AI techniques including fine-tuning, transfer learning, LoRA, PEFT, embeddings, quantization (16-bit/8-bit), and prompt engineering (Chain-of-Thought, Few-Shot, ReACT, schema-driven prompts) to improve accuracy and reduce note completion time.
AI/ML Engineer at EPICS
June 7, 2021 - February 7, 2022
Architected a Python-based personalized insurance recommendation engine using Matrix Factorization (collaborative filtering), TF-IDF (content-based filtering), and Gradient Boosting (XGBoost), increasing policy conversions by 28% and premium value by 15%. Built feature extraction pipelines and trained 20+ predictive variables with 83% accuracy for identifying high-value policy matches. Automated retraining with Apache Airflow to maintain >85% accuracy amid changing customer behavior. Implemented an encoder-based chatbot using Hugging Face BERT and applied sentiment analysis with VADER achieving 89% accuracy.

Education

Master of Business Administration at Trine University
June 10, 2024 - May 1, 2026
Master of Computer Science at Arizona State University
August 21, 2019 - May 5, 2021
Bachelor of Information Technology at Osmania University
August 20, 2015 - May 3, 2019

Qualifications

IBM Generative AI Engineering with LLMs
August 21, 2025 - October 1, 2025
In this Specialization, learners developed essential knowledge and skills to perform tasks in an AI or NLP Engineer role. The Certificate holder has demonstrated a firm grasp and has practical experience of the various generative AI architectures and LLMs for developing NLP- based systems. They have demonstrated knowledge of tokenization, data loaders, transformers, attention mechanisms, and prompt engineering. They have gained experience in frameworks and LLMs, such as GPT and BERT. The Certificate holder can use PyTorch and the Hugging Face Transformers library to develop RAG applications such as a question-and-answering system. They can also use document loaders in LangChain to bring in information from various sources and prepare it for processing.

Industry Experience

Software & Internet, Healthcare, Professional Services
    paper Medical Note Generation with TinyLlama

    This project implements fine-tuning of the TinyLlama model for medical note generation using various PEFT (Parameter-Efficient Fine-Tuning) methods. The model is trained to generate concise, professional medical notes in SOAP (Subjective, Objective, Assessment, Plan) format from patient transcripts.

    [Github link of the project](https://www.twine.net/signin

    Features

    • Fine-tuning of TinyLlama-1.1B-Chat model for medical note generation
    • Support for multiple PEFT methods:
      • LoRA (Low-Rank Adaptation)
      • Prefix Tuning
      • Prompt Tuning
    • 8-bit quantization for efficient training
    • Custom dataset handling for medical transcripts
    • Chain-of-thought reasoning in medical note generation

    PEFT Methods Explained

    LoRA (Low-Rank Adaptation)

    LoRA is a parameter-efficient fine-tuning method that adds small trainable rank decomposition matrices to existing weights. Instead of fine-tuning all parameters, LoRA only trains these small matrices, significantly reducing the number of trainable parameters while maintaining model performance. In this project, LoRA is applied to the key and value projection matrices of the attention layers.

    Prefix Tuning

    Prefix Tuning prepends trainable continuous vectors (prefixes) to the input sequence. These prefixes act as task-specific instructions that guide the model’s behavior. The method keeps the base model frozen and only trains these prefix vectors, making it highly parameter-efficient. In our implementation, we use 100 virtual tokens as prefixes.

    Prompt Tuning

    Similar to Prefix Tuning, Prompt Tuning adds trainable continuous vectors to the input. However, Prompt Tuning is simpler and typically uses fewer parameters. The main difference is that Prompt Tuning only adds vectors at the beginning of the sequence, while Prefix Tuning can add them at multiple layers. We use 600 virtual tokens for prompt tuning in this project.

    paper Advanced RAG Chatbot with LangGraph and Hybrid Memory

    This project implements an advanced Retrieval-Augmented Generation (RAG) chatbot using LangChain, LangGraph, and a hybrid memory system. The chatbot is designed to answer questions based on a knowledge base while maintaining conversational context.

    [Github link of the project](https://www.twine.net/signin

    Features

    • Retrieval-Augmented Generation (RAG): The chatbot retrieves relevant information from a knowledge base before generating a response, ensuring answers are grounded in provided data.
    • LangGraph Orchestration: The entire workflow is orchestrated using LangGraph, providing a clear and maintainable structure for the different stages of processing a query.
    • Hybrid Memory: The chatbot utilizes a hybrid memory system, combining both a traditional chat history and a vector-based memory for more nuanced context management.
    • Long-Term Memory: Utilizes a vector database to store and access previous messages, enabling the chatbot to maintain context over extended conversations.
    • Query Deconstruction: User queries are deconstructed into sub-queries to identify the core question and any necessary context from the chat history.
    • Reciprocal Rank Fusion (RRF): Employs RRF to effectively combine and re-rank documents obtained from multiple sub-queries, ensuring the most relevant information is presented.
    • Debug Mode: A debug flag allows for streaming the output of each step in the LangGraph workflow, providing insight into the chatbot’s internal workings.
    • Vector Store Management: The project includes functionality for creating and managing a vector store for the knowledge base.
    paper Sentence-Transformer

    This project implements a multi-task sentence transformer using the Hugging Face Transformers library. The model is designed to perform both text classification and Named Entity Recognition (NER), leveraging a shared BERT-based encoder.

    [Github link to the project](https://www.twine.net/signin

    The model architecture includes:

    🔹 A shared BERT encoder that learns contextual embeddings for sentences and tokens.
    🔹 Two task-specific heads — one for classification and another for NER (BIO tagging).
    🔹 A flexible trainer that supports freezing specific components:

    Freeze the encoder → fine-tune only the task heads for faster convergence.

    Freeze one head → focus on improving the other task (e.g., NER).

    This selective freezing mechanism allows targeted improvement on individual tasks while retaining shared knowledge — an efficient way to optimize downstream performance.

    paper IMDB Movie Chatbot

    This project builds a chatbot using LangChain and OpenAI to answer questions about the IMDB Top 1000 movies dataset. It leverages a combination of semantic search (using ChromaDB) and structured querying (using Pandas DataFrames) to provide comprehensive and insightful responses.

    [Github link to the project](https://www.twine.net/signin

    Features

    • Semantic Search: Ask questions about movie plots and themes using natural language.
    • Structured Querying: Filter and analyze movies based on various criteria like genre, release year, rating, director, and more.
    • Sub-Query Handling: Breaks down complex user queries into smaller, manageable sub-queries for efficient processing.
    • Intermediate Data Storage: Stores and utilizes intermediate results for subsequent queries, enabling multi-step analysis.
    • Chain of Thought Reasoning: The agent employs a chain of thought process for complex query resolution.
    paper RAG-based Question Answering System

    This project implements a Retrieval Augmented Generation (RAG) Question Answering System. It answers user questions based on the content of a provided markdown document, using modern NLP techniques (embeddings, vector search, and LLMs).

    Main Components

    Document Loading & Splitting: Loads a markdown file and splits it into manageable text chunks for processing.
    Embedding & Vector Store: Uses a Sentence Transformer model (all-MiniLM-L6-v2) to generate embeddings for each chunk, stored in a FAISS vector database for efficient similarity search.
    Retrieval: Retrieves the most relevant document chunks using vector similarity when a question is asked.
    Question Answering: Uses OpenAI’s GPT-4o model to generate answers, but only if relevant context is found. If no relevant context is found, it returns a fallback message.