Hi, I’m Farhan Mohammed, a Senior AI/ML & NLP Engineer with a passion for building practical AI systems that scale. I bring hands-on experience in Python-based application development and end-to-end ML/AI model deployment, with deep expertise in TensorFlow, PyTorch, scikit-learn, and Generative AI, including GPT, LLaMA, Mistral, and Claude.
I’ve shipped production-grade AI solutions across text, code, and knowledge retrieval, designing scalable APIs, ML workflows, and cloud-based deployments on AWS, GCP. I routinely apply fine-tuning (SFT, LoRA, PEFT), embeddings, prompt engineering, and efficient inference to deliver faster results and measurable improvements in accuracy and throughput.
Experience Level
Language
Work Experience
Education
Qualifications
Industry Experience
- Fine-tuning of TinyLlama-1.1B-Chat model for medical note generation
- Support for multiple PEFT methods:
- LoRA (Low-Rank Adaptation)
- Prefix Tuning
- Prompt Tuning
- 8-bit quantization for efficient training
- Custom dataset handling for medical transcripts
- Chain-of-thought reasoning in medical note generation
This project implements fine-tuning of the TinyLlama model for medical note generation using various PEFT (Parameter-Efficient Fine-Tuning) methods. The model is trained to generate concise, professional medical notes in SOAP (Subjective, Objective, Assessment, Plan) format from patient transcripts.
[Github link of the project](https://www.twine.net/signin
Features
PEFT Methods Explained
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning method that adds small trainable rank decomposition matrices to existing weights. Instead of fine-tuning all parameters, LoRA only trains these small matrices, significantly reducing the number of trainable parameters while maintaining model performance. In this project, LoRA is applied to the key and value projection matrices of the attention layers.
Prefix Tuning
Prefix Tuning prepends trainable continuous vectors (prefixes) to the input sequence. These prefixes act as task-specific instructions that guide the model’s behavior. The method keeps the base model frozen and only trains these prefix vectors, making it highly parameter-efficient. In our implementation, we use 100 virtual tokens as prefixes.
Prompt Tuning
Similar to Prefix Tuning, Prompt Tuning adds trainable continuous vectors to the input. However, Prompt Tuning is simpler and typically uses fewer parameters. The main difference is that Prompt Tuning only adds vectors at the beginning of the sequence, while Prefix Tuning can add them at multiple layers. We use 600 virtual tokens for prompt tuning in this project.
- Retrieval-Augmented Generation (RAG): The chatbot retrieves relevant information from a knowledge base before generating a response, ensuring answers are grounded in provided data.
- LangGraph Orchestration: The entire workflow is orchestrated using LangGraph, providing a clear and maintainable structure for the different stages of processing a query.
- Hybrid Memory: The chatbot utilizes a hybrid memory system, combining both a traditional chat history and a vector-based memory for more nuanced context management.
- Long-Term Memory: Utilizes a vector database to store and access previous messages, enabling the chatbot to maintain context over extended conversations.
- Query Deconstruction: User queries are deconstructed into sub-queries to identify the core question and any necessary context from the chat history.
- Reciprocal Rank Fusion (RRF): Employs RRF to effectively combine and re-rank documents obtained from multiple sub-queries, ensuring the most relevant information is presented.
- Debug Mode: A debug flag allows for streaming the output of each step in the LangGraph workflow, providing insight into the chatbot’s internal workings.
- Vector Store Management: The project includes functionality for creating and managing a vector store for the knowledge base.
This project implements an advanced Retrieval-Augmented Generation (RAG) chatbot using LangChain, LangGraph, and a hybrid memory system. The chatbot is designed to answer questions based on a knowledge base while maintaining conversational context.
[Github link of the project](https://www.twine.net/signin
Features
This project implements a multi-task sentence transformer using the Hugging Face Transformers library. The model is designed to perform both text classification and Named Entity Recognition (NER), leveraging a shared BERT-based encoder.
[Github link to the project](https://www.twine.net/signin
The model architecture includes:
🔹 A shared BERT encoder that learns contextual embeddings for sentences and tokens.
🔹 Two task-specific heads — one for classification and another for NER (BIO tagging).
🔹 A flexible trainer that supports freezing specific components:
Freeze the encoder → fine-tune only the task heads for faster convergence.
Freeze one head → focus on improving the other task (e.g., NER).
This selective freezing mechanism allows targeted improvement on individual tasks while retaining shared knowledge — an efficient way to optimize downstream performance.
- Semantic Search: Ask questions about movie plots and themes using natural language.
- Structured Querying: Filter and analyze movies based on various criteria like genre, release year, rating, director, and more.
- Sub-Query Handling: Breaks down complex user queries into smaller, manageable sub-queries for efficient processing.
- Intermediate Data Storage: Stores and utilizes intermediate results for subsequent queries, enabling multi-step analysis.
- Chain of Thought Reasoning: The agent employs a chain of thought process for complex query resolution.
This project builds a chatbot using LangChain and OpenAI to answer questions about the IMDB Top 1000 movies dataset. It leverages a combination of semantic search (using ChromaDB) and structured querying (using Pandas DataFrames) to provide comprehensive and insightful responses.
[Github link to the project](https://www.twine.net/signin
Features
This project implements a Retrieval Augmented Generation (RAG) Question Answering System. It answers user questions based on the content of a provided markdown document, using modern NLP techniques (embeddings, vector search, and LLMs).
Main Components
Document Loading & Splitting: Loads a markdown file and splits it into manageable text chunks for processing.
Embedding & Vector Store: Uses a Sentence Transformer model (all-MiniLM-L6-v2) to generate embeddings for each chunk, stored in a FAISS vector database for efficient similarity search.
Retrieval: Retrieves the most relevant document chunks using vector similarity when a question is asked.
Question Answering: Uses OpenAI’s GPT-4o model to generate answers, but only if relevant context is found. If no relevant context is found, it returns a fallback message.
Hire a Data Scientist
We have the best data scientist experts on Twine. Hire a data scientist in Morristown today.