I am a data scientist with 7+ years of experience delivering value with data science and AI projects. I analyze data and implement machine learning algorithms to gain meaningful insights, leveraging Python and SQL to turn data into action. I enjoy building scalable solutions that impact business outcomes and communicate complex results clearly to both technical and non-technical stakeholders.
In my leadership roles as Co-Founder and Data Lead, I guide the development of AI-powered products, dashboards, and data platforms. I collaborate with cross-functional teams to translate business needs into practical data-driven solutions, while managing AI usage and costs to align with the company vision.
Skills
Experience Level
Language
Work Experience
Education
Qualifications
Industry Experience
- Retention impact: identify customers likely to churn so the business can prioritize outreach, offers, or service recovery.
- Clear success measure: optimize for Recall on churners, because missing a true churner can be more costly than contacting a customer who would have stayed.
- Repeatable workflow: a structured project that can be re-run as new data arrives.
Business understanding
Define the decision the business will make (who to target) and the primary metric (Recall).Data collection and preparation
Use the Telco Customer Churn dataset (about 7k customers) and ensure the data is clean and usable for modeling.Model build (baseline)
Train an initial model to establish a baseline performance and generate early churn risk scores.Optimization
Improve results through better feature choices and parameter tuning, guided by validation results and business trade-offs.Deployment concept
Package the outcome so it can be used consistently (for example: scoring new customers regularly and producing a prioritized retention list).- A churn risk scoring output (customer-level risk)
- A short performance summary with Recall emphasized
- A documented workflow that can be extended (new features, new models, monitoring)
- KDnuggets: Step-by-Step Tutorial to Building Your First Machine Learning Model (June 10, 2024)
https://www.twine.net/signin - Dataset: Telco Customer Churn (Kaggle)
https://www.twine.net/signin
End-to-End Machine Learning Project: Telco Customer Churn Prediction
A beginner-friendly, business-oriented machine learning project that builds a churn prediction model from start to finish. The goal is not just “training a model”, but delivering something a company could actually use: a repeatable process that turns customer data into actionable churn risk signals.
What we want to achieve
Project structure (end-to-end)
Deliverables
References
- Voice input (recording): Record microphone audio to a WAV file.
- Speech-to-text (local transcription): Transcribe the audio locally using Whisper.
- Wake word activation: Gate the system behind a wake word, detected via embedding similarity to remain robust to transcription variations.
- Knowledge base preparation:
- Load and extract text from a PDF handbook.
- Split the text into overlapping chunks for retrieval.
- Embed each chunk and store it in a local vector database.
- Retrieval: Embed the user query and retrieve the top-k most relevant chunks from the vector store.
- Generation (local LLM): Generate a response using a local chat model, grounded by the retrieved chunks.
- Text-to-speech (local TTS): Convert the generated response to speech, save it as an audio file, and play it back.
Voice Receiver and Transcription
- Audio recording and playback
- Local transcription for user input
- Wake word detection using embedding similarity + cosine similarity
Knowledge Base
- PDF extraction
- Chunking with overlap
- Local vector storage and retrieval
Audio File Response Generation
- Local response generation
- Local text-to-speech and playback
- Audio I/O:
sounddevice - Speech-to-text: Whisper (
base.en) - Embeddings: Sentence Transformers (
all-MiniLM-L6-v2) - Vector DB: ChromaDB (local)
- Chunking:
RecursiveCharacterTextSplitter - Similarity: cosine similarity (
sklearn.metrics.pairwise) - Local LLM:
Qwen/Qwen1.5-0.5B-Chat(Hugging Facetransformers) - Text-to-speech:
suno/bark-small(Hugging Facetransformers) - Article: https://www.twine.net/signin
Creating a Useful Voice-Activated Fully Local RAG System
A fully local, voice-first Retrieval-Augmented Generation (RAG) assistant that runs end-to-end on your machine. The system listens for a wake word, records and transcribes speech locally, retrieves relevant context from a PDF knowledge base using vector similarity search, generates a grounded answer with a local LLM, and converts the answer into a playable audio response.
What this project does
Core components
Tech stack
Installation
Create and activate a virtual environment:
python -m venv rag-env-audio
# Windows: rag-env-audio\Scripts\activate
# macOS/Linux: source rag-env-audio/bin/activate
Suggested project structure
.
├─ app.py # voice → RAG → audio response pipeline
├─ dataset/
│ └─ Insurance_Handbook_20103.pdf
└─ chroma_db/ # persisted vector store (generated)
References
- Ingest: Extract text from one or more PDF files.
- Chunk: Split raw text into overlapping, semantically meaningful chunks.
- Embed + Index: Convert chunks into vector embeddings and store them in a local vector database.
- Retrieve: For each user query, fetch the top-k most similar chunks via vector similarity search.
- Generate: Provide the retrieved context to an LLM to produce a grounded answer.
- PDF ingestion with PyPDF2
- Chunking with LangChain RecursiveCharacterTextSplitter
- Embedding generation with Sentence Transformers (all-MiniLM-L6-v2)
- Vector storage and retrieval with Chroma (persistent local store for prototyping)
- LLM integration with LiteLLM (example: Gemini gemini-1.5-flash)
- Chunk size: 500
- Chunk overlap: 50
- Retrieval: top_k = 5
- Vector DB persistence path: chroma_db
- Example collection name: knowledge_base
- A clean baseline for learning and demos of RAG fundamentals
- Internal knowledge-base Q&A prototypes (policies, manuals, SOPs)
- A starting point to add reranking, evaluation, guardrails, and observability
- Tutorial article: https://www.twine.net/signin
- Repository: https://www.twine.net/signin
Simple RAG Implementation With Contextual Semantic Search
A minimal, end-to-end reference implementation of a Retrieval-Augmented Generation (RAG) pipeline that grounds LLM responses in your documents using contextual semantic search. This project demonstrates how to reduce hallucinations by retrieving the most relevant passages from a PDF knowledge base and injecting them as context at generation time.
What this project does
Key capabilities
Default configuration
Installation
pip install -q chromadb pypdf2 sentence-transformers litellm langchain
Environment variables
export HUGGINGFACE_TOKEN="YOUR_TOKEN"
export GEMINI_API_KEY="YOUR_KEY"
Suggested project structure
.
├─ dataset/ # PDF files used as the knowledge base
├─ chroma_db/ # local persisted vector store (generated)
├─ notebooks/ # tutorial / experiments
└─ src/ # reusable pipeline components (optional refactor)
Intended use
References
Hire a Data Scientist
We have the best data scientist experts on Twine. Hire a data scientist in Jakarta today.