I am a data scientist with 7+ years of experience delivering value with data science and AI projects. I analyze data and implement machine learning algorithms to gain meaningful insights, leveraging Python and SQL to turn data into action. I enjoy building scalable solutions that impact business outcomes and communicate complex results clearly to both technical and non-technical stakeholders. In my leadership roles as Co-Founder and Data Lead, I guide the development of AI-powered products, dashboards, and data platforms. I collaborate with cross-functional teams to translate business needs into practical data-driven solutions, while managing AI usage and costs to align with the company vision.

Cornellius Yudha Wijaya

I am a data scientist with 7+ years of experience delivering value with data science and AI projects. I analyze data and implement machine learning algorithms to gain meaningful insights, leveraging Python and SQL to turn data into action. I enjoy building scalable solutions that impact business outcomes and communicate complex results clearly to both technical and non-technical stakeholders. In my leadership roles as Co-Founder and Data Lead, I guide the development of AI-powered products, dashboards, and data platforms. I collaborate with cross-functional teams to translate business needs into practical data-driven solutions, while managing AI usage and costs to align with the company vision.

Available to hire

I am a data scientist with 7+ years of experience delivering value with data science and AI projects. I analyze data and implement machine learning algorithms to gain meaningful insights, leveraging Python and SQL to turn data into action. I enjoy building scalable solutions that impact business outcomes and communicate complex results clearly to both technical and non-technical stakeholders.

In my leadership roles as Co-Founder and Data Lead, I guide the development of AI-powered products, dashboards, and data platforms. I collaborate with cross-functional teams to translate business needs into practical data-driven solutions, while managing AI usage and costs to align with the company vision.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert

Language

Indonesian
Fluent
English
Advanced

Work Experience

Co-Founder (Chief Product Officer and Data Lead) at GoArif Startup
January 1, 2024 - Present
Led the development of Data Science Dashboard features, including data visualization, statistical analysis, AI Agent development, Speech-to-Text, and an LLM-based chat platform with RAG capability. Produced Product Requirements Documents and steered the AI product to align with the company vision. Managed AI product usage and related costs. Collaborated with six cross-functional team members (developers, designers, project manager) to deliver four major features about 15% ahead of deadlines.
AI Engineer (Contract-based) at PT. Sobat Sepadan
February 1, 2024 - June 1, 2025
Spearheaded the development of a Financial-AI application leveraging LLM services (OpenAI GPT-4, Gemini), achieving 98% accuracy in business classification and reducing manual processing time by 50%. Designed and deployed scalable AI architectures for three core product features (Document classification, document extraction, document summarization), aligning 100% with the company vision. Ensured AI models were seamlessly integrated with the existing system and translated business requirements into the necessary solutions.
Data Scientist Assistant Manager at Allianz Life Indonesia
June 1, 2020 - December 1, 2024
Generated an end-to-end Data Science working framework and implemented MLOps design within the framework. Successfully implemented and supervised various ML models to improve business quality, including Propensity-to-Buy models (reaching ~150% of annual target), a customer segmentation model with actionable insights to support Propensity-to-Buy, churn prediction reducing churn by 3%, and a fraud detection project reducing claim fraud by 1%. Led and mentored junior data scientists, developed MicroStrategy dashboards for business users, and implemented AWS cloud-based data science services for end-to-end use cases.

Education

Master of Science in Evolutionary Biology at Uppsala University
August 1, 2016 - June 1, 2018
Bachelor of Science in Biology at Universitas Gadjah Mada
August 1, 2011 - October 1, 2015

Qualifications

Add your qualifications or awards here.

Industry Experience

Software & Internet, Professional Services, Financial Services
    paper End-to-End Machine Learning Project

    End-to-End Machine Learning Project: Telco Customer Churn Prediction

    A beginner-friendly, business-oriented machine learning project that builds a churn prediction model from start to finish. The goal is not just “training a model”, but delivering something a company could actually use: a repeatable process that turns customer data into actionable churn risk signals.

    What we want to achieve

    • Retention impact: identify customers likely to churn so the business can prioritize outreach, offers, or service recovery.
    • Clear success measure: optimize for Recall on churners, because missing a true churner can be more costly than contacting a customer who would have stayed.
    • Repeatable workflow: a structured project that can be re-run as new data arrives.

    Project structure (end-to-end)

    1. Business understanding
      Define the decision the business will make (who to target) and the primary metric (Recall).

    2. Data collection and preparation
      Use the Telco Customer Churn dataset (about 7k customers) and ensure the data is clean and usable for modeling.

    3. Model build (baseline)
      Train an initial model to establish a baseline performance and generate early churn risk scores.

    4. Optimization
      Improve results through better feature choices and parameter tuning, guided by validation results and business trade-offs.

    5. Deployment concept
      Package the outcome so it can be used consistently (for example: scoring new customers regularly and producing a prioritized retention list).

    Deliverables

    • A churn risk scoring output (customer-level risk)
    • A short performance summary with Recall emphasized
    • A documented workflow that can be extended (new features, new models, monitoring)

    References

    paper Creating a Useful Voice-Activated Fully Local RAG System

    Creating a Useful Voice-Activated Fully Local RAG System

    A fully local, voice-first Retrieval-Augmented Generation (RAG) assistant that runs end-to-end on your machine. The system listens for a wake word, records and transcribes speech locally, retrieves relevant context from a PDF knowledge base using vector similarity search, generates a grounded answer with a local LLM, and converts the answer into a playable audio response.

    What this project does

    1. Voice input (recording): Record microphone audio to a WAV file.
    2. Speech-to-text (local transcription): Transcribe the audio locally using Whisper.
    3. Wake word activation: Gate the system behind a wake word, detected via embedding similarity to remain robust to transcription variations.
    4. Knowledge base preparation:
      • Load and extract text from a PDF handbook.
      • Split the text into overlapping chunks for retrieval.
      • Embed each chunk and store it in a local vector database.
    5. Retrieval: Embed the user query and retrieve the top-k most relevant chunks from the vector store.
    6. Generation (local LLM): Generate a response using a local chat model, grounded by the retrieved chunks.
    7. Text-to-speech (local TTS): Convert the generated response to speech, save it as an audio file, and play it back.

    Core components

    • Voice Receiver and Transcription

      • Audio recording and playback
      • Local transcription for user input
      • Wake word detection using embedding similarity + cosine similarity
    • Knowledge Base

      • PDF extraction
      • Chunking with overlap
      • Local vector storage and retrieval
    • Audio File Response Generation

      • Local response generation
      • Local text-to-speech and playback

    Tech stack

    • Audio I/O: sounddevice
    • Speech-to-text: Whisper (base.en)
    • Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
    • Vector DB: ChromaDB (local)
    • Chunking: RecursiveCharacterTextSplitter
    • Similarity: cosine similarity (sklearn.metrics.pairwise)
    • Local LLM: Qwen/Qwen1.5-0.5B-Chat (Hugging Face transformers)
    • Text-to-speech: suno/bark-small (Hugging Face transformers)

    Installation

    Create and activate a virtual environment:

    python -m venv rag-env-audio
    # Windows: rag-env-audio\Scripts\activate
    # macOS/Linux: source rag-env-audio/bin/activate
    

    Suggested project structure

    .
    ├─ app.py                 # voice → RAG → audio response pipeline
    ├─ dataset/
    │  └─ Insurance_Handbook_20103.pdf
    └─ chroma_db/             # persisted vector store (generated)
    

    References

    paper Simple RAG Implementation With Contextual Semantic Search

    Simple RAG Implementation With Contextual Semantic Search

    A minimal, end-to-end reference implementation of a Retrieval-Augmented Generation (RAG) pipeline that grounds LLM responses in your documents using contextual semantic search. This project demonstrates how to reduce hallucinations by retrieving the most relevant passages from a PDF knowledge base and injecting them as context at generation time.

    What this project does

    1. Ingest: Extract text from one or more PDF files.
    2. Chunk: Split raw text into overlapping, semantically meaningful chunks.
    3. Embed + Index: Convert chunks into vector embeddings and store them in a local vector database.
    4. Retrieve: For each user query, fetch the top-k most similar chunks via vector similarity search.
    5. Generate: Provide the retrieved context to an LLM to produce a grounded answer.

    Key capabilities

    • PDF ingestion with PyPDF2
    • Chunking with LangChain RecursiveCharacterTextSplitter
    • Embedding generation with Sentence Transformers (all-MiniLM-L6-v2)
    • Vector storage and retrieval with Chroma (persistent local store for prototyping)
    • LLM integration with LiteLLM (example: Gemini gemini-1.5-flash)

    Default configuration

    • Chunk size: 500
    • Chunk overlap: 50
    • Retrieval: top_k = 5
    • Vector DB persistence path: chroma_db
    • Example collection name: knowledge_base

    Installation

    pip install -q chromadb pypdf2 sentence-transformers litellm langchain
    

    Environment variables

    export HUGGINGFACE_TOKEN="YOUR_TOKEN"
    export GEMINI_API_KEY="YOUR_KEY"
    

    Suggested project structure

    .
    ├─ dataset/          # PDF files used as the knowledge base
    ├─ chroma_db/        # local persisted vector store (generated)
    ├─ notebooks/        # tutorial / experiments
    └─ src/              # reusable pipeline components (optional refactor)
    

    Intended use

    • A clean baseline for learning and demos of RAG fundamentals
    • Internal knowledge-base Q&A prototypes (policies, manuals, SOPs)
    • A starting point to add reranking, evaluation, guardrails, and observability

    References