01

PDF Ingestion

BIS SP 21 PDF is parsed using PyMuPDF / pdfplumber. Each BIS standard extracted as a discrete document preserving IS number, title, and scope.

PyMuPDF
pdfplumber
02

Chunking Strategy

Hierarchical chunking — parent chunk = full standard, child chunks = individual clauses. 100-token overlap windows prevent context loss at boundaries.

LangChain
100-token overlap
03

Embeddings & Vector Store

Chunks encoded with SentenceTransformers or OpenAI embeddings and indexed into FAISS / ChromaDB for millisecond cosine similarity search.

SentenceTransformers
FAISS / ChromaDB
04

Hybrid Retrieval

Dense vector retrieval fused with BM25 keyword scoring for precision on IS numbers. Hallucination guard verifies all retrieved IS numbers against index.

BM25
Cosine similarity
05

LLM Generation

Top-K candidates passed to Claude / GPT-4o / Mistral for re-ranking with natural language rationale. Returns top 3–5 standards with structured JSON output.

Claude / GPT-4o
Mistral
06

API Layer

Served via FastAPI + inference.py. Optimised for consumer hardware with <5s end-to-end latency. REST API returns ranked standards with rationale.

FastAPI
inference.py
BIS SP 21 PDF
Chunk & Embed
Vector Store
Semantic Retrieval
LLM Generation
Standards Output
>80%
Hit Rate @3
Target: >80%
At least 1 correct standard appears in top-3 results
>0.7
MRR @5
Target: >0.7
Mean Reciprocal Rank of first correct result in top-5
<5s
Avg Latency
Target: <5 sec
Average end-to-end query response time
100%
No Hallucinations
Clean responses
IS numbers verified against index — unseen numbers suppressed
Metric Definitions
MetricDefinitionFormula
Hit Rate @3 ≥1 correct standard in top-3 results (correct_queries / total) × 100
MRR @5 Mean reciprocal rank of first correct result in top-5 Σ(1/rank_i) / N
Avg Latency Average time per query end-to-end total_time / num_queries
Hallucination Rate % of responses with fabricated IS numbers 1 − (verified / total)
Impact on MSEs

Weeks → Seconds

Compliance discovery time slashed from weeks of manual search to under 5 seconds.

💰

Zero Cost

Open-source stack runs on consumer GPU — no expensive infrastructure needed.

🔄

80%+ Automation

Automates the majority of manual standard lookup effort, freeing compliance teams.

🏭

6.3 Cr MSEs

Serves 63 million small businesses in India facing compliance burden.

The Problem

Weeks
spent identifying applicable BIS standards
1000+
BIS standards covering building materials alone
6.3 Cr
MSEs in India facing compliance burden
The Team — Code_ninja
MY
Manjeet Yadav
Team Leader · ML & RAG
NK
Nikhil Kumar
Data Ingestion & Chunking
NJ
Nitin Jangra
Backend & Inference API
T4
Team Member 4
UI/UX & Presentation
Acknowledgements
Bureau of Indian Standards — for the SP 21 dataset and mission to support Indian MSEs
HuggingFace — SentenceTransformers and open model ecosystem
LangChain — RAG pipeline orchestration framework
FAISS — High-performance vector similarity search