6. Ingestion & Embeddings (No Steps Skipped)
Ingestion flow
How the pipeline moves
The ingest path starts with raw files and progressively turns them into search-ready knowledge.
- Collect: source files are picked up from the shared ingest location.
- Prepare: content is split into clean chunks so retrieval can target precise passages.
- Encode: chunks are transformed into embeddings for semantic matching via Ollama (default embedding model:
bge-m3). - Index: vectors are stored in Qdrant to power fast similarity search.
- Enrich (optional): when graph mode is on, triplets are extracted with
GRAPH_OLLAMA_RAG_MODELand written to Neo4j. - Track: ingest activity is logged so progress and issues are visible in real time.
Prepare your data
- This step is handled by the crawler service.
crawl4ai-servicecrawls your website and writes the crawled output into the shared Docker volume, which is available to the bridge at/app/shared/<foldername>.
Run ingestion (inside container for no exposed ports)
- Command:
docker exec hawki_rag_bridge sh -lc "python /app/ingest/ingest_crawled.py \
--root /app/shared/<foldername> \
--base-url http://localhost:8000 \
--provider ollama \
--graph \
--batch 16"
"Resume or start fresh"
The ingestion pipeline supports resume behavior and fresh-start behavior:
--resume(skip already ingested docs)--start(ignore previous state and ingest fresh)
"Ingestion with
--graph-only"During ingestion, --graph-only skips Qdrant vector indexing and runs only graph extraction with writes to Neo4j.
Monitoring ingest progress
- View cached log:
docker exec hawki_rag_bridge tail -n 40 /var/www/storage/logs/ingest_progress_cache.log. - Check status JSON:
docker exec hawki_rag_app cat storage/logs/ingest_status.json.
Stopping ingestion
Methods to stop an active ingestion job:
- Use the ingestion stop action exposed by the API/controller flow.
- Kill the running ingest process ID recorded in the status file.
- Restart the
hawki_rag_bridgecontainer if you need a hard stop.