6. Ingestion & Embeddings (No Steps Skipped)

Ingestion flow

How the pipeline moves

The ingest path starts with raw files and progressively turns them into search-ready knowledge.

Collect: source files are picked up from the shared ingest location.
Prepare: content is split into clean chunks so retrieval can target precise passages.
Encode: chunks are transformed into embeddings for semantic matching via Ollama (default embedding model: bge-m3).
Index: vectors are stored in Qdrant to power fast similarity search.
Enrich (optional): when graph mode is on, triplets are extracted with GRAPH_OLLAMA_RAG_MODEL and written to Neo4j.
Track: ingest activity is logged so progress and issues are visible in real time.

Prepare your data

This step is handled by the crawler service.
crawl4ai-service crawls your website and writes the crawled output into the shared Docker volume, which is available to the bridge at /app/shared/<foldername>.

Run ingestion (inside container for no exposed ports)

Command:

docker exec hawki_rag_bridge sh -lc "python /app/ingest/ingest_crawled.py \
  --root /app/shared/<foldername> \
  --base-url http://localhost:8000 \
  --provider ollama \
  --graph \
  --batch 16"

"Resume or start fresh"

The ingestion pipeline supports resume behavior and fresh-start behavior:

--resume (skip already ingested docs)
--start (ignore previous state and ingest fresh)

"Ingestion with --graph-only"

During ingestion, --graph-only skips Qdrant vector indexing and runs only graph extraction with writes to Neo4j.

Monitoring ingest progress

View cached log: docker exec hawki_rag_bridge tail -n 40 /var/www/storage/logs/ingest_progress_cache.log.
Check status JSON: docker exec hawki_rag_app cat storage/logs/ingest_status.json.

Stopping ingestion

Methods to stop an active ingestion job:

Use the ingestion stop action exposed by the API/controller flow.
Kill the running ingest process ID recorded in the status file.
Restart the hawki_rag_bridge container if you need a hard stop.

Ingestion flow​

How the pipeline moves​

Prepare your data​

Run ingestion (inside container for no exposed ports)​

Monitoring ingest progress​

Stopping ingestion​