Skip to main content

Dumont DEP — Architecture Overview

Introduction

Viglet Dumont DEP is a modular data extraction platform that runs connectors independently and delivers indexed documents to a search engine via an asynchronous message queue. It is designed as a companion application to Viglet Turing ES, but can also deliver content directly to Apache Solr or Elasticsearch.


System Overview

The system has three layers: content sources on the left, the Dumont DEP pipeline in the center, and search engines on the right.

Dumont DEP — High-Level Architecture

Each numbered block is detailed in its own diagram below.


① Connectors — How Content Enters the Pipeline

Connectors extract content and feed it into the pipeline. They come in three forms: Java plugins, standalone CLI tools, and the WordPress PHP plugin.

Dumont DEP — Connector Types

TypeConnectorsHow they connect
Java PluginsWeb Crawler, AEMLoaded into dumont-connector.jar via -Dloader.path — one plugin per JVM
Standalone CLIDatabase, FileSystemSeparate JARs that connect to a running Dumont DEP instance via REST API
PHP PluginWordPressInstalled inside WordPress — sends content directly to Turing ES, bypasses Dumont

For details on each connector, see Connectors Overview.


② Pipeline Engine — How Content Is Processed

Once a connector produces a Job Item, it passes through a multi-stage pipeline before reaching the search engine.

Dumont DEP — Pipeline Detail

StageComponentWhat it does
Job ItemA single document with fields, an action (INDEX / DELETE), and a locale
Processing StrategiesPriority chain: deindex (P10) → ignore rules (P20) → index new (P30) → reindex changed (P40) → skip unchanged (P50)
Batch Processor + QueueGroups items into batches of 50, sends to Apache Artemis persistent queue
Indexing PluginDelivers to Turing ES (default), Apache Solr, or Elasticsearch

The Indexing DB stores checksums and status for every processed document — enabling incremental indexing on subsequent runs.

For the conceptual explanation of each stage, see Core Concepts — The Processing Pipeline.


③ Data Flow — Indexing Sequence

The complete sequence from content source to search engine:

Dumont DEP — Indexing Flow


Internal Module Structure

ModulePackageResponsibility
Connector CoreconnectorPlugin interface, session management, and REST API controllers
Processing Strategiesconnector/strategyPriority-based chain — index, re-index, de-index, ignore, or skip
Batch Processorconnector/batchThread-safe buffer that groups Job Items before queue delivery
Message Queueconnector/queueJMS listener on Apache Artemis — delegates to indexing plugins
Indexing Pluginsconnector/indexingOutput adapters: Turing ES, Apache Solr, Elasticsearch
Web Crawlerweb-crawlerJSoup, URL filtering, authentication, locale detection
DatabasedbJDBC queries, batch chunking, multi-database support
FileSystemfilesystemApache Tika text extraction, OCR, metadata mapping
AEMaeminfinity.json, tags, model.json, delta tracking, custom extensions
WordPresswordpressPHP plugin — event-driven indexing inside WordPress
CommonscommonsShared models, interfaces, utilities
AEM Commonsaem-commonsAEM extension interfaces (published to Maven Central)
DB Commonsdb-commonsDB extension interface (published to Maven Central)

Technology Stack

LayerTechnologyNotes
RuntimeJava 21Minimum supported version
FrameworkSpring Boot 4.0.3JMS, caching, async, scheduling
Message BrokerApache ArtemisEmbedded, persistent queues
DatabaseH2 (dev) / PostgreSQL (prod)Indexing state, checksums, config
HTML ParsingJSoup 1.22.1Web Crawler
Text ExtractionApache Tika 3.2.3FileSystem — PDF, DOCX, images (OCR)
Search ClientsTuring Java SDK, SolrJ 10.0.0, ES Client 9.3.2One active per deployment
BuildApache MavenMulti-module project

Deployment Topologies

Development

Dumont DEP (H2 embedded + Artemis embedded)
→ Turing ES (http://localhost:2700)

Production

Dumont DEP + PostgreSQL
→ Turing ES + Apache Solr

Direct Indexing (without Turing ES)

Dumont DEP + PostgreSQL
→ Apache Solr (direct via SolrJ)
→ Elasticsearch (direct via ES Client)

PageDescription
Core ConceptsPipeline stages, strategies, and change detection
Connectors OverviewAll connectors and deployment types
Indexing PluginsTuring ES, Solr, Elasticsearch output targets
Installation GuideSetup with -Dloader.path and systemd
Turing ES — ArchitectureSearch-side architecture — indexing reception, search flow, and deployment topologies