Six Systems That Are Quietly Automating the Hardest Parts of Knowledge Work

The phrase “AI for knowledge work” has become so broad as to be nearly meaningless. But underneath the marketing noise, a set of more precise research systems has emerged — each targeting a specific, stubborn bottleneck in how experts actually spend their time. Looking at six recent systems together reveals a pattern worth examining.

From Static Summaries to Interactive Mechanisms

Most document AI produces static outputs: a summary, a slide deck, a structured extraction. (PaperVoyager) takes a different approach. It uses visual language models to interpret dynamic mechanisms described in research papers — state transitions, iterative algorithms, feedback loops — and generate executable web-based visualizations. The target artifact isn’t a summary but a running system. For anyone who has tried to understand a paper’s core algorithm by reading prose descriptions of it, this distinction matters considerably. Static comprehension tools hide behavior; interactive ones make it tangible and testable.

Screening at Scale Without Losing Depth

Paper screening is one of the most time-consuming tasks in academic research, and existing AI reviewers tend to produce shallow evaluations that miss genuine contributions. (MemoNoveltyAgent) addresses this by integrating a persistent historical research memory into a multi-agent workflow. Rather than evaluating a paper in isolation, the system situates it against prior work the agent has already processed, enabling assessment of true novelty rather than surface-level differentiation. The key insight here is architectural: novelty assessment without longitudinal memory is inherently limited, regardless of how capable the underlying model is.

Agents as Problem Generators, Not Just Solvers

(Code2Math) flips a common assumption about code agents. The usual framing is that agents should solve math problems; Code2Math explores using them to generate and evolve challenging math problems instead. By treating code execution as a scalable environment for mathematical experimentation, the system produces training problems that would be prohibitively expensive to source by hand. The observation that code agents are better at problem generation than problem solving in this domain is the kind of counterintuitive finding that tends to open new research directions rather than close them.

Automating Operations Research Modeling

Translating a business problem stated in natural language into a formal optimization model is slow, expert-dependent work. Errors compound at each step of the translation. (MIRROR) attacks this with a multi-agent framework that combines iterative revision cycles with hierarchical, task-specific retrieval. The iterative revision component is critical: rather than a single-pass generation, agents collaborate to catch and correct errors before they propagate into the final model. This mirrors — deliberately — how experienced OR practitioners actually work, cycling through formulation and validation before committing.

Making Implicit Preferences Legible

Decision-making from large option sets with competing objectives is difficult partly because human preferences in these situations are often implicit and partially formed. (LISTEN) treats the LLM as an iterative decision agent that surfaces and refines these preferences through dialogue — eliciting trade-off evaluations in natural language and updating its internal preference model accordingly. The applications are broad: hiring decisions, procurement, research prioritization. The common thread is any scenario where experts know what they want but struggle to articulate it in a form a system can act on.

Fixing Facts Across Multiple Hops

Factual error correction in text is harder than it looks when errors require reasoning across multiple evidence sources. Single-hop corrections — where the error and the correcting evidence are directly linked — are manageable. Multi-hop cases are not, and most training data skews heavily toward the easier type. (CECOR) constructs synthetic training data specifically targeting compositional, multi-hop factual errors. By building paired datasets that require cross-source reasoning to correct, it scales training signal into the precise regime where current models struggle most.

What These Systems Have in Common

None of these are general-purpose assistants. Each encodes a specific expert workflow — paper screening, OR modeling, problem generation, preference elicitation, factual correction — and automates the parts that are slow, error-prone, or hard to scale. The structural similarities are worth noting: iterative revision appears in MIRROR and LISTEN; persistent memory matters in MemoNoveltyAgent; synthetic data generation connects Code2Math and CECOR. The pattern suggests that building useful automation for knowledge work requires getting specific about the failure modes in existing human workflows, then designing around them rather than building general-purpose pipelines and hoping they generalize.