Extraction Trees: Reliable Structured Data from Messy Text

AI/ML & Data Science Advanced

Single-prompt LLM extraction works about 70% of the time and breaks on documents with more than one event. A layered extraction-tree pipeline, demonstrated on a biomedical benchmark with real ground truth.

Joshua Cook

Data Scientist at Caltech CTME

Most data work now involves pulling structured records out of unstructured text. The default approach is one big LLM prompt asking for JSON. It works maybe 70% of the time, gives different answers on different runs, and breaks the moment a document mentions more than one thing. This talk is about a different approach: extraction trees, a multi-layer pipeline where each layer does one job, the next layer cleans up after the previous one, and a human review interface decides which failure modes get their own layer. The worked example is biomedical event extraction on the GENIA corpus, pulling nested event records (event type, proteins and genes involved, trigger word, links to other events) from research abstracts, where a single sentence can describe one event or six and events point at other events, so the schema is a tree of records, which is what makes single-prompt approaches fall apart. GENIA earns its place for one reason a data engineer will appreciate: it ships with a gold standard and an official scorer, so every number in the talk is measured against ground truth. The talk has three parts: the methodology (what naive extraction produces on this text and where it fails, then the tree as a whole), the results (real per-layer numbers against the official scorer, the layered pipeline beating single-prompt by 14 F1 points, what determinism looks like, and where it still gets things wrong, including how the obvious gold-free quality check misleads you and a cross-vendor panel fixes it), and the signals (a short heuristic for recognizing an extraction-tree problem, and the signs it is not). Extraction at production scale is not prompt engineering, it is systems engineering with LLMs as components.