Most literature tools are designed to help users retrieve papers. Far fewer are designed to help users understand where evidence is limited, fragmented, or absent.
At mineris, we treat research gap detection primarily as a coverage analysis problem, rather than an inference problem. Instead of asking a model to generate gaps directly from unstructured text, we first construct a structured representation of the literature and then identify areas in which evidence coverage is limited.
From articles to structured evidence
Each article in a corpus is transformed into structured fields based on the PICO framework:
- Population
- Intervention
- Outcome
- optionally, Study design
Before analysis, these fields are normalized, cleaned, deduplicated, and mapped to more consistent forms. This step is important because even minor variation in terminology can fragment the evidence landscape and obscure meaningful patterns.
The objective is not to impose a perfect ontology on the literature. Rather, it is to produce a representation that is sufficiently consistent to support systematic analysis of what the corpus actually contains.
Constructing a coverage map
Once articles have been structured, we identify the most frequent terms along each axis, such as population, intervention, and outcome. These terms define a bounded analysis space.
We then construct a PICO coverage grid, in which each cell corresponds to a specific combination of:
- Population × Intervention × Outcome
- optionally extended by Study design
Each cell stores the set of articles supporting that exact combination. In this way, a collection of publications is converted into a structured map of evidence density.
Detecting low-coverage regions
Within this coverage grid, all combinations inside the selected analysis space are evaluated. Each cell can then be characterized as one of the following:
- Well-studied, when supported by multiple articles
- Sparse, when supported by only a small number of articles
- Empty, when no supporting articles are present in the corpus
Sparse and empty cells form the basis of what we describe as candidate research gaps. However, absence of evidence within a grid cell should not automatically be interpreted as a meaningful gap.
Filtering likely artifacts
A central component of the method is filtering. Raw combinations can generate misleading outputs due to:
- extraction noise
- inconsistent terminology
- semantically incompatible terms
To reduce these artifacts, we apply two constraints before labeling a cell as a candidate gap.
1. Semantic plausibility
We remove combinations in which:
- an outcome resembles a population label or intervention term
- terms belong to clearly different clinical domains
- the combination is biologically or clinically implausible
2. Pairwise support
A candidate combination must also show at least minimal co-occurrence among its constituent parts somewhere in the corpus:
- population + intervention
- population + outcome
- intervention + outcome
If none of these pairwise relations appear, the full combination is more likely to be an artifact of the combinatorial search space than a credible low-coverage research area.
What the system returns
After filtering, the remaining cells are classified as:
- empty, indicating no evidence in the corpus
- sparse, indicating minimal evidence
These are returned as candidate research gaps together with:
- supporting articles, when present
- examples of well-studied combinations for context
The resulting output is intended to be:
- interpretable
- inspectable
- suitable for downstream synthesis or prioritization
Why we refer to them as candidate gaps
This approach does not claim to identify definitive research gaps automatically.
An empty or sparse cell may reflect:
- a genuinely under-studied question
- limitations of the corpus
- differences in terminology
- a clinically unimportant combination
For that reason, we treat these outputs as structured signals rather than conclusions. They are intended to support expert review, not replace it.
Why this approach matters
Many gap-detection systems depend heavily on model-based inference. Our approach follows a different sequence:
- first, map what is present in the literature
- then, identify where coverage is limited
- only afterward, interpret those low-coverage areas
This makes the process:
- more transparent
- easier to validate
- less dependent on opaque model behavior
What comes next
Coverage alone is not sufficient to define a research gap. A more complete system should also account for:
- study quality
- recency
- consistency of findings
- patient relevance
Even so, coverage remains a useful foundation.
By transforming literature into a structured evidence map and highlighting its sparse regions, we provide researchers with a more systematic starting point for identifying potentially important unanswered questions.
Not only what has been studied, but also what has not been studied sufficiently, and whether that absence is likely to matter.