The underlying concept is to view a document as a pair of files .
File is the display file that is presented to the user
when the document matches a query and is retrieved.
The file (the surrogate file) is what is indexed at the time of
creating the database or when adding the document to the database.
The key point is that the two files and need not be identical.
Indeed, the surrogate file can have:
Non-displayable metadata, such as (a) authors' notes,
(b) descriptive terms and keywords for equations, formulas,
graphs and tables, and (c) annotations.
Modified representations of mathematical equations, symbols
and other constructs
so that those representations are indexable and searchable by text
search systems. Roughly speaking, the modified representation
corresponds to content markup in the parlance of
MathML [23] and OpenMath [21], and the original
representation corresponds to presentation markup as in MathML,
LATEX, and other word processing systems. Note that the modified
representation could be standard markup such a in MathML, but it
need not be, depending on the intended use.
The main reason for having a pair of files is that display files do not
have all the textual (and thus indexable) contents needed, and, on the
other hand, the surrogate files have contents that are not for display.
In practice, D is a subset of S. File S is created at indexing time, and
can later be discarded (after the index has been generated);
only file D need be kept.