4.3 Search Augmentation

For the augmentation strategy to work, we need to perform the following fundamental tasks:

Textualization
of all symbolic entities in the contents and in the queries. This trivial but necessary step means mapping non-alphanumerical characters to alphanumerical strings. For example, $+$ is turned into ``plus'' and $<$ into ``lt''. This is needed since text search systems do not recognize symbols like $+, /, <$ or ^ as known mathematical operators. The textualizing process merely involves substituting non-alphanumerical characters with pre-defined alphanumerical counterparts.

Flattening
of mathematical equations & structures in contents and in queries, that is, converting them into a linear, sequential form. Flattening is needed because the closest thing to capturing structures in text search is through linear contiguity & proximity operators.

The main task of flattening is to identify and scope the parts of a formula such as numerators, denominators, exponents and arguments, and to surround them with pairs of matching markup tags. Table 1 below illustrates the concept.

The main approach, which performs both textualization and flattening, is as follows. First, each formula is parsed into a tree where each node is labeled with the corresponding entity in the formula (e.g., ``+'', ``2'', ``x'', and ``^''). Second, the parse tree is traversed and converted into a textualized sentence with all the proper matching pairs surrounding formula parts. The granularity of the parts to be scoped depends on how much specificity we want to allow the users to have when formulating queries.


Table 1: Illustrations of Textualization and Flattening
ORIGINAL TEXTUALIZED & FLATTENED FORM
1-12-2  
x^{t-2} = 1 x BeginExponent t minus 2 EndExponent Equal 1
f(x) or f@(x) f ApplyAt BeginArgument x EndArgument
(x+1)/(x-1) fraction (x plus 1)(x minus 1)
  -alternatively-
  fraction BeginNumerator x plus 1 EndNumerator
  BeginDenominator x minus 1 EndDenominator

Flattening deserves special attention, not only because it is inherent to text search, but also because it creates a new problem stemming from the multiplicity of linear forms that an expression can be mapped to. For example, $x_2^3$ can be flattened in two ways, depending on whether we put the subscript before or after the superscript. If the order in the contents is different from the order in the query (after flattening), then an undesirable mismatch results. Hence the need for the next task.

Normalization
of the linear forms, that is, converting the linear form of each expression/formula into a single format, called a normal form.

The normal form is a data model that should be generic enough to model all expressions/formulas. We define a new data model called the sorted parse tree normal form. It is a parse tree where the children of any nodes are sorted from left to right whenever changing the order of the children does not alter the mathematical meaning, such as when the node is an associative and commutative operator, and the children are its operands. The ordering is based on any numerical or alphabetical associated key. For example, we can sort the children based on the number of descendent nodes that each child has in the parse tree. If two nodes have the same number of descendants, then we sort them alphanumerically based on the entity labels of the nodes. The normalization algorithm is an algorithm that sorts the parse tree.

Development of a query language
to express math queries as well as text queries. The math query language will be similar to but much simpler than LaTeX, where many LaTeX word-commands are replaced by short yet intuitive strings. The language incorporates notations from other languages. Table 2 gives a sample of operators, and Table 3 illustrates math queries.


Table 2: A Sample of Operators in the Math Query Language
OPERATOR MEANING
1-12-2  
+, -, /, * the standard arithmetic operators
^, ** Superscript or power
_ subscript
@ apply at as in f(x) or f@(x)
=>, <=>, <== imply, equivalent, if
!=, ~=, not= not equal
=- equivalence ($\equiv$)
=~ congruence ($\cong$)
=. dot equal ($\doteq$)
-~ similar-to or equal ($\simeq$)
~ similar-to ($\sim$)
~~ approximate ($\approx$)
$\vert . \vert$ determinant, magnitude


Table 3: Illustrations of Math Queries. Each query in the left column finds documents having formulas that contain the corresponding matched element in the right column of the table.
QUERY MATCHED ELEMENT
1-12-2  
x^2 $x^2$
(x OR t)^3 $x^3$ or $t^3$
$^3 any single-character cubed
sqrt(x^2+1) $\sqrt{x^2+1}$
Gamma(1/3) $\Gamma(1/3)$
^(x+2) (x+2) in an exponent part
/(x+2) (x+2) as a denominator
(... x+2 ...)/ (x+2) embedded in a numerator

We built an augmented search system for the DLMF. Its principal modules are described next.

Math query language:
Our language, illustrated in Table 3, is flexible, intuitive, easy-to-use, and efficiently expressive. It is flexible in that it accommodates a variety of ``idioms'' for operator symbols, e.g., MATLAB, C, FORTRAN, LATEX, and others. It is intuitive and easy to use in that it draws on what most users know and use in their daily math reading and writing. It is efficiently expressive in that a formula (or fragment thereof) can be expressed in a minimum number of keyboard characters, without resorting to lengthy and elaborate commands and markups. The details of the language will be in the help file of the system, and will be published elsewhere.

Math query parser/translator:
This is a front-end layer to parse and translate formula queries to purely textual-and-numeric queries that use Boolean and proximity operators. This module textualizes, flattens, and normalizes the users' queries. Table 1 illustrates some aspects of the work of this module.

Contents transformation module:
This transforms the mathematical content in the DLMF database, especially equations, formulas, and mathematical symbols, into a textual-numerical normalized form. This module and the query parser/translator perform some of the same processing - textualization, flattening, and normalization. However, they still differ in that the parser's input format, that is, the math query language syntax, is different from the transformation module's input format (e.g., the Latex encoding of math formulas in documents).

Surrogate-files generator:
This module generates the surrogate files that contain the transformed math content as well as other metadata. It clearly calls upon the content transformation module to modify the mathematical representation into an indexable form, as discussed earlier.

Thesaurus:
The thesaurus, which will continue to evolve, contains generic math entries, DLMF-specific entries, and markup related entries.

An example of a generic math entry is that of the term derivative, which includes among its synonyms the term differential or diff for short. An example of a DLMF-specific entry is:

\begin{displaymath}\begin{array}{rl}
\texttt{U:} & \texttt{(U OR VoigtU OR KummerU OR WhitU}\\
& \texttt{ OR ChebyU OR ChebyUs)}
\end{array}\end{displaymath}

which tells the system that U should be interpreted as the OR of generic U, the Voigt function $U$ (through its local markup VoigtU), the Kummer function $U$ (through its markup KummerU), and so on. Note that the latter example illustrates also the use of the thesaurus for capturing internal markup.

Those modules have been designed and implemented, and a working math search system is in place. The system has math search capabilities that far exceed those of text search engines. Still, the system is work in progress. Additional refinements are being made, and new capabilities are being added.

Among those new capabilities are (1) the proper handling of customary names and symbols that are widely used in the area of special functions, and (2) the support of basic equivalences, including commutativity and associativity, among others.

In the future, we will explore searching based on new and emerging encodings such as MathML and OpenMath. The precise direction that will be taken will be determined according to which of the new encoding standards will be adopted and become widely used and technology-supported.

Also, an all-math search engine, created from scratch for that purpose, may be developed, if and when the necessary research in indexing of mathematical structures yields the necessary understanding and techniques. Experience gained from the mathematical and scientific user community regarding math search, especially in reference to query languages, modes of search, and what kinds of math information users usually need to search for, will be an extremely valuable guide into building future math search systems.

Technical Aspects of the Digital Library of Mathematical Functions 1
Bruce R. Miller - Abdou Youssef
Translated by Bruce R Miller on 2002-12-17
Comments? DLMF_feedback@nist.gov
Digital Library of Mathematical Functions