Most modern search engines, provide a mechanism for searching via a
text input box where the user is expected to type search terms. While primitive, this interface was pioneered
by major web-search providers and represented an evolution from the far more complex
interfaces that came earlier. When you
search for multiple terms, however, there seems to be only one basic paradigm:
“find every term”. At AV Text
Ministries, we believe that the vast world of search is rife for a
search-syntax that moves us past only basic search expressions. To this end, we are proposing a
Human-Machine-Interface (HMI) that can be invoked within a simple text input
box. The syntax fully supports basic
Boolean operations such as AND, OR, and NOT.
While great care has been taken to support the construction of complex
queries, greater care has been taken to maintain a clear and concise
syntax. As Clarity was our primary
concern, it became the name of our specification. In the spirit of open source licensing, any
application can implement the Clarity-HMI specification without royalty. We provide this text-based HMI specification
and a corresponding reference implementation of a command interpreter. Both the specification and the reference
implementation are shared with the broader community with a liberal MIT open
source license.
The Clarity-HMI maintains the assumption that proximity of terms to
one another is an important aspect of searching unstructured data. Ascribing importance to the proximity between
search terms is sometimes referred to as a proximal search
technique. Proximal searches
intentionally constrain the number of words that can be used to constitute a
match. The Clarity HMI specification
defines that range between search terms as the span.
The Clarity specification defines a declarative syntax for specifying
search criteria using the find
verb. Clarity also defines additional
verbs to round out its syntax as a simple straightforward means to interact
with custom applications where searching text is the fundamental problem at
hand. As mentioned earlier, AV Text
Ministries provides a reference implementation.
This implementation is written in C# and runs in a command
console using .Net on Windows or using Mono on MacOS. As source code is provided, it
can be seamlessly extended by application programmers.
Clarity syntax:
Clarity Syntax comprises a standard set of six (6) verbs. Each verb corresponds to a basic operation:
Find
Summarize
Export
Set
Get
Clear
The verbs listed above are for the English flavor of Clarity. As Clarity is an open and extensible
standard, verbs for other languages can be defined without altering the overall
syntax structure of the HMI. The
remainder of this document describes Version 1.0 of the Clarity-HMI
specification.
In Clarity terminology, each verb is considered to be a directive. While there are six distinct verbs, there
are only five types of directives:
1.
SEARCH
Directives
·
find
·
summarize
2.
FILE
Directives
·
export
3.
CONFIG
Directives
·
set
4.
STATUS
Directives
·
get
5.
RESET
Directives
·
clear
Most directives operate at the session level, but CONFIG & RESET
directives can be qualified to operate globally for the user across the current
and all future sessions. The at-symbol
adapts such directives to operate globally for the user. In other words, prefixing commands with an
at-symbol (e.g. @set & @clear)
adapts the command to program-scope.
When commands operate on configuration parameters that are at
program-scope, we refer to these parameters as “globals”.
The syntax of a Clarity statement always begins with a verb. From a linguistic standpoint, all Clarity statements
begin with the verb and are they are issued in the imperative. After the verb, a Clarity statement can be
broken into one or more segments. The
syntax for each segment is dependent upon the type of directive for the
segment. And the type of directive is
dictated by the verb. We will refer to
the directive of which a verb is a member as the verb-class. For example, the verb-class of find is
“SEARCH”. All verbs within a verb-class
share the same syntax rules.
Clarity supports two types of statements:
1.
Simple
statements
2.
Compound
statements
A simple statement is merely a verb followed by a single segment. In simple statements, the verb-class fully
constrains the segment to be of that same type.
Here is an example of a simple statement using a SEARCH directive:
find "in the beginning"
We will get into the particular way that this statement is
parsed later in this document. But for
now, we should notice that we have one verb and one segment. The type of the segment agrees with the verb
and there is only one segment.
Consequently, this is, by definition, a simple statement. We should notice that we have not informed
our search engine what to search. So
here is another example of a simple statement using a CONFIGURE directive:
set source=bible
If we had run this configuration command prior to the search
command listed above, our first match would be found in Genesis 1:1. But as the source domain of our search is a
key element of our search, we should have a way to express both of these in a
single command. And this is the rationale
behind a compound statement. A compound
statement has more than one segment. To
combine the previous two statements into one compound statement, we can issue
this command:
find "in the beginning" + source=bible
Segments in compound statements are delimited by plus-signs. For all compound statements, at least one
segment must agree with the verb-class.
Each verb-class defines whether additional segment types are
permissible. In the case of EXPORT
directives and SEARCH directives, CONFIG segments are also valid segment
types. All other verb-classes require that
each segment agrees with the verb-class of the statement.
Before we go deep into the syntax of compound statements, we
should define one more abstraction defined in the Clarity HMI specification. So far, we have considered only ordinary
statements. Ordinary statements always
begin with a verb and contain at least one segment.
In this section, we will introduce the abstraction of
labelled-commands: We can apply a label
to a statement to provide a shorthand for subsequent execution. This gives rise
to two new definitions:
1.
Labeled-Statement
definitions
Commands that allow us to save a statement with a label.
2.
Labeled-Statement
executions
The ability to execute a previously labeled statement
Let’s say we want to name our previously identified SEARCH
directive with a label; We’ll call it “genesis”. To accomplish this, we would issue this
command:
genesis: find “in the beginning” + source=bible
It’s that simple, now instead of typing the entire statement,
we can use the label as shorthand to execute our newly saved command. Here is the command:
{genesis}
By default, labeled commands are scoped to the session. When evaluating a command label, the session
is examined first, and if not defined within the session, the global label is
expanded. However, when defining the
label, complete control over user-scope versus session scope is available. As with all clarity directives, session-scope
is the default. Prefixing a label with
the at-symbol ( @ ) defines the label globally. Of course, without the at-symbol, the label
is defined only within the current session.
Whenever an expression begins with open-brace ( { ) and ends with close-brace ( } ), then it invokes the
statement that has been associated with the label within the braces. As we saw earlier, if a command contains a
colon, then the label before the statement becomes registered as shorthand for
the statement.
It should be noted that compound statements also work with
labels.
Let’s label another statement:
my label can contain
spaces: set span=8
Compound execution of
labeled statements can be accomplished as follows:
{genesis} + {my label
can contain spaces}
As the previous
command is valid syntax for a statement, it even follows that we can define
this macro:
sample: {genesis} + {my
label can contain spaces}
Later I can issue
this command:
{sample}
Which is equivalent
to executing these labeled statements:
{genesis} + {my label
can contain spaces}
Labels can be defined in terms of an ordinary statement or
using one or more labels inside of braces.
And the two constructs can be mixed
derived: {original} + find: foo
Here are four more examples of labeled statement
definitions:
C1: SET search=strict
C2: SET span=8
F1: FIND Godhead
F2: FIND eternal
We can execute these
as a compound statement by issuing this command:
{C1} + {C2} + {F1} + {F2}
Similarly, we could
define another label from these, by issuing this command:
sample2: {C1} + {C2}
+ {F1} + {F2}
Prior to running
compound labeled statements, a normalization process occurs.
Example of
normalization for the sample2 label:
FIND godhead + eternal
+ search=strict + span=8
Interestingly, when SET macros are combined with
another verb, the key-value pairs apply ONLY to the execution of the other verb
[FIND or EXPORT], not to the entire session.
However, if an execution ONLY contains SET verbs,
then the key-value pairs affect the session (or the program if @SET is
used). SET is implied (but not
explicitly required) when paired with a SEARCH or EXPORT directive. The head verb of the command always defines
the scope. For example, configuration
variables are always execution scope when combined with a SEARCH
directive. See the table below for
compatibility of directives and the implicit scope of the command.
Head Directive |
Secondary
Directive |
Implied Scope |
SEARCH |
CONFIG |
Execution |
FILE |
CONFIG |
Execution |
CONFIG |
n/a |
Session |
RESET |
n/a |
Session |
STATUS |
n/a |
Session & Program |
@CONFIG |
n/a |
Program |
@RESET |
n/a |
Program |
When part of a command, the SEARCH directive or FILE directive becomes
the head directive. SEARCH directives
and FILE directives cannot be part of the same statement. Similarly, when STATUS is part of a command,
no other verb-class is permissible. AT-symbols
on verbs are only permitted on the first verb in a sequence and only when the
verb-class is CONFIG or RESET. However, at-symbols
can be used in labeled commands and when such commands are paired with SEARCH
or FILE head directives, the command gets downgraded to execution-scope level,
just as if the macro were defined without an at-symbol.
This concludes our discussion of labeled statements. Now let’s go deeper into extended
statements. Just keep in mind that regardless
of the complexity of an extended command, it can be labeled for shorthand
execution.
Consider the proximity search where the search target is the
bible. Here is an example search using
Clarity syntax:
find source=bible
+ beginning created earth
Clarity syntax can alter the span by supplying a configuration
segment:
find source=bible
+ span=8 + beginning created earth
Assignment clauses can also be standalone
to avoid redundancy with successive find commands:
set source=bible + span=7
set search=strict
Now consider a different search:
find God
created earth
Next, consider a search to find that God created heaven or
earth:
find God
created (earth heaven)
The order in which the search terms are provided is
insignificant. Additionally, the
type-case is insignificant.
Of course, there are times when word order is significant. Accordingly, searching for explicit strings
can be accomplished using double-quotes as follows:
find “God
created ... Earth”
These constructs can even be combined.
For example:
find ”God created ... (Heaven Earth)”
As Clarity supports multiple segments, the above search criteria would
be equivalent to this search:
find “God
created ... Heaven” + “God created ... Earth”
In all cases, “...” means “followed by”, but the ellipsis allows other
words to appear between created and heaven.
Likewise, it allows words to appear between created and Earth.
AV Text Ministries imagines that Clarity HMI can be applied broadly in
the computing industry and can easily be applied outside of the narrow domain
of biblical studies. For example, the
Clarity syntax could easily handle statements such as:
find: source=Wall
Street Journal + “Trump ... tax cuts”
Of course, translating the commands into actual search results might
not be trivial for the application developer.
Still, the reference implementation that parses a Clarity command is
freely available in the reference implementation.
Clarity is designed to be intuitive. It provides the ability to invoke
Boolean logic on how term matching should be performed. Parenthesis can be used to invoke Boolean
multiplication upon the terms that compose a search expression. For instance, there are situations where the
exact word within a phrase is not precisely known. For example, when searching the KJV bible,
one might not recall which form of the second person pronoun was used in an
otherwise familiar passage. Attempting
to locate the serpent’s words to Eve in Genesis, one might execute a search such
as:
find (you thou ye) shall not surely die
This statement uses Boolean multiplication
and is equivalent to this lengthier statement:
find you shall not
surely die + thou shall not surely die + ye shall not surely die
The example above also reveals how multiple
search segments can be strung together to form a compound search: logically speaking, each segment is OR’ed together; this implies that any of the three matches
is acceptable. Parenthetical Terms
provide a shorthand for this type of search.
Definitions:
While some of these concepts have already been introduced, the
following section can be used as a glossary for the terminology used in the
Clarity HMI specification.
Directives are
composed by verbs and are used to construct statements for the Clarity Command
Interpreter. Each directive has
specialized syntax tailored to the imperative verb used in the
statement. The directive limits the type
of segments that may follow. Most
directives permit only a single segment type.
EXPORT and SEARCH directives also allow CONFIG segments. There are five types of directives. These correspond exactly to five verb
classes. While there are nine verbs,
there are only five verb-classes. The
verb-classes correspond exactly to one of the five directive types.
Segments: the verb is followed by one or more
segments. Each segment has a type, and
the type of the segment must be compatible with the directive. As there are five types of directives, it not
a coincidence that there are five types of segments. It is noteworthy that the syntax of a STATUS
segment is identical to the syntax of a RESET segment, but we still consider
the segment types to be distinct.
SEARCH statement: Each statement contains one or more search
segments. If there is more than one SEARCH segment, each each segment is logically OR’ed
with all other segments.
SEARCH segment: Each segment contains one or more search
terms. A SEARCH segment is either unquoted statement or quoted.
Unquoted SEARCH segment: an unquoted segment
contains one or more search words. If there is more than one word, then each
word is logically AND’ed with all other words within
the segment. Like all other types of segments, the end of
the segment terminates
with a plus-sign or a newline.
NOTE:
The absence of
double-quotes means that the statement is unquoted.
Quoted SEARCH segment: a quoted segment contains a single string
of terms to search. An explicit match on
the string is required. However, an
ellipsis ( … ) can be used to indicate that wildcards
may appear within the quoted string.
NOTES:
It is called quoted, as the entire
segment is sandwiched on both sides by double-quotes ( "
).
Parenthetical Terms: When searching, there are situations when the
exact word that appears in a text is not precisely known.
Bracketed Terms: When searching, there are part the order of
some terms within a quoted are unknown.
Square brackets can be used to identify such terms. For example, consider this SEARCH statement:
find “[God created] heaven and earth” + source=bible
The above statement is equivalent to
find “God created heaven and earth” + “created God heaven and earth” +
source=bible
and: In Boolean logic, and means that all terms must be found. With Clarity-HMI, and is represented
by terms that appear within an unquoted segment.
or: In Boolean logic, or means that any term constitutes a match. With Clarity=HMI, or is represented by the plus-sign ( + ) between SEARCH segments.
not: In Boolean logic, not means that the term must not be found. With Clarity, not is represented by a
minus-sign ( - ) and applies to an entire segment (it
cannot be applied to individual words unless the search segment has only a single
term). In other words, a minus-sign
means subtract results; it cancels-out matches against all matches of other
segments. Most segments are additive as
each additional segment increases search results. Contrariwise, a not segment is subtractive as it decreases search results.
NOTE:
The minus-sign means that the segment will be subtracted from the
search results while its absence means that the segment will be added to the
search results. When only a single segment
follows a SEARCH directive, it is always positive. A single negative segment following the find
imperative, while it might be grammatically valid syntax, will never match
anything. Therefore, while permitted in
theory, it would have no real-world meaning.
Consequently, some implementations of Clarity-HMI may disallow such a
construct.
It should also be noted that since minus-signs are used as special
characters in Clarity syntax, this can create ambiguity with respect to words
that contain hyphens. Set, clear, and restore directives can operate on syntax
variables. Currently, only one syntax
variable is supported. It will be
explained via usage examples:
set hyphen=_
clear hyphen
@set hyphen=_
@clear hyphen
When hyphens are mapped, words like x-ray can be specified as x_ray, without the hyphen colliding with the meaning of the
minus-sign as a segment delimiter. The
mapping character cannot be any character reserved by Clarity in its command
syntax.
In some dialects of Clarity (e.g. Clarity AVX), hyphens in words are
never required. For example, xray would be a valid shorthand for x-ray (making the
hyphen superfluous). In such
implementations, the default setting for hyphens would likely be disallow.
More Examples:
Consider a query for all passages that contain God AND created, but
NOT containing earth AND NOT containing heaven:
Find source=old testament of bible + span
= 15: created GOD - Heaven Earth
(this could be read as:
find in the old testament using a span of 15, the words
created AND God, but
NOT Heaven AND Earth)
The simplest form to find ALL of three words (in the beginning):
find in the beginning
It should be noted that such a statement would find either of these strings
in the text:
in the beginning
the beginning of summer in
If a specific string should be match, this can be stated explicitly:
find "in the beginning"
If you are unsure what article should match, you could issue this
statement:
find "in (a the that) beginning"
Boolean multiplication would match only these strings of text:
in a beginning
in the beginning
in that beginning
If you are unsure which words might separate a phrase, you could issue
this statement:
find "in the beginning … heaven and earth"
With this ellipsis in the find statement, it would match this string
of text:
in a beginning, God created heaven and earth
To issue a query to find ANY words, separating each word with a plus
sign would be the simplest form to find ANY of these five words:
find Lord + God +
messiah + Jesus + Christ
If you are unsure about word order within a phrase, square brackets
can be used:
find "in the beginning … [earth heaven]"
With this ellipsis and the final two bracketed terms, it would also
match this string of text:
in a beginning, God created heaven and earth
The "export" verb has very limited grammar. For simplicity, consider the basic variants:
export: output="C:\user\me\Documents\genesis.txt" + format =
text + selection=genesis
export: format = utf8 + output="C:\user\me\Documents\genesis.txt"
+ selection=genesis
export: format = utf16 + output = "C:\user\me\Documents\genesis.txt"
+ selection=genesis
export: format = utf32 + output = "C:\user\me\Documents\genesis.txt"
+ selection=genesis
export: format = html + output = "C:\user\me\Documents\exodus.html"
+ selection=exodus
export: format = docx + "C:\user\me\Documents\Leviticus.docx"
+ selection=leviticus
CONFIG directives:
The set command can be used to set the configuration for default
export formats for the current session of the command interpreter:
set format = docx
set format = html
set format = text
The @set command can be used to globally set the configuration for
default export formats for the current user:
@set format = docx
@set format = html
@set format = text
STATUS directives:
There are two status directives.
One displays the current setting for the session:
get format
The other displays the current global setting:
@get format
RESET directives:
Defaults for SEARCH directives can be restored within the session:
clear span [The global value for span will be restored for the
session]
clear format [The global value for format will be
restored for the session]
Defaults for SEARCH directives can be globally restored (for this and
any future session):
@clear span [equivalent to: save span = 7]
@clear format [equivalent to: save format = html ]
ADDITIONAL NOTES:
In all cases, any number
of spaces can be used between operators and terms.
Also noteworthy: The reference Clarity implementation automatically
adjusts the span of your to be inclusive of the number of search terms for the
largest segment. So
if you were to express:
find span=1 + in the beginning (God Lord Jesus Christ Messiah)
The minimum span has to be four(4). So the Clarity
parser will adjust the search criteria as if the following command had been
issued:
find span=4 + in the beginning (God Lord Jesus Christ Messiah)
SPECIAL CHARACTERS & OPERATOR PRECEDENCE :
The order for operator
precedence is defined in AV Word as follows:
:
{ }
@
-
+
=
( )
[ ]
"
/
#
*
?
%
FINAL SUMMARY NOTE:
Clarity V1.01 (English) can be summarized as follows:
Imperative Verb |
Expected Object(s) of Verb |
Optional configuration segments |
find summarize |
one or more SEARCH segments |
key-value pairs in the form of: key = value |
export |
One FILE segment with optional variables: e.g. %source, %title, %chapter,
%book |
key-value pairs in the form of: key = value |
set |
one or more key-value pairs in the form of: key = value |
|
get clear |
one or more keys in the form of: key |
|
Clarity syntax is
summarized below:
VERB segment
or
VERB segment + segment
or
VERB segment + segment + segment
etc.
Exactly one
VERB is required.
At least one segment is required.
Directives refine the syntax generalized above.