LDC Search Functional Specification

From Worki

Table of contents

LDC Search

Overview

This page describes the functional requirements for the LDC search system. The LDC search system provides the backend search functionality for projects at the LDC as well as being the search system behind the LDC online. The system includes the abiltiy to store, index and search textual information with the minimal intervention by the end-user.

Philosophy

With a number of open source search engines available one wonders why the LDC would develop it's own. Search engines are complex pieces of software, usually developed by many contributors from various backgrounds. No matter how general purpose a piece of software is designed to be ultimately design choices are made which may or may not be suitable for a given application. Lucene, for example, written in Java, is written to handle dynamic collections of documents, i.e. where documents are added and deleted often without reindexing. As a result (java, index design, other contraints I'm not aware of), indexing is extremely slow (8 hours/million Gigaword documents). For our applications, we generally have static document collections which we need to search using various techniques. Moreover, reindexing an entire collection is often desirable after some sort of preprocessing on the data has occured. If index performance is fast this isn't an issue. The LDC engine is able to index the entire Gigaword corpus (~3 million documents) in approximately 2 hours on a single processor Dell workstation.

As another example, on the query side, Lucene is designed to 'stop' (ignore) high frequency terms. For most applications search results are not improved by including words like 'the' in the index so usually they can be ignored. However, for linguistic applications one is often interested in finding high terms in a given context. Or, in a part-of-speech tagged corpus one might be interested in finding all occurences of a 'N' (noun) tagged term. Lucene's index is not designed for efficient searching of terms which occur in potentially hundreds of times in every document of the corpus. The LDC engine solves this problem by indexing the indices of very high occurence terms.

In addition to the performance advantages, the LDC engine is designed to take advantage of every lexical clue when searching. That means not only *not* employing a stop word list, but actually being able to leverage high frequency terms like 'of', 'the', 'over' to find useful documents. A perfect example is in the search for ACE documents. Annotators need to find documents that have a high liklihood of containing specific events and relationships among these events and the objects of these events. Queries for generic documents of this type cannot rely on specific terms from example documents such as 'Russian', 'knive', 'bomb', etc. since the discriminating power of such terms will dominate the search and return more of the same examples. Instead the engine must be able to employ less specific terms in the hopes of returning general documents containing events of all types.

Another feature of the LDC engine useful in the semantic annotation of a document corpus is the ability to provide negative evidence as well as positive evidence.

Indexing and Data Loading Features

Data is loaded into the system by placing a data definition file into a predefined location. The system automatically looks for data definition files and upon finding them it loads and indexes the data. Logs are also kept in a predefined location.

Search Features

keyword search

Simple keyword searches default to and-ing multiple terms. E.g.

cat dog == cat & dog

phrase search

phrases are specified using quotes as is customary. The terms of a phrase must occur in the same proximity to each other as in the query.

"summer european championships"

fielded search

Documents in a database may contain various fields containing textual and numeric information. The engine will allow searches over this fielded information. A keyword search term will match in the body of the document as well as any field. Additionally, a term may be restricted to match exclusively in a given field by prepending the field name to the search term. e.g.

banking executive source:apw

will match documents which contain 'banking' and 'executive' in any field of the document and which contain 'apw' in the source field.

exclusion term

(-dog = documents cannot contain dog)

numeric fields

date:[19990101]                       exact match
date:[1990101-]                       matches dates 19990101 and later
date:[-1990101]                       matches dates upto and including 19990101
date:[19990101-20010101]              matches between and including 19990101 and 20010101

search-within-results

given a results set ranked by relevancy, search within the first X documents for keyword.

Boolean Queries

The engine supports boolean query syntax. Query term 'chunks' may be combined together in a standard boolean fashion to describe a query to the engine. A 'chunk' is defined as follows:

chunk -> term
      -> "term term ..."              quoted phrase requires proximity to match
      -> fieldname:term               fielded term must occur only within fieldname
      -> fieldname:"term term ..."    fielded quoted phrase

Two chunks may be anded together using an implied AND or by an explicit use of &:

cat dog
cat & dog

Parenthesis may be used to impose precidence on the operators. Otherwise left to right precidence is employed.

cat & (dog | sun) & title:lamb

Other examples:

"bank executive"               phrase
-dog                           term exclusion
cat | dog                      or'ed terms
source:apw                     fielded
source:apw | source:nyt        fielded

mixed boolean examples

date:[19990101-] (source:apw | source:nyt) cat dog ("new york" | "san francisco")

relevancy search

A relevancy search ranks all documents of the database as to their relevancy to a set of query terms. A query is identified as a relevancy search query by the enclosing the terms within curly brackets. Relevancy searching can be combined with boolean searching such that the ranking occurs over only those documents that meet the boolean requirements. The following are examples of relevancy search queriers.

{skiing olympics women's downhill}

relative keyword weighting

The Weight of individual terms of a relevancy search query can be adjusted relative to the other terms of the query using the 'hat'.

{skiing olympics^2 women's downhill}

query by example

A document of the database can be used as a relevancy search query by citing the internal document number. A generic file not in the database can be specified by giving the pathname to the file. The file must reside in the current directory as in myExampleQueryFile or the fully qualified path to the file must be given (no dots).

{@356}
{@myExampleQueryFile}
{@/home/bob/hisExampleQueryFile}

negative evidence

Negative evidence can be presented to the evidence as shown in the examples below. Documents containing negative evidence terms are penalized.

{skiing olympics women's downhill}-{peekaboo}
{@356}-{avocado}
{@myPositiveQuery}-{@myNegativeQuery}


phrases

Phrase terms may also be used:

{skiing olympics "women's downhill"}