Working With Elastic Search

April 1, 2023

Elastic Search, created by Shay Banon in 2014, is a Java-based search engine based on Apache Lucene. It uses an HTTP and JSON based protcol. This post assumes that you have installed Elastic Search. If you have not, please follow the steps in my blog post on Elastic Search installation.

In this post, our interaction with Elastic Search will be with the curl utility. Curl (short for Client URL, previously called httpget and urlget) is a tool developed by Daniel Stenberg, and it is available on most mainstream Linux distributions. Common command-line switches that we can pass to curl are:

    -L to follow HTTP redirects
    -v for verbose output
    -H <header-content> to pass an HTTP header to the request
    -o <filename> to save the response to a file
    --data <data> to pass data in the HTTP request body
    --data "@filename" to pass data from a file

When sending a request to Elastic Search 6, sending the HTTP request header “Content-Type: application/json” is required. This was not a requirement in Elastic Search 5 and earlier.

To check if the Elastic Search server is accessible, send a request to /_cluster/health?pretty . The “?pretty” can be added to the end of the URL for pretty-print formatted HTTP response. Alternatively, we can pass the output to jq, a command-line JSON processor for pretty-printing and syntax-highlighting.

When creating a table in an RDBMS, we would create a table and define the columns for it. With Elastic Search, it is possible to get the server to dynamically identify the fields in the documents being stored. However, this could lead to incorrect types and would also index fields that are not used for search, a condition referred to as mapping explosion. We therefore perform explicit mapping instead of dynamic mapping for most scenarios.

To define the mapping for a ’table’, referred to henceforth as an index, we send a PUT request to / with a JSON payload:

    mappings: { properties: {
        field_name: {
            type: text|date|integer|ip|keyword|...   Field Data Types (codingexplained.com)  (string, deprecated --> text)
            index: analyzed|not_analyzed|no
            analyzer: standard|whitespace|simple|english|...
        }
    }}

We can fetch the mapping for an index by sending a GET request to //_mapping and we can remove an index by sending a DELETE request to / . List the indices on the server by sending a GET request to /_cat/indices .

Documents are immutable. Every document has an _version ; on updating, a new document is created with incremented _version and the old document is marked for deleting.

Indices in Elastic Search can be closed to avoid using resources and can later be opened when it needs to be accessed. This is done by sending a POST request to //_close and //_open. To check if an index is closed, send a GET request to the /_cat/indices/?v and look at the status column, or call /_cat/indices/?h=status . Alternatively, send a GET request to /_cluster/state and look at the metadata.indices.state or send a GET request to /_cluster/state/metadata/?filter_path=metadata.indices.*.state .

Settings for an index can be changed by sending a JSON document with the settings through a PUT request to //_settings . Example:

    {
        index.mapping.ignore_malformed: true  //ignore fields that don't match mapping
    }

There is a default limit of 1,000 fields per index, but this can be changed with the setting “index.mapping.total_fields.limit”.

To insert data into an index, send the data through a POST request to //_doc/ . Elastic Search can auto-generate an ID, if not specified. The document can be fetched by sending a GET request to //_doc/ . To fetch all documents, send a GET request to //_search .

A bulk insert can be made by sending a PUT request to //_bulk with a document with the create operation type followed by the actual document, repeated multiple times:

    {create: {_index: <entity-name>, _id: <the-id>}}
    {id: <the-id>, ...}

Inserting the same ID multiple times would lead to a conflict.

Updating documents can be performed by sending the new fields in a POST request to //_update/ as follows:

    {doc:
    {new_field: ""}  //this is merged into the existing doc
    }

To remove a document, send a DELETE request to //_doc/

An index can be copied to another index by calling /_reindex with the source and destination specified with:

    {
        source: {index: "collection1"},
        dest: {index: "connection2"}
    }

Querying Elastic Search is by sending POST requests to //_search with a query document such as:

    {
        query: {
            match: {
                field: value
            }
        }
    }

The following types of filters can be specified in the query:

    term - Exact values
    terms - Exact values from a List
    range - As the name says
    Exists - {exists: {field: "field_name"}}
    Missing - Like exists
    Bool - Boolean logic; must (and), must_not (not), should (or)
    match_all - Implicit default; Returns all docs
    match - Search analyzed results (full text search)
    multi_match - {multi_match: {query: ___, fields: [__, __]}}
    bool - Like bool filter, but scores by relevance
    match_phrase - Match terms in the same order as phrase
    match_phrase_prefix - Matches "startsWith"; specify slop

Match phrase is commonly used for text search, and can be specified as follows:

    {match_phrase: {
        field_name: {query: value, slop: n}
    }}
    //slop is no. of other words in between or swapped positions
    //words closer together would have higher relevance

Fuzzy search can be performed with:

    {query: {fuzzy: {field_name: {value: the_value, fuzziness: n}}}}

Fuzziness can be specified as:

    AUTO
    0 for 1-2 strlen
    1 for 3-5 strlen
    2 for 5+ strlen

Prefix (starts with), wildcard, and regex search can be performed with:

    {query: {prefix: {field_name: partial_value}}}
    {query: {wildcard: {field_name: partial_value* }}} // "*" can be anywhere in string
    {query: {regexp: {field_name: my-regex }}}

There is also a short-hand for performing queries by sending a GET request with the query specified in the URL query string, with “q”: //_search?q=+field:value+field2:>value2

Sorting of results is possible by mapping a keyword analyzer on the field to be sorted.

Analyzers in Elastic Search perform the functions of filtering characters (Eg. Remove HTML, Convert symbols to words), tokenizing with delimiters (Eg. punctuation, whitespace), and filter the tokens (Eg. To-lowercase, stemming, synonyms, removing stop words). The choice of analyzers in Elastic Search is from the following:

    Standard - Split words, remove punctuation, to-lowercase
    Simple - Split on non-letter, to-lowercase
    Whitespace - Split on whitespace, no capitalization change
    Language - Standard, plus Stop Words and Stemming
    keyword mapping i.e. {type: keyword} - for exact-match
    text mapping i.e. {type: text, analyzer: english}

Custom analyzers can be defined as follows:

    PUT /<entity-name>

    {settings: {
    analysis: {
        filter: { my_autocomplete_filter: {type: "edge_ngram", min_gram: 1, max_gram: 20} },
        analyzer: {my_autocomplete: {type: custom, tokenizer: standard, filter: [lowercase, my_autocomplete_filter]}}
    }
    }}

    PUT /<entity-name>/_mapping

    {properties: {field_name: {type: text, analyzer: my_autocomplete}}}

It is possible to have multiple analyzers, and to override the default analyzer:

    {query: {match: {field_name: {query: the_value, analyzer: standard}}}}

Auto-complete analyzer is specified with {type: “search_as_you_type”}. The created subfields are: field_name.2gram, field_name.3gram, field_name._index_prefix. The .ngram subfields are for autocomplete with words, _index_prefix is for autocomplete with characters.

Auto-complete queries can be performed with:

    POST /<entity-name>/_analyze?pretty

    {
        tokenizer: standard,
        filter: [{type: edge_ngram, min_gram: 1, max_gram: 4}],
        text: "My Text Here"
    }

    GET /<entity-name>/_search

    {
        size: 5,
        query: {
            multi_match: {
            query: 'innov',
            type: bool_prefix,
            fields: [ field_name, field_name._2gram, field_name._3gram ]
            }
        }
    }