My Notes

 _____ _           _   _                                _     
| ____| | __ _ ___| |_(_) ___   ___  ___  __ _ _ __ ___| |__  
|  _| | |/ _` / __| __| |/ __| / __|/ _ \/ _` | '__/ __| '_ \ 
| |___| | (_| \__ \ |_| | (__  \__ \  __/ (_| | | | (__| | | |
|_____|_|\__,_|___/\__|_|\___| |___/\___|\__,_|_|  \___|_| |_|
                                                              

Definition

Elasticsearch is a real-time, distributed search and analytics engine that is Horizontally scalable and capable of solving a wide variety of use cases. It is core search and analytics engine.
It is SCHEMALESS and DOCUMENT-ORIENTED.

Q: Difference between mongo and elasticsearch

Ans: Hypothetically both store data objects that have key-value pair, and allow querying that body of objects. But both come from 2 different camps and are made for different purposes. MongoDB is a general purpose database, Elasticsearch is a distributed text search engine backed byLucene.

ElasticSearch is very good for specific task — indexing and searching big datasets. It is used when you have some secondary info about your data and you need to know actual records to select.

And since it’s architecture is optimized for this, it’s weaker in some other use cases. For example, in compare to many NoSQL databases ElasticSearch is slow on adding new data. In ElasticSearch Indexing semantics is defined on client side, so the actual indexing cannot be optimized as well as with real storages.

Source: Medium link

Core strength lies in searching capabilities

Searching is like zooming in and finding needle in haystack, Analytics is zooming out and finding that its Haystack of yellow color

Meaning seeeing the overall picture is what analytics is.

It has rich client library and REST API
Horizontal scalability is ability to scale system by setting up multiple same type of machine, not adding more powerful machines.
ElasticSearch is both horizontally and vertically scalable.
Lightning fast
Fault tolerant
Even if there is node failure or network failure, es can work pretty effectively.
ElasticSearch is excellent with geospatial data and time series data.
It works excellently with Kibana
Best part about elasticsearch is that its capable of performing google like searches, its used in github, wikipedia etc.
ElasticSearch can be leveraged against content aggregation platforms which basically crawl and aggregate content from various places.

Getting started with Elastic Search

Abstractions in Elastic Search

Indexes
Types
Documents
Clusters
Nodes
Shards and Replicas
mapping and types
Inverted indexes

So as to send a request using curl command

curl -XPUT -H "content-type:application/json" http://127.0.0.1:9200/catalog/_doc/1 -d 
'{
"sku": "SP000001",
"title": "Elastic search for pewpew",
"description": "UwU describes stuffs",
"author": "Akuma!",
"ISBN": "123123432432",
"price": 27.90
}' | jq .
# or you can use fx for the same.

Indexes

An index is a container that stores and manages documents of a single type in ElasticSearch. An index can contain document of a single type.

     +----------------------------+
     | index                      |
     |    +---------------------+ |
     |    |  Type               | |
     |    |  +----------------+ | |
     |    |  |Document        | | |
     |    |  +----------------+ | |
     |    +---------------------+ |
     +----------------------------+

Documents

Documents contain multiple fields, Each field in the json is of a particular type.

In the product catalogue example, sku, title, description and price are basically key-value pair.

Nodes

Elasticsearch is a distributed system, It consists of multiple processes running across different machines in a network that communicate with the other processes.
A node precisely corresponds to one instance of Elasticsearch process.

Clusters

A cluster hosts one or more indices and is responsible for providing operations such as
- searching
- indexing
- Aggregations
A cluster is formed by one or more nodes, Every elasticsearch node is always a part of a cluster.
config/elasticsearch.yml has various variables that can be modified.

#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

Shards and replicas

An index contains documents of one or more types. Shards help in distributing an index over cluster
Shards help in dividing the documents of a single index over multiple nodes
there is a limit to the amount of data that can be stored, which is very obvious thing if seen carefully.
process of dividing data amongst shards is called as sharding, its is inherent and is a way of scaling and parallelization.

Benefits:

Better utilisation of Storage
Better utilization of processing powers
Less burden on each of the systems

By default index are configured to have 5 shards, but one can specify the number of shards at time of creation.

     +------------------------+        +------------------------+
     | Node 1                 |        |  Node 2                |
     |    +-------+           |        |   +--------+           |
     |    |  P1   |           |        |   |  P3    |           |
     |    +-------+           |        |   +--------+           |
     |                        |        |                        |
     |         +---------+    |        |        +---------+     |
     |         |  P2     |    |        |        | P4      |     |
     |         +---------+    |        |        +---------+     |
     +------------------------+        +------------------------+

Now if Node 1 goes down then its obvious that the data in p1/2 wont be available but distributed systems are expected to run inspite of any type of failures (network or hardware precisely)

This is issue is managed and tackled by replicas, Each shard can be configured to have zero or more replica shards. Replica shards are extra copies of original primary shard and they provide high availability.

  Node 1                            Node 2
 +-------------------+          +-------------------+
 |                   |          |                   |
 | +------+          |          | +------------+    |
 | |   p1 |          |          | |  p3        |    |
 | +------+          |          | +------------+    |
 |     +-------+     |          | |   r2       |    |
 |     | p2    |     |          | +------------+    |
 |     +-------+     |          | |    r3      |    |
 | +--------+        |          | +------------+    |
 | | r1     |        |          | |      r1    |    |
 | +--------+        |          | +------------+    |
 |                   |          |                   |
 +-------------------+          +-------------------+

Apart from highavailability and failover, replicas allow to share the workload of search queries and aggregations. This entire distribuition is totally transparent to the user, and even querying.

This is very much like hadoop’s ecosystem.

Mapping and datatypes

ES supports a wide vairety of datatypes for different scenarios.
These range from text, numbers, booleans, binary objects, arrays, objcers, nested objects, geopoints, geoshapes and many other specialized datatypes, such as IPv4 and IPv6.

List of Datatypes

String Datatypes:
- text : useful for supporting fulltext search, these are analyzed before indexing.
- keyword : enables analytics on string fields.
Numeric Datatypes:
- byte, short, integer and long: 8,16,32,64 bits respectively.
- float and double
- half_float
Complex Datatypes:
- Array Datatype
- Object Datatype
- Nested Datatype
Other Datatypes:
- Geo point datatype
- Geo shape datatype
- IP datatype (I am too lazy to byheart or even read definitions of these things.)

Mappings

curl -XPUT -H "content-type:application/json" http://127.0.0.1:9200/catalog/_doc/2 -d 
'{ 
"sku": "SP000002",
"title": "Alienware",
"description": "Hightech tech for hadoop",
"author":"pewpew Akuma",
"ISBN": "1777771281881",
"price": 20006.99,
"os":"Alpine", 
"resolution":"1920x1080" 
}' | jq .

in elasticsearch 7+ dont include the type while retrieving the mappings,

GET /catalog/_mapping will generate the mappign response in JSON format.

Inverted Index

Inverted indexes are core data structures of Elasticsearch and any other system supporting full text search. It is similar to the indexes at the back of the book, for example:

Document ID	Document
1	It is sunday tomorrow
2	Sunday is the last day of the week
3	The choice is yours

Then the inverted index will be,

Term	Frequency	Documents
choice	1	3
day	1	2
is	3	1,2,3
it	1	1
last	1	2
of	1	2
sunday	2	1,2
the	3	2,3
tomorrow	1	1
week	1	2
yours	1	3

Note:

Documents were broken down and punctuations were removed and placed in lowercase
Terms are sorted alphabetically
Frequency column captures frequency of occurances
third column captures occurances.

In most of scenarios, searching is extremely quick and fast, as it is without the overhead of parsing the text and more like a dictionary search for a word. For more than one word, union can be used to pinpoint the location.

CRUD operations

                             +-----------------+
                             |  CRUD API       |
                             +--------+--------+
                                      |
                                      |
                                      |
                                      |
                                      |
                      +----------+----------+----------------+
                      |          |          |                |
                      |          |          |                |
                      |          |          |                |
                      |          |          |                |
             +-------------+  +---------+ +-------+  +---------------+
             | Index API   |  | Update  | | GET   |  |  Delete API   |
             |             |  |         | |       |  |               |
             +-------------+  +---------+ +-------+  +---------------+

Index API

Adding/creating a document into a type within an index of ELasticSearch is called an indexing operation.
It involves addding the document to the index by parsing all fields within the document and building the inverted index, this is why its called as indexing operations.
There are two methods of indexing,
- With ID
- Without ID

Indexing with ID

Basically make a PUT request of type, /<index>/<type>/<id> using json

curl -XPUT -H "content-type:application/json" http://127.0.0.1:9200/catalog/product/1 -d '{
"sku": "SP000001",
"title": "ElasticSearch for hadoop",
"description": "Elasticsearch for hadoop",
"author": "pewpew",
"ISBN": "1772712717",
"price": 26.99
}'

Indexing without ID

Almost same, just dont send the id :P

POST /catalog/product
{
"sku": "SP000003",
"title": "Pewpew elasticsearch",
"description": "pewpew",
"author": "pewpew akuma",
"price": 54.99
}

In such cases ID is basically a hash string, so yay!.

GET API

GET is useful for retrieving the document whent the id is known to you, it is like a select query basically.

GET /catalog/product/<hash>

SYNTAX: GET /<index>/<type>/<id>

NOTE: Types are not supported in ES6+

curl -X PUT "localhost:9200/my-index-000001/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{ "index":{ } }
{ "@timestamp": "2099-11-15T14:12:12", "http": { "request": { "method": "get" }, "response": { "bytes": 1070000, "status_code": 200 }, "version": "1.1" }, "message": "GET /search HTTP/1.1 200 1070000", "source": { "ip": "127.0.0.1" }, "user": { "id": "kimchy" } }
{ "index":{ } }
{ "@timestamp": "2099-11-15T14:12:12", "http": { "request": { "method": "get" }, "response": { "bytes": 1070000, "status_code": 200 }, "version": "1.1" }, "message": "GET /search HTTP/1.1 200 1070000", "source": { "ip": "10.42.42.42" }, "user": { "id": "elkbee" } }
{ "index":{ } }
{ "@timestamp": "2099-11-15T14:12:12", "http": { "request": { "method": "get" }, "response": { "bytes": 1070000, "status_code": 200 }, "version": "1.1" }, "message": "GET /search HTTP/1.1 200 1070000", "source": { "ip": "10.42.42.42" }, "user": { "id": "elkbee" } }
'

Shards

Shards are a lot like indices, they are basically containers. Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

As data is written to a shard, it is periodically published into new immutable Lucene segments on disk, and it is at this time it becomes available for querying.

Why is ElasticSearch so AWESOME!! (it really is)

In elasticsearch most basic unit of storage is a shard, which is basically logical abstraction of data, like you can just not care about where in the universe data is stored, if its on shard its searchable and you can use it. But the engine, lucene engine makes things a bit different, in elasticsearch shard is a lucene index and each lucene index and each lucene index consists of various lucene segments.

            Index
+-----------------------------------+
|                                   |                        +------------------------+
|                                   |                        |  +----+      +-----+   |
| +------------+  +--------------+  |                        |  |    |      |     |   |
| |            |  |              |  | Each shard is a lucene |  |  1 |      | 2   |   |
| |            |  |              |-------------------------->|  +----+      +-----+   |
| |  shard 1   |  |  shard 2     |  |      index             |                        |
| |            |  |              |  |                        |  +----+      +-----+   |
| +------------+  +--------------+  |                        |  |    |      |     |   |
|                                   |                        |  |  3 |      | 4   |   |
|                                   |                        |  +----+      +-----+   |
|                                   |                        +------------------------+
|                                   |                      A single shard of ES in lucene's
+-----------------------------------+                              Structure
      Index structure in ES

The concept behind segmentation is that whenever a new document is created they are written in new segments, if they are new,there is no need for modification of any existing segment. Upon attempt to deletion, it is flagged as deleted in its original segment, this means it never gets physically deleted from segment.

As far as updating goes, the previous version is marked as delted in the previous segment and the updated version is kept under the same Document ID in the current segment.

Lucene reopen

when Lucene reopen is called, will make the data accumulated available for search. Although the latest data is made available for search, it doesn't guarantee the persistence of the data or that it is not written to the disk.

Lucene commits the data to be safe, for each of the commits the data from different segments is merged and pushed to the disk, making the data persistent. Although commits are ideal way to persist data, the issue is that each commit operation is Resource Expensive Each commit has its own I/O operatios and R/w cycles.

Now here comes the genius of Elasticsearch,

Translog

Elastic search addresses the issue of persistance taking a different approach, It introduces a translog(transaction log) in every shard, New documents indexed are passed to this transaction log and an in memory buffer.

                                +--------------------------------------------+
                                |             +-------------+                |
                                |    |------->|             |                |
 +--------+                     |    |        |  Translog   |                |
+--------+|                     |    |        |             |                |
|        || New documennts      |    |        +-------------+                |
|        ||---------------------|--->|                                       |
|        ||    indexing         |    |                                       |
|        ||                     |    |        +-------------+                |
|        |+                     |    |        | In memory   |                |
+--------+                      |    |        |             |                |
                                |    |------->|  Buffer     |                |
                                |             |             |                |
                                |             +-------------+                |
                                |                                            |
                                +--------------------------------------------+
                                                   Shard

In elasticsearch the _refresh operation is set to be executed every second by default, during this the in-memory buffer contents is copied to a newly created segment.

Translog handles persistence very nicely, transog pertains to the physical disk memory, It is fsynced and safe, thus we obtain both durability and persistence even for non committed data. In case something bad happens, transaction log can be restored.

Searching

basics of text analysis
Searching from structured data
Writing compound queries
Searching from full text

Analyzers

Job of analyzer is to take the documents and each field of the document and extract, terms from them. These terms make the index searchable, that is, it ca help us find out which documents contain particular search terms

Core task of the analyzer is to parse the document fields and build the actual index. Every field of text type needs to be analyzed before the document is indexed, this process of analysis is what maes the documents searchanle by an search term that is used at the time of searching. Analyzers can be configured of a per field basis, that is it is possible to have two fields of the type text within the same document, each one using the different analyzers.

+--------------------------------------------------------------+
| +----------------------+   +-------------------+             |
| |                      |   |                   |             |
| |  Character filters   +--->   token filters   |-----|       |
| |                      |   |                   |     |       |
| +----------------------+   +-------------------+     |       |
| +------------+                                       |       |
| |            |     Elasticsearch Analyzer            |       |
| |            |                                       |       |
| |   Token    |                                       |       |
| |  filters   |                                       |       |
| |            |                                       |       |
| |            |                                       v       |
| |            |<---------------------------------------       |
| +------------+                                               |
+--------------------------------------------------------------+

Character filters A character filter works on a stream of characters from the input field; each character filter can add, remove, or change the characters in the input field. ElasticSearhc ships with builtin charcater filters and also allows us to create our own filters.

for example, to create a custom filter which allows: :) -> smile :( -> sad :D -> laugh

"char_filter": {
"my-char-filter": {
"type": "mapping",
"mappings": [
":) => _smile_",
":( => _sad_",
":D => _laugh_"
    ]
  }
}

Tokenizer

An analyzer ahs exactly one tokenizer, its reponsibility is to take a stream of characters and generate a stream of tokens. These tokens are then used in building the inverted index, it is roughlt equivalent to a word.

Word Oriented Tokenizers
edit

The following tokenizers are usually used for tokenizing full text into individual words:

Standard Tokenizer
    The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages. 
Letter Tokenizer
    The letter tokenizer divides text into terms whenever it encounters a character which is not a letter. 
Lowercase Tokenizer
    The lowercase tokenizer, like the letter tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. 
Whitespace Tokenizer
    The whitespace tokenizer divides text into terms whenever it encounters any whitespace character. 
UAX URL Email Tokenizer
    The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens. 
Classic Tokenizer
    The classic tokenizer is a grammar based tokenizer for the English Language. 
Thai Tokenizer
    The thai tokenizer segments Thai text into words. 

Partial Word Tokenizers
edit

These tokenizers break up text or words into small fragments, for partial word matching:

N-Gram Tokenizer
    The ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g. quick → [qu, ui, ic, ck]. 
Edge N-Gram Tokenizer
    The edge_ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g. quick → [q, qu, qui, quic, quick]. 

Structured Text Tokenizers
edit

The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text:

Keyword Tokenizer
    The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like lowercase to normalise the analysed terms. 
Pattern Tokenizer
    The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms. 
Simple Pattern Tokenizer
    The simple_pattern tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the pattern tokenizer. 
Char Group Tokenizer
    The char_group tokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions. 
Simple Pattern Split Tokenizer
    The simple_pattern_split tokenizer uses the same restricted regular expression subset as the simple_pattern tokenizer, but splits the input at matches rather than returning the matches as terms. 
Path Tokenizer
    The path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. /foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz ].