SIREn Solutions | Solr & Elasticsearch Consultancy – Relational Joins for Elasticsearch: the SIREn Join Plugin


We are very very happy to announce today the general availability of SIREn Join, an open source plugin for Elasticsearch which allows ultra fast “Filter Joins” across Elasticsearch indexes. It is often the case that one needs to filter something by the property of something else. “Articles” that talk about “Companies that have been funded not later than 3 months ago”, etc.



 
 

Related Links:

Content

Cached Text (at the time of saving)


We are very very happy to announce today the general availability of SIREn Join, an open source plugin for Elasticsearch which allows ultra fast “Filter Joins” across Elasticsearch indexes.
Introduction
It is often the case that one needs to filter something by the property of something else. E.g. “Articles” that talk about “Companies that have been funded not later than 3 months ago”, etc.
Sometime, one can use things like Parent-Child in Elasticsearch. Some other times (see scenarios below) this is not feasible and other much uglier/hard to maintain solutions are adopted, e.g., typically by materializing all sort of data into the records to be able to search and filter.
With SIREn Join, you can now say no (!) to this practice and keep things clean pretty much as you would in a normalized RDBMS.
Actually cleaner, thanks to the fact that ES, unlike DBs, nicely supports multi-valued attributes (see below).
Currently one type of join is implemented which we call “Filter Join”, but our objective is to add more type of joins in the future.
The Filter Join is basically a (left) semi-join between two set of documents based on a common attribute, where the result only contains the attributes of one of the joined set of documents. This join is used to filter one document set based on a second document set, hence its name. It is equivalent to the EXISTS() operator in SQL.
The Filter Join supports joins on both numerical and textual fields, but the joined attributes must be of the same type. You can also freely combine and nest multiple Filter Joins using boolean operators (conjunction, disjunction, negation) to create complex query plans.
It is fully integrated with the Elasticsearch API and is compatible with distributed environments.
How Does It Compare With Parent-Child
The Filter Join is similar in nature to the Parent-Child feature of Elasticsearch: they perform a join at query-time. However, there are important differences between them:
The parent document and all of its children must live on the same shard, which limits its scalability and flexibility. Filter Join removes this constraint: it allows to join documents across shards and across indices.
Thanks to the data locality of the Parent-Child model, joins are faster and more scalable. The Filter Join on the contrary needs to transfer data across the network to compute joins across shards, limiting its scalability – albeit in a different way – and performance.
There is no “one size fits all” solution to this problem, and you need to understand your requirements to choose the proper solution. As a basic rule, if your data model and data relationships are purely hierarchical (or can be mapped to a purely hierarchical model), then the Parent-Child model might be more appropriate. If on the contrary you need to query both directions of a data relationship (E.g. as Articles can mention more than a Company and Vice Versa), then the Filter Join might be more appropriate.
Easy N to M relationships via Multi-Valued Attributes
The most important requirement for the Filter Join is to have a common shared attribute between two indices. For example, let’s take a simple relational data model composed of two tables, Articles and Companies, and of one junction table ArticlesMentionCompanies to encode the many-to-many relationships between them.
This model can be mapped to two just 2 Elasticsearch indices: Articles and Companies. Compared to a traditional database model where a junction table is necessary, the model is simplified by leveraging multi-valued fields: Article documents simply need to have a multi-valued field mentions with the unique identifiers of the companies mentioned in the article. In other words, the field mentions is a foreign key in the Articles table that refers to the primary key of the Companies table.
How Fast is it?
Very. On a regular laptop we get easily 1 to 5 Million results from a fully connected join between 2 indexes containing 1 to 5 million records each in 40 to 60 MS.
Leverage it using our Kibana version, Kibi
We have engineered Kibi, our Friendly Kibana Fork around the extra capabilities that the SIREn Join plugin gives. Kibi allows (Graphically!) configuring the relationships between your Kibana indexes and allows one to applying cross index filters as easily as clicking on a checkbox that “links” 2 dashboards together.
If you haven’t see Kibi, we highly recommend you take a look at some of the videos on our youtube channel
Kibi is available at http://siren.solutions/kibi.
Taking SIREn Join it for a spin (Getting Started)
Let’s take it for a spin: install the SIREn Join plugin in Elasticsearch, load two collections of documents inter-connected by a common attribute, and execute a relational query across the two collections within the Elasticsearch environment.
This guide requires that you have downloaded and installed the Elasticsearch 1.7.x distribution on your computer. If you do not have an Elasticsearch distribution, you can run the following commands:
$ wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.3.zip $ unzip elasticsearch-1.7.3.zip $ cd elasticsearch-1.7.3
$ wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.3.zip
$ unzip elasticsearch-1.7.3.zip
$ cd elasticsearch-1.7.3
Installing the SIREn Join Plugin
Before starting Elasticsearch, you have to install the SIREn Join plugin. Assuming that you are in your Elasticsearch installation directory, you can run the following command:
$ ./bin/plugin -i solutions.siren/siren-join/1.0 -> Installing solutions.siren/siren-join/1.0... Trying http://download.elasticsearch.org/solutions.siren/siren-join/siren-join-1.0.zip... Trying http://search.maven.org/remotecontent?filepath=solutions/siren/siren-join/1.0/siren-join-1.0.zip... Downloading ..........................................................................................DONE Installed solutions.siren/siren-join/1.0 into /tmp/elasticsearch-1.7.3/plugins/siren-join
$ ./bin/plugin -i solutions.siren/siren-join/1.0
-> Installing solutions.siren/siren-join/1.0...
Trying http://download.elasticsearch.org/solutions.siren/siren-join/siren-join-1.0.zip...
Trying http://search.maven.org/remotecontent?filepath=solutions/siren/siren-join/1.0/siren-join-1.0.zip...
Downloading ..........................................................................................DONE
Installed solutions.siren/siren-join/1.0 into /tmp/elasticsearch-1.7.3/plugins/siren-join
In case you want to remove the plugin, you can run the following command:
$ ./bin/plugin -r siren-join
-> Removing siren-join...
Starting Elasticsearch
To launch Elasticsearch, run the following command:
In the output, you should see a line like the following which indicates that the SIREn Join plugin is installed and running:
1
[2015-10-21 15:07:33,369][INFO ][plugins ] [Free Spirit] loaded [FilterJoinPlugin], sites []
Loading Some Relational Data
We will use a simple synthetic dataset for the purpose of this demo. The dataset consists of two collections of documents: Articles and Companies. An article is connected to a company with the attribute mentions. Articles will be loaded into the articles index and companies in the companies index. To load the dataset, run the following command:
$ curl -XPUT 'http://localhost:9200/_bulk?pretty' -d ' { "index" : { "_index" : "articles", "_type" : "article", "_id" : "1" } } { "title" : "The NoSQL database glut", "mentions" : ["1", "2"] } { "index" : { "_index" : "articles", "_type" : "article", "_id" : "2" } } { "title" : "Graph Databases Seen Connecting the Dots", "mentions" : [] } { "index" : { "_index" : "articles", "_type" : "article", "_id" : "3" } } { "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] } { "index" : { "_index" : "articles", "_type" : "article", "_id" : "4" } } { "title" : "MapR ships Apache Drill", "mentions" : ["4"] } { "index" : { "_index" : "companies", "_type" : "company", "_id" : "1" } } { "id": "1", "name" : "Elastic" } { "index" : { "_index" : "companies", "_type" : "company", "_id" : "2" } } { "id": "2", "name" : "Orient Technologies" } { "index" : { "_index" : "companies", "_type" : "company", "_id" : "3" } } { "id": "3", "name" : "Cloudera" } { "index" : { "_index" : "companies", "_type" : "company", "_id" : "4" } } { "id": "4", "name" : "MapR" } ' { "took" : 8, "errors" : false, "items" : [ { "index" : { "_index" : "articles", "_type" : "article", "_id" : "1", "_version" : 3, "status" : 200 } }, ... }
$ curl -XPUT 'http://localhost:9200/_bulk?pretty' -d '
'
{
"took" : 8,
"errors" : false,
"items" : [ {
"index" : {
"_index" : "articles",
"_type" : "article",
"_id" : "1",
"_version" : 3,
"status" : 200
}
},
...
}
Relational Querying of the Data
We will now show you how to execute a relational query across the two indices. For example, we would like to retrieve all the articles that mention companies whose name matches orient. This relational query can be decomposed in two search queries: the first one to find all the companies whose name matches orient, and a second query to filter out all articles that do not mention a company from the first result set. The SIREn Join plugin introduces a new Elasticsearch’s filter, named filterjoin, that allows to define such a query plan and a new search API _coordinate_search that allows to execute this query plan. Below is the command to run the relational query:
$ curl -XGET 'http://localhost:9200/articles/_coordinate_search?pretty' -d '{ "query" : { "filtered" : { "query" : { "match_all" : { } }, "filter" : { "filterjoin" : { (1) "mentions" : { (2) "indices" : ["companies"], (3) "path" : "id", (4) "query" : { (5) "term" : { "name" : "orient" } } } } } } } }'
$ curl -XGET 'http://localhost:9200/articles/_coordinate_search?pretty' -d '{
},
}
}
}
}
}
}
}
}'
The filterjoin definition
The path on the target index (i.e., articles) containing the foreign keys
The definition of the source index (i.e., companies)
The path to the primary key of the companies index
The query that will be used to filter out companies
The command should return you the following response with two search hits:
{ "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "articles", "_type" : "article", "_id" : "1", "_score" : 1.0, "_source":{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] } }, { "_index" : "articles", "_type" : "article", "_id" : "3", "_score" : 1.0, "_source":{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] } } ] } }
{
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "articles",
"_type" : "article",
"_id" : "1",
"_score" : 1.0,
"_source":{ "title" : "The NoSQL database glut", "mentions" : ["1", "2"] }
}, {
"_index" : "articles",
"_type" : "article",
"_id" : "3",
"_score" : 1.0,
"_source":{ "title" : "How to determine which NoSQL DBMS best fits your needs", "mentions" : ["2", "4"] }
} ]
}
}
You can also reverse the order of the join, and query for all the companies that are mentioned in articles whose title matchesnosql:
$ curl -XGET 'http://localhost:9200/companies/_coordinate_search?pretty' -d '{ "query" : { "filtered" : { "query" : { "match_all" : { } }, "filter" : { "filterjoin" : { "id" : { "indices" : ["articles"], "path" : "mentions", "query" : { "term" : { "title" : "nosql" } } } } } } } }'
$ curl -XGET 'http://localhost:9200/companies/_coordinate_search?pretty' -d '{
},
}
}
}
}
}
}
}
}'
The command should return you the following response with three search hits:
{ "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "companies", "_type" : "company", "_id" : "4", "_score" : 1.0, "_source":{ "id": "4", "name" : "MapR" } }, { "_index" : "companies", "_type" : "company", "_id" : "1", "_score" : 1.0, "_source":{ "id": "1", "name" : "Elastic" } }, { "_index" : "companies", "_type" : "company", "_id" : "2", "_score" : 1.0, "_source":{ "id": "2", "name" : "Orient Technologies" } } ] } }
{
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "companies",
"_type" : "company",
"_id" : "4",
"_score" : 1.0,
"_source":{ "id": "4", "name" : "MapR" }
}, {
"_index" : "companies",
"_type" : "company",
"_id" : "1",
"_score" : 1.0,
"_source":{ "id": "1", "name" : "Elastic" }
}, {
"_index" : "companies",
"_type" : "company",
"_id" : "2",
"_score" : 1.0,
"_source":{ "id": "2", "name" : "Orient Technologies" }
} ]
}
}
How about Elasticsearch 2.0 compatibility?
We’ll release the 2.0 compatible version shortly, Stay tuned!! (e.g. by signign up here below )