Review of "Mining Modern Repositories with Elasticsearch"

11 Oct 2015

Review of "Mining Modern Repositories with Elasticsearch"

Organizations are generating more and more data, at a rate that have exceed their ability to analyze, but analyzing these data is very crucial to the success of these businesses. Under this situation, ElasticSearch is developed to cope with this needs. It's a distributed full-text search engine, which is scalable and provides near real-time query response.

ElasticSearch(ES) is based on Apache Lucene; each ES index consists one or more Lucene indices, called shards. When a new document is added to an index, the ES server defines the shard that will be responsible for storing and indexing this document. By doing this ES can automatically balance shards among nodes in a cluster. ES provides a RestAPI interface for communication with other applications. It's schema-less, so you can insert different types of documents into the same index. But you can also use a fixed mapping to better control the data inserted into the index, for example, to disable indexing on certain fields. In order to cater for both insertion throughput and data visibility, ES provides a tunable fixed time interval for index refreshing. You can perform query on ES using filters or queries, where the latter also gives the relevance scores of the returned items.

ES provides horizontal scalability and great performance compared to using traditional RDMS for similar queries, while at the same time provides much better agility. Whereas, it also has its weakness, such as no ACL and high learning curve of the query language.

You can use ES to build modules that requires search but exceeds the limit of traditional RDMS. It's a very good module that can certainly fill the gap between traditional RDMS and certain analytical requirements. But for a lot of analytics, especially structured data is required, it's still better to leverage on big data processing frameworks and implement the logic in your own code. Anyway, it serves the unstructured full-text analytics very well, so it's still a very good project.