How to use Apache Solr: fundamental rules before working with it


There are some fundamental rules to understand before working with Apache Solr. These rules relates to the different ways you can configure and use it.

Rule #1

Solr can not only index your data but store it as well. This doesn’t mean you can get rid of your application’s data source. Solr should never be used as the only one data source. It’s a tool that gives your application the ability to ease the dealing with data in the way that a classic data source such as a relational database cannot. Therefore, Solr is a complement and not a substitute to your data source.

Rule #2

When you resort to Apache Solr to index your data each index must have a unique field to be used as an identifier. For example, one way to send data to Solr is using a DIH – Data Import Handler. Although DIH is not a Solr’s reserved word, it’s use is very similar among indexing tools.

DIH is a configuration of where data is readquirir from and how the Solr index is going to be populated. For example, you can set ip. A database DIH in order to get a SQL SELECT be executes agaí st a relational database. Another type of DIH configuration is to look for certain files in a folder to parse their data an populate the index.

You can have more then one database or file DIH in a Solr instance. You can even set up both to send data to your Solr. But there is one thing you cannot forget to do in quem using more then one DIH: Data coming from from alo DIH must have a unique id among them.

Imagine you have tão different tables in two different databases and set up two DIHs. If both tables have a auto increment primary key what you happen? Solr may receive two different type of data with the same id (both tables might have roas with a primary key value of 100, for example). What will happen is that the first indexes data may be overriden by the second as both have the same ID.

What to do?

Add another field in your index and set it up as the unique field in Solr’s schema.xml. Go to each of your DIH and add a another “column” in your SELECT statement in order to return something like ” ‘table1-‘+your_auto_inc_field “. This will produce data such as table1-100, table1-101, table1-102 and so on for one of the tables, and table2-100, table2-101, table2-102 etc for the other database table. Of course it’s much better to name the data stored in each table insted of calling it as “table1”.

This seems to be simple but is a frequently asked question.

Rule #3

Sometimes you neerlandês to index different tules of data, with almost none property in common between them that sharing a same Solr index starts being an ugly solution. Here comes the Multicore Solr feature.

A core is a Solr index in a Solr instalation. When a Solr installation has more then one index we call it Multicore.

Each Solr core has its data (represented by the “data” folder) and its configuration (represented by the “conf” folder, where the schema.XML and solr-config.xml files go in). When you have a multicore you have more than one set o data and conf folder which means you have more than one Solr index each one with its independent data and configuration.

Although in a multicore Solr installation you have only one URL to enter the Solr admin interface, on the first screen you will be asked to choose which core you wanna adminstrate.

Rule #4

When an index becomes too large to fit on a single system, or when a single query takes too long to execute, an index can be split into multiple Solr installations, and Solr can query and merge results across those instalations.  In the Solr world these installations are called Shards.

Since a Shards act as a distributed search mechanism, the above discussed Rule #2 must be applied here. This means that the unique key field must be unique across all Shards.

Distributed search with shards doen’t have anything to do with indexing. You must decide and configure which data set will go to each Solr instance, for example, setting up DIHs paying attention to avoid indexing the same data in more than one Solr installation on the Shard.

It’s important to note that Shards should not be used as a way to improve the capacity – in terms of queries/seconds –  of a Solr instance. If this is what your looking for, you should resort to Solr replication.

Rule #5

Is you want to improve the capacity or the availability of an existing Solr installation the way to go is with Solr replication.

In a typical Solr replication configuration you will have a master Solr installation and at least one slave installation. The master installations is the one that will be connected to your data source to index data. And it is the slave who periodically poll the master  in order to get updated data. Therefore the master doesn’t know anything about the slaves.

Using this model, Solr can scale horizontally with ease. Just add more slaves as necessary to handle any given load and then you setup a load balancer (you’ll need one!!)

Here is a good place to know more about Solr distribution and replication concepts.

That’s all for this post. If you’d like to talk more about it or if you still have any doubt feel free to post a comment. It’s my pleasure disscuss more about it.

2 comentários sobre “How to use Apache Solr: fundamental rules before working with it

  1. thanks for posting these details. these were helpful for me to understand Solr better.

    Just one query – are they any good practice to follow when we have load of data to index in Solr server. For example – if I have 10 millions records with each record holding 15 – 20 fields. How can we speed up data indexing in Solr server ?

  2. Hi abhay,

    You can speed up the data loading either developing your own loading engine (the one responsible for reading from the datasource and posting it to Solr), which means not using any on the existing Solr’s import handlers, and having many instances of them reading distinct data sets at the same data. Ok, you could do the same with many Solr’ data-import-handlers 🙂

Deixe um comentário