Elastic search reindexing best practice

We have found that we occasionally end up with a situation where we search for new items and they don't come back in the search results. NOTE: Usually new things populate just fine, this only happens intermittently. We are able to fix the issue by running a system reindex from admin. We'd rather not have to do this manually and I see that there is a CLI command to reindex. Is it good practice to just run a system reindex using the CLI occasionally? How often, nightly? According to the documentation[1], this sounds like a not unusual occurrence, does that seem accurate to others?

[1] Performing a system index may resolve search issues for records that have been recently created or modules that were recently enabled for search

0 André Lopes over 4 years ago

In which module does that happen? What is the type of ES fields configured in those modules?

André Lopes
Lampada Global

Skype: andre.lampada
0 Brad Pitcher over 4 years ago in reply to André Lopes

It happens both with accounts and contacts. It's happening with some accounts right now even though we just reindexed last night. Not sure how relevant this is but Amazon ESS always has a non-zero amount of "Deleted Documents" when the problem is occuring and a re-index drops that number to 0
0 Gregory Kenenitz over 4 years ago in reply to Brad Pitcher

I'm working on the same project as Brad...

What is the type of ES fields configured in those modules?

All of the fields for `Accounts` and `Contacts` that ship as globally searchable, plus a few custom fields.
0 André Lopes over 4 years ago

We faced a situation in a customer's project. We implemented Visibility Strategy specific bool fields in ES.

On reindexing it works great, but on editing any record, even if that target bool fields remain intact, the record is not fetched from ES anymore.

By debugging ES database we realized that, after reindexing the bool fields are set as true or false, but after editing some record those fields are updated to 0 or 1, so ES wlll not fetch them anymore once the value types are not the same.

We fixed that weird issue by adding both possible values in the Visibility Strategy (true or 1 / false or 0).

I'm not sure if your issue is related.

André Lopes
Lampada Global

Skype: andre.lampada
0 Brad Pitcher over 4 years ago in reply to André Lopes

André, please excuse my ignorance on this subject but I just don't understand what you mean by this: "We implemented Visibility Strategy specific bool fields in ES"
Is there any sugarcrm documentation describing what this means?
0 Enes Saridogan over 4 years ago
Hi Brad,

For one of our customer we implemented a weekly cron job that reindex the whole database, running on sunday evening in order to make it available on monday morning.

Here is the code :

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
json=$(curl -X POST -H Cache-Control:no-cache -H "Content-Type: application/json" -d '{
"grant_type":"password",
"client_id":"sugar",
"client_secret":"",
"username":"",
"password":"",
"platform":"mobile"
}' http://crm-address/rest/v10/oauth2/token)
access_token=$(echo $json | sed "s/{.*\"access_token\":\"$[^\"]*$.*}/\1/g") \
curl -X POST -H "oauth-token: $access_token" -H Cache-Control:no-cache -H "Content-Type: application/json" -d '{
"clear_data":true
}' http://crm-address/rest/v10/Administration/search/reindex
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#!/bin/bash json=$(curl -X POST -H Cache-Control:no-cache -H "Content-Type: application/json" -d '{ "grant_type":"password", "client_id":"sugar", "client_secret":"", "username":"", "password":"", "platform":"mobile" }' http://crm-address/rest/v10/oauth2/token) access_token=$(echo $json | sed "s/{.*\"access_token\":\"$[^\"]*$.*}/\1/g") \ curl -X POST -H "oauth-token: $access_token" -H Cache-Control:no-cache -H "Content-Type: application/json" -d '{ "clear_data":true }' http://crm-address/rest/v10/Administration/search/reindex

Probably there is a better way to do that, but this is working as expected :-)

Best regards,

Enes
0 André Lopes over 4 years ago in reply to Brad Pitcher

Yes, here is the official documentation.

Purpose of such a strategy is to, on the fly, restrict access to data depending on criterias (Teams, Roles, User itself). This dynamic filtering affects both SugarQuery and ES, under any platform (base, portal, mobile, custom atc).

Regards

André Lopes
Lampada Global

Skype: andre.lampada
0 Brad Pitcher over 4 years ago in reply to Enes Saridogan
Thank you Enes, we did end up setting up a cron job to re-index, using the CLI:
Fullscreen
1
./bin/sugarcrm search:reindex -n --clearData
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
./bin/sugarcrm search:reindex -n --clearData

We are running it nightly but we are still ending up with issues during the day. Today for example, and these are not even new accounts. It's almost like the re-index did not include them for some reason. Do you have any idea how we can go about debugging this?
0 Enes Saridogan over 4 years ago in reply to Brad Pitcher
You're welcome Brad.

How are created your accounts ? I suppose via REST Api or manually via the CRM ? Can you share me the following config values please ?

'search_engine' =>
array (
'max_bulk_query_threshold' => 15000,
'max_bulk_delete_threshold' => 999,
),

We faced an issue three years ago where the indexation did not end because the threshold was too important, the query didn't end and the cron job failed into error, the fts_queue table was very very very (very very) big :-)

You can also set the following NUS modification :

src/Elasticsearch/Queue/QueueManager.php

function generateQueryModuleFromQueue

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
protected function generateQueryModuleFromQueue(\SugarBean $bean, int $bucketId = self::DEFAULT_BUCKET_ID)
{
// Get all bean fields
$beanFields = array_keys(
$this->container->indexer->getBeanIndexFields($bean->module_name, true)
);
$beanFields[] = 'id';
$beanFields[] = 'deleted';

$sq = new \SugarQuery();
// disable team security
// adde erased fields
$sq->from($bean, ['add_deleted' => false, 'team_security' => false, 'erased_fields' => true]);
$sq->select($beanFields);
$sq->limit($this->maxBulkQueryThreshold);

// join fts_queue table
if ($this->isDefaultBucketId($bucketId)) {
$sq->joinTable(self::FTS_QUEUE)->on()
->equalsField(self::FTS_QUEUE . '.bean_id', 'id')
->equals(self::FTS_QUEUE . '.bean_module', $bean->module_name);
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
protected function generateQueryModuleFromQueue(\SugarBean $bean, int $bucketId = self::DEFAULT_BUCKET_ID) { // Get all bean fields $beanFields = array_keys( $this->container->indexer->getBeanIndexFields($bean->module_name, true) ); $beanFields[] = 'id'; $beanFields[] = 'deleted'; $sq = new \SugarQuery(); // disable team security // adde erased fields $sq->from($bean, ['add_deleted' => false, 'team_security' => false, 'erased_fields' => true]); $sq->select($beanFields); $sq->limit($this->maxBulkQueryThreshold); // join fts_queue table if ($this->isDefaultBucketId($bucketId)) { $sq->joinTable(self::FTS_QUEUE)->on() ->equalsField(self::FTS_QUEUE . '.bean_id', 'id') ->equals(self::FTS_QUEUE . '.bean_module', $bean->module_name); } else { $sq->joinTable(self::FTS_QUEUE)->on() ->equalsField(self::FTS_QUEUE . '.bean_id', 'id') ->equals(self::FTS_QUEUE . '.processed', $bucketId) ->equals(self::FTS_QUEUE . '.bean_module', $bean->module_name); } $additionalFields = array( array(self::FTS_QUEUE . '.id', 'fts_id'), array(self::FTS_QUEUE . '.processed', 'fts_processed'), ); $sq->select($additionalFields); return $sq; }

the part we modified are after the comment "// join fts_queue table" where we add the condition :

Fullscreen
1
->equals(self::FTS_QUEUE . '.bean_module', $bean->module_name);
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
->equals(self::FTS_QUEUE . '.bean_module', $bean->module_name);

This code is ok for a V10.

You can begin by checking the fts_queue table daily, it must be empty most of the time.

Hope that it would help you :)

Best regards,

Enes
0 Brad Pitcher over 4 years ago in reply to Enes Saridogan

We are creating accounts via the Rest API. We have not changed those config values so I guess we are using the defaults of `max_bulk_query_threshold=15000` and `max_bulk_delete_threshold=3000`. This is a very new project (not even released yet) with a small amount of data. We've added some custom fields but otherwise everything is just out of the box configuration

We ran a reindex every hour overnight last night and the number of searchable documents was different after every run. Any idea why that would be? Data was not changing during most of this time. The `fts_queue` table is empty

Elastic search reindexing best practice

src/Elasticsearch/Queue/QueueManager.php