# Mysql full text search with other indexed data

full-text-searchMySQLperformance

After getting some mysql optimization advice in this question. Was able to get the performance to the acceptable levels (when performing lookups using EventType and event_log_domain_id combo). To recap my database table is around 11mil records, and these are the columns(relevant ones) :

id| EventTime | EventType | EventName ... | event_log_domain_id


Now however on top of filtering(exact match) on EventType and event_log_domain_id I need to add EventName into the mix and perform wildcard search.

After looking at mysql docs and options available the fulltext search option seemed very promising, however running this one :

SELECT COUNT(*) FROM access_logs
WHERE access_logs.event_log_domain_id IN (8, 59, 920, 1054, 2227)
AND (MATCH( EventName ) AGAINST( 'business' ))
AND access_logs.EventType IN (1, 5)


And this one :

SELECT  access_logs.* FROM access_logs
WHERE access_logs.event_log_domain_id IN (8, 59, 920, 1054, 2227)
AND (MATCH( EventName ) AGAINST( 'business' ))
AND access_logs.EventType IN (1, 5)
ORDER BY id desc LIMIT 50


It seemed like Full-Text index is used no matter what, and I have confirmed that first by running explain which looks something like this:

id| select_type | table       | type     | possible_keys                                                                                                 | key            | key_len | ref  | rows | extra
1 | SIMPLE      | access_logs | fulltext | IndexEventType, event_log_domain_id, EventTypeAndeventlogdomainid,eventlogdomainidAndEventType,IndexEventName | IndexEventName | 0       | null | 1    | Using where; Using filesort


It takes about 95-ish seconds to execute either of the queries above. If I take out the full text search using event type the queries run within few hundred ms.

When I run the fulltext search on it's own like so :

SELECT count(*) FROM access_logs
WHERE (MATCH( EventName ) AGAINST( 'business' ))


Executes roughly in 9seconds. Which is pretty high in itself but 'acceptable' given the number of rows.

Have tried tinkering with inner query alias like so :

SELECT
access_logs.*
FROM
inbound_access_logs access_logs
JOIN
(
SELECT  access_logs.* FROM access_logs
WHERE access_logs.event_log_domain_id IN (8, 59, 920, 1054, 2227)
ANDaccess_logs.EventType IN (1, 5)
) AS filtered_query
ON access_logs.id = filtered_query.id


Which executes in few seconds, but if I use the same approach with inner query alias and use it with full text search.

SELECT
access_logs.*
FROM
inbound_access_logs access_logs
JOIN
(
SELECT  access_logs.* FROM access_logs
WHERE access_logs.event_log_domain_id IN (8, 59, 920, 1054, 2227)
ANDaccess_logs.EventType IN (1, 5)
) AS filtered_query
ON access_logs.id = filtered_query.id
where MATCH( access_logs.EventName ) AGAINST( 'business' )


Again 8-9 seconds. Thinking that maybe I leave the like query in there, but I really wanted to optimize this that I don't have to touch it for a while.

Another option that I was thinking was to create a new table like so:

id | access_log_id | EventName


Would write to this table on data insert. On this table I would be having fulltext index on the EventName column, and I would join this table when doing lookups, was hoping this would make the query optimizer think 'better' given that it's two tables, hopefully using one index per table in the same search. Not sure if this is a good idea or not.

My question is: How do I go about optimizing query to filter on EventName with other columns I mentioned above? None of my approaches seems to be great

Update:
I have been experimenting with the inner query alias and like, this one :

SELECT
access_logs.*
FROM
inbound_access_logs access_logs
JOIN
(
SELECT  access_logs.* FROM access_logs
WHERE access_logs.event_log_domain_id IN (8, 59, 920, 1054, 2227)
ANDaccess_logs.EventType IN (1, 5)
) AS filtered_query
ON access_logs.id = filtered_query.id


And time varies depending on the EventName and other conditions. It ranges from 1 to 60 seconds. Which is frustrating, getting some closer runtime would be better.

I am running on the AWS RDS using m3.medium machine, not sure if that makes a difference. Would be good to know what else I can try or how to build in more certainty. Because once I run the query it remains in the database cache and performs fast the second time.

UPDATE 1

Showing table create statement:

'CREATE TABLE access_logs (
id int(11) NOT NULL AUTO_INCREMENT,
EventTime datetime NOT NULL DEFAULT ''0000-00-00 00:00:00'',
EventSize int(10) unsigned NOT NULL DEFAULT ''0'',
EventScore double DEFAULT NULL,
EventCustomer varchar(255) DEFAULT NULL,
event_log_domain_id smallint(6) DEFAULT NULL,
EventIdentifier varchar(255) DEFAULT NULL,
EventType tinyint(3) unsigned NOT NULL DEFAULT ''0'',
EventName varchar(255) DEFAULT NULL,
utc tinyint(1) DEFAULT ''0'',
job_queue_id varchar(255) DEFAULT NULL,
PRIMARY KEY (id),
KEY EventTime (EventTime),
KEY IndexIdentifier (EventIdentifier),
KEY IndexType (EventType),
KEY IndexUserType (EventIdentifier,EventType),
KEY event_log_domain_id (event_log_domain_id),
KEY job_queue_id (job_queue_id),
KEY EventTypeAndeventogdomainid (EventType,event_log_domain_id),
KEY eventogdomainidAndEventType (event_log_domain_id,EventType),
FULLTEXT KEY IndexEventName (EventName)
) ENGINE=InnoDB AUTO_INCREMENT=123163064 DEFAULT CHARSET=latin1'


We have built a page that filters on these events, there are 5 filters:

1. EventName (wildcard search %search_string%, not always provided)
2. EventCustomer (forward match search_string%, not always provided)
3. EventIdentifier (forward match search_string%, not always provided)
4. EventType (exact match, one or more ids but not always provided)
5. event_log_domain_id (exact match, one or more ids always provided)

It is possible for provide more than one filter, so you could provide name and customer for all log domains. or just event type for particular log domain etc.

We have a pagination on the page, so we do need to perform a count to get the row count.

Then we fetch 50/100 records at the time so we do limit too. We also provide default order, or give ability to order by few columns (EventCustomer, EventIdentifier, EventType and EventTime).

As I understand it, the FULLTEXT needs to be performed first. That is, your JOIN attempts make things worse.

access_logs.* probably has some TEXT columns, correct? (Please provide SHOW CREATE TABLE so I would have to guess so much.) That can be a lot to lug around. It may be better to fetch only id until you have done all the filtering, then do an extra JOIN to get *. id is lightweight; * may be bulky. And wasteful if you end up tossing most of the rows.

Give this a try:

SELECT  g.*
FROM ( SELECT  id
FROM  access_logs
WHERE  MATCH(EventName) AGAINST('business' IN BOOLEAN MODE)
) AS f
JOIN  access_logs AS g USING(id)
WHERE  g.domain_id IN (8, 59, 920, 1054, 2227)
AND  g.EventType IN (1, 5)


The LIMIT version needs some more care. First filter down to the 50 ids, then fetch the rest of the columns.

SELECT i.*
FROM ( SELECT  g.id
FROM ( SELECT  id
FROM  access_logs
WHERE  MATCH(EventName) AGAINST('business' IN BOOLEAN MODE)
) AS f
JOIN  access_logs AS g USING(id)
WHERE  g.domain_id IN (8, 59, 920, 1054, 2227)
AND  g.EventType IN (1, 5)
ORDER BY id DESC
LIMIT 50    -- Only here, not at the outer layer
) AS h
JOIN access_logs AS i  USING(id)
ORDER BY id DESC   -- Yes, this is needed a second time