Mysql – How should I save apache logs into a thesql table

indexMySQLperformanceprimary-key

This is a php script to view apache logs.

I want to save apache logs into a mysql database.

Then add some rules for tagging urls using mysql REGEXP search, like: SET tag='some tag' WHERE url REGEXP 'some pattern';

a) Should I use one table for storing all the urls every time they are accessed even if they repeat and then do the REGEXP search and apply the tag to all of them?

b) Or it would be better to save one table with unique urls, and a second table with the id of the url and the time accessed? Then the tagging will be applied this table that has less rows if the url repeats itself.

If option 'b' is better, what kind of index should I use for unique urls? varchar(4000) primary key? I was thinking about creating a md5 hash of the string of the url and use that as primary key because it will be shorter.

I ask this question because I want to know what would be best performance when:

  • Tagging many urls with regexp search
  • Importing thousands of urls into one table and make sure they are unique

Thanks!

Best Answer

Because this is really logging data that you are capturing, I would first store it in its raw form, then ETL/normalize it as needed.

For searching, you can index the front part of the URLs first few characters. Let's say the options are http:// and https:// - then limit the index size to three characters passed the protocol part.

CREATE TABLE log ( 
datetime_created DATETIME, 
url VARCHAR(1024), 
domainname VARCHAR(255),
someotherdata VARCHAR(255), 
...
INDEX `idx_domain` (domainname(3))) 
ENGINE=InnoDB ROW_FORMAT=COMPRESSED; 
  • InnoDB will allow you to search the table w/o locking it.
  • Compression will help with disk space.
  • INDEX idx_domain (domainname(3)) will index the first 3 characters of the domain name column and speed up searches. However, the strategy for REGEXP will be to try to match on the front of the domain name.

Scale will eventually be an issue if the site becomes popular, so buyer beware.