Postgresql – Use case for hstore

hstorepostgresqlpostgresql-9.4

This is the first time (I think) I have the chance to use the hstore data type, but I would like to hear from more experienced folks if what I have in mind is actually a good idea.

Now, there's this web app in which we import payroll data in the form of XML files, which look roughly like this simplified version:

<Company>
    <Employee>
        <LastName>Smith</LastName>
        <FirstName>John</LastName>
        <HourlyWage>9999.99</HourlyWage>
        <!-- several other hundreds of tags -->
    </Employee>
    <Employee>
        <!-- ... -->
    </Employee> 
</Company>

Each employee carries extremely detailed information, and when I say "several other hundreds of tags" it's because they will usually range between 800 and upwards of 1400. And this is for each month. Apart from a set of core tags, each employee can have a different combination of them, hence the very fluctuating but very realistic figures I gave above.

Now, part of this data is imported with a long, slow and very complex process, and I've observed that, with increasing frequency, we are finding ourselves saying "gosh, if only we had always imported that particular tag!".

While the import process is highly configurable, making it run for just a small set of data is slow, impractical and downright painful. Adding whatever customization is needed to import the hypothetical new tag from now on is much easier, but for building up historic data as if we always imported it, it's messy and error prone.

As an added bonus, the task always falls on those two guys, and I'm one of them, and I would love to make our life a tad simpler.

This is why I'm thinking of writing a quick tool that, during the night, will crack those XML files open and, for each month and each employee, create a record with an hstore column containing all of the employee's tags for that month.

Being an absolute beginner at hstore, this looks to me like a very good use case for it, especially if we consider that:

  • since the tags can differ for each employee, this is essentially schema-less data.

  • storing the tags as an EAV with one row per tag would mean approximately 240k rows per month for a 200-employees company (well into the 2.8M per year). Nothing to be concerned about, sure, but there's not only one customer. And one of them has over 7,000 employees (which would be 100M records per year).

  • this data will only ever need to be read, never changed. Plus, it won't even be read that often anyway.

  • I don't really care or know what any given tag means. I just want to store it for future use, telling me which one is needed is the job of the domain experts. Again, schema-less.

The table I would design would look a bit like this:

- id bigserial
- user_id
- file_timestamp (it's embedded on the name of the file)
- employee_id_1 varchar
- employee_id_2 varchar
- month date
- file_id (id of the XML file, it gets logged in a table before being imported)
- tags hstore

(of employee_id 1 and 2, the former looks like the US SSN, the latter is assigned from our payroll app).

I would also create a unique index on (user_id, employee_id_1, employee_id_2, month, file_id). Not 100% sure about the order of the columns, but I think it would accommodate most SELECTs where one wants to progressively narrow the data down.

Also, I wouldn't want to duplicate the table for each customer, and I don't want or need it to be user-visible. I would create a dedicated schema and stick it in there, which would be a bit easier to manage in case I want to partition it at some point in the future. It's going to be a huge table (not so much for the number of rows, but for the space required for each of them), but keeping into its own schema also means it will be easier to exclude from most backups. Plus, we keep the original XML files anyway, so it wouldn't even be hard to rebuild if things go awry.

With such a design, it looks like generating historical data with a bunch of ad-hoc one-off queries would be child's play.

But since I'm not an expert, I was wondering if:

  1. this is actually a good use case for hstore
  2. there are obvious flaws in my design
  3. there are things I should pay attention to in order to avoid painting myself into a corner when hstore enters the scene
  4. since the tags are in a sizable number but still not that many per row, would it be worth it creating an index on the hstore keys?

Best Answer

At the scope at which you're working. I think JSONB is ideal. It handles deeply nested structures and structures with array-keys. It's also standardized and in the spec for sql2016.

In addition, as I answered here, there is an extension that should help you with space consumption called ZSON,

ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents (not only keys, but also values, array elements, etc).

In some cases ZSON can save half of your disk space and give you about 10% more TPS. Memory is saved as well. See docs/benchmark.md. Everything depends on your data and workload, though. Don't believe any benchmarks, re-check everything on your data, configuration, hardware, workload and PostgreSQL version.

You may want to look into ZSON.