Postgresql – How to replace a substring within a string, but only if substring is matched within a regex group

postgresql

I need to extract information from a malformed json-formatted string. Due to an error, all quotes have been stripped, so instead of:

{"receipt_type":"Production","adam_id":1233,"app_item_id":1233,"more":[{"h":234}]}

I have:

 {receipt_type:Production,adam_id:1233,app_item_id:1233,more:[{h:234}]}

Using

replace(replace(replace(replace(replace(replace("transactionReceipt", ':', '":"'),',','","'),'{','{"'),'}','"}'),':"[{',':[{'), ']"}', ']}')

I can get mostly fine json, HOWEVER this incorrectly turns "receipt_creation_date":"2021-01-24 03:11:53 Etc/GMT" into "receipt_creation_date":"2021-01-24 03":"11":"53 Etc/GMT"

Using

REGEXP_REPLACE(replace(replace(replace(replace(replace(replace("transactionReceipt", ':', '":"'),',','","'),'{','{"'),'}','"}'),':"[{',':[{'), ']"}', ']}'),'\d\d\"\:\"\d\d\"\:\"\d\d','','g')

I can match the 03":"11":"53 and remove it totally, yielding correct JSON.

But I want to retain the time, meaning that I swap ":" with : but ONLY if ":" is within the regex pattern of '\d\d\"\:\"\d\d\"\:\"\d\d'?

Thank you!

edit: a complete malformed record (it's actually an IAP transaction receipt from Apple) looks like

{receipt_type:Production,adam_id:123456789,app_item_id:123456789,bundle_id:example.app.yo,application_version:101,download_id:123456789,version_external_identifier:123456789,receipt_creation_date:2020-03-19 16:18:27 Etc/GMT,receipt_creation_date_ms:1584634707000,receipt_creation_date_pst:2020-03-19 09:18:27 America/Los_Angeles,request_date:2020-03-19 16:18:28 Etc/GMT,request_date_ms:1584634708876,request_date_pst:2020-03-19 09:18:28 America/Los_Angeles,original_purchase_date:2020-02-28 17:55:33 Etc/GMT,original_purchase_date_ms:1582912533000,original_purchase_date_pst:2020-02-28 09:55:33 America/Los_Angeles,original_application_version:101,in_app:[{quantity:1,product_id:com.eample.iap.id,transaction_id:123456789,original_transaction_id:123456789,purchase_date:2020-03-19 16:18:27 Etc/GMT,purchase_date_ms:1584634707000,purchase_date_pst:2020-03-19 09:18:27 America/Los_Angeles,original_purchase_date:2020-03-19 16:18:27 Etc/GMT,original_purchase_date_ms:1584634707000,original_purchase_date_pst:2020-03-19 09:18:27 America/Los_Angeles,is_trial_period:false}]}

Best Answer

I would use two nested regexp_replace() calls:

  • first to enclose all keys with double quotes
  • then to enclose all values with double quotes
with data (input) as (
  values ('{receipt_type:Production,adam_id:1233,app_item_id:1233,more:[{h:234}]}')
)
select regexp_replace(
          regexp_replace(input, '(\w+)(:)', '"\1"\2', 'g'),
          '(:)(\w+)(,|})', '\1"\2"\3', 'g')
from data;          

returns:

{"receipt_type":"Production","adam_id":"1233","app_item_id":"1233","more":[{"h":"234"}]}

If you want have the integers without quotes, you could run a third regexp_replace that removes the quotes from integers with quotes:

regexp_replace(..., '"([0-9]+)"', '\1', 'g')