Why you have to start loving JSON and stop using XML


I recently had to onboard XML data in to Splunk. To say the least, this is not something that is done straight out of the box.

This time around I decided to try to work smarter, so I started looking into tools to convert the XML data into something acceptable for Splunk – enter JSON.

The reason for going with JSON is that Splunk is able to ingest proper JSON with little configuration. Every row contains headers, so field extraction magic is also not required.

I scavenged the net, and ended up using the now deprecated XML2JSON from GitHub.
This Python script simply inputs XML and spits out JSON, no question asked.
For this project I specified “–strip_namespace –strip_newlines –pretty –strip_text”.
xml2json.py -t xml2json -o ${OUTPUTFILE}.json ${INPUTFILE}.xml

The original XML containted several levels, so I had to use the marvellous tool jq to specify that I were only interested in the data some levels down in the three.
jq .message.body.bodyContent.meeting ${INPUTFILE}.json > ${OUTPUTFILE}.json

On the forwarder (heavy in my case), I had to specify a scripted input to actually fetch the data from the API, and also a monitor input to read the resulting JSON:

inputs.conf
[script:///path/to/script/script.sh]
index=someindex
sourcetype=vendor:product:script
disabled=false
start_by_shell=false

[monitor:///path/to/resulting/logs/*.json]
index=someindex
sourcetype=vendor:product:json
disabled=false
ignoreOlderThan = 14d

props.conf on the forwarder:
[vendor:product:json]
TRUNCATE = 0
CHARSET = UTF-8
KV_MODE=none
INDEXED_EXTRACTIONS=JSON
SHOULD_LINEMERGE=false
DATETIME_CONFIG=CURRENT


The reason for using DATETIME_CONFIG=CURRENT is due to the fact that these events did not contain any timestamps.

props.conf on the search heads:
[vendor:product:json]
KV_MODE=none

KV_MODE=none is specified to specify that the search heads do not need extract fields, as this is already done at index time.

Please note that this configuration results in Splunk performing the extractions at index time and not at search time.
This could result in better search performance if you search using the key::value syntax, or using tstats to search indexed data.

This WILL also consume more storage usage on the indexers.
Final note – indexed extractions is usually not recommended except in specific cases.


Leave a Reply

Your email address will not be published.

Mastodon