I am doing a web scraping project using Splunk and Scrapy. I have a server that's responsible for web scraping and has the universal forwarder installed. The forwarder will forward the scraped data, which are in JSON format, to a Splunk Enterprise instance every one hour.
Each row in the JSON file is an event on Splunk and each event has a timestamp field. The JSON file looks like this:
{..., timestamp: '2016-03-01 08:30:46.094063', ...}
{..., timestamp: '2016-03-01 08:30:19.596477', ...}
I created a custom source type for this data source. On Splunk Enterprise, `/opt/splunk/etc/apps/search/local/props.conf` is:
[scrapy_json]
DATETIME_CONFIG =
INDEXED_EXTRACTIONS = json
NO_BINARY_CHECK = true
TIMESTAMP_FIELDS = timestamp
TIME_FORMAT = %Y-%m-%d %H:%M:%S.%6N
TZ = Asia/Hong_Kong
category = Structured
pulldown_type = 1
disabled = false
SHOULD_LINEMERGE = false
On the universal forwarder, `/opt/splunkforwarder/etc/system/local/inputs.conf` is:
[monitor:///{path_to_json}/*.json]
index=scrapy
sourcetype=scrapy_json
And `/opt/splunkforwarder/etc/system/local/props.conf` is:
[source::{path_to_json}/*.json]
sourcetype=scrapy_json
However, the `timestamp` field has an extra `none` value as shown. Manually uploading the JSON files via Splunk Web does not cause this issue. How do I get rid of that `none` value?
![alt text][1]
[1]: /storage/temp/107202-screenshot-from-2016-03-01-09-50-18.png
↧