Troy Lopez

Towards better log intake

log parsing is hard

Currently, LogCrunch uses a format similar to NDJSON, but with 2x \n after each entry. Each entry uses this format:

type Log struct {  
    Host      string      `json:"host"`  
    Timestamp int64       `json:"timestamp"`  
    Type      string      `json:"type"`  
    Payload   interface{} `json:"payload"`  
}

While this is fast, and simple to trace, it isn't really practical for data manipulation or querying. The original log is stored verbatim, and the metadata is sometimes redundant. No parsing of the log itself occurs, so accessing aspects of the log (payload) itself isn't possible without bespoke code. Any sort of query would require scanning the entire file, every time. The current solution evolved out of the need to send discrete entries across a channel to a single destination. While this method works, it doesn't give us anything to move off of, and doesn't "expose" anything we could leverage as an interface without custom wrappers. Additionally, with many endpoints this file will quickly get bulked down and begin draining performance.

While a new solution is needed, this current method isn't a complete waste; with consistent log rotation and compression this could work as a detailed, albeit verbose cold storage method.

Existing Solutions

LogCrunch aims to undercut the competition in log aggregation by staying lightweight and small. While proper enterprise SaaS is performant, it is extremely resource intensive, often rendering it infeasible for the small network or lab. However, that does not mean we can't learn from existing OSS solutions.

Elastic Search / the ELK stack use Apache Lucene under the surface for log storage^[1]. Lucene is a full-text search engine library written in Java that utilizes inverted indexes. When a document (in our example, a log or set of logs) is written to disk, it's tokenized and each word is added to inverted index structures. This acts sort of like the index at the back of a textbook, where words are mapped to their appearances^[2]. Logs are then stored in Segment files, immutable groups of indexes and documents. Segments are merged every so often for efficiency. This splits writing and reading across different files, helping with concurrency.

When a log is sent to Elasticsearch, it first is written to an in-memory buffer. It's then appended to a transaction log, to note its arrival, and later both are flushed from ram to disk as a Lucene segment. The generated indexes are split into shards of the whole, and duplicated for redundancy. This is designed for multi-server efficiency and scalability, but is luckily out of scope for our needs.

There's some good things we can take away from this. The inverted index method of storing documents should help improve our efficiency, and begin to introduce proper querying/parsing/filtering capabilities. However, the concept of shards and duplication are currently overkill for this small of a project. Also, it goes without saying, but introducing the JVM into this is a non-starter.

Next Steps

One of the caveats of existing solutions is the beefiness required by the SIEM server. To counteract this, we can distribute the actual computing necessary for log storage, by parsing before sending logs over the wire. Normalization and staging before transit means the server won't need to both think and write, it can focus on storage and querying without mutating along the way. While LogCrunch is currently built with dumb agents, this will need to change for an effective solution.

Log aggregation functions in waves. For each log event, it must be read and lexed by the agent. Then, this needs to be parsed into a standardized format for transit. Metadata can be attached, and the data is sent over the wire. In a perfect world, we can now stop thinking, and the server can just write to disk. Then, querying can occur disjunct from this process. There is only one elephant in the room-- alerting. If LogCrunch is to become a full SIEM, we'll need some ways to generate alerts based on behavior deduced from these logs. The metadata pass on agent log processing could potentially identify unexpected behaviors and create some sort of alert indicator, but at this point in time there is no good solution for multi-endpoint alert generation (that is, alerts including logs from multiple boxes). This presents a variety of challenges, but thankfully can wait until parsing is implemented.

In conclusion

Log parsing is hard. It would be great if everyone adhered to an RFC for log formatting, such as RFC5424, but this doesn't work for application-level logs. God I need sleep. Data-sickness is real.

Update, a week later

After getting some sleep, I have found a somewhat more elegant solution to the log parsing problem. My first approach to parsing required a different function for each type of log format, and then in a config file the user would map log file locations to functions. This is a decent solution, but it isn't expandable; If a user wants to add new formats, they would need to write more functions and then recompile. Additionally, having N functions for N formats slowly creeps up the size of the binary and the complexity of the logic, something we want to avoid. This can be alleviated by introducing a meta-parser, a parsing function given regular expression of the language used by the log files and a struct of desired output fields. This means that for N formats we need N regex, but only 1 function^[3]. This drastically simplifies our logical flow and maximizes code reuse. To further optimize deployment, we can move these regexes out of the binary itself, into the configuration file. Now, instead of just mapping file paths to parsing functions, users can define a regex for parsing and a list of logs to parse with that format, allowing for theoretically endless parsing patterns with the same binary. This introduces a bit of overhead, as the configuration file must be validated before execution, but still reduces the amount of heft required to intake a large number of different logs.

Potential Issues

Allowing user-defined object creation and arbitrary file intake can be scary. An attacker could define an agent config that reads a sensitive file, for example /etc/shadow, and sends it to a SIEM server they control. Or, they could define extremely redundant log locations, and fork-bomb a machine by trying to watch an absurd number of files simultaneously. Preventing abuse of the SIEM agent will be an important step when hardening LogCrunch, and I'm actively working on mitigations to these risks.

In conclusion, for real

Log parsing isn't hard, there's just a lot to do. By opening up formatting to user-defined regex, LogCrunch won't need to be omniscient, just good enough out of the box. Rare or novel behavior can be defined at the configuration level by users, and plugged in at startup. Server-side intake and storage is still to be done, but functioning agents should help with determining database schema and general next steps.

^[1] https://discuss.elastic.co/t/where-does-elasticsearch-store-read-logs/323119

^[2] https://www.geeksforgeeks.org/dbms/inverted-index/

^[3] Note binary logs such as journalctl still need unique parsing functions

8/15/2025