Tenzir is hiring several key engineering roles to meet the needs in expanding the team. Our product: security data pipelines. From the data side, think of it as an Arrow-native, multi-schema ETL tool that offers optional storage in Parquet/Feather. From the security perspective, think of it as a solution for collecting, parsing, transforming, aggregating, and routing data. We typically sit between the data sources (endpoint, network, cloud) and sinks (SIEM, data lake).
Our open-source execution engine is C++20, our platform is SvelteKit and TypeScript. Experience with data-first frontend apps is a great plus. Open positions at https://tenzir.jobs.personio.de:
- Senior C++ Engineer
- (SecOps) Solution Engineer
We are based out of Hamburg, Germany, and hire across EU time zones, stretching all the way to India.
We see this trend as well. And AWS Security Lake goes exactly there.
Right now, we‘re working on OCSF normalization in our pipelines to drop structured security telemetry in the right format where you need it. Like a security ETL layer.
We considered ClickHouse and DuckDB but struggled with making the execution engine multi-schema, e.g., more jq-like but still on top of data frames. So we started with a custom catalog and engine on top of Parquet and Feather that we will later factor into a plugin to transpile our query language (TQL) to SQL. The custom language because security people are not data engineers.
We're building something similar at Tenzir, but more for operational security workloads. https://docs.tenzir.com
Differences to Vector:
- An agent has optional indexed storage, so you can store your data there and pick it up later. The storage is based on Apache Feather, Parquet's little brother.
- Pipelines operators both work with data frames (Arrow record batches) or chunks of bytes.
- Structured pipelines are multi-schema, i.e., a single pipeline can process streams of record batches with different schemas.
Tenzir is hiring several key engineering roles to meet the needs in expanding the team. Our product: security data pipelines. From the data side, think of it as an Arrow-native, multi-schema ETL tool that offers optional storage in Parquet/Feather. From the security perspective, think of it as a solution for collecting, parsing, transforming, aggregating, and routing data. We typically sit between the data sources (endpoint, network, cloud) and sinks (SIEM, data lake).
Also a pipeline language, PRQL-inspired, but differing in that (i) TQL supports multiple data types between operators, both unstructured blocks of bytes and structured data frames as Arrow record batches, (ii) TQL is multi-schema, i.e., a single pipeline can have different "tables", as if you're processing semi-structured JSON, and (iii) TQL has support for batch and stream processing, with a light-weight indexed storage layer on top of Parquet/Feather files for historical workloads and a streaming executor.
We're in the middle of getting TQL v2 [@] out of the door with support for expressions and more advanced control flow, e.g., match-case statements. There's a blog post [#] about the core design of the engine as well.
While it's a general-purpose ETL tool, we're targeting primary operational security use case where people today use Splunk, Sentinel/ADX, Elastic, etc. So some operators are very security'ish, like Sigma, YARA, or Velociraptor.
Tenzir is hiring several key engineering roles to meet the needs in expanding the team. Our product: security data pipelines. From the data side, think of it as an Arrow-native, multi-schema ETL tool that offers optional storage in Parquet/Feather. From the security perspective, think of it as a solution for collecting, parsing, transforming, aggregating, and routing data. We typically sit between the data sources (endpoint, network, cloud) and sinks (SIEM, data lake).
Our open-source execution engine is C++20, our platform is SvelteKit and TypeScript. Experience with data-first frontend apps is a great plus.
We want to achieve something similar with our pipelines [1] by making the beginning and the end of the pipeline symmetric, giving you this flow:
1. Acquire bytes (void → unstructured)
2. Parse bytes to events (unstructured → structured)
3. Transform events (structured → structured)
4. Print events (structured → unstructured)
5. Send bytes (unstructured → void)
The "Publish" part is a combination of (4) and (5). Sometimes they are fused because not all APIs differentiate those steps. We're currently focusing on building blocks (engine, connectors, formats) as opposed to application-level integrations, so turnkey Reverse ETL is not near. But the main point is that the symmetry reduces cognitive effort for the user, because they worked that muscle on the "E" side already and now just need to find the dual in the docs.
I don't do security, but I have been a data engineer for the better part of a decade and I don't understand what void and unstructured are. Am I the fool? I don't get it.
The primitives of many of these ETL systems are structured tables (snowflake, parquet, pandas dataframes, whatever) and I don't think I'd ever choose bytes over structured tables. The unstructured parts of data systems I've worked on have always chewed up an outsize portion of labor with difficult to diagnose failure modes. The biggest cognitive effort win of reverse ETL solutions has been to make external systems and applications "speak table".
The extra data type of unstructured/bytes is optional in that you don’t have to use it if you don’t need it. Just start with a table if that’s your use case.
In security, binary artifacts are common, e.g., to scan YARA rules on malware samples and produce a structured report (“table”). Turning packet traces into structured logs is another example. Typically you have to switch between a lot of tools for that, which makes the process complex.
(The “void” type is only for symmetry in that every operator has an input and output type. The presence of void makes an operator a source or sink. A “closed” pipeline invariant is one with source and sink, and only closed pipelines can execute in our mental model.)
Author here: it’s describing how we built our pipeline engine. Technical content, presented in a form that I think is of interest to many HN users. I don’t think „ad“ is the correct term to refer to this. Can you elaborate what prompted your judgement?
Why not both? You took a public action that your company could derive value from. You're offering value in return, in the form of technical content, but...it also advertises your company to your target audience. This article wasn't posted organically by someone whose only interest is in the content of the article.
Not op, but maybe the perception is based on relatively few aha moments or new insights in the article. The one thing I would come back to it for is the listing of pipeline languages, which sounds interesting. YMMV.
Hey, founder of Tenzir [1] here — We are building an open-core pipeline-first engine that can massively reduce Splunk costs. Even though we go to market "mid stream" we have a few users that use us as light-weight SIEM (or more accurately, just plain log management).
We are still in early access but you can browse through our docs or swing by our Discord.
Tenzir is hiring several key engineering roles to meet the needs in expanding the team. Our product: security data pipelines. From the data side, think of it as an Arrow-native, multi-schema ETL tool that offers optional storage in Parquet/Feather. From the security perspective, think of it as a solution for collecting, parsing, transforming, aggregating, and routing data. We typically sit between the data sources (endpoint, network, cloud) and sinks (SIEM, data lake).
Our open-source execution engine is C++20, our platform is SvelteKit and TypeScript. Experience with data-first frontend apps is a great plus. Open positions at https://tenzir.jobs.personio.de:
We are based out of Hamburg, Germany, and hire across EU time zones, stretching all the way to India.