Skip to content

MobileTeleSystems/data-rentgen

Repository files navigation

Data.Rentgen logo

Repo Status PyPI PyPI License PyPI Python Version Docker image Documentation Build Status Coverage pre-commit.ci

What is Data.Rentgen?

Data.Rentgen is a Data Motion Lineage service, compatible with OpenLineage specification.

Note: service is under active development, and is not ready to use yet.

Goals

  • Collect lineage events produced by OpenLineage clients & integrations (Spark, Airflow).
  • Support consuming large amounts of lineage events, by using Kafka as event buffer and storing data in tables partitioned by event timestamp.
  • Store operation-grained events (instead of job grained Marquez), for better detalization.
  • Provide API for building run ↔ dataset lineage, as well as parent run → children run lineage.
  • Ability to build lineage graph with specific time boundaries (unlike Marquez there lineage is build only for last job run).
  • Ability to build lineage graph with different granularity. e.g. merge all individual Spark operations into Spark applicationId or Spark applicationName.

Non-goals

  • This is not a Data Catalog. Use Datahub or OpenMetadata instead.
  • Static Data Lineage like view → table is not supported.
  • Currently column-level lineage is collected by OpenLineage, but not yet consumed by Data.Rentgen.

Documentation

See https://data-rentgen.readthedocs.io/