Currently LinkedIn engineers. However, we’re receiving more and more PRs from individuals working at various companies.
We welcome contributions from everyone in the community. Please read our contirbuting guidelines. In general, we will review PRs with the same rigor as our internal code review process to maintain overall quality.
We plan to organize public town hall meetings at a monthly cadence, which may change depending on the interest from the community. Also we recently deprecated Gitter and start using Slack as one of the main ways to support the community.
Should this be the platform we decide upon, we’d like to fully engage and work with LinkedIn and the community. What’s the best way and what level of engagement/involvement should we expect?
The best way to engage is through the Slack channel. You’ll get to interact with the developers and the community. We are fairly response on Slack and plan to setup proper oncall support during normal business hours (Pacific Time). Sometimes we also create working groups with dedicated POC from the LinkedIn team to tackle specific use cases the community has.
For reproducible technical issues, bugs and code contributions, Github issues and PRs are the preferred channel.
Github is the best resource. We have documented the steps to install and test DataHub thoroughly there. There is also copious of document on overall architecture, definitions, and onboarding guides.
The DataHub Introduction and Open Sourcing Datahub blog posts are also useful resources for getting a high level understanding of the system.
You can learn more about GMA's product roadmap, which gets updated regularly.
You can learn more about the current list of features.
Are the product strategy/vision/roadmap driven by the LinkedIn Engineering team, community, or a collaborative effort?
Mixed of both LinkedIn GMA team and the community. The roadmap will be a joint effort of both LinkedIn and the community. However, we’ll most likely prioritize tasks that align with the community's asks.
Please take a look at our roadmap & features to get a sense of what’s being open sourced in the near future. If there’s something missing from the list, we’re open to discussion. In fact, the town hall would be the perfect venue for such discussions.
How do LinkedIn Engineering team and the community ensure the quality of the community code for GMA?
All PRs are reviewed by the LinkedIn team.
It varies depending on the data platform. HDFS, MySQL, Oracle, Teradata, and LDAP are scheduled on a daily basis. We rely on real-time pushs to ingest from sveral data platforms such as Hive, Presto, Kafka, Pinot, Espresso, Ambry, Galene, Venice, and more.
URN is the only sensible option to ensure events for the same entity land in the same parition and get processed in the chronological order.
In addition to leverage the Kafka schema validation to ensure the MXEs output from metadata producer, we also actively monitor the ingestion streaming pipeline on the snapshot level with status.
When using Kafka and Confluent Schema Registry, does DataHub support multiple schemas for the same topic?
You can configure compatibility level per topic at Confluent Schema Registry. The default being used is “Backward”. So, you’re only allowed to make backward compatible changes on the topic schema. You can also change this configuration and flex compatibility check. However, as a best practice, we would suggest not doing backward incompatible changes on the topic schema because this will fail your old metadata producers’ flows. Instead, you might consider creating a new Kafka topic (new version).
How do we better document and map transformations within an ETL process? How do we create institutional knowledge and processes to help create a paradigm for tribal knowledge?
We plan to add “fine-grain lineage” in the near future, which should cover the transformation documentation. DataHub currently has a simple “Docs” feature that allows capturing of tribal knowledge. We also plan to expand it significantly going forward.
We’re working on a similar feature internally. Will evaluate and update the roadmap once we have a better idea of the timeline.
Is DataHub capturing/showing column level constraints set at table definition?
The SchemaField model currently does not capture any property/field corresponding to constraints defined in the table definition. However, it should be fairly easy to extend the model to support that if needed.
MCE is the ideal way to push metadata from different security zones, assuming there is a common Kafka infrastructure that aggregates the events from various security zones.
Currently, GMA supports all major database providers that are supported by Ebean as the document store i.e. Oracle, Postgres, MySQL, H2. We also support Espresso, which is LinkedIn's proprietary document store. Other than that, we support Elasticsearch and Neo4j for search and graph use cases, respectively. However, as data stores in the backend are all abstracted and accessed through DAOs, you should be able to easily support other data stores by plugging in your own DAO implementations. Please refer to Metadata Serving for more details.
Supported data sources are listed here. To onboard your own data source which is not listed there, you can refer to the onboarding guide.
You can call the rest.li API to ingest metadata into a GMS directly instead of using Kafka event. Metadata ingestion is real-time if you're updating via rest.li API. It's near real-time in the case of Kafka events due to the asynchronous nature of Kafka processing.
Yes. We are using the Spring Boot framework to start our apps, including setting up Kafka. You can use environment variables to set system properties, including Kafka properties. From there you can set your SSL configuration for Kafka.