Skip to content

Releases: datahub-project/datahub

V0.15.0

16 Jan 15:37
3108b53
Compare
Choose a tag to compare

DataHub v0.15.0 Release Notes

User Experience

  • Structured Properties

    • Added comprehensive support for managing structured properties, including creation, editing, deletion, and display preferences. Introduced timestamps for tracking creation and modification. [#12100, #11419]
    • Enhanced property display options with badge styling, custom column types, and configurable visibility settings in asset sidebars and schema fields. [#12111, #12052]
    • Added structured property filtering in UI with improved aggregation logic and entity metadata display. Introduced new property validators and display settings. [#12097, #12099]
  • UI Enhancements

    • Enhanced container organization with parent hierarchy labels. [#11705]
    • Added support for markdown in incident descriptions, enabling rich formatting capabilities. [#11759]
    • Improved ingestion reporting with better visibility of successful ingestions with warnings. Enhanced browse paths display for business attributes and schema fields. [#11704, #11585]
    • Added support for timeseries aspects in OpenAPI and customizable date range fields for Analytics charts. [#12096, #11366]
  • Authorization & Authentication

    • Enabled authentication and API authorization by default, with support for URN-wildcard-based policies using STARTS_WITH condition. [#11484, #11441]
    • Added authorization checks for managing Glossary terms, including privileges for ownership, domain management, and link actions. [#11337]

Metadata Ingestion

Ingestion Framework Improvements

  • Enhanced Data Source Support: Expanded ingestion capabilities for multiple platforms, including Superset (with dataset entities, schema fields, and column-level lineage), Feast (supporting tags and owners ingestion), Neo4j, and Cassandra. Added stateful ingestion support for file sources. [#11688, #11784, #11804, #11526, #11822]

  • SQL Processing Improvements: Replaced vulnerable sqlparse dependency with an in-house SQL parser, optimized CLL generation with reduced memory usage, and added special handling for MSSQL case sensitivity. Enhanced multi-query lineage support for Snowflake temporary tables. [#11645, #11708, #11920, #12020]

  • CLI Enhancements: Introduced new commands for managing ingestion, including listing source runs with filtering capabilities, undoing soft deletes with platform filtering, and listing structured properties. Added an offline flag to the SQL parser CLI. [#11740, #11980, #12012, #12283, #11635]

  • Ownership and Metadata Management: Extended ownership transformer capabilities across entities, improved glossary sync to preserve custom ownership types, and added support for multiple ownership types in glossaries and terms. Enhanced Forms CLI with additional filters for subtypes, platform instances, owners, tags, and glossary terms. [#11700, #11545, #12050, #10979]

  • Core Infrastructure Improvements: Implemented unique URN generation for all entities, added support for efficient entity ingestion through get_entity_as_mcps, improved empty field handling, and introduced progress reporting during ingestion. Added execution request cleanup job and support for dropping duplicate schema fields. [#11676, #11425, #11613, #12117, #11765, #12308]

Source-Specific Ingestion Improvements

Airflow

  • Upgraded infrastructure with support for Airflow 2.10, deprecated versions below 2.3, and improved template handling with Jinja support. Added configuration options for dag patterns and environment variables. [#11300, #11371, #11472, #11537, #11579, #12056]
  • Enhanced error handling and debugging with improved logging, fixed plugin stability issues on EMR, and added support for AthenaOperator lineage extraction. Introduced ability to disable plugin without restart. [#11857, #11877, #11880, #12098]

BigQuery

  • Enhanced data modeling capabilities with support for foreign/primary keys, BigLake tables, and improved handling of external tables. Added support for region qualifiers and partition management. [#11686, #11728, #11874, #11940]
  • Improved lineage tracking with GCS data source support and optimized query performance. Added platform resource entity generation from BigQuery labels. [#11442, #11492, #11534, #11602]
  • Enhanced profiling and performance with better type handling and size limits. Fixed issues with tag synchronization and platform instance settings. [#11807, #12060]

Dagster

  • Added support for skipping Asset ingestion, fixed input/output value formatting, and improved compatibility with latest Dagster versions (v1.9.6). Deprecated Python 3.8 support. [#11262, #11481, #12121, #12189]

dbt

  • Improved performance and functionality with node_name_patterns for faster CLL processing, support for multiple test paths, and better handling of custom owner types. [#11450, #11460, #11848]
  • Enhanced lineage handling by preventing cycles in SQL parsing and supporting multiple dataset assertions for tests. Added support for dbt Cloud's Explore page. [#11666, #11451, #12223]

Snowflake

  • Expanded support for various table types, including secure, dynamic, and hybrid tables. Enhanced lineage capabilities for renames, swaps, and external tables. [#11600, #12039, #12094, #12179]
  • Improved authentication with OAuth support and token management. Added incremental property processing and structured property support for tags. [#11888, #12048, #12080, #12285]
  • Enhanced error handling and logging with better parse failure reporting and dot handling in table names. [#12105, #12110, #12153]

Tableau

  • Enhanced project management with new path pattern filtering and improved handling of hidden assets. Added support for access roles and group permissions. [#10855, #11157, #11559]
  • Improved API integration with retry logic for various error codes (502, 504), better authentication handling, and consistent page size application. [#12213, #12216, #12233]
  • Enhanced reporting and debugging capabilities while maintaining efficient performance and proper permission handling. [#12015, #12024, #12175]

PowerBI

  • Improved M-query parsing with support for comments, better handling of quotes, and DatabricksMultiCloud native query functionality. [#12177, #11743, #11756]
  • Enhanced workspace management with cross-workspace dataset linking and app ingestion support. Added timeouts for M-query parsing. [#11560, #11629, #11753]
  • Improved error reporting and performance optimization with reduced type casting and better organization of responsibilities. [#11763, #12004]

Developer Experience

  • Entity Management: Introduced entity versioning for Datasets and ML Models, with support for version set linking. Improved timeline functionality with better handling of primary key changes and rename events. Added data transformation logic models to enhance data processing capabilities. [#11819, #11843, #12166, #12198]

  • Enhanced Configuration Management: Added new customization options through environment variables and Helm charts, including editable dataset names and configurable garbage collection scheduling. The bootstrap process has been optimized to reduce latency during installation. [#11391, #11518]

  • Development Environment Updates: Added Git support to the ingestion-base image, enabling better source control integration for ingestion workflows. [#11477]

  • Security Logging Enhancement: Improved security audit trails by adding actor URN tracking for unauthorized access attempts. [#12030]

NEW: Garbage Collection

  • Comprehensive Metadata Cleanup: Introduced a new ingestion source: DataHubGC to function as a garbage collector for managing dataflows, data jobs, and data process instances, with configurable retention policies and deletion parameters. Added dry run mode for testing cleanup operations. [#11102, #11413]

  • Performance Optimizations: Significantly improved processing speed from 1 hour to 15 minutes by implementing batch processing, optimizing queries, and removing unnecessary operations. Increased default hard delete limit from 10k to 25k entities. [#11809, #12093, #12238]

  • Reliability Improvements: Enhanced garbage collection stability with additional validation checks, improved error handling, and better process visibility through ingestion stage reporting. Fixed issues with entity deletion logic and reference handling to preserve critical lineage relationships. [#12011, #12013, #12027, #12049, #12124, #12226]

Thank You to Our Contributors!

First-Time Contributors

@AColocho, @alberttwong, @Alice-608, @Bumyu, @chakru-r, @chriscc2, @dejan2609, @donovan-acryl, @eagle-25, @hwmarkcheng, @k-bartlett, @kanavnarula, @kartikey-visa, @kevinkarchacryl, @kousiknandy, @kris48k, @llance, @margaridafernandes-trip, @mikeburke24, @raudzis, @ronybony1990, @ryota-cloud, @shepherd44, @siong-tcha, @ssidorenko, @tanguyantoine, @th0ger, @udays-visa, @udbhav-hbk, @vejeta

Repeat Contributors

@aviv-julienjehannet, @bda618, @bossenti, @darnaut, @deepgarg-visa, @DSchmidtDev, @dushayntAW, @eboneil, @ethan-cartwright, @feldjay, @githendrik, @haeniya, @Jorricks, @Masterchen09, @mkamalas, @Nbagga14, @nicholas-fwang, @noggi, @pankajmahato-visa, @pinakipb2, @rtekal, @sagar-salvi-apptware, @steffengr

DataHub Maintainers

@acrylJonny, @anshbansal, @asikowitz, @chriscollins3456, @david-leifker, @gabe-lyons, @hsheth2, @jayacryl, @jjoyce0510, @maggiehays, @mayurinehate, @pedro93, @RyanHolstien, @sakethvarma397, @sgomezvillamor, @shirshanka, @sid-acryl, @skrydal, @treff7es, @yoonhyejin

What's Changed

Read more

v0.14.1

17 Sep 21:48
6a165a8
Compare
Choose a tag to compare

DataHub v0.14.1 Release Notes

User Experience

  • Enhanced Data Propagation UI: New features allow viewing propagated column documentation, source information, and asset-level propagation details. This improves visibility into data lineage and enables better understanding of data flow across the organization. (#11047)

  • Improved Search Result Tracking: Added page number to search result click events, enabling better measurement of search ranking performance. This helps users understand and optimize their search experience. (#11151)

  • Fixed Display Issues: Resolved issues with displaying "0" values for last ingested data and improved handling of multilingual characters in descriptions. These fixes ensure more accurate and readable information presentation. (#10840, #10975)

Developer Experience

  • Performance Improvements:

    • Implemented lazy dataLoaders for GraphQL queries, significantly reducing latency for local environments. (#11293)
    • Added option to log slow GraphQL queries, helping identify and address performance bottlenecks. (#11308)
    • Introduced session authorization caching for faster access checks. (#11327)
  • Enhanced Search Capabilities:

    • Added support for custom highlighting fields in GraphQL queries, allowing faster and more customizable data retrieval. (#11339)
    • Implemented new search query functionality to filter by parents/children of Domains or Containers. (#11279)
    • Added support for multiple values in 'CONTAIN', 'START_WITH', and 'END_WITH' operators, enabling more flexible and precise searches. (#11068)
  • API Improvements:

    • Extended throttling to API requests, supporting non-browser ingestion/write requests and manual throttling for better control over system load. (#11325)
    • Added support for 'START_WITH' and 'END_WITH' operators in GraphQL API, enhancing string query capabilities. (#11026)
  • Bug Fixes:

    • Resolved issues with forward slash handling in search queries, empty key-value pairs in Elasticsearch mapping, and support for various data types in object fields. These fixes improve search accuracy and data representation. (#10932, #11004, #11066)
    • Addressed Postgres regression by upgrading the ebean library from version 12.x to 15.x, resolving a read lock NPE issue. (#11379)

Metadata Ingestion

  • S3 Integration Enhancements:

    • Enhanced partition support for S3 dataset ingestion, improving metadata representation and enabling advanced partition detection. (#11083)
    • Enhanced S3 ingestion process to support reading specific file types, allowing more granular control over data ingestion. (#11177)
  • BigQuery Improvements:

    • Implemented query log extractor for BigQuery, creating "Query" entities with usage statistics, lineage, and operation details. (#10994)
    • Added support for filtering GCP project ingestion based on project labels, enabling more targeted data collection. (#11169)
    • Implemented query job retries for transient errors, improving system robustness. (#11162)
  • Snowflake Updates:

    • Added support for Iceberg tables in Snowflake access history, enhancing lineage capture capabilities. (#10961)
    • Introduced ability to define clustering key formulas for Snowflake datasets. (#11254)
    • Fixed tag exclusion issues in Snowflake ingestion process. (#11250)
  • New and Updated Connectors:

    • Added ingestion source for SAP Analytics Cloud, expanding DataHub's integration capabilities. (#10958)
    • Enhanced Salesforce connector with customizable API version and improved error messages. (#11145, #11266)
    • Updated Tableau ingestion process with new parameters and improved field type parsing. (#11255, #11202)
  • Other Ingestion Improvements:

    • Added support for MongoDB database ingestion as containers. (#11178)
    • Implemented automatic capturing of Snowflake assets with Pandas I/O Manager in Dagster module. (#11189)
    • Enhanced Fivetran ingestion with destination ID filtering capabilities. (#11277)
    • Added support for browse-only tables in Databricks ingestion. (#10766)

Other Improvements and Fixes

  • Upgraded various dependencies including Kafka, Azure Identity, Acryl-SQLglot, and GraphQL/Spring versions.
  • Improved error handling and logging across multiple components.
  • Enhanced test coverage and reliability.
  • Updated documentation for various features and processes.

Breaking Changes

Notable breaking changes include:

  • Removal of lower method from get_db_name in SQLAlchemySource, affecting URNs of related entities.
  • Changes to default sink mode and aspect handling that require server version 0.14.0+.

See the full details here.

Contributors

We extend our heartfelt thanks to all contributors for their valuable work on this release:

First-Time Contributors

@AaronYang0628, @alexandrebunn, @alisa-aylward-toast, @arpanchakra29, @esselius, @eunseokyang, @ignitz, @milindgupta, @milindgupta9, @Nbagga14, @rohansun, @sakethvarma397, @vignesh-hbk

Repeat Contributors

@deepgarg-visa, @dushayntAW, @feldjay, @filipe-caetano-ovo, @ksrinath, @Masterchen09, @matthew-coudert-cko, @mayurinehate, @nmbryant, @pinakipb2, @prashanthic23, @sagar-salvi-apptware, @siladitya2, @sleeperdeep

DataHub Maintainers

@anshbansal, @asikowitz, @chriscollins3456, @darnaut, @david-leifker, @eboneil, @hsheth2, @jjoyce0510, @maggiehays, @pedro93, @RyanHolstien, @shirshanka, @sid-acryl, @skrydal, @treff7es, @yoonhyejin

Your contributions are invaluable in making DataHub better for everyone. Thank you!

What's Changed

Read more

v0.14.0.2

21 Aug 15:29
Compare
Choose a tag to compare

DataHub v0.14.0.2 Release Notes

User Experience

  • Renamed: Validation --> Quality: The Validation tab has been renamed to Quality to make it more intuitive to end-users that it contains outcomes from data quality checks. [#10935]

  • Data Contract UI: A new Data Contract UI is now available under the Quality Tab, allowing users to handle various data assertion types and add/remove contracts more easily. [#10625]

  • Updates to Customized Search Ranking: By default, explore (* ) query results are ranked based on enrichment (tags, terms, owners, description, domains, row/column counts) as well as incident status. [#10774]

  • Custom Dataset Names: Business users can now maintain an editable dataset name separate from default properties, providing more control over dataset identification. [#10608]

  • Documentation Propagation Setting Page: A new settings page has been added to the UI for managing Documentation Propagation, giving users more control over how documentation is shared across the platform. [#11038]

Developer Experience

  • NEW: DataHub Open Assertions Specification:

    • Announcing a universal assertions specification for declaring Data Quality checks and compiling them into artifacts for use by 3rd party Data Quality tools like Great Expectations, dbt tests, and Snowflake via Data Quality DMFs. [#10609]
    • Added ability to define data quality rules using a YAML specification file, enabling users to set assertions like volume metrics and conditions, with the ability to compile and schedule them to run on Snowflake as the assertion backend. [#10602]
  • API and SDK Enhancements:

    • New GraphQL APIs added for managing forms, structured properties, and data contracts. [#10826, #10825, #10632]
    • Updates to Java and Python SDKs to support creating and updating structured properties on assets. [#10823, #10824]
    • Support for conditional write semantics including If-Modified-Since, If-Unmodified-Since, and If-Version-Match in MetadataChangeProposals (MCP) and OpenAPI. [#10868]
  • CLI Improvements:

    • A new check server-config command has been added to test server credentials and retrieve diagnostic information. [#10990]
    • The get command now includes a --details/--no-details flag for more detailed output, facilitating easier issue debugging. [#10815]
    • Update to CLI to optionally display server configuration settings. [#10676]
    • Added functionality to the CLI by introducing the ability to assign actors (users or groups) to forms in the forms YAML API. [#10683 ]
  • Improved Logging and Monitoring:

    • Unified request logging implemented across GraphQL, OpenAPI, and Restli requests, including additional information like actor, IP address, and API type. [#10802]
    • New CLI command check server-config added to test server credentials and retrieve diagnostic information. [#10990]
  • Performance Optimizations:

    • Implemented throttling for the mce-consumer based on mae-consumer lag. [#10626]
    • Unified request logging now includes additional information like actor, IP address, and API type across GraphQL, OpenAPI, and Restli requests. [#10802]
    • Added an ASYNC_BATCH mode to the rest sink for improved performance. [#10733]
    • Improved the performance of read queries in Neo4j by specifying labels and combining multiple Neo4j statements within the addEdge function into a single statement, improving efficiency and performance. [#10593, #10598]
  • Security Enhancements:

    • Updated encryption and decryption methods with a stronger cryptographic algorithm. [#11059]
    • Optimized regular expressions to prevent potential ReDoS vulnerabilities. [#10315]

Metadata Ingestion

  • New Ingestion Sources:

    • Azure Blob Storage: Added as a new ingestion source with support for Path Specs. [#10813]
    • Grafana: New connector to ingest dashboards, providing documentation within DataHub for DevOps members on call. [#10891]
    • IBM DB2: Added support for this platform. [#10601]
  • Snowflake Improvements:

    • Enhanced view lineage parsing without query-based lineage/usage. [#10905]
    • Added support for more than 10k views in a Snowflake database. [#10718]
    • Implemented parallel schema extraction for improved performance. [#10653]
    • Added snowflake-queries source for lineage, usage, queries, and operational metadata to improve performance and configurability. [#10835]
  • BigQuery Enhancements:

    • Refactored and parallelized dataset metadata extraction for better performance. [#10884]
    • Added support for new data types including BIGNUMERIC, NUMERIC, DECIMAL, BIGDECIMAL, FLOAT64, and RANGE. [#10950]
    • Added support for ingesting View labels during ingestion. [#10648]
  • Looker Updates:

    • Ingested explore tags into DataHub. [#10547]
    • Fixed issues related to CLL generation when the view definition language is SQL. [#10542]
    • Added support for including platform instance details in URNs for dashboards and charts. [#10771]
  • Other Improvements:

    • dbt: Enhanced flexibility in lineage generation with the new experimental prefer_sql_parser_lineage flag. [#11039]
    • Airflow: Task ownership info can now be set as a group rather than an individual user. [#10742]
    • Athena: Enhanced profiling capabilities to support column quantiles and medians. [#10723]
    • Fivetran: Improved connector performance for faster ingestion. [#10556]
    • SageMaker: Added stateful ingestion capability to remove deleted assets during ingestion runs. [#10573]
    • Tableau: Support added for ingesting multiple Tableau sites in a single configuration, with sites appearing as containers in DataHub. [#10498]
    • Added support for ingesting schemas from schema registry in the Kafka module. [#10612]
    • Introduced a TagsToTermMapper transformer for mapping specific tags to glossary terms. [#10758]
    • Enhanced the SQL lineage parser with an optional default_dialect parameter for customized dialect selection. [#10830]

Other Improvements and Fixes

  • Fixed high vulnerabilities related to sensitive information logging. [#11088]
  • Optimized regular expressions to prevent potential ReDoS vulnerabilities. [#10315]
  • Improved error handling and logging across various modules.
  • Enhanced test coverage for new features and existing functionality.

Breaking Changes

  • Protobuf CLI will no longer create binary encoded protoc custom properties by default.
  • Changes to Data flow info and data job info aspects may require a server upgrade.
  • OpenAPI V3 - Creation of aspects now requires wrapping within a value key.
  • Profiling configuration for Glue source has been updated.

For full details on breaking changes, please refer to the updating guide.

Contributors

Massive shoutout to all of the contributors who made this release possible:

First-Time Contributors

@aabharti-visa, @acrylJonny, @amit-apptware, @AndreasHegerNuritas, @aviv-julienjehannet, @brbrown25, @chardaway, @dragontail, @ipolding-cais, @joelmataKPN, @john-claro-cko, @jordanjeremy, @lima-renan, @nadavgross, @nephtyws, @obaltian, @PeamThom, @pie1nthesky, @pulsar256, @samblackk, @shtephlee, @simaov, @steffengr, @tkdrahn, @TristanHeisler, @wornjs, @xkollar

Repeat Contributors

@ajoymajumdar, @bossenti, @cburroughs, @cccs-eric, @deepgarg-visa, @dushayntAW, @fjmacagno, @githendrik, @haeniya, @jayasimhankv, @k7ragav, @kevin1chun, @ksrinath, @Kunal-kankriya, @looppi, @Masterchen09, @mayurinehate, @ngamanda, @nmbryant, @noggi, @pankajmahato-visa, @PatrickfBraz, @pinakipb2, @Rajasekhar-Vuppala, @rtekal, @sagar-salvi-apptware, @shubhamjagtap639, @siladitya2, @ssilb4, @Sukeerthi31, @sumitappt, @TonyOuyangGit, @walter9388

DataHub Maintainers

@anshbansal, @asikowitz, @chriscollins3456, @darnaut, @david-leifker, @eboneil, @ethan-cartwright, @gabe-lyons, @hsheth2, @jayacryl, @jjoyce0510, @maggiehays, @pedro93, @RyanHolstien, @shirshanka, @sid-acryl, @skrydal, @treff7es, @yoonhyejin

What's Changed

Read more

v0.14.0

13 Aug 18:40
5e9188c
Compare
Choose a tag to compare

Known Issues

Issue with kafka-setup missing a script for new deployments, hotfix will be released shortly

What's Changed

Read more

v0.13.3

23 May 23:11
Compare
Choose a tag to compare

DataHub Release Notes

User Experience

  • NEW: Business Attributes: Business Attributes are used to standardize and manage data elements across multiple domains, projects, and applications. By linking dataset attributes to Business Attributes, organizations ensure uniformity and ease of updates, as changes made to a Business Attribute are automatically propagated across all linked datasets. #9863
  • Improved UI for Dataset Properties: Added collapse functionality for long dataset properties, making it easier to navigate and view relevant information. #10203
  • Pagination for Ingestion Tasks Listing: Added pagination to the tasks listing page, making it easier to manage and navigate through tasks. #10293
  • Rich Text Support for Form Descriptions: Added support for rich text in form descriptions, enhancing the user experience. #10425
  • New Analytics Charts: Added charts in the Analytics tab to identify Top Users and New Users. #10344
  • Enhanced search functionality with customizable autocomplete configuration. #10426

Developer Experience

  • Unified CI Workflow Updates: Improved CI build with unified workflow updates and disk space cleanup, making the build process more efficient. #10353
  • Improved Logging for GraphQL Requests: Enhanced logging for GraphQL requests, providing better insights and debugging capabilities. #10404
  • Enhanced Documentation for Lineage Feature Guide: Updated documentation for the lineage feature guide, making it easier to understand and implement. #10401
  • Improved Documentation for SchemaField.label: Updated documentation for SchemaField.label, providing clearer guidance for developers. #10251
  • Enhanced CI with Docker Image Publishing: Added Docker image publishing capabilities to the CI workflow, streamlining the deployment process. #10193
  • Redesigned Docs Site Feedback Button: Improved the design of the feedback button in the documentation, making it more user-friendly. #10182

Metadata Ingestion

  • Improved Data Profiling by early filtering of tables, correctly computing sample row counts, and combining unique count queries per table. #10378, #10319, #10322
  • Airflow: Introduced support for BigQueryInsertJobOperator. #10452
  • BigQuery: Added support for Table Clones and incremental column-level lineage.
  • Snowflake: Improved reporting for usage aggregation and handled lineage errors; Improved ingestion performance with system sampling on very large tables. #10279, #10430
  • Glue: Introduced support for delta schemas. #10299
  • Redshift: Improved usage extraction by filtering out system queries. #10247
  • Mode: Enhanced ingestion for Mode by adding dashboards into containers, improving data visualization and management. #10563
  • PowerBI: Added support to automatically extract table lineage between PowerBI and Databricks. #10416
  • dbt: Improved dbt ingestion by handling complex SQL and enhancing documentation, providing better data management and insights. #10323
  • NiFi: Enhanced ingestion for NiFi with process group as browse path and incremental lineage, improving data organization and tracking. #10202
  • Incubating Sigma and CockroachDB sources. #10037, #10226

Breaking Changes

  • DynamoDB Connector: aws_region is now a required configuration. The connector will no longer loop through all AWS regions; instead, it will only use the region passed into the recipe configuration. #10419
  • Custom Validators and Mutators: Dropped a previously required constructor. #10389
  • FabricType RVW: Added as a new FabricType. No rollbacks allowed once metadata with this fabric type is added without manual cleanups in databases. #10472

For full details on breaking changes, please refer to the updating DataHub documentation.

Contributors

A big thank you to all our contributors for this release!

First-Time Contributors

@bouaouda-achraf, @camilogutierrez, @dotan-mor, @egemenberk, @erikkvale, @guyr-ziprecruiter, @ishtartec, @jonasHanhan, @mrjefflewis, @noggi, @olgapenedo, @paguos, @richenc, @Rosmirose, @sagar-salvi-apptware, @timothyjin

Repeat Contributors

@ajoymajumdar, @deepgarg-visa, @dushayntAW, @filipe-caetano-ovo, @gaurav2733, @kevin1chun, @ksrinath, @Masterchen09, @mayurinehate, @ms32035, @Nelvin73, @rtekal, @sgomezvillamor, @shubhamjagtap639, @siladitya2, @skrydal

DataHub Maintainers

@anshbansal, @asikowitz, @chriscollins3456, @darnaut, @david-leifker, @eboneil, @gabe-lyons, @hsheth2, @jayacryl, @jjoyce0510, @RyanHolstien, @shirshanka , @sid-acryl, @treff7es, @yoonhyejin

Thank you all for your hard work and contributions!

What's Changed

  • fix(ingest/bigquery): Supporting lineage extraction in case the select query result's target table is set on job by @treff7es in #10191
  • fix(retention): fix time-based retention by @trialiya in #10118
  • feat(lineage): give via and paths in entity lineage response by @RyanHolstien in #10192
  • fix(ingestion/datahub): implemented the filter to ignore/include URN for ingestion by @dushayntAW in #10174
  • fix(ingestion/glue): fix to ingest the comment for partition key as description by @dushayntAW in #10189
  • feat(ingest/looker): cleanup usage generation code by @hsheth2 in #10153
  • fix(dev): fix env file overrides for profiles by @hsheth2 in #10194
  • fix(ingestion/hive): ignore sampling for tagged column/table by @dushayntAW in #10096
  • fix(ui/property): add collapse for long dataset properties by @gaurav2733 in #10203
  • saas release v0.3.1 release notes by @david-leifker in #10205
  • fix(ingest/databricks): pin pandas for databricks ingestion by @mayurinehate in #10204
  • Fixed issue where the custom defined aspects were missing from the API specification. by @ajoymajumdar in #10208
  • feat(ingestion/transformer): Handle overlapping while mapping in extract ownership from tags transformer by @shubhamjagtap639 in #10201
  • fix(build): avoid nested gradle commands by @hsheth2 in #10198
  • feat(ingest/great_expectations): support in-memory (Pandas) data assets by @bouaouda-achraf in #9811
  • ci(workflow): publish docker from pr with label by @david-leifker in #10193
  • bump(version): bump classgraph version, add early package filter by @david-leifker in #10207
  • fix(ingestion/mongodb): MongoDB source unable to parse datetimes with years > 9999 by @jonasHanhan in #10110
  • fix(graphql-core): DomainEntitiesResolver does not support values FacetFilterInput parameter by @siladitya2 in #10188
  • fix(graphql-core):Auto completion/suggestion of Domains are not working by @siladitya2 in #10150
  • chore(usage-stats): measure time for getting buckets and aggregations by @darnaut in #10220
  • test(search): introduce retry for search test by @david-leifker in #10206
  • feat(ingest/bigquery): fix support for incremental column lineage by @hsheth2 in #10222
  • fix(ingest/dbt): better dbt timestamp parsing by @hsheth2 in #10223
  • feat(ingest/sql): normalize bigquery partitioned tables when parsing by @hsheth2 in #10224
  • docs: fix feedback button design by @yoonhyejin in #10182
  • docs: add discourse to community tab by @yoonhyejin in #10181
  • docs: edit the text and destination for sign up link by @yoonhyejin in #10183
  • fix(ingestion/datahub): moved urn_pattern config to source config by @dushayntAW in #10215
  • fix(ingestio...
Read more

v0.13.2

16 Apr 19:09
Compare
Choose a tag to compare

Hotfix Release

Fixes MCL message deserialization bug when using internal schema registry and running specific upgrade jobs.

policyFields (enabled by default):
BOOTSTRAP_SYSTEM_UPDATE_POLICY_FIELDS_ENABLED:true

dataJobNodeCLL (disabled by default):
BOOTSTRAP_SYSTEM_UPDATE_DATA_JOB_NODE_CLL_ENABLED:false

Example Error:

Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 1
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 13 out of bounds for length 2
        at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:460)
        at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:283)
        at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
        at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
        at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)

Recovery Directions:

If currently affected, please remove the topic prior to upgrading to v0.13.2 to remove the corrupted message. The default topic name is MetadataChangeLog_Versioned_v1 however if you've customized the topic name be sure to remove that topic.

If running kafka per the example Helm chart for prerequisites the following command will delete the topic.

kubectl exec -it prerequisites-kafka-broker-0 -c kafka -- kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic MetadataChangeLog_Versioned_v1

Full Changelog: v0.13.1...v0.13.2

v0.13.1

02 Apr 19:40
2873736
Compare
Choose a tag to compare

DataHub Release Notes

User Experience

  • Capture and Manage Common Joins between Datasets: Users can now view and manage common join relationships between datasets, making it easier than ever to capture best practices and bespoke join logic. Watch the walkthrough here! 8325
    • Head's up: you'll need to enable the ER_MODEL_RELATIONSHIP_FEATURE_ENABLED env variable to use this feature!
  • Enhanced UI Interactions: Users can now enjoy an improved markdown editor and filter policies by active/inactive statuses, resulting in a more intuitive and manageable interface. 9949, 9958
  • Visual Context for Groups: You can now include picture links for groups in the UI, adding a richer visual context and enhancing the navigational experience. 9882
  • Improved Error Visibility: The UI now displays error messages related to data size limitations, allowing for better troubleshooting and user experience. 10038

Developer Experience

  • Enhanced Kafka Compatibility: Updated client version for Kafka setup ensures better compatibility and functionality for developers. 9962
  • Optimized Docker Build: Docker setups now respect pip mirrors, optimizing the build process especially in restricted network environments. 9963
  • Advanced Error Handling: New error handling for duplicate class names and improved fspath lint error management enhance the code reliability and quality. 9960, 9976
  • Latest OpenSearch Image: Incorporation of OpenSearch image version 2.11.0 aligns with the latest stable releases, boosting performance and security. 9984

Metadata Ingestion

  • NEW: Dagster Integration: You can now seamlessly ingest your Dagster Pipelines, Jobs, Ops, and lineage into DataHub. 10071
  • Expanded Field Classification Support: This release introduces support for field-level classification during ingestion for Redshift, BigQuery, DynamoDB, and SQL Sources. 10013, 10031
  • Enhanced Ingestion Capabilities: DataHub now offers stateful ingestion by default, optimizing routines for REST sinks and improving metadata accuracy across diverse sources like dbt and BigQuery. 9934, 10158, 10080
  • Better Data Lineage: This release introduced support for Openlineage in service of the Spark Lineage Beta Plugin; additionally, we now support incremental Column-Level Lineage, improving the accuracy of detecting column-level relationships during ingestion.9870, 9967, 10090
  • Schema Clarity: New descriptions support for JSON schema arrays and a mechanism to escape special characters in BigQuery table descriptions aid in clearer schema validation and ingestion processes. Databricks ingestion now supports Hive Metastore schemas with special characters. 9757, 9932, 10049

Version Upgrades

  • Kafka client and OpenSearch image were updated to the latest versions.

Breaking Changes

This release introduces default settings for stateful ingestion and updates in handling dbt ingestion. For details on all breaking changes, view the full documentation here.

Contributors

MASSIVE shoutout to our contributors!

First-Time Contributors

akarsh991, alexs-101, AvaniSiddhapuraAPT, diegmonti, dushayntAW, filipe-caetano-ovo, HuanjieGuo, jayacryl, k7ragav, kopax-polyconseil, LePuppy, Nelvin73, pinakipb2, poorvi767, rae89, trialiya, valeral.

Repeat Contributors

ANich, shubhamjagtap639, sgomezvillamor, siladitya2, skrydal, sumitappt, Masterchen09, mayurinehate, ngamanda, gaurav2733, githendrik, jayasimhankv.

DataHub Maintainers

anshbansal, asikowitz, chriscollins3456, darnaut, david-leifker, eboneil, ethan-cartwright, gabe-lyons, hsheth2, pedro93, RyanHolstien, treff7es, yoonhyejin.

What's Changed

  • bump(kafka-setup): client version bump by @david-leifker in #9962
  • feat(ingest): throw codegen error on duplicate class names by @hsheth2 in #9960
  • feat(docker): respect pip mirrors with uv by @hsheth2 in #9963
  • Openlineage endpoint and Spark Lineage Beta Plugin by @treff7es in #9870
  • fix(ingest/json-schema): adding support descriptions for array by @AvaniSiddhapuraAPT in #9757
  • fix(ingest/redshift): fix bug in lineage v2 table renames by @hsheth2 in #9967
  • feat(ingest): speed up to_obj() and validate() by @hsheth2 in #9969
  • feat(ingest): fix fspath lint error by @hsheth2 in #9976
  • docs: archive old version before 0.12.0 & fix broken links by @yoonhyejin in #9957
  • fix(ui/markdown-editor): arrows change field when editing description… by @gaurav2733 in #9949
  • feat(ui/policies): add filter for Active/Inactive/All on policy page by @gaurav2733 in #9958
  • feat(ui): add option to add picture link for groups by @akarsh991 in #9882
  • feat(ingest): add Looks subtype + stop reemitting browsePathV2 by @hsheth2 in #9978
  • fix(ingest/bigquery): escape special characters for table descriptions by @AvaniSiddhapuraAPT in #9932
  • feat(ui): add loading spin to access management table by @filipe-caetano-ovo in #9974
  • fix(ingestion/fivetran): Fix fivetran get connector jobs bug by @shubhamjagtap639 in #9975
  • feat(ingest/dbt): generate CLL for all node types by @hsheth2 in #9964
  • chore(search): bump OpenSearch image version to 2.11.0 by @darnaut in #9984
  • feat(ingest): enable stateful_ingestion by default for DataHub rest sink by @shubhamjagtap639 in #9934
  • feat(ingestion/cli): Adding check option to validate allow/deny and path_specs by @treff7es in #9983
  • fix(ingest): only import PathSpec when necessary by @hsheth2 in #9989
  • feat(config): add configuration to reprocess UI sourced events by @RyanHolstien in #9988
  • feat(pluginRegistry): add configuration to reduce runnable frequency by @RyanHolstien in #9990
  • build(react): Fix typescript errors in test files by @sumitappt in #9982
  • feat(docs): disable last update timestamps by @hsheth2 in #9987
  • feat: add versioned content for 0.12.1 by @yoonhyejin in #9944
  • doc: add version 0.13.0 by @yoonhyej...
Read more

v0.13.0

29 Feb 23:20
8b6790e
Compare
Choose a tag to compare

DataHub v0.13.0 Release Notes Summary

User Experience

  • NEW - Asset Documentation Forms & UI-Editable Properties: Define specific documentation requirements via a Form, and empower your asset owners to capture their valuable knowledge via UI-Editable Properties. Watch the demo here!
  • NEW - DataHub Incidents: Create, communicate, and data quality and observability incidents when they inevitably arise. Watch the demo here!
    UI Improvements: Editing secrets, handling forms, and rendering token pages and lineage diagrams have been improved for a smoother user interface experience.
  • UI Improvements: Editing secrets, handling forms, and rendering token pages and lineage diagrams have been improved for a smoother user interface experience.

Developer Experience

  • Security Upgrades: Core dependencies like shiro-core and FastAPI have been upgraded to fix vulnerabilities, ensuring a safer development environment.
  • GraphQL/OpenAPI Enhancements: New GraphQL endpoints and better OpenAPI documentation provide more powerful tools for API interaction, making developers' jobs easier.
  • Performance Tuning: Backend improvements for search operations and ingestion processes make the platform faster and more reliable.

Metadata Ingestion

  • Platform Integrations: Enhanced support for dbt, Metabase, BigQuery, AWS Glue, Oracle, and Redshift allows for more comprehensive metadata capture, making integration with these platforms smoother.
  • Ingestion Framework: The reliability of ingestion has been improved, with new capabilities like support for tags from Tableau datasources and compatibility with Airflow 2.5.0, facilitating a broader range of data synchronization tasks.
  • Connector Improvements: Ingestion connectors for external data tools have been streamlined, ensuring easier integration and data synchronization.

Other Improvements and Fixes

  • Enhanced internal testing frameworks with Cypress and pytest-random-order for ingestion tests.
  • Simplified developer workflows with configurable Docker Compose project names in CLI.
  • Addressed various ingestion-related bugs for platforms like Feast and Snowflake.
  • Enhanced the UI codebase with TypeScript compilation linting and updated styles.
  • Streamlined CI processes for pull requests and linting conditions.
  • Version Upgrades: Upgraded pytest-docker, Pegasus, and SQLglot, among others, to improve stability and performance. Security vulnerabilities addressed by upgrading FastAPI, gitdb, and follow-redirects.

Notable Breaking Changes

  • Updates to MySQL version for quickstarts and migration to Neo4j 5.x may impact existing setups.
  • JDK17 build requirement and Docker Compose > 2.20 needed for building DataHub.
  • Python 3.8+ requirement for the acryl-datahub CLI.
  • Changes in Unity Catalog ingestion source configs and Redshift lineage generation.
  • Deprecation of Spark 2.x and associated JDK8 build requirements.

For full details on breaking changes, please visit DataHub's update guide.

Acknowledgements

A huge thank you to all our contributors for making this release possible. Your hard work and dedication are greatly appreciated.

First-Time Contributors

7onn, Adityamalik123, atjones0011, BlueHorn07, diegoreico, dim-ops, fer-marino, Gerrit-K, gp1105739, ilpianista, ingthorb, KaYunKIM, Kunal-kankriya, muzzacode, nnnkkk7, pankajmahato-visa, rubiojr, ryaminal, scalvanese452, sleeperdeep, stevenayers.

Repeat Contributors

allizex, arunvasudevan, cburroughs, feldjay, gaurav2733, iprentic, KulykDmytro, kushagra-apptware, mayurinehate, nmbryant, noggi, purnimagarg1, rinzool, sgomezvillamor, shubhamjagtap639, siddiquebagwan-gslab, siladitya2, skrydal, sumitappt, TonyOuyangGit, wngus606, yangjiandan, Salman-Apptware.

DataHub Maintainers

anshbansal, asikowitz, chriscollins3456, darnaut, david-leifker, eboneil, ethan-cartwright, gabe-lyons, hsheth2, jjoyce0510, maggiehays, pedro93, RyanHolstien, shirshanka, sid-acryl, treff7es, yoonhyejin.

What's Changed

Read more

DataHub v0.12.1

08 Dec 23:44
159a013
Compare
Choose a tag to compare

Release Highlights

New Features

SQLAlchemy Source Enhancements: Support for view lineage across all SQLAlchemy sources (PR #9039).
Airflow Integration: Retry callback and support for ExternalTaskSensor subclasses added (PR #8514).
Kafka Enhancements: Increased Kafka message size and enabled compression (PR #9038).
JSONSchema Ingestion: Enabled schema-aware JsonSchemaTranslator (PR #8971).
Search Bar Improvements: Added a flag to hide/display the autocomplete query (PR #9104).
SQL Parser Performance: Enhancements and asyncio fixes (PR #9119).
MongoDB Ingestion: Support for stateful ingestion and improved schema inference for lists (PR #9118, PR #9145).
Policy Engine Updates: Refactoring and enhancements, including support for 10k+ policies (PR #9163, PR #9177).
UI Enhancements: Numerous improvements including command-k icons in the search bar, updated Apollo cache, and auto-complete debounce in the search bar (PR #9194, PR #9193, PR #9205).
Fivetran Integration: Connector integration for Fivetran (PR #9018).
Neo4j Database Support: Connection to specific Neo4j databases now supported (PR #9179).
Chart Subtypes in UI: Support for chart subtypes (PR #9186).

Fixes and Improvements

BigQuery Fixes: Resolved issues with lineage filter query, and fixed extracting comments from complex types (PR #9114, PR #8950).
MongoDB Refactoring: Platform instance addition to MongoDB (PR #8663).
Kafka Setup: Adjusted truststore settings for PEM files (PR #8656).
REST API Authorization: Fixed rollback failure when authorization is enabled (PR #9092).
Java Exception Handling: Addressed java.util.ConcurrentModificationException (PR #9090).
UI and Documentation: Fixed filtering logic in UI, corrected documentation errors, and added feature guides (PR #9116, PR #9125, PR #9124, PR #9126, PR #9134, PR #9137, PR #9122, PR #9068).
SQL Server and Snowflake Ingestion: Updated queries and fixed missing view downstream call (PR #9127, PR #8966).
ClickHouse and DB2 Ingestion: Addressed column reflection regression and table properties handling (PR #9143, PR #9128).
Ingestion Improvements: Numerous fixes and enhancements across various ingestion sources (PR #9153, PR #9155, PR #9141, PR #9157, PR #9123).
CI and Build Process: Tweaked workflows, increased gradle retries, and addressed CI errors (PR #9052, PR #9091, PR #9160).
Security Updates: Addressed a zookeeper CVE and other security concerns (PR #9190).
UI Refactoring: Improved entity page loading indicators and renamed button texts (PR #9195, PR #9196).
Policy and Auth Enhancements: Refactored policy locking and added roles to policy engine validation logic (PR #9178).

Testing and Continuous Integration

API Testing: Added tests for managing secrets, access token privilege, and flaky tests fix (PR #9121, PR #9167, PR #9132, PR #9175).
Cypress Test Fixes: Addressed glossary navigation and download_lineage_results tests (PR #9175, PR #9132).
Cleanup and Refactoring
Ingestion Cleanup: Removed legacy memory_leak_detector and refactored ingestion sources (PR #9158, PR #9135, PR #9120, PR #9105).
PDL Refactoring: Refactored Assertion model enums (PR #9191).
Build and Deployment
Release Preparation: Updated files for the 0.12.0 release (PR #9130).

What's Changed

  • feat(ingest): support view lineage for all sqlalchemy sources by @mayurinehate in #9039
  • fix(ingest/bigquery): Fixing lineage filter query by @treff7es in #9114
  • refactor(ingestion/mongodb): Add platform_instance to mongodb by @nicholas-fwang in #8663
  • fix(kafka-setup): Don't set truststore pass for PEM files by @mmmeeedddsss in #8656
  • fix(ingest): Fix roll back failure when REST_API_AUTHORIZATION_ENABLED is set to true by @TonyOuyangGit in #9092
  • (fix): Avoid java.util.ConcurrentModificationException by @rtekal in #9090
  • Fix(ingest/bigquery): fix extracting comments from complex types by @maaaikoool in #8950
  • docs: add versions 0.12.0 by @yoonhyejin in #9125
  • fix(ui) Fix filtering logic for everwhere generating OR filters by @chriscollins3456 in #9116
  • build(release): Update files for 0.12.0 release by @pedro93 in #9130
  • fix(ingest/sql-server): update queries to use escaped procedure name by @mayurinehate in #9127
  • feat(airflow): retry callback, support ExternalTaskSensor subclasses by @richenc in #8514
  • docs: fix saasonly flags for some pages by @yoonhyejin in #9124
  • fix(ingest/snowflake): missing view downstream cll if platform instance is set by @mayurinehate in #8966
  • feat: Add flag to hide/display the autocomplete query for search bar by @kushagra-apptware in #9104
  • docs(timeline): correct markdown heading level by @mayurinehate in #9126
  • docs(graphql) Correct mutation -> query for searchAcrossLineage examples by @eboneil in #9134
  • feat(kafka): increase kafka message size and enable compression by @david-leifker in #9038
  • feat(ingest/jsonschema) enable schema-aware JsonSchemaTranslator by @KulykDmytro in #8971
  • fix(metadata-ingestion): adds default value to _resolved_domain_urn i… by @alexklavensnyt in #9115
  • ci: tweak to only run relevant workflows by @anshbansal in #9052
  • Fix for flaky download_lineage_results cypress test by @kkorchak in #9132
  • docs: Update updating-datahub.md by @pedro93 in #9131
  • fix(ingest/clickhouse): pin version to solve column reflection regression by @hsheth2 in #9143
  • feat(ingest/looker): cleanup error handling by @hsheth2 in #9135
  • feat(ingest): add entity_supports_aspect helper by @hsheth2 in #9120
  • feat(sqlparser): support more update syntaxes + fix bug with subqueries by @hsheth2 in #9105
  • docs: correct broken doc links by @sachinsaju in #9137
  • feat(ingest): sql parser perf + asyncio fixes by @hsheth2 in #9119
  • feat(quickstart): fix broker InconsistentClusterIdException issues by @hsheth2 in #9148
  • fix(policies): remove non-existent policies, fix name by @anshbansal in #9150
  • Fix for a test that passed on Oss and failed on Saas by @kkorchak in #9147
  • docs(teradata): teradata doc external link 404 fix by @sachinsaju in #9152
  • fix(datahub-client): Include relocation for snakeyaml dependency. by @jiateoh in #8911
  • fix(ingest): cleanup large images in CI by @hsheth2 in #9153
  • build: increase gradle retries by @hsheth2 in #9091
  • feat(ingest): bump sqlglot parser by @hsheth2 in #9155
  • feat(ingest/mongodb): support stateful ingestion by @TonyOuyangGit in #9118
  • API test for managing secrets privilege by @kkorchak in #9121
  • fix(ingest): handle exceptions in min, max, mean profiling by @mayurinehate in #9129
  • feat: rename Assets tab to Owner Of by @kushagra-apptware in #9141
  • fix(ingest/mongodb): fix schema inference for lists of values by @hsheth2 in #9145
  • fix(ingest/db2): fix handling for table properties by @deepgarg-visa in #9128
  • fix(ingest): fully support MCPs in urn_iter primitive by @hsheth2 in #9157
  • fix(ingest/bigquery): use correct row count in null count profiling c… by @mayurinehate in #9123
  • docs: add feature guides for subscriptions and notifications by @yoonhyejin in #9122
  • docs: unify oidc guides using tabs by @yoonhyejin in #9068
  • chore(ingest): remove legacy memory_leak_detector by @hsheth2 in #9158
  • feat(ingest/looker): support emitting unused explores by @hsheth2 in #9159
  • refactor(policy): refactor policy locking, no functional difference by @david-leifker in #9163
  • API test for managing access token privilege by @kkorchak in #9167
  • fix(mysql-setup): quote database name by @darnaut in #9169
  • fix(health): fix health check ...
Read more

v0.12.1rc2

28 Nov 14:22
ac7fa56
Compare
Choose a tag to compare
v0.12.1rc2 Pre-release
Pre-release

What's Changed

Full Changelog: v0.12.1...v0.12.1rc2