Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fg reader hive3 #10996

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
7d6e12d
squash commits
Dec 28, 2023
d419efd
refactor a bit and add some comments
Dec 28, 2023
84b89a5
make build properly
Dec 28, 2023
3f7189e
fix some of the failing tests
Dec 28, 2023
0777367
revert to old impl when schema evolution enabled
Dec 28, 2023
8f3a528
disable fg reader for stupid test
Dec 28, 2023
b216504
fix some failing tests
Dec 28, 2023
ddec84c
assigned the ports backwards
Dec 28, 2023
4a84143
verbose output bundle validation
Dec 28, 2023
1dbc824
add volume to docker compose
Dec 28, 2023
40d1a74
need to lowercase the field names omg
Dec 28, 2023
669ecaa
don't remove partition if it's listed in the schema
Dec 28, 2023
b8c905d
put partition cols at end of output
Dec 28, 2023
a73fa68
invert filter
Dec 28, 2023
1fed423
support no base file, only log. support read from non-hudi table
Dec 29, 2023
4b31fd3
disable for skip merge as well
Dec 29, 2023
ea196eb
fix non hoodie path read
Dec 29, 2023
987684e
revert setting precombine
Jan 2, 2024
ebdeeb3
fix no meta cols table
Jan 2, 2024
9aceb00
check if no requested fields
Jan 2, 2024
de6898f
create empty schema properly
Jan 2, 2024
c96dd5a
check if metadata folder exists
Jan 2, 2024
de4e4cc
handle mor with no meta fields
Jan 2, 2024
ffcf47d
disable reader for a test because mor seems to work different
Jan 2, 2024
82f87fa
delete partition column from the jobconf if it is written in the file
Jan 4, 2024
19f6f20
modify data schema due to partition column madness
Jan 4, 2024
144aaf5
remove unused import
Jan 4, 2024
1d5c295
add some comments
Jan 4, 2024
bd6e0e3
don't add partition fields when the data schema doesn't have them
Jan 5, 2024
c0fbf8d
Merge branch 'apache:master' into use_fg_reader_hive
jonvex Jan 5, 2024
8f08fa5
Merge branch 'master' into use_fg_reader_hive
Jan 16, 2024
ac9cb0c
address review feedback
Jan 19, 2024
15ed1ad
accidently put remove in for loop for combine reader
Jan 19, 2024
c44be9d
Merge branch 'master' into use_fg_reader_hive
Jan 23, 2024
3ae140a
Merge branch 'master' into use_fg_reader_hive
Jan 29, 2024
c487e69
get building again
Jan 29, 2024
2c38ef7
address some review comments
Jan 29, 2024
68d31b7
add reviewer suggested change
Feb 2, 2024
31978ae
Merge branch 'master' into use_fg_reader_hive
Feb 5, 2024
ed4a1ba
add missing params fg reader
Feb 5, 2024
0e0840b
Merge branch 'master' into use_fg_reader_hive
Feb 19, 2024
a7a5219
address some comments
Feb 20, 2024
d8ed4e3
Merge branch 'master' into use_fg_reader_hive
Apr 8, 2024
7557d79
tmp
Apr 11, 2024
974f610
add missing deps
Apr 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,319 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

version: "3.3"

services:

namenode:
image: apachehudi/hudi-hadoop_3.1.0-namenode:latest
hostname: namenode
container_name: namenode
environment:
- CLUSTER_NAME=hudi_hadoop310_hive312_spark321
ports:
- "9870:9870"
- "8020:8020"
env_file:
- ./hadoop.env
healthcheck:
test: ["CMD", "curl", "-f", "http://namenode:9870"]
interval: 30s
timeout: 10s
retries: 3

datanode1:
image: apachehudi/hudi-hadoop_3.1.0-datanode:latest
container_name: datanode1
hostname: datanode1
environment:
- CLUSTER_NAME=hudi_hadoop310_hive312_spark321
env_file:
- ./hadoop.env
ports:
- "50075:50075"
- "50010:50010"
links:
- "namenode"
- "historyserver"
healthcheck:
test: ["CMD", "curl", "-f", "http://datanode1:50075"]
interval: 30s
timeout: 10s
retries: 3
depends_on:
- namenode

historyserver:
image: apachehudi/hudi-hadoop_3.1.0-history:latest
hostname: historyserver
container_name: historyserver
environment:
- CLUSTER_NAME=hudi_hadoop310_hive312_spark321
depends_on:
- "namenode"
links:
- "namenode"
ports:
- "58188:8188"
healthcheck:
test: ["CMD", "curl", "-f", "http://historyserver:8188"]
interval: 30s
timeout: 10s
retries: 3
env_file:
- ./hadoop.env
volumes:
- historyserver:/hadoop/yarn/timeline

hive-metastore-postgresql:
image: bde2020/hive-metastore-postgresql:3.1.0
volumes:
- hive-metastore-postgresql:/var/lib/postgresql
hostname: hive-metastore-postgresql
container_name: hive-metastore-postgresql

hivemetastore:
image: apachehudi/hudi-hadoop_3.1.0-hive_3.1.2:latest
hostname: hivemetastore
container_name: hivemetastore
links:
- "hive-metastore-postgresql"
- "namenode"
env_file:
- ./hadoop.env
command: /opt/hive/bin/hive --service metastore
environment:
SERVICE_PRECONDITION: "namenode:9870 hive-metastore-postgresql:5432"
ports:
- "9083:9083"
healthcheck:
test: ["CMD", "nc", "-z", "hivemetastore", "9083"]
interval: 30s
timeout: 10s
retries: 3
depends_on:
- "hive-metastore-postgresql"
- "namenode"

hiveserver:
image: apachehudi/hudi-hadoop_3.1.0-hive_3.1.2:latest
hostname: hiveserver
container_name: hiveserver
env_file:
- ./hadoop.env
environment:
SERVICE_PRECONDITION: "hivemetastore:9083"
JAVA_TOOL_OPTIONS: "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005"
ports:
- "10000:10000"
# JVM debugging port
- "64757:5005"
depends_on:
- "hivemetastore"
links:
- "hivemetastore"
- "hive-metastore-postgresql"
- "namenode"
volumes:
- ${HUDI_WS}:/var/hoodie/ws
- /Users/jon/Desktop/hiveWorkload:/var/hiveWorkload

sparkmaster:
image: apachehudi/hudi-hadoop_3.1.0-hive_3.1.2-sparkmaster_3.2.1:latest
hostname: sparkmaster
container_name: sparkmaster
env_file:
- ./hadoop.env
ports:
- "8080:8080"
- "7077:7077"
environment:
- INIT_DAEMON_STEP=setup_spark
links:
- "hivemetastore"
- "hiveserver"
- "hive-metastore-postgresql"
- "namenode"

spark-worker-1:
image: apachehudi/hudi-hadoop_3.1.0-hive_3.1.2-sparkworker_3.2.1:latest
hostname: spark-worker-1
container_name: spark-worker-1
env_file:
- ./hadoop.env
depends_on:
- sparkmaster
ports:
- "8081:8081"
environment:
- "SPARK_MASTER=spark://sparkmaster:7077"
links:
- "hivemetastore"
- "hiveserver"
- "hive-metastore-postgresql"
- "namenode"

zookeeper:
image: 'arm64v8/zookeeper:3.4.12'
platform: linux/arm64
hostname: zookeeper
container_name: zookeeper
ports:
- "2181:2181"
environment:
- ALLOW_ANONYMOUS_LOGIN=yes

kafka:
image: 'wurstmeister/kafka:2.12-2.0.1'
platform: linux/arm64
hostname: kafkabroker
container_name: kafkabroker
ports:
- "9092:9092"
environment:
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
- KAFKA_ADVERTISED_HOST_NAME=kafkabroker

# presto-coordinator-1:
# container_name: presto-coordinator-1
# hostname: presto-coordinator-1
# image: apachehudi/hudi-hadoop_3.1.0-prestobase_0.271:latest
# ports:
# - '8090:8090'
# environment:
# - PRESTO_JVM_MAX_HEAP=512M
# - PRESTO_QUERY_MAX_MEMORY=1GB
# - PRESTO_QUERY_MAX_MEMORY_PER_NODE=256MB
# - PRESTO_QUERY_MAX_TOTAL_MEMORY_PER_NODE=384MB
# - PRESTO_MEMORY_HEAP_HEADROOM_PER_NODE=100MB
# - TERM=xterm
# links:
# - "hivemetastore"
# volumes:
# - ${HUDI_WS}:/var/hoodie/ws
# command: coordinator
#
# presto-worker-1:
# container_name: presto-worker-1
# hostname: presto-worker-1
# image: apachehudi/hudi-hadoop_3.1.0-prestobase_0.271:latest
# depends_on: [ "presto-coordinator-1" ]
# environment:
# - PRESTO_JVM_MAX_HEAP=512M
# - PRESTO_QUERY_MAX_MEMORY=1GB
# - PRESTO_QUERY_MAX_MEMORY_PER_NODE=256MB
# - PRESTO_QUERY_MAX_TOTAL_MEMORY_PER_NODE=384MB
# - PRESTO_MEMORY_HEAP_HEADROOM_PER_NODE=100MB
# - TERM=xterm
# links:
# - "hivemetastore"
# - "hiveserver"
# - "hive-metastore-postgresql"
# - "namenode"
# volumes:
# - ${HUDI_WS}:/var/hoodie/ws
# command: worker
#
# trino-coordinator-1:
# container_name: trino-coordinator-1
# hostname: trino-coordinator-1
# image: apachehudi/hudi-hadoop_3.1.0-trinocoordinator_368:latest
# ports:
# - '8091:8091'
# links:
# - "hivemetastore"
# volumes:
# - ${HUDI_WS}:/var/hoodie/ws
# command: http://trino-coordinator-1:8091 trino-coordinator-1
#
# trino-worker-1:
# container_name: trino-worker-1
# hostname: trino-worker-1
# image: apachehudi/hudi-hadoop_3.1.0-trinoworker_368:latest
# depends_on: [ "trino-coordinator-1" ]
# ports:
# - '8092:8092'
# links:
# - "hivemetastore"
# - "hiveserver"
# - "hive-metastore-postgresql"
# - "namenode"
# volumes:
# - ${HUDI_WS}:/var/hoodie/ws
# command: http://trino-coordinator-1:8091 trino-worker-1
#
# graphite:
# container_name: graphite
# hostname: graphite
# image: graphiteapp/graphite-statsd
# ports:
# - 80:80
# - 2003-2004:2003-2004
# - 8126:8126

adhoc-1:
image: apachehudi/hudi-hadoop_3.1.0-hive_3.1.2-sparkadhoc_3.2.1:latest
hostname: adhoc-1
container_name: adhoc-1
env_file:
- ./hadoop.env
depends_on:
- sparkmaster
ports:
- '4040:4040'
environment:
- "SPARK_MASTER=spark://sparkmaster:7077"
links:
- "hivemetastore"
- "hiveserver"
- "hive-metastore-postgresql"
- "namenode"
#- "presto-coordinator-1"
#- "trino-coordinator-1"
volumes:
- ${HUDI_WS}:/var/hoodie/ws
- /Users/jon/Desktop/hiveWorkload:/var/hiveWorkload

adhoc-2:
image: apachehudi/hudi-hadoop_3.1.0-hive_3.1.2-sparkadhoc_3.2.1:latest
hostname: adhoc-2
container_name: adhoc-2
env_file:
- ./hadoop.env
depends_on:
- sparkmaster
environment:
- "SPARK_MASTER=spark://sparkmaster:7077"
links:
- "hivemetastore"
- "hiveserver"
- "hive-metastore-postgresql"
- "namenode"
#- "presto-coordinator-1"
#- "trino-coordinator-1"
volumes:
- ${HUDI_WS}:/var/hoodie/ws
- /Users/jon/Desktop/hiveWorkload:/var/hiveWorkload

volumes:
namenode:
historyserver:
hive-metastore-postgresql:

networks:
default:
name: hudi-network
5 changes: 5 additions & 0 deletions docker/setup_demo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ COMPOSE_FILE_NAME="docker-compose_hadoop284_hive233_spark244.yml"
if [ "$HUDI_DEMO_ENV" = "--mac-aarch64" ]; then
COMPOSE_FILE_NAME="docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml"
fi

if [ "$HUDI_DEMO_ENV" = "--hive3" ]; then
COMPOSE_FILE_NAME="docker-compose_hadoop310_hive312_spark321_mac_aarch64.yml"
fi

# restart cluster
HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/${COMPOSE_FILE_NAME} down
if [ "$HUDI_DEMO_ENV" != "dev" ]; then
Expand Down
5 changes: 5 additions & 0 deletions docker/stop_demo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@ COMPOSE_FILE_NAME="docker-compose_hadoop284_hive233_spark244.yml"
if [ "$HUDI_DEMO_ENV" = "--mac-aarch64" ]; then
COMPOSE_FILE_NAME="docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml"
fi

if [ "$HUDI_DEMO_ENV" = "--hive3" ]; then
COMPOSE_FILE_NAME="docker-compose_hadoop310_hive312_spark321_mac_aarch64.yml"
fi

# shut down cluster
HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/${COMPOSE_FILE_NAME} down

Expand Down
Loading
Loading