Skip to content

Commit

Permalink
2024-12-02 Custom DC stable release (#4763)
Browse files Browse the repository at this point in the history
# Highlights
- SQL queries have been parameterized.
- The /translate API endpoint and the /translator web endpoint have been
removed.
- Some base CSS files have been modified as part of ongoing visual
refreshes of the main Data Commons site.
- A new config file option `includeInputSubdirs` is now available.
Please note that while this feature has undergone basic testing, it may
have rough edges and is not yet documented on our docsite. The
associated feature request is
https://issuetracker.google.com/issues/369945544.

# Submodule diffs
- Mixer:
datacommonsorg/mixer@656512f...b5d6d7c
- Import:
datacommonsorg/import@5d14167...98cd40c
  • Loading branch information
hqpho authored Dec 2, 2024
2 parents 8373289 + fed5115 commit eb41486
Show file tree
Hide file tree
Showing 152 changed files with 9,987 additions and 12,792 deletions.
34 changes: 0 additions & 34 deletions .github/workflows/all-commits-in-master.yml

This file was deleted.

8 changes: 4 additions & 4 deletions .github/workflows/codeql-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,11 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@v2
uses: actions/checkout@v4

# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v2
uses: github/codeql-action/init@v3
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
Expand All @@ -50,7 +50,7 @@ jobs:
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@v2
uses: github/codeql-action/autobuild@v3

# ℹ️ Command-line programs to run using the OS shell.
# 📚 https://git.io/JvXDl
Expand All @@ -64,4 +64,4 @@ jobs:
# make release

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
uses: github/codeql-action/analyze@v3
43 changes: 43 additions & 0 deletions .github/workflows/release-branch-checks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: Release branch checks

on:
pull_request:
branches: [ "customdc_stable" ]
# Required for merge queue to work: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue#triggering-merge-group-checks-with-github-actions
merge_group:
branches: [ "customdc_stable" ]

jobs:
verify_all_commits_are_already_in_master:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
# Fetch all history for accurate comparison
fetch-depth: 0
# Check out the PR branch
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}

- name: Verify that all commits are already in the master branch
run: |
git remote add dc https://github.com/datacommonsorg/website.git
git fetch dc
MASTER_BRANCH="dc/master"
# Get the list of commits in the source branch that are not in the master branch
MISSING_COMMITS=$(git log --pretty="%H - %s" $MASTER_BRANCH..HEAD --)
if [[ -n "$MISSING_COMMITS" ]]; then
echo ""
echo "ERROR: The following commits are not present in $MASTER_BRANCH:"
echo ""
echo "$MISSING_COMMITS"
echo ""
echo "PRs to release branches should only contain commits that are already in master."
echo "To fix this PR, reset its branch locally to a commit at or behind https://github.com/datacommonsorg/website/commits/master/ and then force-push it."
echo "Note that a release branch PR should be based on master and not the previous version of the release branch, which contains merge commits."
exit 1
fi
echo "All commits are present in $MASTER_BRANCH"
72 changes: 28 additions & 44 deletions build/cdc_data/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,47 +12,31 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# #### Stage 1: Build env for data importer. ####
FROM python:3.11.4-slim as data-importer

ARG PIP_DISABLE_PIP_VERSION_CHECK=1
ARG PIP_NO_CACHE_DIR=1
# #### Stage 1: Download base dc model from GCS. ####
FROM google/cloud-sdk:slim as model-downloader

# Copy model.
RUN mkdir -p /tmp/datcom-nl-models \
&& gsutil -m cp -R gs://datcom-nl-models/ft_final_v20230717230459.all-MiniLM-L6-v2/ /tmp/datcom-nl-models/


# #### Stage 2: Copy required files. ####
FROM python:3.11.4-slim as file-copier

WORKDIR /workspace

# Copy requirements.
# Copy simple importer requirements.
COPY import/simple/requirements.txt ./import/simple/requirements.txt

# Create a virtual env and install requirements.
RUN python -m venv /workspace/venv
ENV PATH="/workspace/venv/bin:$PATH"
RUN pip3 install -r ./import/simple/requirements.txt

# Copy simple importer.
COPY import/simple/ ./import/simple/


# #### Stage 2: Build env for embeddings builder. ####
FROM python:3.11.4-slim as embeddings-builder

ARG PIP_DISABLE_PIP_VERSION_CHECK=1
ARG PIP_NO_CACHE_DIR=1

WORKDIR /workspace

# Copy requirements.
# Copy embeddings builder requirements.
# Copy nl_requirements.txt since it is referenced by embeddings requirements.txt
COPY tools/nl/embeddings/requirements.txt ./tools/nl/embeddings/requirements.txt
COPY nl_requirements.txt ./nl_requirements.txt

# Create a virtual env and install requirements.
# Remove lancedb - it is not used by custom dc.
RUN python -m venv ./venv
ENV PATH="/workspace/venv/bin:$PATH"
RUN sed -i'' '/lancedb/d' /workspace/nl_requirements.txt \
&& pip3 install torch==2.2.2 --extra-index-url https://download.pytorch.org/whl/cpu \
&& pip3 install -r ./tools/nl/embeddings/requirements.txt

# Copy the embeddings builder module.
COPY tools/nl/embeddings/. ./tools/nl/embeddings/
# Copy the shared module.
Expand All @@ -63,15 +47,7 @@ COPY nl_server/. /workspace/nl_server/
COPY deploy/nl/. /datacommons/nl/


# #### Stage 3: Download base dc model from GCS. ####
FROM google/cloud-sdk:slim as model-downloader

# Copy model.
RUN mkdir -p /tmp/datcom-nl-models \
&& gsutil -m cp -R gs://datcom-nl-models/ft_final_v20230717230459.all-MiniLM-L6-v2/ /tmp/datcom-nl-models/


# #### Stage 4: Runtime env. ####
# #### Stage 3: Runtime env. ####
FROM python:3.11.4-slim as runner

ARG ENV
Expand All @@ -80,19 +56,27 @@ ENV ENV=${ENV}
WORKDIR /workspace

# Copy scripts, dependencies and files from the build stages.
COPY --from=data-importer /workspace/ .
COPY --from=embeddings-builder /workspace/ .
COPY --from=embeddings-builder /datacommons/ /datacommons
COPY --from=file-copier /workspace/ .
COPY --from=file-copier /datacommons/ /datacommons
COPY --from=model-downloader /tmp/datcom-nl-models /tmp/datcom-nl-models

ARG PIP_DISABLE_PIP_VERSION_CHECK=1
ARG PIP_NO_CACHE_DIR=1

# Create a virtual env, add it to path, and install all requirements.
RUN python -m venv /workspace/venv
ENV PATH="/workspace/venv/bin:$PATH"
RUN pip3 install -r ./import/simple/requirements.txt
# Remove lancedb - it is not used by custom dc.
RUN sed -i'' '/lancedb/d' /workspace/nl_requirements.txt \
&& pip3 install torch==2.2.2 --extra-index-url https://download.pytorch.org/whl/cpu \
&& pip3 install -r ./tools/nl/embeddings/requirements.txt

# Copy executable script.
COPY build/cdc_data/run.sh .

# Make script executable.
RUN chmod +x run.sh

# Add virtual env to the path.
ENV PATH="/workspace/venv/bin:$PATH"

# Set the default command to run the script.
CMD ./run.sh
CMD ./run.sh
79 changes: 79 additions & 0 deletions build/ci/cloudbuild.push_cdc_stable.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Updates stable-tagged Docker images for custom DC.
# Assumes the stable branch is already checked out, which it should be
# if this is triggered on push to branch for the stable branch.

################################################################################

# NOTE: Logs-based metrics for this build are dependent on step numbers.
# For this reason, please either add new steps at the end of the file OR
# update ALL metrics when adding/removing steps.

################################################################################

steps:
# Step 0: Initialize submods
- id: init-submods
name: gcr.io/cloud-builders/git
entrypoint: bash
args:
- -c
- |
set -e
git submodule update --init --recursive
waitFor: ["-"]

# Step 1: Get a label that combines commit hashes.
- id: get-label
name: gcr.io/cloud-builders/git
entrypoint: bash
args:
- -c
- |
set -e
set -o pipefail
./scripts/get_commits_label.sh | tail -1 >"$_IMAGE_LABEL_PATH"
waitFor: ["init-submods"]

# Step 2: Services container
- id: build-and-tag-stable-services
name: gcr.io/datcom-ci/deploy-tool
entrypoint: bash
args:
- -c
- |
set -e
image_label=$(cat "$_IMAGE_LABEL_PATH")
./scripts/build_cdc_services_and_tag_stable.sh $image_label
waitFor: ["get-label"]

# Step 3: Data management container
- id: build-and-tag-stable-data
name: gcr.io/datcom-ci/deploy-tool
entrypoint: bash
args:
- -c
- |
set -e
image_label=$(cat "$_IMAGE_LABEL_PATH")
./scripts/build_cdc_data_and_tag_stable.sh $image_label
waitFor: ["get-label"]

substitutions:
_IMAGE_LABEL_PATH: "/workspace/tmp_cdc_stable_image_label.txt"

options:
machineType: "E2_HIGHCPU_32"
2 changes: 1 addition & 1 deletion build/ci/cloudbuild.screenshot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
steps:
# Build the static files
- id: package_js
name: gcr.io/datcom-ci/node:2024-06-11
name: gcr.io/datcom-ci/node:2024-11-19
entrypoint: /bin/bash
waitFor: ["-"]
args:
Expand Down
22 changes: 8 additions & 14 deletions build/ci/cloudbuild.webdriver.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
steps:
# Build the static files
- id: package_js
name: gcr.io/datcom-ci/node:2024-06-11
name: gcr.io/datcom-ci/node:2024-11-19
entrypoint: /bin/bash
waitFor: ["-"]
args:
Expand All @@ -30,33 +30,27 @@ steps:
# ./run_test.sh -b will build client packages.
# These js files generated will be necessery for the flask_webdriver_test task.
./run_test.sh -b
# Download the files needed for nl server to run. Do the download here because
# webdriver runs on mulitple processes & we only want to do the download once.
- id: download_nl_files
- id: setup_python
name: python:3.11.3
entrypoint: /bin/sh
waitFor:
- package_js
entrypoint: /bin/bash
waitFor: ["-"]
args:
- -c
- |
cd tools/nl/download_nl_files
./run.sh
./run_test.sh --setup_python
# Run the webdriver tests
- id: flask_webdriver_test
name: gcr.io/datcom-ci/webdriver-chrome:2024-06-05
entrypoint: /bin/sh
waitFor:
- download_nl_files
waitFor: ["package_js", "setup_python"]
args:
- -c
- |
./run_test.sh --setup_python
./run_test.sh -w
timeout: 1800s
timeout: 1800s # 30 minutes

options:
machineType: "E2_HIGHCPU_32"
2 changes: 1 addition & 1 deletion build/node/cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# limitations under the License.

substitutions:
_VERS: "2024-06-11"
_VERS: "2024-11-19"

steps:
- name: "gcr.io/cloud-builders/docker"
Expand Down
2 changes: 1 addition & 1 deletion deploy/nl/catalog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ indexes:
bio_ft:
store_type: MEMORY
source_path: ../../tools/nl/embeddings/input/bio
embeddings_path: gs://datcom-nl-models/bio_ft_2024_06_24_23_40_05/embeddings.csv
embeddings_path: gs://datcom-nl-models/bio_ft_2024_11_08_19_00_38/embeddings.csv
model: ft-final-v20230717230459-all-MiniLM-L6-v2
healthcheck_query: "Gene"
base_uae_lance:
Expand Down
8 changes: 8 additions & 0 deletions docs/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,3 +352,11 @@ the same region.
### Testing cloudbuild changes

To test .yaml cloudbuild files, you can use cloud-build-local to dry run the file before actually pushing. Find documentation for how to install and use cloud-build-local [here](https://github.com/GoogleCloudPlatform/cloud-build-local).

### Inline Icons

The Data Commons site makes use of Material Design icons. In certain cases, font-based Material Design icon usage can result in
flashes of unstyled content that can be avoided by using SVG icons.

We have provided tools to facilitate the creation and use of Material SVG icons in both the Jinja template and in React components.
For instructions on how to generate and use these SVGs and components, please see: [Icon Readme](../tools/resources/icons/README.md):
2 changes: 1 addition & 1 deletion gke/get_storage_permission.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ SERVICE_ACCOUNT="$NAME@$PROJECT_ID.iam.gserviceaccount.com"

# Data store project roles
declare -a store_roles=(
"roles/bigquery.admin" # BigQuery
"roles/bigquery.dataViewer" # BigQuery
"roles/bigtable.reader" # Bigtable
"roles/storage.objectViewer" # Branch Cache Read
"roles/pubsub.editor" # Branch Cache Subscription
Expand Down
Loading

0 comments on commit eb41486

Please sign in to comment.