Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command to migrate MariaDB database #61

Merged
merged 54 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
5a469e2
restoring maria-migrate branch
amyfromandi Jul 17, 2024
677cc4a
modified .gitignore to ignore specific files
amyfromandi Jul 17, 2024
871128b
Refactored some code
amyfromandi Jul 22, 2024
66c161a
updated utils.py
amyfromandi Jul 25, 2024
b2700bd
added liths prescript query
amyfromandi Jul 25, 2024
ce87076
Merge branch 'main' into maria-migrate
davenquinn Jul 25, 2024
b4da967
Moved entire mariadb_migration system to subdirectory
davenquinn Jul 25, 2024
28c062a
Incorporated legacy migration command into new directory
davenquinn Jul 25, 2024
00c1a02
Created basic command to run MariaDB CLI
davenquinn Jul 26, 2024
1f234f9
Remove need to connect with --net=host
davenquinn Jul 26, 2024
7f9193c
Add command that restores a database onto the MariaDB server
davenquinn Jul 26, 2024
93c7fb6
Removed overly complex logging approach
davenquinn Jul 26, 2024
225e016
find_row_variances() compares maridb data to macrostrat data to ensur…
amyfromandi Jul 26, 2024
12f5509
Refactor - move stream reader to a utils file
davenquinn Jul 26, 2024
76b1b2b
Fix reference to decoder
davenquinn Jul 26, 2024
4c58f5e
Rearrange CLI utils
davenquinn Jul 26, 2024
8d7eccd
Moved pre- and post- scripts to external SQL files
davenquinn Jul 26, 2024
9b8ba34
Removed extra files created by IntelliJ
davenquinn Jul 26, 2024
6245e56
Set up namespaced packages correctly in intellij
davenquinn Jul 26, 2024
8161b71
Fix a bit more sql
davenquinn Jul 26, 2024
985bab4
added find_col_variance() method
amyfromandi Jul 26, 2024
884814b
Merge branch 'maria-migrate' of https://github.com/UW-Macrostrat/macr…
amyfromandi Jul 26, 2024
acf35b9
Move change assessment methods to separate file
davenquinn Jul 26, 2024
aa4c84d
Updated MariaDB dump/restore functions
davenquinn Jul 27, 2024
1262ea3
PostgreSQL migration now mostly works
davenquinn Jul 29, 2024
de6470e
Get reporting to work
davenquinn Jul 29, 2024
e239b9b
Somewhat improved column and row count functions
davenquinn Jul 29, 2024
e7650e6
Updated mariadb migration functions
davenquinn Jul 29, 2024
0ab3678
Fixed MYSQLDump command and deadlock in dump/restore
davenquinn Jul 29, 2024
1a9fbf3
added find_col_variances() function
amyfromandi Jul 29, 2024
5fc2636
Updated some formatting
davenquinn Jul 29, 2024
4fdc534
Streamlined migration scripts and renamed files
davenquinn Jul 29, 2024
795395c
Added the ability to run different steps of the migration process
davenquinn Jul 29, 2024
5064029
accommodated code for port missing in macrostrat.toml. also added ssl…
Jul 31, 2024
5152d49
updated pgloader code to use new pg_temp_engine url and creds
amyfromandi Jul 31, 2024
fb9e628
Merged maria-migrate into this branch
amyfromandi Jul 31, 2024
02d7112
Added most up-to-date find_row_variances() and find_col_variances
amyfromandi Aug 1, 2024
6f3774f
Got find_row_variances() and find_col_variances to function!
amyfromandi Aug 1, 2024
15d7a90
Created utility function we might use to run PGLoader
davenquinn Aug 1, 2024
88c0191
Updated utility functions for creating temporary database users
davenquinn Aug 2, 2024
0c6dda9
Updated function names somewhat
davenquinn Aug 2, 2024
cb633f6
fixed pgloader issue by adding mariadb_migrator as superuser
amyfromandi Aug 6, 2024
761903c
Fixed pgloader and check-data stats
amyfromandi Aug 6, 2024
22214cf
Merge pull request #74 from UW-Macrostrat/maria-migrate-cli
davenquinn Aug 6, 2024
daa13ad
repointed migration scripts to point to macrostrat.macrostrat_temp sc…
amyfromandi Aug 7, 2024
454129f
added preserve_macrostrat_data() function for the final steps in the …
amyfromandi Aug 8, 2024
1f1cf02
Merge branch 'maria-migrate-cli', remote-tracking branch 'origin' int…
amyfromandi Aug 8, 2024
fd039a5
Added code to resolve table and column variances across macrostrat an…
amyfromandi Aug 9, 2024
4c729c3
post script and preserve-macrostrat-data script is finalized!
amyfromandi Aug 13, 2024
f03941c
refactored a little bit of data
amyfromandi Aug 13, 2024
265feaf
removing unnecessary schlep scripts. index files in schlep-index.sql …
amyfromandi Aug 30, 2024
8f8ab02
Delete cli/macrostrat/cli/commands/schlep.py
amyfromandi Sep 3, 2024
78067e6
Delete cli/macrostrat/cli/commands/table_meta.py
amyfromandi Sep 3, 2024
57a2ccf
Modified output to pass data variance test with known 'issues'
amyfromandi Sep 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .idea/dataSources.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

69 changes: 69 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 6 additions & 8 deletions .idea/macrostrat.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion .idea/sqldialects.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion cli/macrostrat/cli/_dev/dump_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
from macrostrat.utils import get_logger
from sqlalchemy.engine import Engine

from .utils import _create_command, print_stdout, print_stream_progress
from .utils import _create_command
from .stream_utils import print_stream_progress, print_stdout

log = get_logger(__name__)

Expand Down
12 changes: 4 additions & 8 deletions cli/macrostrat/cli/_dev/restore_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,8 @@
from .utils import (
_create_command,
_create_database_if_not_exists,
print_stdout,
print_stream_progress,
)
from .stream_utils import print_stream_progress, print_stdout

console = Console()

Expand All @@ -26,10 +25,8 @@ def pg_restore(*args, **kwargs):

async def _pg_restore(
engine: Engine,
*,
*args,
create=False,
command_prefix: Optional[list] = None,
args: list = [],
postgres_container: str = "postgres:15",
):
# Pipe file to pg_restore, mimicking
Expand All @@ -42,11 +39,10 @@ async def _pg_restore(
# host, if possible, is probably the fastest option. There should be
# multiple options ideally.
_cmd = _create_command(
engine,
"pg_restore",
"-d",
args=args,
prefix=command_prefix,
engine,
*args,
container=postgres_container,
)

Expand Down
117 changes: 117 additions & 0 deletions cli/macrostrat/cli/_dev/stream_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import asyncio
import sys
import zlib

from aiofiles.threadpool import AsyncBufferedIOBase
from macrostrat.utils import get_logger
from .utils import console

log = get_logger(__name__)


async def print_stream_progress(
input: asyncio.StreamReader | asyncio.subprocess.Process,
out_stream: asyncio.StreamWriter | None,
*,
verbose: bool = False,
chunk_size: int = 1024,
prefix: str = None,
):
"""This should be unified with print_stream_progress, but there seem to be
slight API differences between aiofiles and asyncio.StreamWriter APIs.?"""
in_stream = input
if isinstance(in_stream, asyncio.subprocess.Process):
in_stream = input.stdout

megabytes_written = 0
i = 0

# Iterate over the stream by chunks
try:
while True:
chunk = await in_stream.read(chunk_size)
if not chunk:
log.info("End of stream")
break
if verbose:
log.info(chunk)
megabytes_written += len(chunk) / 1_000_000
if isinstance(out_stream, AsyncBufferedIOBase):
await out_stream.write(chunk)
await out_stream.flush()
elif out_stream is not None:
out_stream.write(chunk)
await out_stream.drain()
i += 1
if i == 100:
i = 0
_print_progress(megabytes_written, end="\r", prefix=prefix)
except asyncio.CancelledError:
pass
finally:
_print_progress(megabytes_written, prefix=prefix)

if isinstance(out_stream, AsyncBufferedIOBase):
out_stream.close()
elif out_stream is not None:
out_stream.close()
await out_stream.wait_closed()


def _print_progress(megabytes: float, **kwargs):
prefix = kwargs.pop("prefix", None)
if prefix is None:
prefix = "Dumped"
progress = f"{prefix} {megabytes:.1f} MB"
kwargs["file"] = sys.stderr
print(progress, **kwargs)


async def print_stdout(stream: asyncio.StreamReader):
async for line in stream:
log.info(line)
console.print(line.decode("utf-8"), style="dim")


class DecodingStreamReader(asyncio.StreamReader):
"""A StreamReader that decompresses gzip files (if compressed)"""

# https://ejosh.co/de/2022/08/stream-a-massive-gzipped-json-file-in-python/

def __init__(self, stream, encoding="utf-8", errors="strict"):
super().__init__()
self.stream = stream
self._is_gzipped = None
self.d = zlib.decompressobj(zlib.MAX_WBITS | 16)

def decompress(self, input: bytes) -> bytes:
decompressed = self.d.decompress(input)
data = b""
while self.d.unused_data != b"":
buf = self.d.unused_data
self.d = zlib.decompressobj(zlib.MAX_WBITS | 16)
data = self.d.decompress(buf)
return decompressed + data

def transform_data(self, data):
if self._is_gzipped is None:
self._is_gzipped = data[:2] == b"\x1f\x8b"
log.info("is_gzipped: %s", self._is_gzipped)
if self._is_gzipped:
# Decompress the data
data = self.decompress(data)
return data

async def read(self, n=-1):
data = await self.stream.read(n)
return self.transform_data(data)

async def readline(self):
res = b""
while res == b"":
# Read next line
line = await self.stream.readline()
if not line:
break
res += self.transform_data(line)
return res
2 changes: 1 addition & 1 deletion cli/macrostrat/cli/_dev/transfer_tables.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import asyncio
from .utils import print_stream_progress, print_stdout
from .stream_utils import print_stream_progress, print_stdout
from sqlalchemy.engine import Engine
from .dump_database import _pg_dump
from .restore_database import _pg_restore
Expand Down
Loading