Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WJ-1402] Revamp Wikicomma import script #1980

Merged
merged 133 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
133 commits
Select commit Hold shift + click to select a range
bc8571e
Delete prior importer script.
emmiegit Jun 12, 2024
5534270
Start new importer module.
emmiegit Jun 12, 2024
911afe6
Start s3 methods.
emmiegit Jun 12, 2024
19dbb72
Run black formatter.
emmiegit Jun 12, 2024
56b8b16
Start SQLite3 connection file.
emmiegit Jun 12, 2024
b62c732
Start process methods.
emmiegit Jun 13, 2024
1238640
Add import utility gitignore.
emmiegit Jun 13, 2024
8d6c915
Add user ingest method.
emmiegit Jun 13, 2024
6f56464
Update user ingestion code.
emmiegit Jun 13, 2024
07f3ea6
Start separate classes for S3 and SiteImporter.
emmiegit Jun 13, 2024
85e40e9
Add site data ingestion.
emmiegit Jun 13, 2024
d269cf6
Start work on SiteImporter class.
emmiegit Jun 13, 2024
6b10458
Start site subdirectories.
emmiegit Jun 13, 2024
005897d
Add process_pages() stub.
emmiegit Jun 13, 2024
1cab879
Add page ID mapping processing.
emmiegit Jun 13, 2024
53f8286
Add page data.
emmiegit Jun 13, 2024
a6f9b63
Change logging.
emmiegit Jun 13, 2024
4470851
Fix regex execution.
emmiegit Jun 13, 2024
3f14825
Fix init.
emmiegit Jun 13, 2024
846bdae
Run black formatter.
emmiegit Jun 13, 2024
6f43459
Fix add_site().
emmiegit Jun 13, 2024
4615453
Fix decorators.
emmiegit Jun 13, 2024
52c2053
Add missing import.
emmiegit Jun 13, 2024
afea615
Add another missing import.
emmiegit Jun 13, 2024
287a9b9
Fix format string.
emmiegit Jun 13, 2024
93e73fc
Add requirements.txt for importer.
emmiegit Jun 16, 2024
8b36ed2
Skip torrent files.
emmiegit Jun 17, 2024
caca641
Cache site ID (expensive get).
emmiegit Jun 17, 2024
6935462
Fetch site ID from database if present.
emmiegit Jun 18, 2024
1259a04
Add site to page log.
emmiegit Jun 18, 2024
a37a740
Start implementation for page metadata.
emmiegit Jun 18, 2024
6f38f81
Add method to convert page slugs to add colons.
emmiegit Jun 18, 2024
ba99a8c
Fix typo.
emmiegit Jun 18, 2024
849d032
Handle missing tag list.
emmiegit Jun 18, 2024
dd7a11a
Properly convert page slug.
emmiegit Jun 18, 2024
02f72b9
Fix insert query.
emmiegit Jun 18, 2024
46c8b4f
Fix page metadata variables.
emmiegit Jun 18, 2024
d5050fc
Add kangaroo_twelve() utility function.
emmiegit Jun 18, 2024
7602cf8
Add text table.
emmiegit Jun 18, 2024
fdf9e1b
Add wikitext storage to page revisions.
emmiegit Jun 18, 2024
d5c7a08
Add quotes.
emmiegit Jun 18, 2024
24acb5e
Add page revision wikitext extraction.
emmiegit Jun 18, 2024
b53dd1e
Change to properties.
emmiegit Jun 18, 2024
ed23cb9
Fix queries.
emmiegit Jun 20, 2024
e1af90f
Fix get_revision_id() return value.
emmiegit Jun 20, 2024
9a65026
Update metadata title retrieval.
emmiegit Jun 20, 2024
f7b9601
Add page_descr column to page_metadata table.
emmiegit Jun 29, 2024
f465dda
Update get_page_id() method.
emmiegit Jun 29, 2024
3a6fe29
Change logic to use page_descr.
emmiegit Jun 29, 2024
f57ab84
Remove deleted page_id cache.
emmiegit Jun 29, 2024
8953cb7
Get page_id for page metadata.
emmiegit Jun 29, 2024
457c137
Fix helper method.
emmiegit Jun 29, 2024
95791a8
Get page_id after inserting.
emmiegit Jun 29, 2024
f7424f3
Add log messages.
emmiegit Jun 29, 2024
9725a6b
Update schema for file table.
emmiegit Jun 29, 2024
2dd4b0f
Add get_page_descr() helper method.
emmiegit Jun 29, 2024
f7dd20d
Start process_files() implementation.
emmiegit Jun 29, 2024
e9278b6
Write output to log file too.
emmiegit Jun 29, 2024
7d803b7
Fix log file mode.
emmiegit Jun 29, 2024
beeca1e
Fix argument processing.
emmiegit Jun 29, 2024
acf3bce
Run black formatter.
emmiegit Jun 29, 2024
74e85e6
Remove extra newline.
emmiegit Jun 29, 2024
4aaa85c
Unify page table schema.
emmiegit Jun 29, 2024
e67b5d2
Update add_page() method for unified system.
emmiegit Jun 29, 2024
34ab297
Call all stubs.
emmiegit Jun 29, 2024
c3cc44d
Return s3_path after upload.
emmiegit Jun 29, 2024
632eb5a
Add missing import.
emmiegit Jun 29, 2024
2f41c63
Add s3_hash to file table.
emmiegit Jun 29, 2024
42b998d
Move comment placement.
emmiegit Jun 29, 2024
061f0f4
Fix runtime issues in s3.py
emmiegit Jun 29, 2024
e25308d
Add method for adding file row.
emmiegit Jun 29, 2024
a5960cc
Use match statement for get_page_id() method.
emmiegit Jun 29, 2024
f613241
Implement file uploads.
emmiegit Jun 29, 2024
b6d2776
Remove TODO comment.
emmiegit Jun 29, 2024
9909238
Add forum tables to schema.
emmiegit Jun 29, 2024
1642d13
Allow multiple meta paths.
emmiegit Jun 29, 2024
8406d17
Add method for forum categories.
emmiegit Jun 30, 2024
3600c5c
Fix forum category metadata ingestion.
emmiegit Jun 30, 2024
6977a6d
Remove last_posted_at.
emmiegit Jun 30, 2024
6402983
Fix missing data.
emmiegit Jun 30, 2024
963cb6c
Handle missing directory.
emmiegit Jun 30, 2024
82876c5
Start implementing forum ingestion.
emmiegit Jun 30, 2024
94567d5
Fix invalid path formation bug.
emmiegit Jun 30, 2024
a5084dc
Rename forum ingestion methods.
emmiegit Jun 30, 2024
74c2c72
Update schema SQL.
emmiegit Jun 30, 2024
14f8895
Start process_post() method.
emmiegit Jun 30, 2024
cf2e416
Add edited fields to forum_post.
emmiegit Jul 4, 2024
17de203
Remove extra newline.
emmiegit Jul 4, 2024
28517e4
Implement recursive forum post data ingestion.
emmiegit Jul 4, 2024
1b72b17
Only process revision section if there's data.
emmiegit Jul 4, 2024
d3c6dce
Removed debug comment line.
emmiegit Jul 4, 2024
f85d282
Insert blob records to SQLite database.
emmiegit Jul 4, 2024
4df6c2a
Store MIME type in SQLite too.
emmiegit Jul 4, 2024
210b741
Initial addition of methods for forum wikitexts.
emmiegit Jul 4, 2024
18eecb8
Remove extra whitespace from HTML.
emmiegit Jul 4, 2024
8867a25
Fix issues.
emmiegit Jul 4, 2024
763dd84
Skip missing forum directory.
emmiegit Jul 4, 2024
8b70757
Add debug line for _users 'site'.
emmiegit Jul 4, 2024
1ecef2b
Run black formatter.
emmiegit Jul 4, 2024
8a081f6
Add explanatory note on table.
emmiegit Jul 4, 2024
9a20bf9
Update message again.
emmiegit Jul 4, 2024
514b5b6
Fix add_page_vote().
emmiegit Jul 4, 2024
0b05279
Pass in file_metadata and store it.
emmiegit Jul 5, 2024
6564e93
Change database commit order.
emmiegit Jul 5, 2024
a1f9eaa
Add plus sign to vote values.
emmiegit Jul 5, 2024
b537368
Add logic to delete/re-insert pages with multiples.
emmiegit Jul 5, 2024
d129f42
Fix comparison.
emmiegit Jul 5, 2024
e4c3dc0
Handle other case with comparison.
emmiegit Jul 5, 2024
9bcea0b
Add rows to new page_deleted table.
emmiegit Jul 5, 2024
1c22dc9
Add method for is_deleted_page().
emmiegit Jul 5, 2024
028fb16
Add logic to read and skip deleted pages.
emmiegit Jul 5, 2024
1ae7a37
Fix seed syntax.
emmiegit Jul 5, 2024
3ff01f6
Return result of is_deleted_page().
emmiegit Jul 5, 2024
11335d4
Add separate method for adding deleted pages.
emmiegit Jul 5, 2024
f6d62ca
Add deleted page for other branch.
emmiegit Jul 5, 2024
5e0af3c
Fix deletion logic.
emmiegit Jul 5, 2024
15ce770
Fix argument.
emmiegit Jul 5, 2024
c435bec
Log updated fields.
emmiegit Jul 5, 2024
273e9ae
Modify exists() method for checking database and S3.
emmiegit Jul 5, 2024
622e04e
Only use database blob check.
emmiegit Jul 5, 2024
77972a0
Add back magic, use it in case a file_metadata entry is missing.
emmiegit Jul 5, 2024
4cfdfd7
Ignore un-downloaded files.
emmiegit Jul 5, 2024
53adddc
Emit commas for lengths.
emmiegit Jul 5, 2024
fa60b17
Consume missing page for file.
emmiegit Jul 5, 2024
19823d3
Fix logger call.
emmiegit Jul 5, 2024
3998b00
Support Wikidot created by forum thread.
emmiegit Jul 6, 2024
0500c16
Fix percent type.
emmiegit Jul 6, 2024
f1d02b9
Run black formatter.
emmiegit Jul 6, 2024
a4c9b6a
Add default for missing isLocked field.
emmiegit Jul 6, 2024
59b1313
Add another default False.
emmiegit Jul 6, 2024
2eac7f0
Resolve conflict issue in add_forum_post_revision().
emmiegit Jul 7, 2024
1ad5d8b
Add handling for anonymous revisions.
emmiegit Jul 9, 2024
af85d70
Ignore missing pages when inserting wikitexts.
emmiegit Jul 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
Empty file added deepwell/importer/__init__.py
Empty file.
106 changes: 106 additions & 0 deletions deepwell/importer/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#!/usr/bin/env python3

import argparse
import logging
import os
import sys

from .importer import Importer
from .wikicomma_config import parse_config

LOG_FORMAT = "[%(levelname)s] %(asctime)s %(name)s: %(message)s"
LOG_DATE_FORMAT = "%Y/%m/%d %H:%M:%S"
LOG_FILENAME = "import.log"
LOG_FILE_MODE = "a"

if __name__ == "__main__":
argparser = argparse.ArgumentParser(description="WikiComma importer")
argparser.add_argument(
"-q",
"--quiet",
"--no-stdout",
dest="stdout",
action="store_false",
help="Don't output to standard out",
)
argparser.add_argument(
"--log",
dest="log_file",
default=LOG_FILENAME,
help="The log file to write to",
)
argparser.add_argument(
"-c",
"--config",
dest="wikicomma_config",
required=True,
help="The configuration JSON that Wikicomma uses",
)
argparser.add_argument(
"-d",
"--directory",
"--wikicomma-directory",
dest="wikicomma_directory",
required=True,
help="The directory where Wikicomma data resides",
)
argparser.add_argument(
"-o",
"--sqlite",
"--output-sqlite",
dest="sqlite_path",
required=True,
help="The location to output the SQLite database to",
)
argparser.add_argument(
"-D",
"--delete-sqlite",
action="store_true",
help="Delete the output SQLite before starting operations",
)
argparser.add_argument(
"-b",
"--bucket",
"--s3-bucket",
dest="s3_bucket",
required=True,
help="The S3 bucket to store uploaded files in",
)
argparser.add_argument(
"-P",
"--profile",
"--aws-profile",
dest="aws_profile",
required=True,
help="The AWS profile containing the secrets",
)
args = argparser.parse_args()

log_fmtr = logging.Formatter(LOG_FORMAT, datefmt=LOG_DATE_FORMAT)
log_file = logging.FileHandler(
filename=LOG_FILENAME,
encoding="utf-8",
mode=LOG_FILE_MODE,
)
log_file.setFormatter(log_fmtr)

logger = logging.getLogger(__package__)
logger.setLevel(level=logging.DEBUG)
logger.addHandler(log_file)

if args.stdout:
log_stdout = logging.StreamHandler(sys.stdout)
log_stdout.setFormatter(log_fmtr)
logger.addHandler(log_stdout)

wikicomma_config = parse_config(args.wikicomma_config)

importer = Importer(
wikicomma_config=wikicomma_config,
wikicomma_directory=args.wikicomma_directory,
sqlite_path=args.sqlite_path,
delete_sqlite=args.delete_sqlite,
s3_bucket=args.s3_bucket,
aws_profile=args.aws_profile,
)
importer.run()
Loading
Loading