-
Notifications
You must be signed in to change notification settings - Fork 10
Indexing WARCs for Warclight
Now that you have your Warclight application up and running, we need to index data into it.
You'll need Java 8 to run webarchive-discovery. You can compile it from source (mvn clean install
) or use the pre-compiled jar available here.
You'll also need a directory or directories of W/ARCs.
You can point webarchive-discovery at a directory of W/ARCs. Let's use the example of the #WomensMarch crawl. The -i
, -n
, and -u
options are for institution
, collection_name
, and collection_number
.
$ java -jar /path/to/warc-indexer.jar -i "Web Archives for Historical Research" -n "#WomensMarch" -u "54321" -s http://localhost:8983/solr/blacklight-core /path/to/WomensMarch/warcs/*.gz
Note: If you are indexing a large number of W/ARCs and need a different tmp
path than /tmp
you can set that with -Djava.io.tmpdir=/tmp
.
You can also make use of a configuration file with webarchive-discovery. We have an example available in the repo.
$ java -Djava.io.tmpdir=/tmp -jar .internal_test_gem/tmp/warc-indexer.jar -c warclight_warc-indexer.conf -i "York University Libraries" -n "Test Collection" -u "12345" -s http://localhost:8983/solr/warclight /path/to/warcs/*.gz
The output should look like:
2017-10-05 17:23:18 INFO WARCIndexer:176 - Extract text = true
2017-10-05 17:23:18 INFO WARCIndexer:179 - Store text = true
2017-10-05 17:23:18 INFO WARCIndexer:181 - hashUrlId = false
2017-10-05 17:23:18 INFO WARCIndexer:224 - Hashing & Caching thresholds are: < 10485760 in memory, < 104857600 on disk.
2017-10-05 17:23:18 INFO WARCIndexer:227 - Setting up analysers...
2017-10-05 17:23:18 INFO WARCPayloadAnalysers:80 - first_bytes config: false 32
2017-10-05 17:23:18 INFO WARCPayloadAnalysers:88 - Image feature extraction = true
2017-10-05 17:23:19 WARN ImageParser:74 - JBIG2ImageReader not loaded. jbig2 files will be ignored
2017-10-05 17:23:19 INFO TikaExtractor:118 - Config: MIME exclude list: [x-tar, x-gzip, bz, lz, compress, zip, javascript, css, octet-stream]
2017-10-05 17:23:19 INFO TikaExtractor:121 - Config: Parser timeout (ms) 300000
2017-10-05 17:23:19 INFO TikaExtractor:124 - Config: Maximum length of text to extract (characters) 524288
2017-10-05 17:23:19 INFO TikaExtractor:128 - Config: extractAllMetadata false
2017-10-05 17:23:19 INFO TikaExtractor:131 - Config: useBoilerpipe false
2017-10-05 17:23:19 INFO HTMLAnalyser:68 - HTML - Extract resource links false
2017-10-05 17:23:19 INFO HTMLAnalyser:70 - HTML - Extract host links true
2017-10-05 17:23:19 INFO HTMLAnalyser:72 - HTML - Extract domain links true
2017-10-05 17:23:19 INFO HTMLAnalyser:74 - HTML - Extract elements used true
2017-10-05 17:23:19 INFO HTMLAnalyser:76 - HTML - Extract image links true
2017-10-05 17:23:19 INFO ImageAnalyser:74 - Image - detect faces = true
2017-10-05 17:23:19 INFO ImageAnalyser:76 - Image - max size in bytes 1048576
2017-10-05 17:23:19 INFO ImageAnalyser:79 - Image sample rate 0.1
2017-10-05 17:23:19 INFO FaceDetectionParser:86 - Face detection enabled.
2017-10-05 17:23:19 INFO FaceDetectionParser:88 - Dominant colour extraction enabled.
2017-10-05 17:23:20 INFO LanguageAnalyser:65 - Constructed language analyzer with enabled = true
2017-10-05 17:23:20 INFO WARCIndexer:252 - Initialisation of WARCIndexer complete.
Parsing Archive File [1/5]:spec/fixtures/warcs/2013-steacie-hackfest-2015_01_13.warc.gz
2017-10-05 17:23:22 INFO Instrument:249 - Performance statistics
WARCIndexer#content_types(#=29, time=2076.29ms, avg=0.01#/ms 71.60ms/#, 47.99%) top 20 sort=time
WARCIndexer#content_type_served=image/gif(#=6, time=839.65ms, avg=0.01#/ms 139.94ms/#, 19.40%)
WARCIndexer#content_type_served=text/html(#=9, time=437.63ms, avg=0.02#/ms 48.63ms/#, 10.11%)
WARCIndexer#content_type_served=text/plain(#=5, time=411.63ms, avg=0.01#/ms 82.33ms/#, 9.51%)
WARCIndexer#content_type_served=image/jpeg(#=1, time=192.43ms, avg=0.01#/ms 192.43ms/#, 4.45%)
WARCIndexer#content_type_served=text/css(#=5, time=105.19ms, avg=0.05#/ms 21.04ms/#, 2.43%)
WARCIndexer#content_type_served=text/xml(#=1, time=80.98ms, avg=0.01#/ms 80.98ms/#, 1.87%)
WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=1, time=7.06ms, avg=0.14#/ms 7.06ms/#, 0.16%)
WARCIndexer#content_type_served=image/x-icon(#=1, time=1.65ms, avg=0.60#/ms 1.65ms/#, 0.04%)
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 48.97%)
WARCIndexerCommand.commit#success(#=1, time=71.80ms, avg=0.01#/ms 71.80ms/#, 1.66%)
WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=1, time=2204.05ms, avg=0.00#/ms 2204.05ms/#, 50.92%)
WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=104, time=2110.20ms, avg=0.05#/ms 20.29ms/#, 48.75%)
SolrRecord.removeControlCharacters#total(#=2829, time=23.13ms, avg=122.32#/ms 0.01ms/#, 0.53%)
SolrRecord.sanitiseUTF8(#=2829, time=10.81ms, avg=261.81#/ms 0.00ms/#, 0.25%)
WARCIndexer.extract#total(#=29, time=2076.02ms, avg=0.01#/ms 71.59ms/#, 47.95%)
WARCIndexer.extract#archeaders(#=33, time=250.97ms, avg=0.13#/ms 7.61ms/#, 5.80%)
WARCIndexer.extract#hashstreamwrap(#=29, time=6.70ms, avg=4.33#/ms 0.23ms/#, 0.15%)
WARCIndexer.extract#analyzetikainput(#=29, time=1727.17ms, avg=0.02#/ms 59.56ms/#, 39.89%)
WARCPayloadAnalyzers.analyze#total(#=29, time=1727.05ms, avg=0.02#/ms 59.55ms/#, 39.89%)
WARCPayloadAnalyzers.analyze#tikasolrextract(#=29, time=1345.46ms, avg=0.02#/ms 46.40ms/#, 31.08%)
TikaExtractor.extract#detect(#=29, time=78.66ms, avg=0.37#/ms 2.71ms/#, 1.82%)
TikaExtractor.extract#parse(#=28, time=1248.43ms, avg=0.02#/ms 44.59ms/#, 28.83%)
TikaExtractor.extract#extract(#=28, time=13.39ms, avg=2.09#/ms 0.48ms/#, 0.31%)
WARCPayloadAnalyzers.analyze#firstbytes(#=29, time=1.56ms, avg=18.62#/ms 0.05ms/#, 0.04%)
WARCPayloadAnalyzers.analyze#droid(#=29, time=80.18ms, avg=0.36#/ms 2.76ms/#, 1.85%) top 5 sort=avgtime
WARCPayloadAnalyzers.analyze#droid_type=image/vnd.microsoft.icon(#=1, time=3.66ms, avg=0.27#/ms 3.66ms/#, 0.08%)
WARCPayloadAnalyzers.analyze#droid_type=application/xhtml+xml; version=1.0(#=8, time=27.65ms, avg=0.29#/ms 3.46ms/#, 0.64%)
WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=1, time=3.35ms, avg=0.30#/ms 3.35ms/#, 0.08%)
WARCPayloadAnalyzers.analyze#droid_type=application/xml; version=1.0(#=1, time=3.23ms, avg=0.31#/ms 3.23ms/#, 0.07%)
WARCPayloadAnalyzers.analyze#droid_type=application/octet-stream(#=11, time=32.81ms, avg=0.34#/ms 2.98ms/#, 0.76%)
HTMLAnalyzer.analyze#total(#=19, time=110.89ms, avg=0.17#/ms 5.84ms/#, 2.56%)
HTMLAnalyzer.analyze#parser(#=19, time=72.93ms, avg=0.26#/ms 3.84ms/#, 1.68%)
HtmlFeatureParser.parse#jsoupparse(#=19, time=54.13ms, avg=0.35#/ms 2.85ms/#, 1.25%)
HtmlFeatureParser.parse#featureextract(#=19, time=11.68ms, avg=1.63#/ms 0.61ms/#, 0.27%)
ImageAnalyzer.analyze#facesanddominant(#=1, time=188.59ms, avg=0.01#/ms 188.59ms/#, 4.36%)
TextAnalyzers#total(#=29, time=43.04ms, avg=0.67#/ms 1.48ms/#, 0.99%)
LanguageAnalyzer#total(#=15, time=27.48ms, avg=0.55#/ms 1.83ms/#, 0.63%)
PostcodeAnalyzer(#=15, time=1.29ms, avg=11.60#/ms 0.09ms/#, 0.03%)
FuzzyHashAnalyzer(#=15, time=14.12ms, avg=1.06#/ms 0.94ms/#, 0.33%)
WARCIndexerCommand.parseWarcFiles#docdelivery(#=29, time=0.31ms, avg=93.04#/ms 0.01ms/#, 0.01%)
Parsing Archive File [2/5]:spec/fixtures/warcs/YULEARN-2014_12_10.warc.gz
2017-10-05 17:23:41 INFO Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 9.06%)
WARCIndexerCommand.commit#success(#=2, time=75.90ms, avg=0.03#/ms 37.95ms/#, 0.32%)
WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=2, time=21279.16ms, avg=0.00#/ms 10639.58ms/#, 90.91%)
WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=425, time=20949.05ms, avg=0.02#/ms 49.29ms/#, 89.50%)
SolrRecord.removeControlCharacters#total(#=12978, time=58.32ms, avg=222.54#/ms 0.00ms/#, 0.25%)
SolrRecord.sanitiseUTF8(#=12978, time=22.27ms, avg=582.73#/ms 0.00ms/#, 0.10%)
WARCIndexer.extract#total(#=122, time=20864.01ms, avg=0.01#/ms 171.02ms/#, 89.14%)
WARCIndexer.extract#archeaders(#=139, time=286.66ms, avg=0.48#/ms 2.06ms/#, 1.22%)
WARCIndexer.extract#hashstreamwrap(#=122, time=197.10ms, avg=0.62#/ms 1.62ms/#, 0.84%)
WARCIndexer.extract#analyzetikainput(#=122, time=20204.82ms, avg=0.01#/ms 165.61ms/#, 86.32%)
WARCPayloadAnalyzers.analyze#total(#=122, time=20204.36ms, avg=0.01#/ms 165.61ms/#, 86.32%)
WARCPayloadAnalyzers.analyze#tikasolrextract(#=122, time=8964.22ms, avg=0.01#/ms 73.48ms/#, 38.30%)
TikaExtractor.extract#detect(#=122, time=248.04ms, avg=0.49#/ms 2.03ms/#, 1.06%)
TikaExtractor.extract#parse(#=121, time=8664.07ms, avg=0.01#/ms 71.60ms/#, 37.01%)
TikaExtractor.extract#extract(#=121, time=39.41ms, avg=3.07#/ms 0.33ms/#, 0.17%)
WARCPayloadAnalyzers.analyze#firstbytes(#=122, time=5.00ms, avg=24.40#/ms 0.04ms/#, 0.02%)
WARCPayloadAnalyzers.analyze#droid(#=122, time=2854.05ms, avg=0.04#/ms 23.39ms/#, 12.19%) top 5 sort=avgtime
WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 10.76%)
WARCPayloadAnalyzers.analyze#droid_type=image/jpeg; version=1.02(#=21, time=87.51ms, avg=0.24#/ms 4.17ms/#, 0.37%)
WARCPayloadAnalyzers.analyze#droid_type=image/vnd.microsoft.icon(#=2, time=6.79ms, avg=0.29#/ms 3.39ms/#, 0.03%)
WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=1, time=3.35ms, avg=0.30#/ms 3.35ms/#, 0.01%)
WARCPayloadAnalyzers.analyze#droid_type=application/xml; version=1.0(#=1, time=3.23ms, avg=0.31#/ms 3.23ms/#, 0.01%)
HTMLAnalyzer.analyze#total(#=72, time=280.18ms, avg=0.26#/ms 3.89ms/#, 1.20%)
HTMLAnalyzer.analyze#parser(#=72, time=157.62ms, avg=0.46#/ms 2.19ms/#, 0.67%)
HtmlFeatureParser.parse#jsoupparse(#=72, time=97.52ms, avg=0.74#/ms 1.35ms/#, 0.42%)
HtmlFeatureParser.parse#featureextract(#=72, time=33.48ms, avg=2.15#/ms 0.47ms/#, 0.14%)
ImageAnalyzer.analyze#facesanddominant(#=1, time=188.59ms, avg=0.01#/ms 188.59ms/#, 0.81%)
PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 33.80%)
TextAnalyzers#total(#=122, time=94.43ms, avg=1.29#/ms 0.77ms/#, 0.40%)
LanguageAnalyzer#total(#=81, time=61.95ms, avg=1.31#/ms 0.76ms/#, 0.26%)
PostcodeAnalyzer(#=81, time=3.64ms, avg=22.26#/ms 0.04ms/#, 0.02%)
FuzzyHashAnalyzer(#=81, time=28.23ms, avg=2.87#/ms 0.35ms/#, 0.12%)
WARCIndexerCommand.parseWarcFiles#docdelivery(#=122, time=220.52ms, avg=0.55#/ms 1.81ms/#, 0.94%)
WARCIndexerCommanc.checkSubmission#solrSendBatch(#=2, time=219.48ms, avg=0.01#/ms 109.74ms/#, 0.94%)
WARCIndexer#content_types(#=122, time=20864.77ms, avg=0.01#/ms 171.02ms/#, 89.13%) top 20 sort=time
WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 54.26%)
WARCIndexer#content_type_served=image/jpeg(#=21, time=4710.19ms, avg=0.00#/ms 224.29ms/#, 20.12%)
WARCIndexer#content_type_served=image/gif(#=13, time=1512.06ms, avg=0.01#/ms 116.31ms/#, 6.46%)
WARCIndexer#content_type_served=text/html(#=56, time=1127.03ms, avg=0.05#/ms 20.13ms/#, 4.81%)
WARCIndexer#content_type_served=text/plain(#=8, time=436.13ms, avg=0.02#/ms 54.52ms/#, 1.86%)
WARCIndexer#content_type_served=text/css(#=9, time=145.92ms, avg=0.06#/ms 16.21ms/#, 0.62%)
WARCIndexer#content_type_served=image/png(#=1, time=133.21ms, avg=0.01#/ms 133.21ms/#, 0.57%)
WARCIndexer#content_type_served=text/xml(#=1, time=80.98ms, avg=0.01#/ms 80.98ms/#, 0.35%)
WARCIndexer#content_type_served=text/javascript(#=1, time=9.54ms, avg=0.10#/ms 9.54ms/#, 0.04%)
WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=1, time=7.06ms, avg=0.14#/ms 7.06ms/#, 0.03%)
WARCIndexer#content_type_served=image/x-icon(#=1, time=1.65ms, avg=0.60#/ms 1.65ms/#, 0.01%)
Parsing Archive File [3/5]:spec/fixtures/warcs/etig-2014_08_13.warc.gz
2017-10-05 17:24:01 INFO Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 4.87%)
WARCIndexerCommand.commit#success(#=3, time=81.57ms, avg=0.04#/ms 27.19ms/#, 0.19%)
WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=3, time=41435.85ms, avg=0.00#/ms 13811.95ms/#, 95.11%)
WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=1227, time=40822.27ms, avg=0.03#/ms 33.27ms/#, 93.70%)
SolrRecord.removeControlCharacters#total(#=38433, time=145.13ms, avg=264.82#/ms 0.00ms/#, 0.33%)
SolrRecord.sanitiseUTF8(#=38433, time=44.74ms, avg=859.09#/ms 0.00ms/#, 0.10%)
WARCIndexer.extract#total(#=327, time=40607.89ms, avg=0.01#/ms 124.18ms/#, 93.21%)
WARCIndexer.extract#archeaders(#=390, time=355.16ms, avg=1.10#/ms 0.91ms/#, 0.82%)
WARCIndexer.extract#hashstreamwrap(#=327, time=250.56ms, avg=1.31#/ms 0.77ms/#, 0.58%)
WARCIndexer.extract#analyzetikainput(#=327, time=39281.74ms, avg=0.01#/ms 120.13ms/#, 90.16%)
WARCPayloadAnalyzers.analyze#total(#=327, time=39280.65ms, avg=0.01#/ms 120.12ms/#, 90.16%)
WARCPayloadAnalyzers.analyze#tikasolrextract(#=327, time=25749.83ms, avg=0.01#/ms 78.75ms/#, 59.10%)
TikaExtractor.extract#detect(#=327, time=471.30ms, avg=0.69#/ms 1.44ms/#, 1.08%)
TikaExtractor.extract#parse(#=324, time=25155.93ms, avg=0.01#/ms 77.64ms/#, 57.74%)
TikaExtractor.extract#extract(#=324, time=97.23ms, avg=3.33#/ms 0.30ms/#, 0.22%)
WARCPayloadAnalyzers.analyze#firstbytes(#=327, time=7.67ms, avg=42.62#/ms 0.02ms/#, 0.02%)
WARCPayloadAnalyzers.analyze#droid(#=327, time=3409.95ms, avg=0.10#/ms 10.43ms/#, 7.83%) top 5 sort=avgtime
WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 5.78%)
WARCPayloadAnalyzers.analyze#droid_type=text/html(#=1, time=6.65ms, avg=0.15#/ms 6.65ms/#, 0.02%)
WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=25, time=137.66ms, avg=0.18#/ms 5.51ms/#, 0.32%)
WARCPayloadAnalyzers.analyze#droid_type=image/vnd.microsoft.icon(#=4, time=20.93ms, avg=0.19#/ms 5.23ms/#, 0.05%)
WARCPayloadAnalyzers.analyze#droid_type=image/jpeg; version=1.02(#=23, time=93.62ms, avg=0.25#/ms 4.07ms/#, 0.21%)
HTMLAnalyzer.analyze#total(#=189, time=699.17ms, avg=0.27#/ms 3.70ms/#, 1.60%)
HTMLAnalyzer.analyze#parser(#=189, time=414.73ms, avg=0.46#/ms 2.19ms/#, 0.95%)
HtmlFeatureParser.parse#jsoupparse(#=189, time=285.82ms, avg=0.66#/ms 1.51ms/#, 0.66%)
HtmlFeatureParser.parse#featureextract(#=189, time=72.64ms, avg=2.60#/ms 0.38ms/#, 0.17%)
ImageAnalyzer.analyze#facesanddominant(#=9, time=1491.97ms, avg=0.01#/ms 165.77ms/#, 3.42%)
PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 18.16%)
XMLAnalyzer.analyze(#=4, time=8.42ms, avg=0.48#/ms 2.10ms/#, 0.02%)
TextAnalyzers#total(#=327, time=578.61ms, avg=0.57#/ms 1.77ms/#, 1.33%)
LanguageAnalyzer#total(#=228, time=462.17ms, avg=0.49#/ms 2.03ms/#, 1.06%)
PostcodeAnalyzer(#=228, time=19.55ms, avg=11.66#/ms 0.09ms/#, 0.04%)
FuzzyHashAnalyzer(#=228, time=95.15ms, avg=2.40#/ms 0.42ms/#, 0.22%)
WARCIndexerCommand.parseWarcFiles#docdelivery(#=327, time=480.04ms, avg=0.68#/ms 1.47ms/#, 1.10%)
WARCIndexerCommanc.checkSubmission#solrSendBatch(#=6, time=477.56ms, avg=0.01#/ms 79.59ms/#, 1.10%)
WARCIndexer#content_types(#=327, time=40610.08ms, avg=0.01#/ms 124.19ms/#, 93.21%) top 20 sort=time
WARCIndexer#content_type_served=image/jpeg(#=34, time=18883.32ms, avg=0.00#/ms 555.39ms/#, 43.34%)
WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 29.15%)
WARCIndexer#content_type_served=image/gif(#=29, time=2950.86ms, avg=0.01#/ms 101.75ms/#, 6.77%)
WARCIndexer#content_type_served=text/html(#=132, time=2904.28ms, avg=0.05#/ms 22.00ms/#, 6.67%)
WARCIndexer#content_type_served=image/png(#=12, time=1725.54ms, avg=0.01#/ms 143.80ms/#, 3.96%)
WARCIndexer#content_type_served=text/plain(#=47, time=627.03ms, avg=0.07#/ms 13.34ms/#, 1.44%)
WARCIndexer#content_type_served=text/xml(#=36, time=399.84ms, avg=0.09#/ms 11.11ms/#, 0.92%)
WARCIndexer#content_type_served=text/css(#=15, time=193.23ms, avg=0.08#/ms 12.88ms/#, 0.44%)
WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=3, time=150.12ms, avg=0.02#/ms 50.04ms/#, 0.34%)
WARCIndexer#content_type_served=application/atom+xml(#=1, time=26.69ms, avg=0.04#/ms 26.69ms/#, 0.06%)
WARCIndexer#content_type_served=text/javascript(#=2, time=17.33ms, avg=0.12#/ms 8.66ms/#, 0.04%)
WARCIndexer#content_type_served=image/x-icon(#=3, time=16.18ms, avg=0.19#/ms 5.39ms/#, 0.04%)
WARCIndexer#content_type_served=application/x-javascript(#=2, time=9.63ms, avg=0.21#/ms 4.82ms/#, 0.02%)
WARCIndexer#content_type_served=application/x-shockwave-flash(#=1, time=4.83ms, avg=0.21#/ms 4.83ms/#, 0.01%)
Parsing Archive File [4/5]:spec/fixtures/warcs/library_research_roadmap-2014_11_28.warc.gz
2017-10-05 17:25:27 INFO Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 1.64%)
WARCIndexerCommand.commit#success(#=4, time=219.48ms, avg=0.02#/ms 54.87ms/#, 0.17%)
WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=4, time=127494.96ms, avg=0.00#/ms 31873.74ms/#, 98.35%)
WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=2774, time=126562.27ms, avg=0.02#/ms 45.62ms/#, 97.63%)
SolrRecord.removeControlCharacters#total(#=79298, time=213.40ms, avg=371.60#/ms 0.00ms/#, 0.16%)
SolrRecord.sanitiseUTF8(#=79298, time=62.32ms, avg=1272.34#/ms 0.00ms/#, 0.05%)
WARCIndexer.extract#total(#=834, time=126176.50ms, avg=0.01#/ms 151.29ms/#, 97.34%)
WARCIndexer.extract#archeaders(#=905, time=458.20ms, avg=1.98#/ms 0.51ms/#, 0.35%)
WARCIndexer.extract#hashstreamwrap(#=834, time=325.18ms, avg=2.56#/ms 0.39ms/#, 0.25%)
WARCIndexer.extract#analyzetikainput(#=834, time=124492.27ms, avg=0.01#/ms 149.27ms/#, 96.04%)
WARCPayloadAnalyzers.analyze#total(#=834, time=124490.21ms, avg=0.01#/ms 149.27ms/#, 96.04%)
WARCPayloadAnalyzers.analyze#tikasolrextract(#=834, time=105110.56ms, avg=0.01#/ms 126.03ms/#, 81.09%)
TikaExtractor.extract#detect(#=834, time=800.81ms, avg=1.04#/ms 0.96ms/#, 0.62%)
TikaExtractor.extract#parse(#=830, time=104134.99ms, avg=0.01#/ms 125.46ms/#, 80.33%)
TikaExtractor.extract#extract(#=830, time=122.83ms, avg=6.76#/ms 0.15ms/#, 0.09%)
WARCPayloadAnalyzers.analyze#firstbytes(#=834, time=11.84ms, avg=70.41#/ms 0.01ms/#, 0.01%)
WARCPayloadAnalyzers.analyze#droid(#=834, time=4558.91ms, avg=0.18#/ms 5.47ms/#, 3.52%) top 5 sort=avgtime
WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 1.94%)
WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-fmt-682; name="Thumbs DB file"; version=XP(#=3, time=76.48ms, avg=0.04#/ms 25.49ms/#, 0.06%)
WARCPayloadAnalyzers.analyze#droid_type=image/vnd.adobe.photoshop(#=1, time=8.50ms, avg=0.12#/ms 8.50ms/#, 0.01%)
WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-x-fmt-234; name="Paint Shop Pro Image"; version=5.0(#=1, time=6.59ms, avg=0.15#/ms 6.59ms/#, 0.01%)
WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=25, time=137.66ms, avg=0.18#/ms 5.51ms/#, 0.11%)
HTMLAnalyzer.analyze#total(#=331, time=847.20ms, avg=0.39#/ms 2.56ms/#, 0.65%)
HTMLAnalyzer.analyze#parser(#=331, time=502.98ms, avg=0.66#/ms 1.52ms/#, 0.39%)
HtmlFeatureParser.parse#jsoupparse(#=331, time=320.59ms, avg=1.03#/ms 0.97ms/#, 0.25%)
HtmlFeatureParser.parse#featureextract(#=331, time=106.78ms, avg=3.10#/ms 0.32ms/#, 0.08%)
ImageAnalyzer.analyze#facesanddominant(#=36, time=6036.88ms, avg=0.01#/ms 167.69ms/#, 4.66%)
PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 6.10%)
XMLAnalyzer.analyze(#=4, time=8.42ms, avg=0.48#/ms 2.10ms/#, 0.01%)
TextAnalyzers#total(#=834, time=668.60ms, avg=1.25#/ms 0.80ms/#, 0.52%)
LanguageAnalyzer#total(#=387, time=524.84ms, avg=0.74#/ms 1.36ms/#, 0.40%)
PostcodeAnalyzer(#=387, time=23.99ms, avg=16.13#/ms 0.06ms/#, 0.02%)
FuzzyHashAnalyzer(#=387, time=117.07ms, avg=3.31#/ms 0.30ms/#, 0.09%)
WARCIndexerCommand.parseWarcFiles#docdelivery(#=834, time=637.14ms, avg=1.31#/ms 0.76ms/#, 0.49%)
WARCIndexerCommanc.checkSubmission#solrSendBatch(#=16, time=631.91ms, avg=0.03#/ms 39.49ms/#, 0.49%)
WARCIndexer#content_types(#=834, time=126180.51ms, avg=0.01#/ms 151.30ms/#, 97.34%) top 20 sort=time
WARCIndexer#content_type_served=image/png(#=38, time=38129.61ms, avg=0.00#/ms 1003.41ms/#, 29.41%)
WARCIndexer#content_type_served=image/jpeg(#=65, time=35024.03ms, avg=0.00#/ms 538.83ms/#, 27.02%)
WARCIndexer#content_type_served=image/gif(#=340, time=34797.18ms, avg=0.01#/ms 102.34ms/#, 26.84%)
WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 9.80%)
WARCIndexer#content_type_served=text/html(#=260, time=3918.95ms, avg=0.07#/ms 15.07ms/#, 3.02%)
WARCIndexer#content_type_served=text/plain(#=54, time=773.47ms, avg=0.07#/ms 14.32ms/#, 0.60%)
WARCIndexer#content_type_served=text/xml(#=36, time=399.84ms, avg=0.09#/ms 11.11ms/#, 0.31%)
WARCIndexer#content_type_served=text/css(#=17, time=201.74ms, avg=0.08#/ms 11.87ms/#, 0.16%)
WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=3, time=150.12ms, avg=0.02#/ms 50.04ms/#, 0.12%)
WARCIndexer#content_type_served=application/atom+xml(#=1, time=26.69ms, avg=0.04#/ms 26.69ms/#, 0.02%)
WARCIndexer#content_type_served=application/x-javascript(#=4, time=19.00ms, avg=0.21#/ms 4.75ms/#, 0.01%)
WARCIndexer#content_type_served=text/javascript(#=2, time=17.33ms, avg=0.12#/ms 8.66ms/#, 0.01%)
WARCIndexer#content_type_served=image/x-icon(#=3, time=16.18ms, avg=0.19#/ms 5.39ms/#, 0.01%)
WARCIndexer#content_type_served=application/x-shockwave-flash(#=1, time=4.83ms, avg=0.21#/ms 4.83ms/#, 0.00%)
Parsing Archive File [5/5]:spec/fixtures/warcs/test.warc.gz
WARC Indexer Finished in 129.779 seconds.
2017-10-05 17:25:27 INFO Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=1, time=129782.70ms, avg=0.00#/ms 129782.70ms/#, 100.00%)
WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 1.63%)
WARCIndexerCommand.commit#success(#=6, time=347.08ms, avg=0.02#/ms 57.85ms/#, 0.27%)
WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=5, time=127567.59ms, avg=0.00#/ms 25513.52ms/#, 98.29%)
WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=2780, time=126568.94ms, avg=0.02#/ms 45.53ms/#, 97.52%)
SolrRecord.removeControlCharacters#total(#=79426, time=213.56ms, avg=371.91#/ms 0.00ms/#, 0.16%)
SolrRecord.sanitiseUTF8(#=79426, time=62.37ms, avg=1273.55#/ms 0.00ms/#, 0.05%)
WARCIndexer.extract#total(#=835, time=126182.51ms, avg=0.01#/ms 151.12ms/#, 97.23%)
WARCIndexer.extract#archeaders(#=906, time=458.35ms, avg=1.98#/ms 0.51ms/#, 0.35%)
WARCIndexer.extract#hashstreamwrap(#=835, time=325.24ms, avg=2.57#/ms 0.39ms/#, 0.25%)
WARCIndexer.extract#analyzetikainput(#=835, time=124497.57ms, avg=0.01#/ms 149.10ms/#, 95.93%)
WARCPayloadAnalyzers.analyze#total(#=835, time=124495.51ms, avg=0.01#/ms 149.10ms/#, 95.93%)
WARCPayloadAnalyzers.analyze#tikasolrextract(#=835, time=105113.68ms, avg=0.01#/ms 125.88ms/#, 80.99%)
TikaExtractor.extract#detect(#=835, time=802.16ms, avg=1.04#/ms 0.96ms/#, 0.62%)
TikaExtractor.extract#parse(#=831, time=104136.66ms, avg=0.01#/ms 125.31ms/#, 80.24%)
TikaExtractor.extract#extract(#=831, time=122.90ms, avg=6.76#/ms 0.15ms/#, 0.09%)
WARCPayloadAnalyzers.analyze#firstbytes(#=835, time=11.85ms, avg=70.46#/ms 0.01ms/#, 0.01%)
WARCPayloadAnalyzers.analyze#droid(#=835, time=4560.54ms, avg=0.18#/ms 5.46ms/#, 3.51%) top 5 sort=avgtime
WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 1.94%)
WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-fmt-682; name="Thumbs DB file"; version=XP(#=3, time=76.48ms, avg=0.04#/ms 25.49ms/#, 0.06%)
WARCPayloadAnalyzers.analyze#droid_type=image/vnd.adobe.photoshop(#=1, time=8.50ms, avg=0.12#/ms 8.50ms/#, 0.01%)
WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-x-fmt-234; name="Paint Shop Pro Image"; version=5.0(#=1, time=6.59ms, avg=0.15#/ms 6.59ms/#, 0.01%)
WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=26, time=139.28ms, avg=0.19#/ms 5.36ms/#, 0.11%)
HTMLAnalyzer.analyze#total(#=332, time=847.74ms, avg=0.39#/ms 2.55ms/#, 0.65%)
HTMLAnalyzer.analyze#parser(#=332, time=503.30ms, avg=0.66#/ms 1.52ms/#, 0.39%)
HtmlFeatureParser.parse#jsoupparse(#=332, time=320.74ms, avg=1.04#/ms 0.97ms/#, 0.25%)
HtmlFeatureParser.parse#featureextract(#=332, time=106.84ms, avg=3.11#/ms 0.32ms/#, 0.08%)
ImageAnalyzer.analyze#facesanddominant(#=36, time=6036.88ms, avg=0.01#/ms 167.69ms/#, 4.65%)
PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 6.10%)
XMLAnalyzer.analyze(#=4, time=8.42ms, avg=0.48#/ms 2.10ms/#, 0.01%)
TextAnalyzers#total(#=835, time=668.88ms, avg=1.25#/ms 0.80ms/#, 0.52%)
LanguageAnalyzer#total(#=388, time=525.05ms, avg=0.74#/ms 1.35ms/#, 0.40%)
PostcodeAnalyzer(#=388, time=24.00ms, avg=16.17#/ms 0.06ms/#, 0.02%)
FuzzyHashAnalyzer(#=388, time=117.13ms, avg=3.31#/ms 0.30ms/#, 0.09%)
WARCIndexerCommand.parseWarcFiles#docdelivery(#=835, time=637.15ms, avg=1.31#/ms 0.76ms/#, 0.49%)
WARCIndexerCommanc.checkSubmission#solrSendBatch(#=17, time=648.34ms, avg=0.03#/ms 38.14ms/#, 0.50%)
WARCIndexer#content_types(#=835, time=126186.53ms, avg=0.01#/ms 151.12ms/#, 97.23%) top 20 sort=time
WARCIndexer#content_type_served=image/png(#=38, time=38129.61ms, avg=0.00#/ms 1003.41ms/#, 29.38%)
WARCIndexer#content_type_served=image/jpeg(#=65, time=35024.03ms, avg=0.00#/ms 538.83ms/#, 26.99%)
WARCIndexer#content_type_served=image/gif(#=340, time=34797.18ms, avg=0.01#/ms 102.34ms/#, 26.81%)
WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 9.79%)
WARCIndexer#content_type_served=text/html(#=261, time=3924.97ms, avg=0.07#/ms 15.04ms/#, 3.02%)
WARCIndexer#content_type_served=text/plain(#=54, time=773.47ms, avg=0.07#/ms 14.32ms/#, 0.60%)
WARCIndexer#content_type_served=text/xml(#=36, time=399.84ms, avg=0.09#/ms 11.11ms/#, 0.31%)
WARCIndexer#content_type_served=text/css(#=17, time=201.74ms, avg=0.08#/ms 11.87ms/#, 0.16%)
WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=3, time=150.12ms, avg=0.02#/ms 50.04ms/#, 0.12%)
WARCIndexer#content_type_served=application/atom+xml(#=1, time=26.69ms, avg=0.04#/ms 26.69ms/#, 0.02%)
WARCIndexer#content_type_served=application/x-javascript(#=4, time=19.00ms, avg=0.21#/ms 4.75ms/#, 0.01%)
WARCIndexer#content_type_served=text/javascript(#=2, time=17.33ms, avg=0.12#/ms 8.66ms/#, 0.01%)
WARCIndexer#content_type_served=image/x-icon(#=3, time=16.18ms, avg=0.19#/ms 5.39ms/#, 0.01%)
WARCIndexer#content_type_served=application/x-shockwave-flash(#=1, time=4.83ms, avg=0.21#/ms 4.83ms/#, 0.00%)
0.18%)
If you do not have the ability to take advantage of the Hadoop functionality with webarchive-discovery, you use GNU Parallel or xargs
to batch index. Using the example from above, the command would look like this:
parallel
(24 jobs):
$ time find /data/401/3490/warcs -iname "*.gz" -type f | parallel --jobs 24 --gnu "java -Xmx5g -Djava.io.tmpdir=/mnt/tmp -jar /home/ubuntu/warc-indexer.jar -d -c /home/ubuntu/warclight.conf -i 'University of Alberta Libraries' -n 'Idle No More' -u '3490' -s http://192.168.32.35:8983/solr/ualberta {} > $(basename {})-24-05.log"
xargs
(44 jobs):
$ time (find /tuna1/scratch/nruest/geocities/warcs -iname "*.gz" -type f -print0 | xargs -0 -P 44 -n 1 -I {} bash -c 'java -Xmx2g -Djava.io.tmpdir=/tuna1/scratch/nruest/tmp -jar /home/ruestn/warc-indexer.jar -d -c /home/ruestn/warclight_annotation.conf http://192.168.32.35:8983/solr/geocities "{}" > /tuna1/scratch/nruest/logs/$(basename {}).log' );
Depending on your Solr setup, you might want to use the -d
disable-commit option. That open up a new searcher for every WARC processes, which could be many per minute. If you do use this option, make sure to run curl "http://mysolrcloud:8983/solr/update?commit=true&openSearcher=true"
at the end of a job to force a commit. For more information, see this webarchive-discovery GitHub issue.
This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.
Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
This project drew inspiration from the Arclight and UKWA's Shine, and would like to thank those creators and contributors.