Skip to content
This repository has been archived by the owner on Jun 13, 2023. It is now read-only.

Indexing WARCs for Warclight

Nick Ruest edited this page Oct 6, 2020 · 1 revision

Introduction

Now that you have your Warclight application up and running, we need to index data into it.

Requirements

You'll need Java 8 to run webarchive-discovery. You can compile it from source (mvn clean install) or use the pre-compiled jar available here.

You'll also need a directory or directories of W/ARCs.

Indexing

You can point webarchive-discovery at a directory of W/ARCs. Let's use the example of the #WomensMarch crawl. The -i, -n, and -u options are for institution, collection_name, and collection_number.

$ java  -jar /path/to/warc-indexer.jar -i "Web Archives for Historical Research" -n "#WomensMarch" -u "54321" -s http://localhost:8983/solr/blacklight-core /path/to/WomensMarch/warcs/*.gz

Note: If you are indexing a large number of W/ARCs and need a different tmp path than /tmp you can set that with -Djava.io.tmpdir=/tmp.

You can also make use of a configuration file with webarchive-discovery. We have an example available in the repo.

$ java -Djava.io.tmpdir=/tmp -jar .internal_test_gem/tmp/warc-indexer.jar -c warclight_warc-indexer.conf -i "York University Libraries" -n "Test Collection" -u "12345" -s http://localhost:8983/solr/warclight /path/to/warcs/*.gz

The output should look like:

2017-10-05 17:23:18 INFO  WARCIndexer:176 - Extract text = true
2017-10-05 17:23:18 INFO  WARCIndexer:179 - Store text = true
2017-10-05 17:23:18 INFO  WARCIndexer:181 - hashUrlId = false
2017-10-05 17:23:18 INFO  WARCIndexer:224 - Hashing & Caching thresholds are: < 10485760 in memory, < 104857600 on disk.
2017-10-05 17:23:18 INFO  WARCIndexer:227 - Setting up analysers...
2017-10-05 17:23:18 INFO  WARCPayloadAnalysers:80 - first_bytes config: false 32
2017-10-05 17:23:18 INFO  WARCPayloadAnalysers:88 - Image feature extraction = true
2017-10-05 17:23:19 WARN  ImageParser:74 - JBIG2ImageReader not loaded. jbig2 files will be ignored
2017-10-05 17:23:19 INFO  TikaExtractor:118 - Config: MIME exclude list: [x-tar, x-gzip, bz, lz, compress, zip, javascript, css, octet-stream]
2017-10-05 17:23:19 INFO  TikaExtractor:121 - Config: Parser timeout (ms) 300000
2017-10-05 17:23:19 INFO  TikaExtractor:124 - Config: Maximum length of text to extract (characters) 524288
2017-10-05 17:23:19 INFO  TikaExtractor:128 - Config: extractAllMetadata false
2017-10-05 17:23:19 INFO  TikaExtractor:131 - Config: useBoilerpipe false
2017-10-05 17:23:19 INFO  HTMLAnalyser:68 - HTML - Extract resource links false
2017-10-05 17:23:19 INFO  HTMLAnalyser:70 - HTML - Extract host links true
2017-10-05 17:23:19 INFO  HTMLAnalyser:72 - HTML - Extract domain links true
2017-10-05 17:23:19 INFO  HTMLAnalyser:74 - HTML - Extract elements used true
2017-10-05 17:23:19 INFO  HTMLAnalyser:76 - HTML - Extract image links true
2017-10-05 17:23:19 INFO  ImageAnalyser:74 - Image - detect faces = true
2017-10-05 17:23:19 INFO  ImageAnalyser:76 - Image - max size in bytes 1048576
2017-10-05 17:23:19 INFO  ImageAnalyser:79 - Image sample rate 0.1
2017-10-05 17:23:19 INFO  FaceDetectionParser:86 - Face detection enabled.
2017-10-05 17:23:19 INFO  FaceDetectionParser:88 - Dominant colour extraction enabled.
2017-10-05 17:23:20 INFO  LanguageAnalyser:65 - Constructed language analyzer with enabled = true
2017-10-05 17:23:20 INFO  WARCIndexer:252 - Initialisation of WARCIndexer complete.
Parsing Archive File [1/5]:spec/fixtures/warcs/2013-steacie-hackfest-2015_01_13.warc.gz
2017-10-05 17:23:22 INFO  Instrument:249 - Performance statistics
WARCIndexer#content_types(#=29, time=2076.29ms, avg=0.01#/ms 71.60ms/#, 47.99%) top 20 sort=time
  WARCIndexer#content_type_served=image/gif(#=6, time=839.65ms, avg=0.01#/ms 139.94ms/#, 19.40%)
  WARCIndexer#content_type_served=text/html(#=9, time=437.63ms, avg=0.02#/ms 48.63ms/#, 10.11%)
  WARCIndexer#content_type_served=text/plain(#=5, time=411.63ms, avg=0.01#/ms 82.33ms/#, 9.51%)
  WARCIndexer#content_type_served=image/jpeg(#=1, time=192.43ms, avg=0.01#/ms 192.43ms/#, 4.45%)
  WARCIndexer#content_type_served=text/css(#=5, time=105.19ms, avg=0.05#/ms 21.04ms/#, 2.43%)
  WARCIndexer#content_type_served=text/xml(#=1, time=80.98ms, avg=0.01#/ms 80.98ms/#, 1.87%)
  WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=1, time=7.06ms, avg=0.14#/ms 7.06ms/#, 0.16%)
  WARCIndexer#content_type_served=image/x-icon(#=1, time=1.65ms, avg=0.60#/ms 1.65ms/#, 0.04%)
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
  WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 48.97%)
  WARCIndexerCommand.commit#success(#=1, time=71.80ms, avg=0.01#/ms 71.80ms/#, 1.66%)
  WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=1, time=2204.05ms, avg=0.00#/ms 2204.05ms/#, 50.92%)
    WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=104, time=2110.20ms, avg=0.05#/ms 20.29ms/#, 48.75%)
      SolrRecord.removeControlCharacters#total(#=2829, time=23.13ms, avg=122.32#/ms 0.01ms/#, 0.53%)
        SolrRecord.sanitiseUTF8(#=2829, time=10.81ms, avg=261.81#/ms 0.00ms/#, 0.25%)
      WARCIndexer.extract#total(#=29, time=2076.02ms, avg=0.01#/ms 71.59ms/#, 47.95%)
        WARCIndexer.extract#archeaders(#=33, time=250.97ms, avg=0.13#/ms 7.61ms/#, 5.80%)
        WARCIndexer.extract#hashstreamwrap(#=29, time=6.70ms, avg=4.33#/ms 0.23ms/#, 0.15%)
        WARCIndexer.extract#analyzetikainput(#=29, time=1727.17ms, avg=0.02#/ms 59.56ms/#, 39.89%)
          WARCPayloadAnalyzers.analyze#total(#=29, time=1727.05ms, avg=0.02#/ms 59.55ms/#, 39.89%)
            WARCPayloadAnalyzers.analyze#tikasolrextract(#=29, time=1345.46ms, avg=0.02#/ms 46.40ms/#, 31.08%)
              TikaExtractor.extract#detect(#=29, time=78.66ms, avg=0.37#/ms 2.71ms/#, 1.82%)
              TikaExtractor.extract#parse(#=28, time=1248.43ms, avg=0.02#/ms 44.59ms/#, 28.83%)
              TikaExtractor.extract#extract(#=28, time=13.39ms, avg=2.09#/ms 0.48ms/#, 0.31%)
            WARCPayloadAnalyzers.analyze#firstbytes(#=29, time=1.56ms, avg=18.62#/ms 0.05ms/#, 0.04%)
            WARCPayloadAnalyzers.analyze#droid(#=29, time=80.18ms, avg=0.36#/ms 2.76ms/#, 1.85%) top 5 sort=avgtime
              WARCPayloadAnalyzers.analyze#droid_type=image/vnd.microsoft.icon(#=1, time=3.66ms, avg=0.27#/ms 3.66ms/#, 0.08%)
              WARCPayloadAnalyzers.analyze#droid_type=application/xhtml+xml; version=1.0(#=8, time=27.65ms, avg=0.29#/ms 3.46ms/#, 0.64%)
              WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=1, time=3.35ms, avg=0.30#/ms 3.35ms/#, 0.08%)
              WARCPayloadAnalyzers.analyze#droid_type=application/xml; version=1.0(#=1, time=3.23ms, avg=0.31#/ms 3.23ms/#, 0.07%)
              WARCPayloadAnalyzers.analyze#droid_type=application/octet-stream(#=11, time=32.81ms, avg=0.34#/ms 2.98ms/#, 0.76%)
            HTMLAnalyzer.analyze#total(#=19, time=110.89ms, avg=0.17#/ms 5.84ms/#, 2.56%)
              HTMLAnalyzer.analyze#parser(#=19, time=72.93ms, avg=0.26#/ms 3.84ms/#, 1.68%)
                HtmlFeatureParser.parse#jsoupparse(#=19, time=54.13ms, avg=0.35#/ms 2.85ms/#, 1.25%)
                HtmlFeatureParser.parse#featureextract(#=19, time=11.68ms, avg=1.63#/ms 0.61ms/#, 0.27%)
            ImageAnalyzer.analyze#facesanddominant(#=1, time=188.59ms, avg=0.01#/ms 188.59ms/#, 4.36%)
        TextAnalyzers#total(#=29, time=43.04ms, avg=0.67#/ms 1.48ms/#, 0.99%)
          LanguageAnalyzer#total(#=15, time=27.48ms, avg=0.55#/ms 1.83ms/#, 0.63%)
          PostcodeAnalyzer(#=15, time=1.29ms, avg=11.60#/ms 0.09ms/#, 0.03%)
          FuzzyHashAnalyzer(#=15, time=14.12ms, avg=1.06#/ms 0.94ms/#, 0.33%)
    WARCIndexerCommand.parseWarcFiles#docdelivery(#=29, time=0.31ms, avg=93.04#/ms 0.01ms/#, 0.01%)
Parsing Archive File [2/5]:spec/fixtures/warcs/YULEARN-2014_12_10.warc.gz
2017-10-05 17:23:41 INFO  Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
  WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 9.06%)
  WARCIndexerCommand.commit#success(#=2, time=75.90ms, avg=0.03#/ms 37.95ms/#, 0.32%)
  WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=2, time=21279.16ms, avg=0.00#/ms 10639.58ms/#, 90.91%)
    WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=425, time=20949.05ms, avg=0.02#/ms 49.29ms/#, 89.50%)
      SolrRecord.removeControlCharacters#total(#=12978, time=58.32ms, avg=222.54#/ms 0.00ms/#, 0.25%)
        SolrRecord.sanitiseUTF8(#=12978, time=22.27ms, avg=582.73#/ms 0.00ms/#, 0.10%)
      WARCIndexer.extract#total(#=122, time=20864.01ms, avg=0.01#/ms 171.02ms/#, 89.14%)
        WARCIndexer.extract#archeaders(#=139, time=286.66ms, avg=0.48#/ms 2.06ms/#, 1.22%)
        WARCIndexer.extract#hashstreamwrap(#=122, time=197.10ms, avg=0.62#/ms 1.62ms/#, 0.84%)
        WARCIndexer.extract#analyzetikainput(#=122, time=20204.82ms, avg=0.01#/ms 165.61ms/#, 86.32%)
          WARCPayloadAnalyzers.analyze#total(#=122, time=20204.36ms, avg=0.01#/ms 165.61ms/#, 86.32%)
            WARCPayloadAnalyzers.analyze#tikasolrextract(#=122, time=8964.22ms, avg=0.01#/ms 73.48ms/#, 38.30%)
              TikaExtractor.extract#detect(#=122, time=248.04ms, avg=0.49#/ms 2.03ms/#, 1.06%)
              TikaExtractor.extract#parse(#=121, time=8664.07ms, avg=0.01#/ms 71.60ms/#, 37.01%)
              TikaExtractor.extract#extract(#=121, time=39.41ms, avg=3.07#/ms 0.33ms/#, 0.17%)
            WARCPayloadAnalyzers.analyze#firstbytes(#=122, time=5.00ms, avg=24.40#/ms 0.04ms/#, 0.02%)
            WARCPayloadAnalyzers.analyze#droid(#=122, time=2854.05ms, avg=0.04#/ms 23.39ms/#, 12.19%) top 5 sort=avgtime
              WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 10.76%)
              WARCPayloadAnalyzers.analyze#droid_type=image/jpeg; version=1.02(#=21, time=87.51ms, avg=0.24#/ms 4.17ms/#, 0.37%)
              WARCPayloadAnalyzers.analyze#droid_type=image/vnd.microsoft.icon(#=2, time=6.79ms, avg=0.29#/ms 3.39ms/#, 0.03%)
              WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=1, time=3.35ms, avg=0.30#/ms 3.35ms/#, 0.01%)
              WARCPayloadAnalyzers.analyze#droid_type=application/xml; version=1.0(#=1, time=3.23ms, avg=0.31#/ms 3.23ms/#, 0.01%)
            HTMLAnalyzer.analyze#total(#=72, time=280.18ms, avg=0.26#/ms 3.89ms/#, 1.20%)
              HTMLAnalyzer.analyze#parser(#=72, time=157.62ms, avg=0.46#/ms 2.19ms/#, 0.67%)
                HtmlFeatureParser.parse#jsoupparse(#=72, time=97.52ms, avg=0.74#/ms 1.35ms/#, 0.42%)
                HtmlFeatureParser.parse#featureextract(#=72, time=33.48ms, avg=2.15#/ms 0.47ms/#, 0.14%)
            ImageAnalyzer.analyze#facesanddominant(#=1, time=188.59ms, avg=0.01#/ms 188.59ms/#, 0.81%)
            PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 33.80%)
        TextAnalyzers#total(#=122, time=94.43ms, avg=1.29#/ms 0.77ms/#, 0.40%)
          LanguageAnalyzer#total(#=81, time=61.95ms, avg=1.31#/ms 0.76ms/#, 0.26%)
          PostcodeAnalyzer(#=81, time=3.64ms, avg=22.26#/ms 0.04ms/#, 0.02%)
          FuzzyHashAnalyzer(#=81, time=28.23ms, avg=2.87#/ms 0.35ms/#, 0.12%)
    WARCIndexerCommand.parseWarcFiles#docdelivery(#=122, time=220.52ms, avg=0.55#/ms 1.81ms/#, 0.94%)
      WARCIndexerCommanc.checkSubmission#solrSendBatch(#=2, time=219.48ms, avg=0.01#/ms 109.74ms/#, 0.94%)
WARCIndexer#content_types(#=122, time=20864.77ms, avg=0.01#/ms 171.02ms/#, 89.13%) top 20 sort=time
  WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 54.26%)
  WARCIndexer#content_type_served=image/jpeg(#=21, time=4710.19ms, avg=0.00#/ms 224.29ms/#, 20.12%)
  WARCIndexer#content_type_served=image/gif(#=13, time=1512.06ms, avg=0.01#/ms 116.31ms/#, 6.46%)
  WARCIndexer#content_type_served=text/html(#=56, time=1127.03ms, avg=0.05#/ms 20.13ms/#, 4.81%)
  WARCIndexer#content_type_served=text/plain(#=8, time=436.13ms, avg=0.02#/ms 54.52ms/#, 1.86%)
  WARCIndexer#content_type_served=text/css(#=9, time=145.92ms, avg=0.06#/ms 16.21ms/#, 0.62%)
  WARCIndexer#content_type_served=image/png(#=1, time=133.21ms, avg=0.01#/ms 133.21ms/#, 0.57%)
  WARCIndexer#content_type_served=text/xml(#=1, time=80.98ms, avg=0.01#/ms 80.98ms/#, 0.35%)
  WARCIndexer#content_type_served=text/javascript(#=1, time=9.54ms, avg=0.10#/ms 9.54ms/#, 0.04%)
  WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=1, time=7.06ms, avg=0.14#/ms 7.06ms/#, 0.03%)
  WARCIndexer#content_type_served=image/x-icon(#=1, time=1.65ms, avg=0.60#/ms 1.65ms/#, 0.01%)
Parsing Archive File [3/5]:spec/fixtures/warcs/etig-2014_08_13.warc.gz
2017-10-05 17:24:01 INFO  Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
  WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 4.87%)
  WARCIndexerCommand.commit#success(#=3, time=81.57ms, avg=0.04#/ms 27.19ms/#, 0.19%)
  WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=3, time=41435.85ms, avg=0.00#/ms 13811.95ms/#, 95.11%)
    WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=1227, time=40822.27ms, avg=0.03#/ms 33.27ms/#, 93.70%)
      SolrRecord.removeControlCharacters#total(#=38433, time=145.13ms, avg=264.82#/ms 0.00ms/#, 0.33%)
        SolrRecord.sanitiseUTF8(#=38433, time=44.74ms, avg=859.09#/ms 0.00ms/#, 0.10%)
      WARCIndexer.extract#total(#=327, time=40607.89ms, avg=0.01#/ms 124.18ms/#, 93.21%)
        WARCIndexer.extract#archeaders(#=390, time=355.16ms, avg=1.10#/ms 0.91ms/#, 0.82%)
        WARCIndexer.extract#hashstreamwrap(#=327, time=250.56ms, avg=1.31#/ms 0.77ms/#, 0.58%)
        WARCIndexer.extract#analyzetikainput(#=327, time=39281.74ms, avg=0.01#/ms 120.13ms/#, 90.16%)
          WARCPayloadAnalyzers.analyze#total(#=327, time=39280.65ms, avg=0.01#/ms 120.12ms/#, 90.16%)
            WARCPayloadAnalyzers.analyze#tikasolrextract(#=327, time=25749.83ms, avg=0.01#/ms 78.75ms/#, 59.10%)
              TikaExtractor.extract#detect(#=327, time=471.30ms, avg=0.69#/ms 1.44ms/#, 1.08%)
              TikaExtractor.extract#parse(#=324, time=25155.93ms, avg=0.01#/ms 77.64ms/#, 57.74%)
              TikaExtractor.extract#extract(#=324, time=97.23ms, avg=3.33#/ms 0.30ms/#, 0.22%)
            WARCPayloadAnalyzers.analyze#firstbytes(#=327, time=7.67ms, avg=42.62#/ms 0.02ms/#, 0.02%)
            WARCPayloadAnalyzers.analyze#droid(#=327, time=3409.95ms, avg=0.10#/ms 10.43ms/#, 7.83%) top 5 sort=avgtime
              WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 5.78%)
              WARCPayloadAnalyzers.analyze#droid_type=text/html(#=1, time=6.65ms, avg=0.15#/ms 6.65ms/#, 0.02%)
              WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=25, time=137.66ms, avg=0.18#/ms 5.51ms/#, 0.32%)
              WARCPayloadAnalyzers.analyze#droid_type=image/vnd.microsoft.icon(#=4, time=20.93ms, avg=0.19#/ms 5.23ms/#, 0.05%)
              WARCPayloadAnalyzers.analyze#droid_type=image/jpeg; version=1.02(#=23, time=93.62ms, avg=0.25#/ms 4.07ms/#, 0.21%)
            HTMLAnalyzer.analyze#total(#=189, time=699.17ms, avg=0.27#/ms 3.70ms/#, 1.60%)
              HTMLAnalyzer.analyze#parser(#=189, time=414.73ms, avg=0.46#/ms 2.19ms/#, 0.95%)
                HtmlFeatureParser.parse#jsoupparse(#=189, time=285.82ms, avg=0.66#/ms 1.51ms/#, 0.66%)
                HtmlFeatureParser.parse#featureextract(#=189, time=72.64ms, avg=2.60#/ms 0.38ms/#, 0.17%)
            ImageAnalyzer.analyze#facesanddominant(#=9, time=1491.97ms, avg=0.01#/ms 165.77ms/#, 3.42%)
            PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 18.16%)
            XMLAnalyzer.analyze(#=4, time=8.42ms, avg=0.48#/ms 2.10ms/#, 0.02%)
        TextAnalyzers#total(#=327, time=578.61ms, avg=0.57#/ms 1.77ms/#, 1.33%)
          LanguageAnalyzer#total(#=228, time=462.17ms, avg=0.49#/ms 2.03ms/#, 1.06%)
          PostcodeAnalyzer(#=228, time=19.55ms, avg=11.66#/ms 0.09ms/#, 0.04%)
          FuzzyHashAnalyzer(#=228, time=95.15ms, avg=2.40#/ms 0.42ms/#, 0.22%)
    WARCIndexerCommand.parseWarcFiles#docdelivery(#=327, time=480.04ms, avg=0.68#/ms 1.47ms/#, 1.10%)
      WARCIndexerCommanc.checkSubmission#solrSendBatch(#=6, time=477.56ms, avg=0.01#/ms 79.59ms/#, 1.10%)
WARCIndexer#content_types(#=327, time=40610.08ms, avg=0.01#/ms 124.19ms/#, 93.21%) top 20 sort=time
  WARCIndexer#content_type_served=image/jpeg(#=34, time=18883.32ms, avg=0.00#/ms 555.39ms/#, 43.34%)
  WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 29.15%)
  WARCIndexer#content_type_served=image/gif(#=29, time=2950.86ms, avg=0.01#/ms 101.75ms/#, 6.77%)
  WARCIndexer#content_type_served=text/html(#=132, time=2904.28ms, avg=0.05#/ms 22.00ms/#, 6.67%)
  WARCIndexer#content_type_served=image/png(#=12, time=1725.54ms, avg=0.01#/ms 143.80ms/#, 3.96%)
  WARCIndexer#content_type_served=text/plain(#=47, time=627.03ms, avg=0.07#/ms 13.34ms/#, 1.44%)
  WARCIndexer#content_type_served=text/xml(#=36, time=399.84ms, avg=0.09#/ms 11.11ms/#, 0.92%)
  WARCIndexer#content_type_served=text/css(#=15, time=193.23ms, avg=0.08#/ms 12.88ms/#, 0.44%)
  WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=3, time=150.12ms, avg=0.02#/ms 50.04ms/#, 0.34%)
  WARCIndexer#content_type_served=application/atom+xml(#=1, time=26.69ms, avg=0.04#/ms 26.69ms/#, 0.06%)
  WARCIndexer#content_type_served=text/javascript(#=2, time=17.33ms, avg=0.12#/ms 8.66ms/#, 0.04%)
  WARCIndexer#content_type_served=image/x-icon(#=3, time=16.18ms, avg=0.19#/ms 5.39ms/#, 0.04%)
  WARCIndexer#content_type_served=application/x-javascript(#=2, time=9.63ms, avg=0.21#/ms 4.82ms/#, 0.02%)
  WARCIndexer#content_type_served=application/x-shockwave-flash(#=1, time=4.83ms, avg=0.21#/ms 4.83ms/#, 0.01%)
Parsing Archive File [4/5]:spec/fixtures/warcs/library_research_roadmap-2014_11_28.warc.gz
2017-10-05 17:25:27 INFO  Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=0, time=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%)
  WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 1.64%)
  WARCIndexerCommand.commit#success(#=4, time=219.48ms, avg=0.02#/ms 54.87ms/#, 0.17%)
  WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=4, time=127494.96ms, avg=0.00#/ms 31873.74ms/#, 98.35%)
    WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=2774, time=126562.27ms, avg=0.02#/ms 45.62ms/#, 97.63%)
      SolrRecord.removeControlCharacters#total(#=79298, time=213.40ms, avg=371.60#/ms 0.00ms/#, 0.16%)
        SolrRecord.sanitiseUTF8(#=79298, time=62.32ms, avg=1272.34#/ms 0.00ms/#, 0.05%)
      WARCIndexer.extract#total(#=834, time=126176.50ms, avg=0.01#/ms 151.29ms/#, 97.34%)
        WARCIndexer.extract#archeaders(#=905, time=458.20ms, avg=1.98#/ms 0.51ms/#, 0.35%)
        WARCIndexer.extract#hashstreamwrap(#=834, time=325.18ms, avg=2.56#/ms 0.39ms/#, 0.25%)
        WARCIndexer.extract#analyzetikainput(#=834, time=124492.27ms, avg=0.01#/ms 149.27ms/#, 96.04%)
          WARCPayloadAnalyzers.analyze#total(#=834, time=124490.21ms, avg=0.01#/ms 149.27ms/#, 96.04%)
            WARCPayloadAnalyzers.analyze#tikasolrextract(#=834, time=105110.56ms, avg=0.01#/ms 126.03ms/#, 81.09%)
              TikaExtractor.extract#detect(#=834, time=800.81ms, avg=1.04#/ms 0.96ms/#, 0.62%)
              TikaExtractor.extract#parse(#=830, time=104134.99ms, avg=0.01#/ms 125.46ms/#, 80.33%)
              TikaExtractor.extract#extract(#=830, time=122.83ms, avg=6.76#/ms 0.15ms/#, 0.09%)
            WARCPayloadAnalyzers.analyze#firstbytes(#=834, time=11.84ms, avg=70.41#/ms 0.01ms/#, 0.01%)
            WARCPayloadAnalyzers.analyze#droid(#=834, time=4558.91ms, avg=0.18#/ms 5.47ms/#, 3.52%) top 5 sort=avgtime
              WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 1.94%)
              WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-fmt-682; name="Thumbs DB file"; version=XP(#=3, time=76.48ms, avg=0.04#/ms 25.49ms/#, 0.06%)
              WARCPayloadAnalyzers.analyze#droid_type=image/vnd.adobe.photoshop(#=1, time=8.50ms, avg=0.12#/ms 8.50ms/#, 0.01%)
              WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-x-fmt-234; name="Paint Shop Pro Image"; version=5.0(#=1, time=6.59ms, avg=0.15#/ms 6.59ms/#, 0.01%)
              WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=25, time=137.66ms, avg=0.18#/ms 5.51ms/#, 0.11%)
            HTMLAnalyzer.analyze#total(#=331, time=847.20ms, avg=0.39#/ms 2.56ms/#, 0.65%)
              HTMLAnalyzer.analyze#parser(#=331, time=502.98ms, avg=0.66#/ms 1.52ms/#, 0.39%)
                HtmlFeatureParser.parse#jsoupparse(#=331, time=320.59ms, avg=1.03#/ms 0.97ms/#, 0.25%)
                HtmlFeatureParser.parse#featureextract(#=331, time=106.78ms, avg=3.10#/ms 0.32ms/#, 0.08%)
            ImageAnalyzer.analyze#facesanddominant(#=36, time=6036.88ms, avg=0.01#/ms 167.69ms/#, 4.66%)
            PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 6.10%)
            XMLAnalyzer.analyze(#=4, time=8.42ms, avg=0.48#/ms 2.10ms/#, 0.01%)
        TextAnalyzers#total(#=834, time=668.60ms, avg=1.25#/ms 0.80ms/#, 0.52%)
          LanguageAnalyzer#total(#=387, time=524.84ms, avg=0.74#/ms 1.36ms/#, 0.40%)
          PostcodeAnalyzer(#=387, time=23.99ms, avg=16.13#/ms 0.06ms/#, 0.02%)
          FuzzyHashAnalyzer(#=387, time=117.07ms, avg=3.31#/ms 0.30ms/#, 0.09%)
    WARCIndexerCommand.parseWarcFiles#docdelivery(#=834, time=637.14ms, avg=1.31#/ms 0.76ms/#, 0.49%)
      WARCIndexerCommanc.checkSubmission#solrSendBatch(#=16, time=631.91ms, avg=0.03#/ms 39.49ms/#, 0.49%)
WARCIndexer#content_types(#=834, time=126180.51ms, avg=0.01#/ms 151.30ms/#, 97.34%) top 20 sort=time
  WARCIndexer#content_type_served=image/png(#=38, time=38129.61ms, avg=0.00#/ms 1003.41ms/#, 29.41%)
  WARCIndexer#content_type_served=image/jpeg(#=65, time=35024.03ms, avg=0.00#/ms 538.83ms/#, 27.02%)
  WARCIndexer#content_type_served=image/gif(#=340, time=34797.18ms, avg=0.01#/ms 102.34ms/#, 26.84%)
  WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 9.80%)
  WARCIndexer#content_type_served=text/html(#=260, time=3918.95ms, avg=0.07#/ms 15.07ms/#, 3.02%)
  WARCIndexer#content_type_served=text/plain(#=54, time=773.47ms, avg=0.07#/ms 14.32ms/#, 0.60%)
  WARCIndexer#content_type_served=text/xml(#=36, time=399.84ms, avg=0.09#/ms 11.11ms/#, 0.31%)
  WARCIndexer#content_type_served=text/css(#=17, time=201.74ms, avg=0.08#/ms 11.87ms/#, 0.16%)
  WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=3, time=150.12ms, avg=0.02#/ms 50.04ms/#, 0.12%)
  WARCIndexer#content_type_served=application/atom+xml(#=1, time=26.69ms, avg=0.04#/ms 26.69ms/#, 0.02%)
  WARCIndexer#content_type_served=application/x-javascript(#=4, time=19.00ms, avg=0.21#/ms 4.75ms/#, 0.01%)
  WARCIndexer#content_type_served=text/javascript(#=2, time=17.33ms, avg=0.12#/ms 8.66ms/#, 0.01%)
  WARCIndexer#content_type_served=image/x-icon(#=3, time=16.18ms, avg=0.19#/ms 5.39ms/#, 0.01%)
  WARCIndexer#content_type_served=application/x-shockwave-flash(#=1, time=4.83ms, avg=0.21#/ms 4.83ms/#, 0.00%)
Parsing Archive File [5/5]:spec/fixtures/warcs/test.warc.gz
WARC Indexer Finished in 129.779 seconds.
2017-10-05 17:25:27 INFO  Instrument:249 - Performance statistics
WARCIndexerCommand.main#total(#=1, time=129782.70ms, avg=0.00#/ms 129782.70ms/#, 100.00%)
  WARCIndexerCommand.parseWarcFiles#startup(#=1, time=2119.54ms, avg=0.00#/ms 2119.54ms/#, 1.63%)
  WARCIndexerCommand.commit#success(#=6, time=347.08ms, avg=0.02#/ms 57.85ms/#, 0.27%)
  WARCIndexerCommand.parseWarcFiles#fullarcprocess(#=5, time=127567.59ms, avg=0.00#/ms 25513.52ms/#, 98.29%)
    WARCIndexerCommand.parseWarcFiles#solrdocCreation(#=2780, time=126568.94ms, avg=0.02#/ms 45.53ms/#, 97.52%)
      SolrRecord.removeControlCharacters#total(#=79426, time=213.56ms, avg=371.91#/ms 0.00ms/#, 0.16%)
        SolrRecord.sanitiseUTF8(#=79426, time=62.37ms, avg=1273.55#/ms 0.00ms/#, 0.05%)
      WARCIndexer.extract#total(#=835, time=126182.51ms, avg=0.01#/ms 151.12ms/#, 97.23%)
        WARCIndexer.extract#archeaders(#=906, time=458.35ms, avg=1.98#/ms 0.51ms/#, 0.35%)
        WARCIndexer.extract#hashstreamwrap(#=835, time=325.24ms, avg=2.57#/ms 0.39ms/#, 0.25%)
        WARCIndexer.extract#analyzetikainput(#=835, time=124497.57ms, avg=0.01#/ms 149.10ms/#, 95.93%)
          WARCPayloadAnalyzers.analyze#total(#=835, time=124495.51ms, avg=0.01#/ms 149.10ms/#, 95.93%)
            WARCPayloadAnalyzers.analyze#tikasolrextract(#=835, time=105113.68ms, avg=0.01#/ms 125.88ms/#, 80.99%)
              TikaExtractor.extract#detect(#=835, time=802.16ms, avg=1.04#/ms 0.96ms/#, 0.62%)
              TikaExtractor.extract#parse(#=831, time=104136.66ms, avg=0.01#/ms 125.31ms/#, 80.24%)
              TikaExtractor.extract#extract(#=831, time=122.90ms, avg=6.76#/ms 0.15ms/#, 0.09%)
            WARCPayloadAnalyzers.analyze#firstbytes(#=835, time=11.85ms, avg=70.46#/ms 0.01ms/#, 0.01%)
            WARCPayloadAnalyzers.analyze#droid(#=835, time=4560.54ms, avg=0.18#/ms 5.46ms/#, 3.51%) top 5 sort=avgtime
              WARCPayloadAnalyzers.analyze#droid_type=application/pdf; version=1.3(#=10, time=2519.16ms, avg=0.00#/ms 251.92ms/#, 1.94%)
              WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-fmt-682; name="Thumbs DB file"; version=XP(#=3, time=76.48ms, avg=0.04#/ms 25.49ms/#, 0.06%)
              WARCPayloadAnalyzers.analyze#droid_type=image/vnd.adobe.photoshop(#=1, time=8.50ms, avg=0.12#/ms 8.50ms/#, 0.01%)
              WARCPayloadAnalyzers.analyze#droid_type=application/x-puid-x-fmt-234; name="Paint Shop Pro Image"; version=5.0(#=1, time=6.59ms, avg=0.15#/ms 6.59ms/#, 0.01%)
              WARCPayloadAnalyzers.analyze#droid_type=text/html; version=5(#=26, time=139.28ms, avg=0.19#/ms 5.36ms/#, 0.11%)
            HTMLAnalyzer.analyze#total(#=332, time=847.74ms, avg=0.39#/ms 2.55ms/#, 0.65%)
              HTMLAnalyzer.analyze#parser(#=332, time=503.30ms, avg=0.66#/ms 1.52ms/#, 0.39%)
                HtmlFeatureParser.parse#jsoupparse(#=332, time=320.74ms, avg=1.04#/ms 0.97ms/#, 0.25%)
                HtmlFeatureParser.parse#featureextract(#=332, time=106.84ms, avg=3.11#/ms 0.32ms/#, 0.08%)
            ImageAnalyzer.analyze#facesanddominant(#=36, time=6036.88ms, avg=0.01#/ms 167.69ms/#, 4.65%)
            PDFAnalyzer.analyze(#=10, time=7911.28ms, avg=0.00#/ms 791.13ms/#, 6.10%)
            XMLAnalyzer.analyze(#=4, time=8.42ms, avg=0.48#/ms 2.10ms/#, 0.01%)
        TextAnalyzers#total(#=835, time=668.88ms, avg=1.25#/ms 0.80ms/#, 0.52%)
          LanguageAnalyzer#total(#=388, time=525.05ms, avg=0.74#/ms 1.35ms/#, 0.40%)
          PostcodeAnalyzer(#=388, time=24.00ms, avg=16.17#/ms 0.06ms/#, 0.02%)
          FuzzyHashAnalyzer(#=388, time=117.13ms, avg=3.31#/ms 0.30ms/#, 0.09%)
    WARCIndexerCommand.parseWarcFiles#docdelivery(#=835, time=637.15ms, avg=1.31#/ms 0.76ms/#, 0.49%)
      WARCIndexerCommanc.checkSubmission#solrSendBatch(#=17, time=648.34ms, avg=0.03#/ms 38.14ms/#, 0.50%)
WARCIndexer#content_types(#=835, time=126186.53ms, avg=0.01#/ms 151.12ms/#, 97.23%) top 20 sort=time
  WARCIndexer#content_type_served=image/png(#=38, time=38129.61ms, avg=0.00#/ms 1003.41ms/#, 29.38%)
  WARCIndexer#content_type_served=image/jpeg(#=65, time=35024.03ms, avg=0.00#/ms 538.83ms/#, 26.99%)
  WARCIndexer#content_type_served=image/gif(#=340, time=34797.18ms, avg=0.01#/ms 102.34ms/#, 26.81%)
  WARCIndexer#content_type_served=application/pdf(#=10, time=12700.82ms, avg=0.00#/ms 1270.08ms/#, 9.79%)
  WARCIndexer#content_type_served=text/html(#=261, time=3924.97ms, avg=0.07#/ms 15.04ms/#, 3.02%)
  WARCIndexer#content_type_served=text/plain(#=54, time=773.47ms, avg=0.07#/ms 14.32ms/#, 0.60%)
  WARCIndexer#content_type_served=text/xml(#=36, time=399.84ms, avg=0.09#/ms 11.11ms/#, 0.31%)
  WARCIndexer#content_type_served=text/css(#=17, time=201.74ms, avg=0.08#/ms 11.87ms/#, 0.16%)
  WARCIndexer#content_type_served=image/vnd.microsoft.icon(#=3, time=150.12ms, avg=0.02#/ms 50.04ms/#, 0.12%)
  WARCIndexer#content_type_served=application/atom+xml(#=1, time=26.69ms, avg=0.04#/ms 26.69ms/#, 0.02%)
  WARCIndexer#content_type_served=application/x-javascript(#=4, time=19.00ms, avg=0.21#/ms 4.75ms/#, 0.01%)
  WARCIndexer#content_type_served=text/javascript(#=2, time=17.33ms, avg=0.12#/ms 8.66ms/#, 0.01%)
  WARCIndexer#content_type_served=image/x-icon(#=3, time=16.18ms, avg=0.19#/ms 5.39ms/#, 0.01%)
  WARCIndexer#content_type_served=application/x-shockwave-flash(#=1, time=4.83ms, avg=0.21#/ms 4.83ms/#, 0.00%)
 0.18%)

Parallel indexing

If you do not have the ability to take advantage of the Hadoop functionality with webarchive-discovery, you use GNU Parallel or xargs to batch index. Using the example from above, the command would look like this:

parallel (24 jobs):

$ time find /data/401/3490/warcs -iname "*.gz" -type f | parallel --jobs 24 --gnu "java -Xmx5g -Djava.io.tmpdir=/mnt/tmp -jar /home/ubuntu/warc-indexer.jar -d -c /home/ubuntu/warclight.conf  -i 'University of Alberta Libraries' -n 'Idle No More' -u '3490' -s http://192.168.32.35:8983/solr/ualberta {} > $(basename {})-24-05.log"

xargs (44 jobs):

$ time (find /tuna1/scratch/nruest/geocities/warcs -iname "*.gz" -type f -print0 | xargs -0 -P 44 -n 1 -I {} bash -c 'java -Xmx2g -Djava.io.tmpdir=/tuna1/scratch/nruest/tmp -jar /home/ruestn/warc-indexer.jar -d -c /home/ruestn/warclight_annotation.conf http://192.168.32.35:8983/solr/geocities "{}" > /tuna1/scratch/nruest/logs/$(basename {}).log' );

Depending on your Solr setup, you might want to use the -d disable-commit option. That open up a new searcher for every WARC processes, which could be many per minute. If you do use this option, make sure to run curl "http://mysolrcloud:8983/solr/update?commit=true&openSearcher=true" at the end of a job to force a commit. For more information, see this webarchive-discovery GitHub issue.