FSCrawler 2.7 π
release-drafter
released this
05 Aug 11:10
·
1292 commits
to master
since this release
The FSCrawler team is pleased to announce the FSCrawler 2.7 release!
FSCrawler
FS Crawler offers a simple way to index binary files into elasticsearch.
Usage
Download FSCrawler 2.7:
wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7/fscrawler-es7-2.7.zip
Start FS crawler with:
bin/fscrawler job_name
FS crawler will read a local file (default to ~/.fscrawler/{job_name}/_settings.json
).
If the file does not exist, FS crawler will propose to create your first job.
$ bin/fscrawler job_name
18:28:58,174 WARN [f.p.e.c.f.FsCrawler] job [job_name] does not exist
18:28:58,177 INFO [f.p.e.c.f.FsCrawler] Do you want to create it (Y/N)?
y
18:29:05,711 INFO [f.p.e.c.f.FsCrawler] Settings have been created in [~/.fscrawler/job_name/_settings.json]. Please review and edit before relaunch
Create a directory named /tmp/es
or c:\tmp\es
, add some files you want to index in it and start again:
$ bin/fscrawler job_name
18:30:34,330 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:30:34,332 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
18:30:34,682 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]
More details in the documentation.
New features
- #991: Add Workplace Search connector.
- #1203: Add FTP crawler. By helsonxiao.
- #1211: Add
file.content_type
field on folders. - #1210: Add
file.filename
field on folders. - #1179: Automatically create Custom Sources.
- #1037: Split console logs and actual logs and add a banner :).
- #1036: Support ssl verification configurable. By TommyLike.
- #1035: Log index errors in documents.log.
- #1031: Add an external Log4J2 configuration file.
- #907: Add
path_prefix
option. - #820: Generate FSCrawler docker images. By toto1310.
- #776: Report HEAP size at startup.
- #752: Add option to ignore symlinks. By budachst.
- #715: Allow custom index name in the REST API. By kikkauz.
- #698: Add Cross-Origin Resource Sharing (CORS) headers to RestServer. By isaac-ipl.
- #692: Allow running OCR but not on PDF files.
- #673: Add support for YAML configuration.
- #663: Add Patterns table to includes and excludes. By wrathagom.
Fixed Bugs
- #1224: Fix NPE in Console when running with Docker.
- #1217: Check if date is null when formatting it to RFC3339.
- #1204: Split build and deploy phases for Docker images.
- #1201: 2.7 - Docker image broken. By agrantdeakin.
- #1194: Elasticsearch node settings should not be null by default.
- #1193: Corrupt PDF can lead to a StackOverflow.
- #1137: Ignore errors when parsing a 0 byte file.
- #1085: fscrawler.bat added a CD to move to the appropriate directory. By CircuitGuy.
- #1084: InputStream must have > 0 bytes. By yuanzhian.
- #1066: Start fscrawler instead of internal services.
- #1041: Fixed an issue that caused an error when running in a windows environment. By muraken720.
- #1006: Running fscrawler with no argument now lists existing jobs. By janhoy.
- #1005: Fix ENTRYPOINT in Dockerfile to allow variable substitution. By Maijin.
- #994: Using cloud id gives "invalid IPv6 Address". By tdaroly.
- #973: Fix SSH crawling from Windows machine.
- #899: FSCrawler can't index .doc or .docx elements. By LaaKii.
- #895: java.lang.NoSuchMethodError: parsing some Word files. By mwaltersbmc.
- #860: Bug Syntax error in fscrawler file, to init fscrawler. By CarlosRCDev.
- #847: sun.jnu.encoding=UTF-8 added in .bat and .sh both. By shahariaazam.
- #834: FS Crawler freezes when crawling a 0 byte TXT file. By dansfelix.
- #819: Fix Percentage computation.
- #760: Allow passing test parameters to Maven CLI.
- #714: fix release-drafter. By jetersen.
- #701: Change log level and display logs only if filters on content.
- #691: OCR without pdf_ocr. By Newmski.
- #686: Wait for healthy index when creating the index.
- #681: SSH dirs should be seen as dirs and not files.
- #680: trying to index remote files with ssh - files seen as folder. By sblanc0054.
- #660: Fix authentication when sending announcement email.
Main changes
- #1218: Isolate WorkplaceSearchClient and ElasticsearchClient.
- #1213: Switch back to Java 11.
- #1049: Update Dockerfile to use JDK14. By mario-89.
- #1212: Let's use JsonPath.
- #1207: Generate only 2 docker images.
- #1206: Detect when fscrawler runs in foreground and adapt logs.
- #1205: Add logs to the console when running a Docker instance.
- #1172: Move CI from Travis to GitHub actions.
- #872: Add more information to the _simulate API.
- #700: Add dependency convergence checks.
- #695: Exclude the PDFParser from the DefaultParser.
- #694: Display full names when catching parsing errors.
- #693: Move
fs.pdf_ocr
setting tofs.ocr.pdf_strategy
. - #675: Warn in case of Tika error.
- #1219: Update to Elasticsearch 7.14.0 and 6.8.18.
- #1180: Bump tika.version from 1.26 to 1.27.
Removed
- #978: files lost. By bluebell1990.
Have fun!
-FSCrawler team