-
-
Notifications
You must be signed in to change notification settings - Fork 33
JesterJ is a system for loading data into search engines. It is possible to use it as a general purpose ETL platform, but its primary focus is to make it quick and easy to work with data in a format that is appropriate for search platforms. Specifically this means that the atomic unit of data is a "document" object which maps keys to multiple values (like a MultiMap). This differs from traditional ETL systems that focus on databases and conceptualize the in-flight data as a row, with columns (keys) and only support a single value per row/column.
Version 1.0beta3 is the first "stable" version. This version should only receive bug fixes before being released as 1.0 (no new features). Prior versions were not really stable enough for general usage.
How much time do you have? In April 2023 the code has demonstrated stability across millions of documents. With the arbitrary standard of ingests over 12 hours being unacceptable, JesterJ is recommended for any corpus up to 30 million documents. Recent observations indicate that for a plan that has one input and one output approximately 800 1k documents is likely. This is admittedly not very fast and it was not the processing of the document within JesterJ that was limiting. Performance currently appears to be limited by Cassandra write times (possibly due to an index on one table, that perhaps can be eliminated). This also means that increasing the number of destinations, or deterministic paths through your plan will work but slow things down. The same plan with 10 send to Solr steps instead of 1 unfortunately dropped to 250-300 documents per second. 1.0 is the "make it work" release, and future releases will seek to improve performance.
Yes. These types of systems have awesome potential, but they tend to require a LOT of setup time... Set up a spark cluster, set up a zookeeper cluster, write custom code to run inside of spark against a spark RDD. Spark systems are quite good once one invests in them, but many companies can't afford to spend tens of thousands of dollars to get the data flowing. Hadoop and Map Reduce systems tend to be very high bandwidth, and very high Latency, so they would only be appropriate for a search index that didn't need to show data in a timely fashion. JesterJ is meant to be very very easy to start using, and also scale well into large production usage. Handling very very large data sets, or extreme throughput may still require Hadoop or Spark based solutions, but the other 95% of projects should be able to use JesterJ.
Solr's data import handler is very good if you simply want to reflect exactly what is in the database in your search Engine. However, if you want to combine data from multiple tables, you begin to work with increasingly complex queries and if you do any sort of data-enrichment (such as geo-locating) that then has to be done, and the intermediate result stored back in your database (or a secondary database). Certainly this can be done, and if existing database ETL expertise is available in house it may be a reasonable solution, but such systems often devolve into multiple disconnected steps with temporary tables. In such a situation fault tolerance is nearly impossible. With JesterJ your search technologist does not need to be an expert DBA too. They can massage the data in the most common language for search work (Java) and the entire process can be maintained in a single location. As noted above classic ETL systems become inefficient when they need to handle multi-value fields which usually need to be expressed as many rows and aggregated at the last second in custom code. Handling several multi-value fields for the same document can get really complicated.
Yes! Solr is the platform that the project maintainer uses most, so this is the most well tested application.
Unfortunately, not any more :( Support was added initially, but no users are known to be using it and the Elastic dev team has a tendency to break backwards compatibility. It's old notions of allowable Lucene versions were holding back Solr related advancements (for which there were users and paying customers). Upgrade was attempted, but too many incompatibilities with prior API's and new dependency verification code added by the Elastic dev team made the upgrade too difficult. If you're interested in using JesterJ with elastic, we've got a ticket you can work on to get it going again here: https://github.com/nsoft/jesterj/issues/124
Possibly. There are a growing number of standard implementations for things like renaming fields or scanning rows in a database, or sending documents to Solr/Elastic, but of course there's always a task that you need that we haven't added. The other major goal of JesterJ is to hide the complexity of the infrastructure and make writing custom tasks very very simple. Simply write a Java class that implements Processor interface (one method, takes a Document, returns an array of Document) and you're off and running. (More details here: )
Yes. No software written for this project falls into that category, but the distributed bundles ending in -node.jar contain all dependencies, including Apache Tika and BouncyCastle.org. Please see the front page (Home) of this wiki for more details an links to information about the bundled software.