-
Notifications
You must be signed in to change notification settings - Fork 19
Source Code
The C++ source code is organized in a set of "modules", each in their own folder :
- aho
- ali
- base
- core
- engine
- enginetest
- shell
Lookup of text uses the Aho-Corasic algorithm, it basically is a state machine for very fast key matching. The folder also contains 2 project files for each language module, since these are designed as sequential models, with currently "lexrep" and "regex" matching. Each language has its own folder here, for language specific data. 2 subfolders: and _regex represent the models. In these subfolders are again 2 subfolders : ali and lexrep, these represent the state machine data, and are in fact inline tables. This data is generated by our language compiler as part of the build process. Do not edit !
Automatic Language Identification (ali) is part of our IRIS NLP functionality, it allows for multilingual documents to be indexed, or for language identification in a document collection. There is currently no API to use it in the standalone version of iKnow, but it could, since the source code is present.
This contains 3 generic modules : IkStringAlg.cpp
(functionality for our "String" type), IkStringEncoding.cpp
(interface to ICU, basically for utf8 std::string
to UCS2 "String" conversions), and a pool allocator for performance.
These are the core modules (although some are obsolete), representing the internal classes in use. The main workhorse is IkIndexProcess.cpp
, all indexing starts with :
void IkIndexProcess::Start(IkIndexInput* pInput,
IkIndexOutput* pOut,
IkIndexDebug<TraceListType>* pDebug,
bool bMergeRelations,
bool bBinaryMode,
bool delimitedSentences,
size_t max_concept_cluster_length,
IkKnowledgebase* pUdct)
This is the main module for interfacing with clients. 2 subfolders exist : "src" and "language_data". The first has engine.h
as the API specification, the second contains again language specific data (anything but state machine data), that results from language model compilation. Do not edit !
This has an example of how to interface with the engine. Use enginetest.cpp
as a working template for writing your own programs.
This represents the abstraction of a language model, that is derived from every supported language. There used to be 2 versions : "shared memory" and "compiled", but here there's only the "compiled" model. Since "compiled" is derived from "shared memory", both source module exist.
For Visual Studio 2019 users, 2 files in the modules folder are most important : iKnowEngineTest.sln
is the solution file, that will build all modules (28 .dll files and 1 executable). Dependencies.props
is used by all project files for ICU reference, edit if necessary to correctly reflect your environment.
If the latter is set correctly, the solution will build all modules, and place all executable code in .\kit\x64\(Debug|Release)\bin
.