-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
45 lines (33 loc) · 2.98 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
SearchInMemory as a reseach about full text/fault-tolerant searches
SearchInMemory is my first impression about how works and how should work full text search engines or fault-tolerant searches (FTS). It begins as a research about how stuff works.
It was also a test for: is it hard to write own full text search engine?
You may consider it as a experiment to uncover how full text search engines works.
I have also tried a BinaryTree implementation of index, but it wasn't so good for me as HashIndex.
But even when I finished a really huge part of code there is still place for improvements(you can use it as a roadmap for your own FTS), like:
- n-gram indexing and searching with levenstein sorting
- wildcard searching
- excluding some phrases from results
- caching generated indexed on some memory based structure on disk
- steeming words to others
- improved and more complex way to do faceting
- a socket connector for searcher from a unix level
- index updating and deleting particular records
- whole phrases in " " signs, like: "billy bob" to match exactly this phrase (need to improve inverted indexes in HashIndex)
- possibilities of import/export data from indexes using formats: json/xml/csv
- tweaking results based on special criteria or queries
Cheers,
Rafal "RaVbaker" Piekarski
Contact:
web: http://about.me/ravbaker
twitter: ravbaker
github: https://github.com/RaVbaker
Great start for your own research:
- http://en.wikipedia.org/wiki/Levenshtein_distance - a minimal knowlegde about comparing similar words
- http://en.wikipedia.org/wiki/Inverted_index - goot start for building indexes - specially full inverted indexes
- http://en.wikipedia.org/wiki/N-gram - N-grams, what it is and why?
- http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html - quite old Google Research department post about n-grams in practise - with a large available dataset.
- http://ngrams.googlelabs.com/ - a practise usage of ngrams with Books Ngram Viewer from Google.
- http://today.java.net/pub/a/today/2005/08/09/didyoumean.html - great article about how Did you mean works with Lucene. Very inspiring post - but mainly about Java.
- http://framework.zend.com/code/filedetails.php?repname=Zend+Framework&path=%2Ftrunk%2Flibrary%2FZend%2FSearch%2FLucene.php - sourcecode from Zend Framework with their PHP implementation of Lucene. Nice source of thougts.
- http://www.ir.uwaterloo.ca/book/ - A book when you think BIG. It's about building your own service for full text search engine scalable almost like Google/Bing. Lots of theory and C/C++ code and algorithms. For very begining I suggest reading an excerpt from chapter 4 - Static inverted indicies - http://www.ir.uwaterloo.ca/book/04-static-inverted-indices.pdf
- Helpful php functions: http://www.php.net/manual/en/function.levenshtein.php, http://www.php.net/manual/en/function.metaphone.php, http://php.net/manual/en/function.soundex.php, http://docs.php.net/manual/en/language.types.array.php :)