Full-Text Search Engine with no external dependencies written in C# for .NET Core.
The aim of this project is to showcase algorithms, data structures and techniques that are used to create full-text search engines.
On Windows:
-
Download and build code. Use the following commands:
dotnet restore dotnet build
-
Open folder with binaries: bin\Debug\netcoreapp2.0
-
Start the following command. Replace DATA_PATH with a path to Datasets folder
run_test.bat DATA_PATH
-
If everything goes well the following messages are printed:
Log from index construction:
dotnet Protsyk.PMS.FullText.ConsoleUtil.dll index --input "F:\Sources\FullTextSearch\Datasets" PMS Full-Text Search (c) Petro Protsyk 2017-2018 F:\Sources\FullTextSearch\Datasets\Simple\TestFile001.txt F:\Sources\FullTextSearch\Datasets\Simple\TestFile002.txt F:\Sources\FullTextSearch\Datasets\Simple\TestFile003.txt Indexed documents: 3, time: 00:00:00.1010004
Dump of the index (for each term in the dictionary - the list of all occurrences):
dotnet Protsyk.PMS.FullText.ConsoleUtil.dll print PMS Full-Text Search (c) Petro Protsyk 2017-2018 2017 -> [1,1,9] algorithms -> [1,1,19] and -> [1,1,20] apple -> [3,1,1] banana -> [3,1,2] build -> [1,1,25] c -> [1,1,16] data -> [1,1,21] demonstrate -> [1,1,18] ...
Search with query WORD(pms):
dotnet Protsyk.PMS.FullText.ConsoleUtil.dll search --query "WORD(pms)" {filename:"TestFile001.txt", size:"180", created:"2018-04-02T10:09:41.4208444+02:00"} {[1,1,1]} {filename:"TestFile002.txt", size:"29", created:"2018-04-02T10:09:41.4248447+02:00"} {[2,1,1]} Documents found: 2, matches: 2, time: 00:00:00.0564721
Lookup in the dictionary using a pattern i.e. all terms matching pattern:
dotnet Protsyk.PMS.FullText.ConsoleUtil.dll lookup --pattern "WILD(pet*)" petro-mariya-sophie Terms found: 1, time: 00:00:00.0704173 dotnet Protsyk.PMS.FullText.ConsoleUtil.dll lookup --pattern "EDIT(projct, 1)" project Terms found: 1, time: 00:00:00.0847931
- WORD(apple) - single word
- WILD(app*) - wildcard pattern
- EDIT(apple, 1) - Levenshtein (edit distance, fuzzy search)
Conjunction operators
- OR - boolean or
- AND - boolean and
- SEQ - sequence of words, phrase
Examples of queries:
- AND(WORD(apple), OR(WILD(a*), EDIT(apple, 1)))
- SEQ(WORD(hello), WORD(world))
- Dictionary of the persistent index is implemented using either:
- Key-value storage for document metadata is based on persistent B-Tree implementation: B-Tree.
- Packed Integers.
- Classic data structures:
- Heap
- Linked List
- LRU Cache
- Approximate term matching is based on Levenshtein automaton.
- Query Language parser is implemented using recursive descent parser technique.
- A method for encoding/decoding occurrences uses:
- Group VarInt encoding.
- Delta encoding.
- VarInt encoding.
- Classic algorithms:
- Binary search
- Heap sort
- Huffman Encoding
-
Information Retrieval: Implementing and Evaluating Search Engines
dotnet publish -c Release --self-contained -r osx.10.13-x64
dotnet publish -c Release --self-contained -r win10-x64