-
Notifications
You must be signed in to change notification settings - Fork 36
/
TODO
148 lines (130 loc) · 5.5 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
get unused import detection, tidy, checkstyle running
fix test runner
- rewrite the "Reproduce with" to build.py
- remove all the useless stack frames from a test failure -- don't need to see that boilerplate, just the relevant frames (my sources)
expose HNSW/KNN!
finish getting distributed queue approach online
get faster (CSV, chunked json) bulk update document/s
hmm what to do about "index gen" and "searcher version" now that I am sharding!? i guess it must become a... list?
rename state -> globalState
partial reindex (just one column)
forbidden APIs
server docs hosted somewhere
mv src/java -> src, src/test -> srctest
make sure if a doc fails to index that we log its fields+values
reindex api?
parallel reader "create new column" api? (TestDemoParallelReader)
rolling restart
task management api?
long running tasks?
- glass sharding
- reserve internal field names (leading _?)
- index aliases
- require _source field? so reindex is always possible ... but tricky because i must link to "the csv header used" if it's a csv source
- get facets working w/ NRT replication in lucene server
- rolling upgrade?
- federated
- reindex api?
- shards
- what about suggesters?
- how to handle index gen?
- deletion of old data
- hmm: maybe use LogByteSizeMP? and then, i can prune individual segments?
- force merge closed shards, then close IW?
- should snapshots really hold settings too?
- curious: spillover shards can be closed as soon as the next spillover happens: powerful
- index subdir should be random uid, not the index name (ES had problems here)
- have something like IMC
- make sure i fully exposed sorted set dv facets
- maybe add hierarchy?
- get IW 1, IW 2, ... working
- clean per-segment caching for facet counts
- rolling upgrades
- reindex api
- upgrade index api
- hmm "we don't cache aggregations": can lucene cache facet counts and post-aggreggate?
- live similarity updating?
- accept cbor too
- how to index a binary field when CSV import is via http?
- index aliases?
- reindex api? should i store _source?
- add op to collapse indices together
- make a larger scale distributed search test
- 5 shards :)
- where else to disable nagle's
- distributed search
- test across N machines
- distributed term stats
- separate fetch phase
- get this controlled RT working with replication too
- really i just need indexGen -> searcherVersion mapping? and incoming query just maps indexGen to searcherVersion
- MAKE TEST lucene server: concurrent close index while searching
- node lock, to ensure only one node is running "here"?
- geo names
- get copy field working
- add ./build.py beast
- source field, reindexing?
- upgrade process how?
- persistent http connections b/w nodes
- go open issue about the double rounding difference
- no automatic routing of documents to shards: user simply adds to whichever index
- this is nice because load can be reduced by splitting docs out to the indices that are more lightly loaded right now
- and we can make new shards any time
- node to node security
- the strongly timestamp'd case
- time based indices
- append only
- new index every X time
- time range filter should only query indices having that time range
- index sorting
- simon's mapping of big index down to 1 segment over time ... oh, N shards down to 1
- searching N indices and merging
- reducing load by changing which node is primary
- ttl?
- es limitations
- rolling upgrade
- does lucene server also have a _field_names? opt in?
- be smart about range searching, if the index or segment's values are outside of the range...
- link to the docs from github, i guess running @ jirasearch?
- improve the docs!
- cold/hot indices
- resharding
- maybe open up allowed "overage backlog" in flushing new segments???
- binary format for adding docs?
- ES distributed problems
- single node being bad dragging down whole cluster
- ugh: include indexTaxis.py in release!
- make it "aware" when it's in a release unzip
- hrm: merges fall behind w/ replication
- should i prioritize merges as fifo?
- settings that take units?
- measure replica NRT latency
- compare json bulk import vs csv
- test nrt replication
- taxis
- why is stored fields there?
- investigate StreamTokenizer?
- is 1 gb buffer really better?
- RUN RAW LUCENE
- test building full index, and then adding replica!
- test index sorting lucene
- replication
- try index sorting
- upgrade jirasearch, switch to SimpleQP
- persistent http/1.1 connections
- diffs vs ES
- thin(ish) wrapper on Lucene
- anal checking of settings, request params, etc.
- ES combines "flush" and "refresh"; uwe's issue about NOT wanting to refresh
- crazy slow restart time for ES to "recover"
- sharing field/document instances
- using addDocuments
- streaming bulk index API, client is single threaded
- faster double parsing
- CSV not json
- no version map
- no xlog
- no periodic fsync
- no added id field, _type, _version, _source, _all
- IW can pick single segment to write to disk at a time, but ES writes them all, or refreshes
- 512 KB chunks go into the per-thread queue, not single docs