Problems about Cluster Data Environment #373

Coder-Qian · 2020-09-21T09:45:57Z

Question 1:
We know that 2^31 (2.1 billion) is the lucene max documents per index. By default, one Cassandra node can only store 2.1 billion data. For a single node, if I want to break through 2.1 billion, are there other ways to do it without virtual indexes? Because I don't know if there are performance issues with virtual indexes.

Question 2:
I have a 50 billion time series data. If I use 3 replicas, how many nodes should I set up to achieve the second level data query. How many CPU cores and memory are recommended for each node？ And, How much out of heap memory should I leave for each node? In the production environment, there are 7 servers with 256g and 32 cpu cores. Can they meet the requirements with docker?

Please forgive my poor English. thanks you for your advice.

vroyer · 2020-09-21T10:23:51Z

Question1: Partitioned indices allows to create many indices based on one single cassandra table, to go beyond the Lucene limit of 2^31. Virtual indices allows to store the Elasticsearch mapping only once for many partitioned indices. Yes, there is a performance penalty if you create a lots of small partitioned indices (Like elasticsearch oversharding). Question 2: It really depends of your data, 50 billions rows with one column or 200 columns is not the same. A JVM 8 should not have more than 30g heap (see https://www.elastic.co/blog/a-heap-of-trouble <https://www.elastic.co/blog/a-heap-of-trouble>). You should test your application on a 3 nodes datacenter, add more data until the latency/throughput reach the acceptable limit for you, then scale horizontally to store all your dataset.

…

On 21 Sep 2020, at 11:46, Coder-Qian ***@***.***> wrote: Question 1: We know that 2^31 (2.1 billion) is the lucene max documents per index. By default, one Cassandra node can only store 2.1 billion data. For a single node, if I want to break through 2.1 billion, are there other ways to do it without virtual indexes? Because I don't know if there are performance issues with virtual indexes. Question 2: I have a 50 billion time series data. If I use 3 replicas, how many nodes should I set up to achieve the second level data query. How many CPU cores and memory are recommended for each node？ And, How much out of heap memory should I leave for each node? In the production environment, there are 7 servers with 256g and 32 cpu cores. Can they meet the requirements with docker? Please forgive my poor English. thanks you for your advice. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#373>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOMPGJITYRWJYKJUTCLQDTSG4OGRANCNFSM4RUKIIWA>.

Coder-Qian · 2020-09-23T06:02:09Z

Our production environment is that we have three tables. Each table has about 70 fields. Each table needs to store 20 billion data. The yyyyMMdd is used as partition keys. It is noticed that each cassandra node can have 30g heap, so, one server who has 256g menory can allocate 5 cassandra nodes by installed 2:1 ratio (heap memory and out heap memory, (256/(30+15) = 5.6). We want to deploy the cluster well when building the production environment, because it is too slow to migrate data after adding nodes to Elassandra. Thank everyone for helping me. Do not let me post sink.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems about Cluster Data Environment #373

Problems about Cluster Data Environment #373

Coder-Qian commented Sep 21, 2020

vroyer commented Sep 21, 2020 via email

Coder-Qian commented Sep 23, 2020

Problems about Cluster Data Environment #373

Problems about Cluster Data Environment #373

Comments

Coder-Qian commented Sep 21, 2020

vroyer commented Sep 21, 2020 via email

Coder-Qian commented Sep 23, 2020