-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems about Cluster Data Environment #373
Comments
Question1:
Partitioned indices allows to create many indices based on one single cassandra table, to go beyond the Lucene limit of 2^31.
Virtual indices allows to store the Elasticsearch mapping only once for many partitioned indices.
Yes, there is a performance penalty if you create a lots of small partitioned indices (Like elasticsearch oversharding).
Question 2:
It really depends of your data, 50 billions rows with one column or 200 columns is not the same.
A JVM 8 should not have more than 30g heap (see https://www.elastic.co/blog/a-heap-of-trouble <https://www.elastic.co/blog/a-heap-of-trouble>).
You should test your application on a 3 nodes datacenter, add more data until the latency/throughput reach the acceptable limit for you, then scale horizontally to store all your dataset.
… On 21 Sep 2020, at 11:46, Coder-Qian ***@***.***> wrote:
Question 1:
We know that 2^31 (2.1 billion) is the lucene max documents per index. By default, one Cassandra node can only store 2.1 billion data. For a single node, if I want to break through 2.1 billion, are there other ways to do it without virtual indexes? Because I don't know if there are performance issues with virtual indexes.
Question 2:
I have a 50 billion time series data. If I use 3 replicas, how many nodes should I set up to achieve the second level data query. How many CPU cores and memory are recommended for each node? And, How much out of heap memory should I leave for each node? In the production environment, there are 7 servers with 256g and 32 cpu cores. Can they meet the requirements with docker?
Please forgive my poor English. thanks you for your advice.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#373>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOMPGJITYRWJYKJUTCLQDTSG4OGRANCNFSM4RUKIIWA>.
|
Our production environment is that we have three tables. Each table has about 70 fields. Each table needs to store 20 billion data. The yyyyMMdd is used as partition keys. It is noticed that each cassandra node can have 30g heap, so, one server who has 256g menory can allocate 5 cassandra nodes by installed 2:1 ratio (heap memory and out heap memory, (256/(30+15) = 5.6). We want to deploy the cluster well when building the production environment, because it is too slow to migrate data after adding nodes to Elassandra. Thank everyone for helping me. Do not let me post sink. |
Question 1:
We know that 2^31 (2.1 billion) is the lucene max documents per index. By default, one Cassandra node can only store 2.1 billion data. For a single node, if I want to break through 2.1 billion, are there other ways to do it without virtual indexes? Because I don't know if there are performance issues with virtual indexes.
Question 2:
I have a 50 billion time series data. If I use 3 replicas, how many nodes should I set up to achieve the second level data query. How many CPU cores and memory are recommended for each node? And, How much out of heap memory should I leave for each node? In the production environment, there are 7 servers with 256g and 32 cpu cores. Can they meet the requirements with docker?
Please forgive my poor English. thanks you for your advice.
The text was updated successfully, but these errors were encountered: