setting chunk size of chunk_corpus function #56

doncat99 · 2024-09-13T15:29:27Z

def chunk_corpus(corpus: list, chunk_size: int = 64) -> list:
"""
Chunk the corpus into smaller parts. Run the following command to download the required nltk data:
python -c "import nltk; nltk.download('punkt')"

@param corpus: the formatted corpus, see README.md
@param chunk_size: the size of each chunk, i.e., the number of words in each chunk
@return: chunked corpus, a list
"""

the default chunk_size is 64, is that the best practice? I tried with 150, and the entity count is the same as 64, but 10% more relationships were obtained.

The text was updated successfully, but these errors were encountered:

doncat99 changed the title ~~chunk~~ setting chunk size of chunk_corpus function Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setting chunk size of chunk_corpus function #56

setting chunk size of chunk_corpus function #56

doncat99 commented Sep 13, 2024 •

edited

Loading

setting chunk size of chunk_corpus function #56

setting chunk size of chunk_corpus function #56

Comments

doncat99 commented Sep 13, 2024 • edited Loading

doncat99 commented Sep 13, 2024 •

edited

Loading