[Question]: build_nodes_from_splits looks incorrect #17348

sgondala · 2024-12-22T05:10:17Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

Context: build_nodes_from_splits in llama-index-core/llama_index/core/node_parser/node_utils.py

Each node embedding is initialized to the value of document embedding. This is incorrect.

We should probably initialize it with None and later add embeddings as needed.

def build_nodes_from_splits(
    text_splits: List[str],
    document: BaseNode,
    ref_doc: Optional[BaseNode] = None,
    id_func: Optional[IdFuncCallable] = None,
) -> List[TextNode]:
    """Build nodes from splits."""
    ref_doc = ref_doc or document
    id_func = id_func or default_id_func
    nodes: List[TextNode] = []
    """Calling as_related_node_info() on a document recomputes the hash for the whole text and metadata"""
    """It is not that bad, when creating relationships between the nodes, but is terrible when adding a relationship"""
    """between the node and a document, hence we create the relationship only once here and pass it to the nodes"""
    relationships = {NodeRelationship.SOURCE: ref_doc.as_related_node_info()}
    for i, text_chunk in enumerate(text_splits):
        logger.debug(f"> Adding chunk: {truncate_text(text_chunk, 50)}")

        if isinstance(document, ImageDocument):
            image_node = ImageNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                image=document.image,
                image_path=document.image_path,
                image_url=document.image_url,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_separator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(image_node)  # type: ignore
        elif isinstance(document, Document):
            node = TextNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_separator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(node)
        elif isinstance(document, TextNode):
            node = TextNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_seperator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(node)
        else:
            raise ValueError(f"Unknown document type: {type(document)}")

    return nodes

The text was updated successfully, but these errors were encountered:

dosubot · 2024-12-22T05:12:05Z

Hey there, @sgondala! I'm here to help you out with any bugs, questions, or contributions. Let's tackle this together!

The build_nodes_from_splits function initializes each node embedding to the value of the document embedding to maintain consistency and context from the original document. The embedding represents the semantic content of the document, and initializing each node with this embedding ensures that the nodes retain the semantic information of the original document. Initializing with None would mean losing this semantic context, which might not be desirable for the intended use of these nodes. Therefore, the current implementation is intentional and should not be changed to None without a specific reason that aligns with the application's requirements ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

sgondala added the question Further information is requested label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: build_nodes_from_splits looks incorrect #17348

[Question]: build_nodes_from_splits looks incorrect #17348

sgondala commented Dec 22, 2024

dosubot bot commented Dec 22, 2024

[Question]: build_nodes_from_splits looks incorrect #17348

[Question]: build_nodes_from_splits looks incorrect #17348

Comments

sgondala commented Dec 22, 2024

Question Validation

Question

dosubot bot commented Dec 22, 2024