Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: build_nodes_from_splits looks incorrect #17348

Open
1 task done
sgondala opened this issue Dec 22, 2024 · 1 comment
Open
1 task done

[Question]: build_nodes_from_splits looks incorrect #17348

sgondala opened this issue Dec 22, 2024 · 1 comment
Labels
question Further information is requested

Comments

@sgondala
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Context: build_nodes_from_splits in llama-index-core/llama_index/core/node_parser/node_utils.py

Each node embedding is initialized to the value of document embedding. This is incorrect.

We should probably initialize it with None and later add embeddings as needed.

def build_nodes_from_splits(
    text_splits: List[str],
    document: BaseNode,
    ref_doc: Optional[BaseNode] = None,
    id_func: Optional[IdFuncCallable] = None,
) -> List[TextNode]:
    """Build nodes from splits."""
    ref_doc = ref_doc or document
    id_func = id_func or default_id_func
    nodes: List[TextNode] = []
    """Calling as_related_node_info() on a document recomputes the hash for the whole text and metadata"""
    """It is not that bad, when creating relationships between the nodes, but is terrible when adding a relationship"""
    """between the node and a document, hence we create the relationship only once here and pass it to the nodes"""
    relationships = {NodeRelationship.SOURCE: ref_doc.as_related_node_info()}
    for i, text_chunk in enumerate(text_splits):
        logger.debug(f"> Adding chunk: {truncate_text(text_chunk, 50)}")

        if isinstance(document, ImageDocument):
            image_node = ImageNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                image=document.image,
                image_path=document.image_path,
                image_url=document.image_url,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_separator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(image_node)  # type: ignore
        elif isinstance(document, Document):
            node = TextNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_separator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(node)
        elif isinstance(document, TextNode):
            node = TextNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_seperator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(node)
        else:
            raise ValueError(f"Unknown document type: {type(document)}")

    return nodes
@sgondala sgondala added the question Further information is requested label Dec 22, 2024
Copy link

dosubot bot commented Dec 22, 2024

Hey there, @sgondala! I'm here to help you out with any bugs, questions, or contributions. Let's tackle this together!

The build_nodes_from_splits function initializes each node embedding to the value of the document embedding to maintain consistency and context from the original document. The embedding represents the semantic content of the document, and initializing each node with this embedding ensures that the nodes retain the semantic information of the original document. Initializing with None would mean losing this semantic context, which might not be desirable for the intended use of these nodes. Therefore, the current implementation is intentional and should not be changed to None without a specific reason that aligns with the application's requirements [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant