sparse_embeddings are not stored in Vertex AI index when using add_texts_with_embeddings #720

personabb · 2025-01-29T10:52:07Z

Hello maintainers,
I noticed that when I call vector_store.add_texts_with_embeddings() with the sparse_embeddings parameter, the sparse embeddings are not actually stored in the Vertex AI index.

I followed the LangChain tutorial (link here), and used the following code snippet:

vector_store.add_texts_with_embeddings(
    texts=texts,
    embeddings=embeddings,
    sparse_embeddings=sparse_embeddings,
    ids=ids,
    metadatas=metadatas,
)

However, the sparse_embeddings do not appear in the resulting Vertex AI index. After investigating the source code, I found that in the data_points_to_batch_update_records function (in libs/vertexai/langchain_google_vertexai/vectorstores/_utils.py), the sparse embeddings are seemingly dropped. Specifically:

for data_point in data_points:
    record = {
        "id": data_point.datapoint_id,
        "embedding": [component for component in data_point.feature_vector],
        "restricts": [
            {
                "namespace": restrict.namespace,
                "allow": [item for item in restrict.allow_list],
            }
            for restrict in data_point.restricts
        ],
        "numeric_restricts": [
            {"namespace": restrict.namespace, "value_float": restrict.value_float}
            for restrict in data_point.numeric_restricts
        ],
    }

    records.append(record)

It appears that sparse_embeddings is not included in the record dictionary here.

Expected Behavior

sparse_embeddings should be stored in the Vertex AI index so that hybrid search can function as intended.

Actual Behavior

sparse_embeddings are not stored in the index, preventing hybrid search from utilizing them.

Proposed Fix

I tested a local fix by modifying the code to include sparse_embeddings data, as shown below:

(data_points_to_batch_update_records function (in libs/vertexai/langchain_google_vertexai/vectorstores/_utils.py))

for data_point in data_points:
    record = {
        "id": data_point.datapoint_id,
        "embedding": [component for component in data_point.feature_vector],
        "restricts": [
            {
                "namespace": restrict.namespace,
                "allow": [item for item in restrict.allow_list],
            }
            for restrict in data_point.restricts
        ],
        "numeric_restricts": [
            {"namespace": restrict.namespace, "value_float": restrict.value_float}
            for restrict in data_point.numeric_restricts
        ],
    }

    if hasattr(data_point, "sparse_embedding") and data_point.sparse_embedding is not None:
        record["sparse_embedding"] = {
            "values": [value for value in data_point.sparse_embedding.values],
            "dimensions": [dim for dim in data_point.sparse_embedding.dimensions],
        }

    records.append(record)

After applying the above fix, I ran the following sample code snippet from the tutorial:

vector_store.similarity_search_by_vector_with_score(
    embedding=embedding,
    sparse_embedding=sparse_embedding,
    k=5,
    rrf_ranking_alpha=0.7,  # 0.7 weight to dense and 0.3 weight to sparse
)

As a result, I confirmed that the sparse embeddings is now retrieved as expected, ensuring the hybrid search functionality works properly.

Request for Guidance

I am a beginner in open-source contributions, so I am unsure if this change might have a broader impact on other parts of the library. I would greatly appreciate any feedback from the maintainers or other contributors regarding:

Whether this fix is valid or if it breaks anything else.
If there is a preferred or more robust approach to including sparse_embeddings in the index.

Thank you in advance for your time and guidance. If you think this is acceptable, I would be happy to open a PR with this proposed fix.

The text was updated successfully, but these errors were encountered:

lkuligin · 2025-01-30T10:41:11Z

the fix looks good to me, feel free to send a PR, please!

…batch_update_records function

personabb pushed a commit to personabb/langchain-google that referenced this issue Jan 30, 2025

Fix: langchain-ai#720 - Preserve sparse_embeddings in data_points_to_…

d8b5683

…batch_update_records function

personabb mentioned this issue Jan 30, 2025

vertexai: Preserve sparse_embeddings in data_points_to_batch_update_records #721

Merged

lkuligin closed this as completed in 5788f75 Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse_embeddings are not stored in Vertex AI index when using add_texts_with_embeddings #720

sparse_embeddings are not stored in Vertex AI index when using add_texts_with_embeddings #720

personabb commented Jan 29, 2025 •

edited

Loading

lkuligin commented Jan 30, 2025

sparse_embeddings are not stored in Vertex AI index when using add_texts_with_embeddings #720

sparse_embeddings are not stored in Vertex AI index when using add_texts_with_embeddings #720

Comments

personabb commented Jan 29, 2025 • edited Loading

Expected Behavior

Actual Behavior

Proposed Fix

Request for Guidance

lkuligin commented Jan 30, 2025

personabb commented Jan 29, 2025 •

edited

Loading