Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse_embeddings are not stored in Vertex AI index when using add_texts_with_embeddings #720

Closed
personabb opened this issue Jan 29, 2025 · 1 comment

Comments

@personabb
Copy link
Contributor

personabb commented Jan 29, 2025

Hello maintainers,
I noticed that when I call vector_store.add_texts_with_embeddings() with the sparse_embeddings parameter, the sparse embeddings are not actually stored in the Vertex AI index.

I followed the LangChain tutorial (link here), and used the following code snippet:

vector_store.add_texts_with_embeddings(
    texts=texts,
    embeddings=embeddings,
    sparse_embeddings=sparse_embeddings,
    ids=ids,
    metadatas=metadatas,
)

However, the sparse_embeddings do not appear in the resulting Vertex AI index. After investigating the source code, I found that in the data_points_to_batch_update_records function (in libs/vertexai/langchain_google_vertexai/vectorstores/_utils.py), the sparse embeddings are seemingly dropped. Specifically:

for data_point in data_points:
    record = {
        "id": data_point.datapoint_id,
        "embedding": [component for component in data_point.feature_vector],
        "restricts": [
            {
                "namespace": restrict.namespace,
                "allow": [item for item in restrict.allow_list],
            }
            for restrict in data_point.restricts
        ],
        "numeric_restricts": [
            {"namespace": restrict.namespace, "value_float": restrict.value_float}
            for restrict in data_point.numeric_restricts
        ],
    }

    records.append(record)

It appears that sparse_embeddings is not included in the record dictionary here.

Expected Behavior

sparse_embeddings should be stored in the Vertex AI index so that hybrid search can function as intended.

Actual Behavior

sparse_embeddings are not stored in the index, preventing hybrid search from utilizing them.

Proposed Fix

I tested a local fix by modifying the code to include sparse_embeddings data, as shown below:

(data_points_to_batch_update_records function (in libs/vertexai/langchain_google_vertexai/vectorstores/_utils.py))

for data_point in data_points:
    record = {
        "id": data_point.datapoint_id,
        "embedding": [component for component in data_point.feature_vector],
        "restricts": [
            {
                "namespace": restrict.namespace,
                "allow": [item for item in restrict.allow_list],
            }
            for restrict in data_point.restricts
        ],
        "numeric_restricts": [
            {"namespace": restrict.namespace, "value_float": restrict.value_float}
            for restrict in data_point.numeric_restricts
        ],
    }

    if hasattr(data_point, "sparse_embedding") and data_point.sparse_embedding is not None:
        record["sparse_embedding"] = {
            "values": [value for value in data_point.sparse_embedding.values],
            "dimensions": [dim for dim in data_point.sparse_embedding.dimensions],
        }

    records.append(record)

After applying the above fix, I ran the following sample code snippet from the tutorial:

vector_store.similarity_search_by_vector_with_score(
    embedding=embedding,
    sparse_embedding=sparse_embedding,
    k=5,
    rrf_ranking_alpha=0.7,  # 0.7 weight to dense and 0.3 weight to sparse
)

As a result, I confirmed that the sparse embeddings is now retrieved as expected, ensuring the hybrid search functionality works properly.

Request for Guidance

I am a beginner in open-source contributions, so I am unsure if this change might have a broader impact on other parts of the library. I would greatly appreciate any feedback from the maintainers or other contributors regarding:

  1. Whether this fix is valid or if it breaks anything else.
  2. If there is a preferred or more robust approach to including sparse_embeddings in the index.

Thank you in advance for your time and guidance. If you think this is acceptable, I would be happy to open a PR with this proposed fix.

@lkuligin
Copy link
Collaborator

the fix looks good to me, feel free to send a PR, please!

personabb pushed a commit to personabb/langchain-google that referenced this issue Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants