Support embedding similarity search for GraphRAG #2200

Jant1L · 2024-12-13T07:32:39Z

Fix #2196

Description

Now support using embedding similarity search for GraphRAG
Meanwhile you can also use keywords search

How to use

set SIMILAR_SEARCH_ENABLED=True in .env
False for default

merge

SonglinLyu · 2024-12-13T09:28:26Z

dbgpt/storage/knowledge_graph/community/tugraph_store_adapter.py

@@ -145,6 +149,7 @@ def upsert_entities(self, entities: Iterator[Vertex]) -> None:
                "_document_id": "0",
                "_chunk_id": "0",
                "_community_id": "0",
+                "_embedding": entity.get_prop("embedding"),


这里的get_prop为什么还是embedding而不是_embedding？

在构建memory graph的时候向量的key用的是embedding，后面为了避免从tugraph拿出数据的时候数据长度太长，所以给schema中的向量字段名称改为_embedding，这样可以被white_list过滤，防止返回的数据量太大，chunk同理

SonglinLyu · 2024-12-13T09:29:19Z

dbgpt/storage/knowledge_graph/community/tugraph_store_adapter.py

+        create_vector_index_query = (
+            f"CALL db.addVertexVectorIndex("
+            f'"{GraphElemType.ENTITY.value}", "_embedding", '
+            "{dimension: 512})"


向量的dimension是否需要一个配置项？

qwen的api提供的dimension基本都是固定的，可以看一下这个https://help.aliyun.com/zh/dashscope/developer-reference/text-embedding-quick-start?spm=a2c4g.11186623.help-menu-610100.d_3_7_4_0_0.7beb3bc9dl675z

SonglinLyu · 2024-12-13T09:39:16Z

dbgpt/storage/knowledge_graph/community/tugraph_store_adapter.py

@@ -189,6 +203,7 @@ def upsert_chunks(self, chunks: Iterator[Union[Vertex, ParagraphChunk]]) -> None
                "id": self._escape_quotes(chunk.vid),
                "name": self._escape_quotes(chunk.name),
                "content": self._escape_quotes(chunk.get_prop("content")),
+                "_embedding": chunk.get_prop("embedding"),


这里的get_prop为什么还是embedding而不是_embedding？

和对entity的处理相似

SonglinLyu · 2024-12-13T09:45:51Z

dbgpt/storage/knowledge_graph/community/tugraph_store_adapter.py

@@ -42,6 +42,10 @@ def __init__(self, graph_store: TuGraphStore):
        # Create the graph
        self.create_graph(self.graph_store.get_config().name)

+        # vector index create control
+        self._chunk_vector_index = False
+        self._entity_vector_index = False


upsert_vector和upsert_chunk是会被重复调用吗?为什么需要这个变量？

会重复调用两个upsert，所以在这里用这个变量控制，保证不会多次调用创建索引的语句

SonglinLyu · 2024-12-16T08:02:46Z

dbgpt/storage/knowledge_graph/community/tugraph_store_adapter.py

+                    similarity_search = (
+                        f"CALL db.vertexVectorKnnSearch("
+                        f"'{GraphElemType.ENTITY.value}','_embedding', {vector}, "
+                        "{top_k:2, hnsw_ef_search:10})"


top_k是否需要一个配置项？这里直接指定的是2

hnsw_ef_search这个参数指定的是？

top_k太大了我觉得可能会影响效果，毕竟按照原来的设计，一个关键字对应一个header，这里相当于是一个向量对应两个header了；ef_search是hnsw搜索时需要指定的参数，构建时也同样需要指定参数，但是我们这里使用了默认的就没有写出来

SonglinLyu · 2024-12-16T08:17:17Z

dbgpt/storage/knowledge_graph/community/tugraph_store_adapter.py

@@ -560,10 +586,28 @@ def explore(
                rel = f"<-[r:{GraphElemType.RELATION.value}*{depth_string}]-"
            else:
                rel = f"-[r:{GraphElemType.RELATION.value}*{depth_string}]-"
+
+            if all(isinstance(item, str) for item in subs):


这里判断subs是List[str]还是List[List[float]]是否需要遍历整个list才能确定？

按照之前最开始不遍历的方法，因为都是list类型，他就分辨不出来，所以这里得进去分辨

SonglinLyu · 2024-12-16T08:35:43Z

dbgpt/storage/knowledge_graph/community_summary.py

+        if similar_search_enabled:
+            keywords: List[List[float]] = []
+            vector = await self._garph_embedder.embed(text)
+            keywords.append(vector)


这个变量不应该再用keywords了吧？

或者把keywords重命名一下？

SonglinLyu

部分评论需后续确认

SonglinLyu · 2024-12-16T09:06:36Z

dbgpt/storage/knowledge_graph/community_summary.py

@@ -348,8 +371,13 @@ async def asimilar_search_with_scores(

            if document_graph_enabled:
                keywords_for_document_graph = keywords
-                for vertex in subgraph.vertices():
-                    keywords_for_document_graph.append(vertex.name)
+                if similar_search_enabled:


这里在triplet_graph和document_graph都enable的情况下，document graph的查询似乎不应该再使用向量，而是沿用原来的直接从vertex.name查起的放法，在只有document_graph enable的情况下，使用向量

Appointat

Good job, and thanks for your PR. I have reviewed the most of the PR, and left some comments. Please fix them thanks!

Appointat · 2024-12-16T10:52:16Z

dbgpt/rag/transformer/text2vector.py

+        resp = dashscope.TextEmbedding.call(
+            model = dashscope.TextEmbedding.Models.text_embedding_v3,
+            input = text,
+            dimension = 512)


It is better not to fix the embedded model, which is not in line with extensibility.

The solution should be to encapsulate embedding_fn (which is an LLM embedding model, not dashscope's model) instead of calling a pkg. Refer: embedding_fn: Optional[Embeddings] = Field( ... from BuiltinKnowledgeGraphConfig.

Appointat · 2024-12-16T10:53:58Z

dbgpt/rag/transformer/text2vector.py

+    async def batch_embed(
+        self,
+        texts: List[str],
+    ) -> List[List[float]]:


Add a para called batch_size, and donot forget to modify the config var consistantly.
Refer to: _triplet_extraction_batch_size.

Appointat · 2024-12-16T10:56:29Z

dbgpt/rag/transformer/text2vector.py

+        for text in texts:
+            vector = await self._embed(text)
+            results.extend(vector)


Handle the task in batch/parallel (note: asyncio)

And, it is append rather than extend.

Appointat · 2024-12-16T10:57:55Z

dbgpt/rag/transformer/text2vector.py


 from dbgpt.rag.transformer.base import EmbedderBase

 logger = logging.getLogger(__name__)


-class Text2Vector(EmbedderBase):
+class Text2Vector(EmbedderBase, ABC):


Is it an abstract class?

Appointat · 2024-12-16T11:02:44Z

dbgpt/rag/transformer/graph_embedder.py

+                    if vertex.get_prop("vertex_type") == GraphElemType.CHUNK.value:
+                        text = vertex.get_prop("content")
+                        vector = await self._embed(text)
+                        vertex.set_prop("embedding", vector)


use _embedding, since it is a prop not visible for the users (it's implicit).

Appointat · 2024-12-16T11:29:16Z

dbgpt/storage/knowledge_graph/community_summary.py

+    similar_search_enabled: bool = Field(
+        default=False,
+        description="Enable the similarity search",
+    )



Move the code to after document_graph_enabled.

Suggested change

similar_search_enabled: bool = Field(

default=False,

description="Enable the similarity search",

)

similarity_search_enabled: bool = Field(

default=False,

description="Enable the similarity search",

)

Appointat · 2024-12-16T11:30:21Z

dbgpt/storage/knowledge_graph/community_summary.py

+        self._similar_search_enabled = (
+            os.environ["SIMILAR_SEARCH_ENABLED"].lower() == "true"
+            if "SIMILAR_SEARCH_ENABLED" in os.environ
+            else config.similar_search_enabled
+        )


use similarity, not similar

Appointat · 2024-12-16T11:30:43Z

dbgpt/storage/knowledge_graph/community_summary.py

@@ -244,6 +256,8 @@ async def _aload_triplet_graph(self, chunks: List[Chunk]) -> None:
        if not graphs_list:
            raise ValueError("No graphs extracted from the chunks")

+        graphs_list = await self._garph_embedder.batch_embed(graphs_list)


use batch_size maybe better, and _garph_embedder has spell error.

Appointat · 2024-12-16T11:32:25Z

dbgpt/storage/knowledge_graph/community_summary.py

+        similar_search_enabled = self._similar_search_enabled
+
+        if similar_search_enabled:
+            keywords: List[List[float]] = []


it is a list of vec, which can not be called as keywords, use subs.

Appointat · 2024-12-16T12:20:30Z

dbgpt/storage/knowledge_graph/community_summary.py

+                        keywords_for_document_graph.append(vector)
+                else: 
+                    for vertex in subgraph.vertices():
+                        keywords_for_document_graph.append(vertex.name)


Note: I have not reviewed yet.

SonglinLyu · 2024-12-20T06:14:01Z

描述中的#2196修改为Fix #2196

fanzhidongyzby · 2024-12-20T07:16:44Z

dbgpt/storage/knowledge_graph/community_summary.py

-                    keywords_for_document_graph.append(vertex.name)
+                if similar_search_enabled:
+                    for vertex in subgraph.vertices():
+                        vector = await self._garph_embedder.embed(vertex.name)


In the document graph for tracing the source chunks, only entity names can be used, and the embeddings of entity names cannot be used.

fanzhidongyzby · 2024-12-20T07:22:32Z

dbgpt/storage/knowledge_graph/community_summary.py

@@ -363,6 +391,7 @@ async def asimilar_search_with_scores(
                    limit=self._knowledge_graph_chunk_search_top_size,
                    search_scope="document_graph",
                )
+


We need an additional vector search of chunk.content and trace back all the upstream paths of chunks to obtain a document graph based on chunks and merge it with the document graph based on triple subgraph search.

fanzhidongyzby · 2024-12-20T07:34:36Z

dbgpt/rag/transformer/text2vector.py

+    async def batch_embed(
+        self,
+        texts: List[str],
+    ) -> List[List[float]]:


return List[] maybe better.

fanzhidongyzby · 2024-12-20T07:35:45Z

dbgpt/rag/transformer/graph_embedder.py

+    async def batch_embed(
+        self,
+        graphs_list: List[List[Graph]],
+    ) -> List[List[Graph]]:


use:
async def batch_embed(
self,
graphs_list: List[Graph],
) -> List[Graph]:

Jant1L and others added 8 commits December 3, 2024 06:40

support vector search

897c3da

Merge branch 'dev' into main

3ea27ca

Merge pull request #2 from zebra-uestc/main

7fdcd6e

merge

support vector search

de3b351

Merge pull request #3 from eosphoros-ai/main

b8ba884

merge

fix embed bug

24443c2

fix bug

fdc0068

fix format bug

a2a759f

SonglinLyu reviewed Dec 13, 2024

View reviewed changes

SonglinLyu reviewed Dec 16, 2024

View reviewed changes

Appointat suggested changes Dec 16, 2024

View reviewed changes

fanzhidongyzby reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support embedding similarity search for GraphRAG #2200

Support embedding similarity search for GraphRAG #2200

Jant1L commented Dec 13, 2024 •

edited

Loading

SonglinLyu Dec 13, 2024

Jant1L Dec 13, 2024

SonglinLyu Dec 13, 2024

Jant1L Dec 13, 2024

SonglinLyu Dec 13, 2024

Jant1L Dec 13, 2024

SonglinLyu Dec 13, 2024

Jant1L Dec 13, 2024

SonglinLyu Dec 16, 2024

SonglinLyu Dec 16, 2024

Jant1L Dec 16, 2024

SonglinLyu Dec 16, 2024

Jant1L Dec 16, 2024

SonglinLyu Dec 16, 2024

SonglinLyu Dec 16, 2024

SonglinLyu left a comment

SonglinLyu Dec 16, 2024

Appointat left a comment

Appointat Dec 16, 2024

Appointat Dec 16, 2024

Appointat Dec 16, 2024

Appointat Dec 16, 2024

Appointat Dec 16, 2024

Appointat Dec 16, 2024

Appointat Dec 16, 2024

Appointat Dec 16, 2024 •

edited by fanzhidongyzby

Loading

Appointat Dec 16, 2024

Appointat Dec 16, 2024

SonglinLyu commented Dec 20, 2024 •

edited

Loading

fanzhidongyzby Dec 20, 2024

fanzhidongyzby Dec 20, 2024

fanzhidongyzby Dec 20, 2024

fanzhidongyzby Dec 20, 2024

Support embedding similarity search for GraphRAG #2200

Are you sure you want to change the base?

Support embedding similarity search for GraphRAG #2200

Conversation

Jant1L commented Dec 13, 2024 • edited Loading

Description

How to use

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SonglinLyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Appointat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Appointat Dec 16, 2024 • edited by fanzhidongyzby Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SonglinLyu commented Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jant1L commented Dec 13, 2024 •

edited

Loading

Appointat Dec 16, 2024 •

edited by fanzhidongyzby

Loading

SonglinLyu commented Dec 20, 2024 •

edited

Loading