From d8e04e89a0cd31b2469f18386e583da25b7d5223 Mon Sep 17 00:00:00 2001
From: glemaitre
Date: Fri, 19 Apr 2024 14:44:42 +0000
Subject: [PATCH] [ci skip] iter 323fea613ca74e081ec1db6bbda0a77af7cff9ca
---
.../user_guide/information_retrieval.rst.txt | 6 +--
.../user_guide/large_language_model.rst.txt | 35 ++++++++----
objects.inv | Bin 1444 -> 1453 bytes
references/index.html | 4 +-
searchindex.js | 2 +-
user_guide/index.html | 11 ++--
user_guide/information_retrieval.html | 16 +++---
user_guide/large_language_model.html | 51 +++++++++---------
user_guide/text_scraping.html | 2 +-
9 files changed, 72 insertions(+), 55 deletions(-)
diff --git a/_sources/user_guide/information_retrieval.rst.txt b/_sources/user_guide/information_retrieval.rst.txt
index ff9c173..a9adeb6 100644
--- a/_sources/user_guide/information_retrieval.rst.txt
+++ b/_sources/user_guide/information_retrieval.rst.txt
@@ -44,15 +44,15 @@ approximate nearest neighbor algorithm, namely `FAISS
As embedding, we provide a :class:`~ragger_duck.embedding.SentenceTransformer` that
download any pre-trained sentence transformers from HuggingFace.
-Reranker: merging lexical and semantic retrievers
-=================================================
+Reranker: merging lexical and semantic retrievers results
+=========================================================
If we use both lexical and semantic retrievers, we need to merge the results of both
retrievers. :class:`~ragger_duck.retrieval.RetrieverReranker` makes such reranking by
using a cross-encoder model. In our case, cross-encoder model is trained on Microsoft
Bing query-document pairs and is available on HuggingFace.
-API of retrivers and Reranker
+API of retrivers and reranker
=============================
All retrievers and reranker adhere to the same API with a `fit` and `query` method.
diff --git a/_sources/user_guide/large_language_model.rst.txt b/_sources/user_guide/large_language_model.rst.txt
index 93ac911..42f8ee5 100644
--- a/_sources/user_guide/large_language_model.rst.txt
+++ b/_sources/user_guide/large_language_model.rst.txt
@@ -1,13 +1,30 @@
.. _large_language_model:
-=========
-Prompting
-=========
+====================
+Large Language Model
+====================
-Prompting for API documentation
-===============================
+In the RAG framework, the Large Language Model (LLM) is the cherry on top. It is in
+charge of generating the answer to the query based on the context retrieved.
-:class:`~ragger_duck.prompt.BasicPromptingStrategy` implements a prompting
-strategy to answer documentation questions. We get context by reranking the
-search from a lexical and semantic retrievers. Once the context is retrieved,
-we request a Large Language Model (LLM) to answer the question.
+A rather important part of the LLM is related to the prompt to trigger the generation.
+In this POC, we did not intend to optimize the prompt because we did not have the data
+at hand to make a proper evaluation.
+
+:class:`~ragger_duck.prompt.BasicPromptingStrategy` allows to interface the LLM with
+the context found by the retriever. For prototyping purposes, we also allow the
+retrievers to be bypassed. The prompt provided to the LLM is the following::
+
+ prompt = (
+ "[INST] You are a scikit-learn expert that should be able to answer"
+ " machine-learning question.\n\nAnswer to the query below using the"
+ " additional provided content. The additional content is composed of"
+ " the HTML link to the source and the extracted contextual"
+ " information.\n\nBe succinct.\n\n"
+ "Make sure to use backticks whenever you refer to class, function, "
+ "method, or name that contains underscores.\n\n"
+ f"query: {query}\n\n{context_query} [/INST]."
+ )
+
+When bypassing the retrievers, we do not provide any context and the sentence related
+to this part.
diff --git a/objects.inv b/objects.inv
index c8866c6058cea3a290a295f5f72a29a7d42acfa1..d03a069b4a3ce0747a3e194d6f090508ed074975 100644
GIT binary patch
delta 1347
zcmV-J1-$yC3#|)~dw*MR<0uq<=T{)jyppyn?aREZI-07gv|UBb?Cv9?Hg&A=B@1{a
z`S%OPHxh7cLz}#`fphyfoO3wFW+X`{!*Lb;Ske5I#&KFCaCi8UmvNO*cncYQql^|2
zWlV(AmAp*>%X0@S(P?FynDfnB-a>|tIC?lN1yMLkPgcg($LF1drnly%4
z##e8tsMkitwu~=@0E>2fVJrz1Sehp>5d^cc67AMs7SfI|lJg4#;Sz}S$gUDs0OTq>
za*3$Lypid1MhcCV6%+>3^nzw$-?1{^30DEt)Y0nMuzV#vjn?&VX?P>lB9gN%>(
z?)h<%+0`>R0)PLC^)8T>%RhF}9pj7CS^QWW<0pD%
z8}^Cy0M}Y}VGsa>R}KWQ&p+M?8=8|sq>*pUKY!6P+pteQ1h5`59|FPg3xxn6
zyz(J{eg5%Iz!}}Ak-HH>zDLuPw#x~Q)mg0`)<(`T6ry5GJw2UJdE@!93WGxI<&V=}DO9aOS{dM#6c6FzLHO%N%(*L21D70zqmP%mAyPsQ
z?Fi`2&pOzqKG%vqHDag}>y{`fy<<8>?X334llmz+g*v=27lg^^GX>H@PG?zbt>#26
z=c69OjUI!Oofz19sbds=lQC-Us+3X%v48%J`?sd9$gy&3cmopZ6?OMapjnFvuwR~K
zIf9i~X`O_t+mu=kDY*fRryd;aW*ZYSf2ZqJ%>T+Tg(+{>Ce`0piSlT@DRmiW?>c?-J9scqf5z_mAWz)FG;hn*VTd?$d62z+pZ++kE&YfHS_r`JQ;1mz
z&wo?_?KH6T{(cs`v3CH_m;yeR^$RiCFa6=kH1N5eUx>|q%@0qefnOgv3*GsS6xt}j
zyJ1|0a;*D=4kGaE5tku#x0}#P0)NLHqQmT<8k@c2Nx63@7@P&|0bu*8${}sG4E+HL
zEN_?`DvY)jZdzusrALiTXKVCqJ8ELOVSU2&_*$!@y3CHc9o3pfaB@mmv^ACLxi-OT
z`dS|Iz-Zj(wD)SN@9n?KQ3nb1l2h6sHH$R2ibPn4A~@z9Zk|<-Q`psw+B}rGdeK4G
zd8Wg1sCG7E_P)#Idx!SZ{W;o^?m1|vpqp%$m{Gzh+@@US{{fiKPFBJGsmgeg`XAGR
F%|sjUlj{Hg
delta 1338
zcmV-A1;zTU3#1E>dw)xh<0ufm^DB^Muk>T3z076RqiMDJFsrCDyL&{mO*_{3kp(=H
z{QCtHW0Qbm8@iKI4E6dbs){l(BUwfnPMhS%4J}`3n&wpoxBJg!oi+u9caYIH%4n5P
zerp(RHli#(X%_SwM1Z-}+^lIOl;n3tD*jfpk}{<_DG2BAet)PrqZMUD(DZh)CXHd1
z@te0?v}>bcQ>T|gfJHaHFqVu8EX|XY2!dJNh;HjI3u#LjDfxwga0x_uWZQ@<0CE)`
zxkS`revs*NMhcCV6%+^4^nzw%-?F;g3SR-$)Y0mBVEICLo~+tuX?QKvBC`Fg>N%>(
z?fG$$+0`>Tf`9Oe^)8T>67d@KU&v
z4Tr>f1nV(lJ`$X;Fo*ymCWZV^IuK5+JCK+$DQod6H(7Aq@$YLF*`0YyO_QQf~A*%(XoUV
z_jiG`q<^+W2eR}s@Lhmhg-0$BwNIodcyo|-(R;Vp|()e`dor6wwd;4%{rO~eut7gSIJ#e))D
zTH->v60ck$ezE+)X`P&T7ea9wsp)TFKDboZOMjf{@_D`kL(z)ztPaqJzma={Gqp+Q
zYB`Fu^)t>91jwsQOUOm~S;mBd&`z|9;6qDmIFKq@mV&&};s|G6rF7TDoZQ)-1`d&i
z%{#oMqj$`$pgV!>H+lJpq%1PBO0}G5N?Hjb^16a8i84SMQFmc-z+)}wl&qI`*g_7H
z4S%UJ%JCh(w;}%85W#Upf8GAPySdlE4$SCI(*L21DVOPI>76#qmP%&5GkR@
zZUpq^7Y4Si&$psa2Qg&C%n~J~cZ^Zg&1!!AWTM|Es$4ox+pqpEhp+Y
zpY#~6^%$JK6`|8>8^6j8S`6rIad&?SC`w|26eRj+0x%8<0@1$lNo5!&+>B{qh{k
z5v;~a>m*!eQ(8Hs7dhprH^LL@PDfn
z{oLk^whIR8&)8k<
-
+
@@ -501,7 +501,7 @@
Semantic retrieversSentenceTransformer that
download any pre-trained sentence transformers from HuggingFace.
-
-
Reranker: merging lexical and semantic retrievers#
+
+
Reranker: merging lexical and semantic retrievers results#
If we use both lexical and semantic retrievers, we need to merge the results of both
retrievers. RetrieverReranker makes such reranking by
using a cross-encoder model. In our case, cross-encoder model is trained on Microsoft
Bing query-document pairs and is available on HuggingFace.
All retrievers and reranker adhere to the same API with a fit and query method.
For the retrievers, the fit method is used to create the index while the query
method is used to retrieve the top-k documents given a query.
BasicPromptingStrategy implements a prompting
-strategy to answer documentation questions. We get context by reranking the
-search from a lexical and semantic retrievers. Once the context is retrieved,
-we request a Large Language Model (LLM) to answer the question.
In the RAG framework, the Large Language Model (LLM) is the cherry on top. It is in
+charge of generating the answer to the query based on the context retrieved.
+
A rather important part of the LLM is related to the prompt to trigger the generation.
+In this POC, we did not intend to optimize the prompt because we did not have the data
+at hand to make a proper evaluation.
+
BasicPromptingStrategy allows to interface the LLM with
+the context found by the retriever. For prototyping purposes, we also allow the
+retrievers to be bypassed. The prompt provided to the LLM is the following:
+
prompt=(
+ "[INST] You are a scikit-learn expert that should be able to answer"
+ " machine-learning question.\n\nAnswer to the query below using the"
+ " additional provided content. The additional content is composed of"
+ " the HTML link to the source and the extracted contextual"
+ " information.\n\nBe succinct.\n\n"
+ "Make sure to use backticks whenever you refer to class, function, "
+ "method, or name that contains underscores.\n\n"
+ f"query: {query}\n\n{context_query} [/INST]."
+)
+
+
+
When bypassing the retrievers, we do not provide any context and the sentence related
+to this part.