todo.txt

* why do I need to re-initialise the retrievers after unpickling the knowledge?

/* knowledge cache does not cache the rules or facts

* multiple knowledge bases, one for internal facts and one for each indexed paths
* perhaps a way to structure the prompt using <> tags. The memory items need to be distinct.
* use poetry

/* why is the cache not working? The system re-loads the knowledge every time

/* dependabot!!!
/* update readme with index.


/* interruptible speech

* upload to hetzner and make it work for some retrieval tasks
* develop more rules + use-cases for voice and other


/* add control over which llm to use from the frontend
/  - add list of models in the backend

/* add quantization of llm to wafl_llm config
/* write docs about it on wafl

/* add option so use llama.cpp from wafl_llm
/* add option to have None as a model setting in wafl_llm

/* add pdf to indexing
* add json to indexing
/* add metadata to indexing items


/make backend it run with ollama as well (no too slow)


/None of the knowledge is loaded from the web interface. Why?
/- you have just changed the load_knowledge function to make it async.


wafl:
/- create indices
- allow files/folders to be indexed (modify rules.yaml and then re-index)
- add keywords in retrieval from tfidf
- silence output when someone speak
- multiple models in wafl-llm, with selection from frontend

training:
- retrain phi3
- add tokens <execute> and <remember> to the training data
- add some prior conversation to the training data, taken from other examples
- add more unused rules in the prompt


* after a rule is deleted, you should also prune the conversation above.
  The system can get confused if the conversation becomes too long
  re-train the system with prior conversations before calling the rule


* substitute utterances in base_interface with the conversation class

* add config file for model names
    - llm model name
    - whisper model name

/* add version name
/* let user decide port for frontend
/* update docs about port
/* push new version
/* update pypi with wafl and wafl-llm
/* clean code for llm eval and make it public
/* update huggingface readme
* read overleaf paper


* on wafl_llm make it so only some LLMs are supported
* change speaker model with newer one


1) train on more steps
   a) try 3 epochs, save each
   b) use lr=1e-6
   c) use batch_size=4
   d) do not use 4 bit original model, use 16 bit (on the GPU)
2) evaluate result
3) Upload to hf
4) create a test set of 50 elements for paper. Find a way to test it. repeat from 1)
5) refactor code
6) maybe change voice model
6) write paper


### TODO

* script to add wrong <tags> when none are needed


On the to_modify set:
* sometimes the user answers yes (after "do you confirm?") and the dialogue does not have "user: yes"


On the accepted set:
* CHANGE <|USER|>\n into user: (some of the elements are in the wrong format)
* Perhaps change <memory>function()</memory> into <memory|retrieve><execute>function()</execute></memory> (the memory should store the results of the function)
* Create a first paragraphs with the summary of the conversation: The conversation must always be grounded on the summary (USE LLM TO CREATE THE SUMMARY)
* The LLM wrote text after </execute|run|running> hallucinating the result of the execution. Think about how to deal with that.
* all the rules that says "two level of retrieval" should have the trigger rewritten to something more specific
* change "bot" into "assistant" some of times
* some sentences are between [] and should be removed
* put the items in <memory> so far in the conversation summary. If it is a function then you need to simulaten the relevant output using the LLM
* sometimes at the end of the conversation the bot says "Process finished with exit code 0". Erase this
* add ability to index files and files in entire folders
* if the bot uses a function to retrieve information, you should add <memory>. This is symmetrical to <memory> with a function call when necessary.
* some tags like <output> should end the training item text
* todo User -> user, or at least be internally consistent

* find a way to use HuggingFaceH4/ultrachat_200k as a starting point for each item
   - each item should be easy to copy into a csv.
   - Separate the items with special tokens/lines
* Create a dataset with about 500 elements
   - use hugginface chat dataset as a starting point for
       - themes
       - conversation guide in prompt
   - use LLM to create corresponding python code
* retriever in create_prompt
* change num_replicas back to 10 in remote_llm_connector


/* create actions from command line
   /* add condition of when to stop to the actions


Actions:
#### Find way to delete cache in remote llm connector
#### Put colors in action output (and dummy interface)
#### Add green for when an expectation is matched
#### write docs about actions
#### push new version to main

* Perhaps the expectation pattern could be build in the rules themselves

/* BUG: the prior memory leaks even when re-loading the interface!!!

* clean single_file_knowledge: it still divides facts, question, and incomplete for rule retrieval.
  Use just one retriever and threshold for all

/* push docker image to docker hub
/* update all to the vast.ai
/* write new docs
/* new version on github!
/* make it easy to run the llm on the server (something more than docker perhaps)?

/* re-train the whisper model using the distilled version
/* make rules reloadable
/* nicer UI?
/   * New icons
/   * darker left bar
/* update tests


/* lots of duplicates in facts! Avoid that
/  * use timestamp for facts (or an index in terms of conversation item)
/  * select only most n recent timestamps
/  * do not add facts that are already in the list (before cluster_facts)


/* redeploy locally and on the server
/* new version on github

/* add rules for
   / shopping lists
   trains and music


* add yaml like in the github issue
* test testcases work (only local entailer)

/* update wafl init: It should create the project the modern way.

* use deci-lm
* make sure system works with audio too

* aggregate rules into a tree using a rule builder (like in the old system)
    * perhaps one use-case is for diary entries: what is my diary for next week requires today's date first


/* I am not sure the system is cancelling code that has been executed. Check the whole pipeline of prior_functions

/* when an import throws an exception, add import <module> to the code and try again
/   * if the import does not exist, return the code as is without <execute> substitution

/* don't use replicas, use a beam decoder where <execute> and <remember> are pushed upwards. (this means no sampling - perhaps there is a better way)
/  * do it in the local llm connector first
/  * use sequence_bias in generate() together with epsilon_cutoff
/    (for example if the <execute> token is not likely its prob should not be increased)
    DOESN'T WORK: It needs to use beam search, but I want to keep sampling with temperature

/  * ALTERNATIVELY increase the number of replicas to 6?


/* quantize the model to 4 bits
   TOO SLOW on 3090
/* merge remote and local llm connector. Both should derive from the same class with common functions
/**** make it so the computer does not repeat! reset conversation when the bot repeats itself

/* only one rule at the time!!
/   * if a rule is executed, it is then consumed

* bug: the system kept executing "The bot predicts:"

**** what to do with conversational collapse?
  - the system just repeats the last utterance
    - how much is 2+2, what is the real name of bon jovi, how tall is mt everest
    - the collapse is due to <execute>NUMBER</execute> being returing unknown (execute becomes more likely after one prior <execute>)
    - the system is also more likely to return unknown after one unknown. Select the answer that has no unknowns?

* solve math expression execute (import do not work in eval and exec needs a print on stdout)
* add errors when loading config file (add log to stderr)
* add a memory that the execute command was called/not called.

* no more than one rule (two rules it already gets confused)
* better ui
* better documentation
* better tests
# the dimension of the sentence emn model 384 should be in config


* multi-step interactions for function execution are still hard.
  - perhaps the rules need to stay in the prior discourse for longer
      - the same rule appears after the first execution, therefore the bot thinks it has already executed it
      - user: follow rules...
        user: compute stuff
        bot: answer

        user: compute stuff
        bot: answer
        user: follow rules... (this was at the beginning in the prior discourse)
        user: do it again

        maybe you need a tag for user: follow rule. Maybe a superuser tag that is removed from the output (but stays in the interface)?


#### keep prior rules for a couple of turns
#### log execute in green and memory in blue
#### keep the temp low  0.2 (otherwise it doesn't follow multi-point
# rules well)

/* Put the answer of <execute> in the facts and re-compute the answer
/* fine-tune the llm to follow the rules
/  - create dataset of about 50 examples
/  - fine tune only last layer
   This does not work


------------------------------------------------


* use perplexity in arbiter to determine what to do
  - use a perplexity budget?


* the answer filter is too fickle
  - wrong transcriptions
  - code is transcribed as prior text
  - It needs examples!! => CREATE A LIST OF EXAMPLES TO BE RETRIEVED FOR THE FILTER
     - examples: code -> code
     - "this is what i came up with:" -> same

* add commands to navigate the conversational tree
  - go back (?)
  - "I am asking you that question"


/* why is <|EOS|> added in the code generated by asking: "write a function in python"
  - the error was in wafl_llm, every # was replaced with <|EOS|> (legacy from MPT)

/* write a function in C++ does not trigger the rule about writing in a language different from python
  - this is because the rules cannot specify when not to be activated: the entailer blocks the retrieval
    write a function in c++ does not entail write a function that is not in python

* write a function in python does not work
  - the system stops at what is the name of the function, there is no reply after that
    - this is because the system fails at "what is the goal of the function"
        * IMPLEMENT A ASK USER, ASK BOT, GENERATE, VERIFY, otherwise it's just a string assignment

  - can you code does not retrieve anything
  - the system creates a new rule, even if a good rule already exists


/* what is the weather like/what about tomorrow -> every new query gets a reply about the weather

/* items are not updated in the web interface:
/   - new utterances by the bot are not added to the list
/   - they only appear after the user has typed something
/   - should you wait to update list?
/   - should you yield all conversations in the conversations events and then say them all at once?

/* add a filter dialogue answerer on top of everything (top-answerer) !!!!
/  - add filter ability to all interfaces
/  - add filter to web interface when it runs from command line
------------------------------------------------
* remember: entailment is related to mutual information.
* If the system generates a rule that has a question implying the trigger -> the answer to that question is unknown


* make it answer: who is the mayor of london, who was the first james bond
    * find a way to re-load the knowledges from the answer bridges while the system is running
    * make it so the system does not repeat questions that are in the trigger
    * slim down the corpus task_creator.csv:
       - do not repeat instructions for every item, just use instructions at the beginning.

* solve issue about intermediate item taking the value of the prior textarea instead of "typing..."

* implement notebook style web interface
/   * will you need to remove user: bot: from utterances?
   * change the interface so you can navigate it
   * allow for web components to write output
      * test with output from matplotlib

/* use temperature in generate: the system tends to repeat itself and is terrible.
/* avoid newline in textarea after pressing enter
/* sometimes (when I say "nothing/no/...") the conversation stops.
  there is no answer and all the next replies are the queries themselves


/* why does the weather not work?

/* make system faster to load locally. Why does it load functions.py 4 times? why the long wait?
/* add a small wafl_home to the init, with folders and everything

/* make stand-alone connectors (no need for server-side)
  /* modify config for all connectors
/  * check remote connectors work!
/  * create note about local speaker for python 3.11

* rewrite documentation !!!!
   * talk about how the system will improvises rules
   * explain arbiter pipeline
   * update config description (possibly with stand-alone connection) !!!!

* add init-default (with modified wafl_home) and init (as empty)


* push to main!
* push wafl-llm to main
  * and docker

* add import from files in rules space


/* split bullet lists into lists for mapping onto a query
/   * add something for merging as well
/* If the inference fail, should you try to improvise?

/* make it so it is possible to navigate among levels of tasks
/   * e.g.: what is the name of the list? that's not what i meant
/   * INTERRUPTIONS: nevermind, nothing, no one, not what i meant
/   * questions: answering a question with a question should trigger a new resolution tree
    -> resolution: before answer_is_informative() I added an entailment check whether the answer makes sense/user doesn't want to answer

/* add #using ../
/* why does sqrt throw exception when eval(code)?
/* make all tests pass
/* REMOVE computer FROM UTTERANCES!! (after activation)

/* modify wafl_llm to have a config file for the models to use/download
/* add a standard variable for the dialogue history to be used in rules space
/* only files declared with #using can be used in rules


* retrievers for in-context learning
  /* all answerers and extractors
  /* create adversarial examples for task extractor

* make this work:
  - Do I need an umbrella?
  - Tell me if I need an umbrella

  The issue is that the relevant rule is not retrieved. The entailer gives ~0 btw the user says"do I need an umbrella" :- the user wants to know the weather
  Find a good way to entail what you need.

  - Issues were:
     - Retriever is not very good. There are a lot other rules scooped up when asking for a task
     - Interruption rule must not be used for tasks
     - Text-generation tasks are understood as questions, therefore no text generation

* Maybe you need to add text_generation task right before searching the answer in rules

/* add feedback from frontend
* Feebdack texts should be used in-context for dialogue and task extractor
    * Use retriever to create dynamic prompts

* make the text disappar in the input after you write
* input should be frozen when the bot is thinking/speaking
* add a check for rules that are too generic.
  - For example, "the user wants to add something"
  - The rule priority should be weighed down according to
    how easy it is to trigger it from different utterances.
* better web UI
* add a way to change the rules on the fly
* create a battery of tests for questions that interrupt a question
  from the bot


/* Make tests pass
/* RETRIEVAL for prompts
/* Add RETRIEVE as a command
* add all the rules you have to the retrievable examples


* only first level for #using
* flexible config for faster answers:
   * make it so the task extractor can be skipped
   * same for task creator

* allow for facts to be checked by the llm directly
* create command line instructions "wafl list all the files"
* why does "[computer] computer what should I buy" does not trigger a rule in wafl_home?


* if utterance would trigger rule but task is unknown, then trigger the rule
* interruptions should always be called. Make it so it is impossible to forget to call them
* give the system the ability to create rules to solve tasks
* use <internal_thoughts></internal_thoughts> in prompt for dialogue

/* add interaction on lists (actions for each item in the list)
/* allow code creation from task creator
/*Does it make sense to have the task_extractor work only when the user issues a command?
/  - Use LLM only after entailment with "The user asks to do something"

/* prompt generation only if it is task
/* otherwise = should return the result of a rule that is being executed
/* implement rules in case the task has not immediate trigger

/* is the sound wave sent to wafl-llm whisper_handler correct? is it corrupted
/  - rememeber that the sound wave was corrupted the other way round0
/  - you might want to use base64 encoding for the sending part as well
/  - save sound file from whisper_handler.py and listen to it


* why is it so slow??
/- would .generate(do_sample=False) accelerate? in llm_handler.py
/  - play around with early stopping and max_length. Use eos_token_id properly

  - would shorter prompt accelerate? maybe you can retrieve the examples that are most relevant

/  - USE past_key_values as argument in generate.
/    This argument is returned when use_cache=True.
/    Save first past_key_values and then use it as argument in generate.
     -> It does not do much for a single query

/  - try to go line-by-line in the code to see where the problem is
/  - would .generate(do_sample=False) accelerate? in llm_handler.py
  - would shorter prompt accelerate? maybe you can retrieve the examples that are most relevant


/* thank you does not close the conversation because of the entailer in inference_answerer
/* refactor sentence-transformer to backend

* policy should only be about finding the correct rule to apply.
    * no y/n stuff, only choose btw candidate rules and none of the above.
    * erase y/n policy, only rule policy remains

/* refactor whisper to backend
/* refactor speaker

/* better frontend (smaller facts and choices)
/* Computer what is your name
/  -> Correct remember
/    -> answer is "I don't know". why?!
/    explanation: question -> task from question -> answer from task
/    solution: there should be a way to use the chitchat answerer to answer questions


* make it so one can change rules on the fly (reload when changed)

* add definions to arbiter (what is tea...?)

* add the ability to ask questions to activate a trigger rule.

* Make chit chat work!
  * should you take it out of arbiter? Maybe it should be the last item in ListAnswerer
    in conversational_events

* if answer is unknown in item = function() then that line should be considered False

* add main.wafl
  * it should divide by
    * the user wants to do something
    * the user asks for information
    * anything else

* add a way for the answerer to access gpt information
   * it should say "I believe ..."

* the rules should be chosen by gptj according to the prior conversation
  (possibly from bw_inference _look_for_answer_in_rules)

* add docker compose for running server + interface
  * add a flask interface for api

* create errors within parser
   * dependency does not exist
   * . is allowes in dependency, / is not

* Clean conversation summary
  * [computer] is in the summary
  * Sorry? is in the summary
  * repetitions are in the summary
  * sentences that are said by the bot are treated the same as they were said by the user
  * maybe questions can be asked at a summary level? Not just sentence by sentence.

* rules should be added and REMOVED!

/* all sentences that are said when the interface is deactivated should not be appended to the conversation

/* Write choices, tasks and remember in web interface
/* erase conversation after deactivation in voice interface
/* refactor everyting to have a chat log with all the choices, retrieved facts
  in the same prompt for chitchat
  /* All answerers should have access to the same list of utterances, choices, and facts.
    The only difference should be the final line of prompt.
    /* MAKE TEST PASS
    /* ADD CACHE TO ALL CONNECTORS (it cannot be done easily for async functions)

/* 1024 is limit in wafl_llm (deepspeed) change this to 2048
/* make it sure 2048 is limit when using gptj
/* only last three conversation items in task extractor

/* Complete the task extractor + policy guidance on the whole of the answerer/inference tree
/  * Create test for different policies
/     * create test for "I didn't mean that"
/        * You need to log the rules that have been chosen.
           This needs to part of the policy decision.
           Dialogue alone is not enough.

/     * create test for "do it again"
       * FIND WAY TO EXTRACT TASK FROM DISCOURSE in arbiter_answerer
       /* Selected_answer() returns unknown if all answers are None
       /* add task recognition to choices and make tests pass

/* make alarm work on wafl_home.
/  * Make async work on wafl
/  * Do not forget to reinstate "what time is it" rule

/* Try to use only GPTJ. Answer questions with dialogue and story.
  Give few examples where the answer is unknown.

/* volume threshold should be sampled continously

/* change {"%%"} into something more typeable (possibly "% %")
/* add torchserve handler as a wafl init

/* Add GENERATE
/    * Use it to get "1" from "one minute"

/* time needs to be pre-processed to trigger rules (5 past seven -> 7,05)
/* add event loop here to test_scheduler
/* create way to add rules


This and next week:
 /* rules in folder
 /* distill gpt2 from gpt-JT onto the CoQA dataset
    -> not done, using gpt-jt directly

First week of the year:
 /* Make test conversations work
 * Debugging with picture as output
   -> not doing it now

Second week of the year:
 * User-defined events
 /* Scheduler

Third week of the year:
 * Refactoring
 * Write up docs

Fourth week:
 * Write demo paper
   * Demo paper should include website with code editor (and connection to GPU)


/* rules and functions should be in a folder.
/  * Think about ability to install

/* add ability to create rules through text "->"
* main conversation loop should be scheduler
  * Add functions that can trigger rules

* InferenceAnswerer can be broken down into simpler answerer
  * within backward inference there is the need for an answerer (when tasks are interrupted)

* Do lists as hard-coded
* y/n questions are never searched in working memory. Is this the right behavior?
/* test_testcases blocks the tests. RESOLVE THIS!!
* fine tune entailer to the tasks in this systems (the bot says: "", the user asks ""...)
* functions use inference from depth_level = 1. This can induce infinite recursion

* if query is not question and answer is False then the system should say "I cannot do it because"
   * After because there should be the answer to the bot asking itself why
   * The way to do it is to have a narrator connected to the logs
   * you need better readable logs
   * you need a way to translate the logs into a coherent text
   * THIS WILL ADD INTROSPECTION!!

* move all thresholds to variables.py
* rules are sorted by retriever but not by entailer!! Do that

* add error detection in parser.
  * for example ( without a closing ). same for {


/* find way to connect to local ngrok from github
/* entailment should be :- instead of <-
/* take entailer and qa out of the __init__ in entailer.py and qa.py
/* implement functions within each dependency
/* remove Batches output in prediction
/* fact retrieval should work in python space as well
/* y/n questions should only accept yes or no (and loop if there is something else)
/* Why does remember not work??!!!
/* separate items added to list with "and"
/* The same answer cannot belong to more than one question in the same task! (it's an approx but needed)
/* Create answering class on top of the conversation
/  * create arbiter class
/    * This should solve the test_executables failing tests
/  * is the dialogue part really needed? if not, you can use a simple qa system

/* Use entailer for common sense? Creak sense does not work very well
/* Make infinite recursion impossible (set max limit or check for repetitions)
/* Do the conversational memory (start with test_working_memory.py)
/ * done but you need to refine the interaction with the narrator class (events are splitted manually in qa.py)
/    * RUN TESTS!!!
/    * START WITH test_conversation (many tests are failing)
/        * The issue is in "the user says" (line 85 in qa.py)
/            - The system should be able to understand if the the user is speaking or the user
/            - possibly if the question is from the user (like in working memory) add "user says"
/            - what would you add when the fact comes from the knowledge base?
/                - should you change the hypothesys "when -> says?" (line 91 and 101)
/
/    * USE LOGGER IN CONVERSATION() TO SPEED UP DEBUGGING!
/* numbers in speaker should be translated to English
/* remove computer as first word
/* Confidence in listener results
/* lists should filter the items
/* a list cannot contain itself, you should check the name
/* Yes/No questions should be more flexible
/* voice thresholds should be in config (write test about them)
/* ADD VERSION NUMBER WHEN STARTING UP
/* check tests.test_working_memory.TestWorkingMemory.test_working_memory_works_for_yes_questions
/* computer name triggers a "faulty" sound

/* functions.py need a more clever way of handling hidden arguments
  - should all functions have a hidden arg?
  - make it possible to have more files

* use different voice model (this one from hf?  )

* speech to text
    - use your own beam decoder + n-gram to improve quality
    - use newer model?
    - filter out filler sounds

* Unify conversation/utils.py "the user says/asks" and the presupposition replace "to the bot" in entailer.py

* Detect who is talking to whom. Some rules can only be activated by the user speaking to the bot
    - Use get_sequence_probability_given_prompt()

* add in knowledge facts about "the user". You don't need to remember everything that is said

* Is there a need for a "Main Task" ? One that oversees everything?

* Python hooks need to be in a class, like with Tests

* rms threshold should be average of background noise
* add delete last item


* Change the speaker voice

* Create unit tests for conversation activation/deactivation

* The answer to the question can be found in the conversation from the bot.
  The bot can speak to itself and then answer the user

* train your own retriever using the conversations in dailydialogue? MAYBE NOT
* If the query is a question YOU NEED A QA RETRIEVER. take MULTI_QA instead of MSMARCO
  - also if the rule.effect is a question


* If a function calls another one within functions.py then there is an argument missing! inference is not there in the code

*yes/no questions should *never* trigger an interruption.
 It's not just yes/no answers, the deal is with the question!


* Select by levenstein distance before text_retriever (lev_retriever?)
* train bart for qa/facts
* train your own retriever

* Why is everything interpreted as "hello" or "hi"? You need a better retriever
  - Change retriever with the one you liked


* allow {variable} to be interpreted as code/call for another task (at least add tests)
* allow some type of introspection

* finish config
 - add hotwords to config.json
* upload to github, with tests and logs
* validate the user code (is REMEMBR spelled correctly?)

* should yes/no filter be in retriever instead of knowledge?

* Create interfaces for google voice, other apis
* A goal oriented bot would scan all the rules to find how to obtain the goal.
  The user can be simulated using a generative model for dialogue.

* for yes/no or limited choice questions there should not be ambiguity. The machine should match the closest item and
  if there is no item close enough ask again.


* New voice! It's ridiculous to have to have a memory leak from picotts
    - Use fairseq voices

* why is Alberto not recognized as a name by the retriever? need a better retriever than MSMARCO distill
* Working memory should really be working knowledge
* refactor BackwardInference

!* make it so if the user does not know the answer, one can continue inference?
   - Or should you try to do the inference first??
!* Implement a standard sign for code. Should it be '''> ?
!* Implement FORGET (the whole Fact should disappear from Knowledge)


* Do not allow arbitrary input (at least for voice)
* Working memory is unnecessarily complicated. It can just contain the story and some method to automatically fill it.
!* Why did it say "no" on "can you please add bananas to the shopping list?"
* Better QA for yes/no question (maybe add SNLI to qa?)
* Add math expressions to fact-checker (some, any, every)
* USER REQUEST: Multiple items to the shopping list in one go: apples AND bananas AND vegetables
* create tests for voice
* Parser should allow for empty lines within rules
* Implement Server with HTML page (docker-compose up)
* Refactor code and clean up

* Investigate interplay btw substitutions and already_matched
  - Maybe one can avoid having same answer for btw same question and same depth (and same rule)
* Say "This is true" or "this is false" if a statement matches (My name is Alberto -> True)
* working memory only within the same level of rules with same activation
  GET WORKING MEMORY FOR FACTS WORKING WITH SHOPPING LIST!!

/* Add working memory for python-space
    - Maybe exclude WM answer if it is the same as the prior answer
/* Use entailment to make generated qa more accurate
    /- Upload qa system to huggingface and pip
    /-**** Use conversation_qa -> refactor qa.py