Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attributes/tags search #36

Open
baggiponte opened this issue Oct 26, 2021 · 13 comments
Open

Attributes/tags search #36

baggiponte opened this issue Oct 26, 2021 · 13 comments
Labels
enhancement New feature or request

Comments

@baggiponte
Copy link

baggiponte commented Oct 26, 2021

The extension allows to (fuzzy) search only by the title of the pages; it would be interesting to see tag/field/attribute search, like in Gmail or GitHub issues:

  • date:YYYY-MM-DD to filter papers of a certain period
  • author:Surname,Name

And so on. Also, it could be interesting if there could be a way to disable fuzzy searching (say, by typing \ at the start of a line).

I don't understand how difficult this would be to implement, as I have not properly understood how the extension works, if it's similar to BetterBibTex (which I don't really know a lot about) and/or if it dumps the citations to a json/sqlite which then is queried. I guess this just sends a request to Zotero via the API?

@baggiponte baggiponte added the enhancement New feature or request label Oct 26, 2021
@krassowski
Copy link
Owner

Good ideas, thank you! Could you contrast that with how other citation tools implement the advanced search for citation insertion/paper exploration?

I have not properly understood how the extension works, if it's similar to BetterBibTex (which I don't really know a lot about) and/or if it dumps the citations to a json/sqlite which then is queried. I guess this just sends a request to Zotero via the API?

The entire collection gets downloaded locally and stored in IStateDB and synced when needed (or requested); this is handled by ZoteroClient which implements IReferenceProvider interface (so we can have other providers in the future too), roughly these lines are relavant:

export class ZoteroClient implements IReferenceProvider {
id = 'zotero';
name = 'Zotero';
icon = zoteroIcon;
private _serverURL: string | null = null;
private _key: string | null = null;
private _user: IUser | null = null;
/**
* Version number from API representing the library version,
* as returned in `Last-Modified-Version` of response header
* for multi-item requests.
*
* Note: responses for single-item requests will have item versions rather
* than global library versions, please do not write those onto this variable.
*
* https://www.zotero.org/support/dev/web_api/v3/syncing#version_numbers
*/
lastModifiedLibraryVersion: string | null = null;
citableItems: Map<string, ICitableData>;
isReady: Promise<any>;
/**
* If the API requests us to backoff we should wait given number of seconds before making a subsequent request.
*
* This promise will resolve once the backoff time passed.
*
* https://www.zotero.org/support/dev/web_api/v3/basics#rate_limiting
*/
protected backoffPassed: Promise<void>;
/**
* The Zotero Web API version that we support
*
* https://www.zotero.org/support/dev/web_api/v3/basics#api_versioning
*/
protected apiVersion = '3';
/**
* Bump this version if changing the structure/type of data stored
* in the StateDB if the change would invalidate the existing data
* (e.g. CSL version updates); this should make updates safe.
*
* Do not bump the version if extra information is stored; instead
* prefer checking if it is present (conditional action).
*/
private persistentCacheVersion = '0..';
progress: Signal<ZoteroClient, IProgress>;
constructor(
protected settings: ISettings,
protected trans: TranslationBundle,
protected state: IStateDB | null
) {
this.progress = new Signal(this);
this.citableItems = new Map();
settings.changed.connect(this.updateSettings, this);
const initialPreparations: Promise<any>[] = [this.updateSettings(settings)];
if (state) {
initialPreparations.push(this.restoreStateFromCache(state));
}
this.isReady = Promise.all(initialPreparations);
// no backoff to start with
this.backoffPassed = new Promise(accept => {
accept();
});
}
private async restoreStateFromCache(state: IStateDB) {
return new Promise<void>(accept => {
state
.fetch(PLUGIN_ID)
.then(JSONResult => {
if (!JSONResult) {
console.log(
'No previous state found for Zotero in the StateDB (it is normal on first run)'
);
} else {
const result = JSONResult as IZoteroPersistentCacheState;
if (result.apiVersion && result.apiVersion !== this.apiVersion) {
// do not restore from cache if Zotero API version changed
return;
}
if (
result.persistentCacheVersion &&
result.persistentCacheVersion !== this.persistentCacheVersion
) {
// do not restore from cache if we changed the structure of cache
return;
}
// restore from cache
this.lastModifiedLibraryVersion = result.lastModifiedLibraryVersion;
if (result.citableItems) {
this.citableItems = new Map([
...Object.entries(result.citableItems)
]);
console.log(
`Restored ${this.citableItems.size} citable items from cache`
);
}
this.updateCacheState();
}
})
.catch(console.warn)
// always resolve this one (if cache is not present or corrupted we can always fetch from the server)
.finally(() => accept());
});
}
private async fetch(
endpoint: string,
args: Record<string, string> = {},
isMultiObjectRequest = false,
forceUpdate = false
) {
if (!this._key) {
const userKey = await getAccessKeyDialog(this.trans);
if (userKey.value) {
this._key = userKey.value;
this.settings.set('key', this._key).catch(console.warn);
} else {
return;
}
}
const requestHeaders: IZoteroRequestHeaders = {
'Zotero-API-Key': this._key,
'Zotero-API-Version': this.apiVersion
};
if (
!forceUpdate &&
isMultiObjectRequest &&
this.lastModifiedLibraryVersion
) {
requestHeaders['If-Modified-Since-Version'] =
this.lastModifiedLibraryVersion;
}
// wait until the backoff time passed;
await this.backoffPassed;
return fetch(
this._serverURL + '/' + endpoint + '?' + new URLSearchParams(args),
{
method: 'GET',
headers: requestHeaders as any
}
).then(response => {
this.processResponseHeaders(response.headers, isMultiObjectRequest);
return response;
});
}

It speaks CSL JSON as defined in https://raw.githubusercontent.com/citation-style-language/schema/master/schemas/input/csl-citation.json and https://raw.githubusercontent.com/citation-style-language/schema/master/schemas/input/csl-data which means that parsing dates is... challenging. I think there is some normalizaiton to make it more palatable elswehere in the codebase.

The JSON is then filtered and sorted in various selectors which implement Selector.IModel interface (the same approach is used for bibliography styles):

export namespace Selector {
export interface IModel<O, M> {
match(option: O, query: string): M;
filter(option: IOption<O, M>): boolean;
sort(a: IOption<O, M>, b: IOption<O, M>): number;
initialOptions?(options: O[]): O[];
}

O stands for option, M for Match.

The default model currently does simple filtering based on title, year, authors, and sorting based on the three + number of citations in the current document to break ties:

export const citationOptionModel = {
filter(option: IOption<ICitationOption, ICitationOptionMatch>): boolean {
return (
option.match !== null &&
[option.match.title, option.match.year, option.match.creators].filter(
v => v !== null
).length !== 0
);
},
sort(
a: IOption<ICitationOption, ICitationOptionMatch>,
b: IOption<ICitationOption, ICitationOptionMatch>
): number {
if (a.match === null || b.match === null) {
return 0;
}
const titleResult =
InfinityIfMissing(a.match.title?.score) -
InfinityIfMissing(b.match.title?.score);
const creatorsResult =
(a.match.creators
? Math.min(...a.match.creators.map(c => InfinityIfMissing(c?.score)))
: Infinity) -
(b.match.creators
? Math.min(...b.match.creators.map(c => InfinityIfMissing(c?.score)))
: Infinity);
const yearResult =
InfinityIfMissing(a.match.year?.absoluteDifference) -
InfinityIfMissing(b.match.year?.absoluteDifference);
const citationsResult =
InfinityIfMissing(b.data.citationsInDocument.length) -
InfinityIfMissing(a.data.citationsInDocument.length);
return creatorsResult || titleResult || yearResult || citationsResult;
},
match(option: ICitationOption, query: string): ICitationOptionMatch {
query = query.toLowerCase();
const publication = option.publication;
const titleMatch = StringExt.matchSumOfSquares(
(publication.title || '').toLowerCase(),
query
);
const regex = /\b((?:19|20)\d{2})\b/g;
const queryYear = query.match(regex);
let yearMatch: IYearMatch | null = null;
if (queryYear) {
yearMatch = {
absoluteDifference: Math.abs(
(publication.date?.getFullYear
? publication.date?.getFullYear()
: 0) - parseInt(queryYear[0], 10)
)
};
}
return {
title: titleMatch,
year: yearMatch,
creators: publication.author
? publication.author.map(creator => {
return StringExt.matchSumOfSquares(
formatAuthor(creator).toLowerCase(),
query
);
})
: null
};
},
initialOptions(options: ICitationOption[]): ICitationOption[] {
const optionsCitedInDocument = options.filter(
option => option.citationsInDocument.length > 0
);
if (!optionsCitedInDocument.length) {
return options;
}
return optionsCitedInDocument.sort(
(a, b) => b.citationsInDocument.length - a.citationsInDocument.length
);
}
};

This needs writing some unit tests.

@baggiponte
Copy link
Author

Hi Mike, sorry for the late reply - I will investigate the other reference managers after the 15th (cob). Thank you for the explanation - this is really fascinating! Is the IStateDB a database format for Jupyter Notebooks? Perhaps @retorquere can tells us something more about betterbibtex?

Also, could the references inserted into a notebook be dumped to a .bib file? This would make the citation manager perfect to use in combination with jupyter book!

@retorquere
Copy link

I'm pretty sure I can, what do you want to know?

@baggiponte
Copy link
Author

baggiponte commented Nov 7, 2021

Hi @retorquere, thank you for the prompt answer! Full disclosure: I am a newbie in reference managers - I use this jupyterlab-citation-manager which currently supports Zotero and I some time ago I played a bit with {rbbt}.

Here's the thing: jupyterlab-citation-manager is sick, but as of now we can only look up references by their title and with fuzzy search. I do not recall if with bbt you can also lookup by other tags, say author, and I was wondering how/if you implemented that. Also, @krassowski has underlined how non-trivial it might be to parse dates with CSL JSON: did you have to deal with it?

As a side note, I was also curious to know how you store data: as @krassowski explained above, jupyterlab-citation-manager dumps the whole Zotero collection to a local IStateDB. I ask this because of another, unrelated thing - which deserves another issue/feature request on its own, but I wanted to wait before opening another one. I was wondering if bbt supported the option of dumping all citations in a local file, like references.bib. The jupyter-book projects supports building a bibliography from a local .bib file and jupyterlab-citation-manager could go hand in hand with it; however, as far as I understand, the bibliography can only be appended at the end of a Notebook.

Thank you!

EDIT: there's also this interesting thread on gitter and should/might be related with #15 and #8 I guess

@retorquere
Copy link

Here's the thing: jupyterlab-citation-manager is sick, but as of now we can only look up references by their title and with fuzzy search. I do not recall if with bbt you can also lookup by other tags, say author, and I was wondering how/if you implemented that.

It's probably using BBTs JSON-RPC search endpoint, and that passes the work to Zotero quicksearch, which should search on all fields & tags. I'm not sure what differentiates search on "all fields and tags" and "everything" on Zotero, but I'd guess that "everything" includes attachment content.

Also, @krassowski has underlined how non-trivial it might be to parse dates with CSL JSON: did you have to deal with it?

I don't do CSL-JSON date parsing, but I do produce CSL-JSON dates, and they appear to me to be very well-defined structured objects - there really isn't anything to parse in CSL dates AFAICT. Do you have a sample of a hard-to-parse CSL date, @krassowski?

Parsing free-from dates into CSL is another matter. The BBT date parser is a few hundred lines of code on top of two pretty large EDTF-parsing libraries.

As a side note, I was also curious to know how you store data: as @krassowski explained above, jupyterlab-citation-manager dumps the whole Zotero collection to a local IStateDB.

I have an sqlite db in de zotero data directory for most BBT data, and a bunch of JSON files for the caches. These can only be read when Zotero is not running; Zotero locks sqlite databases while it is running, and BBT reads-and-deletes the caches to make sure that if an error occurs that prevents saving the cache would not lead to stale caches being read on next startup; it's better to start with an empty cache (which is a self-repairing situation) than a stale cache. The caches are written back out when Zotero shuts down.

I ask this because of another, unrelated thing - which deserves another issue/feature request on its own, but I wanted to wait before opening another one. I was wondering if bbt supported the option of dumping all citations in a local file, like references.bib.

Several ways in fact:

  • You can of course manually export to bibtex
  • You can set up an auto-export which will keep the exported bib file in sync with the source library/collection you used to create it
  • You can download a library/collection from a web endpoint the BBT makes available ("pull export")
  • There is a JSON-RPC endpoint that a program can call to do one-off exports or set up an autoexport

EDIT: there's also this interesting thread on gitter and should/might be related with #15 and #8 I guess

I don't know what the topic under discussion is there.

@baggiponte
Copy link
Author

Thank you for your prompt and exhaustive reply!

I have an sqlite db in de zotero data directory for most BBT data, and a bunch of JSON files for the caches. These can only be read when Zotero is not running; Zotero locks sqlite databases while it is running, and BBT reads-and-deletes the caches to make sure that if an error occurs that prevents saving the cache would not lead to stale caches being read on next startup; it's better to start with an empty cache (which is a self-repairing situation) than a stale cache. The caches are written back out when Zotero shuts down.

That I remember, I had a look inside my ~/Zotero and found the sqlite files. I guess the choice to use IStateDB depends on Jupyter.

Several ways in fact:

  • You can of course manually export to bibtex

From Zotero, right?

  • You can set up an auto-export which will keep the exported bib file in sync with the source library/collection you used to create it
  • You can download a library/collection from a web endpoint the BBT makes available ("pull export")
  • There is a JSON-RPC endpoint that a program can call to do one-off exports or set up an autoexport

Do these pull-export the whole collection or just the files inside the article/publication?

I don't know what the topic under discussion is there.

Ops, sorry: this is more related to exporting to MyST markdown formats. I am drifting off-topic, I guess I should move this discussion to somewhere else. In the meantime, thank you for the answers, let's wait and see if Mike has something to comment upon.

@krassowski
Copy link
Owner

krassowski commented Nov 7, 2021

I don't do CSL-JSON date parsing, but I do produce CSL-JSON dates, and they appear to me to be very well-defined structured objects - there really isn't anything to parse in CSL dates AFAICT. Do you have a sample of a hard-to-parse CSL date, @krassowski?

Parsing free-from dates into CSL is another matter. The BBT date parser is a few hundred lines of code on top of two pretty large EDTF-parsing libraries.

EDTF strings are valid entries of date variables in CSL-JSON schema as is the "structured" form which may take anything between one and three parts which may be strings or numbers and for which the meaning is not very well documented; then you have the extra fields like circa, season, etc and multiple date fields (some records publication date, creation date, etc.; one of those is mandated by CSL if I recall correctly but the thing is it often only contains the year part and all the details are in the other fields and how they are populated appears random to me after looking at a large sample of records from Zotero).

@krassowski
Copy link
Owner

The jupyter-book projects supports building a bibliography from a local .bib file and jupyterlab-citation-manager could go hand in hand with it; however, as far as I understand, the bibliography can only be appended at the end of a Notebook.

This is tracked in the other issues you mentioned - let's keep this issue focused on the search capabilities ;)

@krassowski
Copy link
Owner

Thank you for the explanation - this is really fascinating! Is the IStateDB a database format for Jupyter Notebooks?

No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.

@baggiponte
Copy link
Author

It's probably using BBTs JSON-RPC search endpoint, and that passes the work to Zotero quicksearch, which should search on all fields & tags. I'm not sure what differentiates search on "all fields and tags" and "everything" on Zotero, but I'd guess that "everything" includes attachment content.

I can't figure out the UI for this: do you just write something like author:'Author1 Author2' and then Zotero quick search searches inside of tags? What about BBT, does this translate in a SELECT * WHERE author IN ('author1', 'author2')?

No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.

But you basically query from it, right? So entries have fields and the matter here is just to find a UI query-style consistent with what other citation managers offer, am I correct?

@krassowski
Copy link
Owner

krassowski commented Nov 7, 2021

Thank you @retorquere for your generous advice and explaining how BBT works. From that, I gather it relies on a local Zotero installation and access to its local API to pass on the search tasks; @baggiponte as also discussed in #37 this is really not the feasible path for this extension for several reasons:

  • JupyterHub and other setups will often have no access to the local Zotero installation as it is on a different computer
  • even on the same computer Zotero does not allow to access its API from the browser easily due to the security implications; the block is implemented on CORS level and only possible to circumvent by:
    • developing a browser extension (as Zotero Google Docs integration) which is a tremendous work, subpar UX (now the user has to have the Jupyter server extension, Jupyter frontend extension AND browser extension installed AND give it access to the contents of the websites they open - and we don't know if they will use Jupyter on localhost or say hpc.myuni.ac.uk so we cannot even limit the access request to a single domain!)
    • developing a proxy on the Jupyter server extension which I discussed in local Zotero access (no API) #37; this remains a possibility for an alternative implementation of our IReferenceProvider interface (but it will only work for a subset of users and likely confuse newcomers)
  • by design, this extension is intended to interface with multiple citation providers, subject to API availability, so it should not rely on any Zotero-specific features; we could create an ISearchProvider interface to allow using the Zotero Web API to search references but this should be optional and the core functionality has to operate on standard CSL-JSON records directly
  • touching the Zotero database on disk directly is not a maintainable implementation for me (it is not a public API in the first place AFAIK) and this extension should not do that so we can in the future install it as an isolated package (say flatpak) with restricted or no access to the disk (which should really be the norm now)

@krassowski
Copy link
Owner

No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.

But you basically query from it, right? So entries have fields and the matter here is just to find a UI query-style consistent with what other citation managers offer, am I correct?

Yes, the entries follow csl-citation.json schema (which should be read together with csl-data.json schema).

@retorquere
Copy link

EDTF strings are valid entries of date variables in CSL-JSON schema

Oh yeah that's complex so I don't bother doing it myself, I outsource that to a library.

as is the "structured" form which may take anything between one and three parts which may be strings or numbers and for which the meaning is not very well documented; then you have the extra fields like circa, season, etc and multiple date fields (some records publication date, creation date, etc.;

I find these not too hard to process, but TBH I don't support all possible combinations. What I can sensibly output is constrained by the target format (bibtex and biblatex) and since biblatex supports edtf, I just forward whatever is deemed (by said library) to be valid EDTF.

one of those is mandated by CSL if I recall correctly but the thing is it often only contains the year part and all the details are in the other fields and how they are populated appears random to me after looking at a large sample of records from Zotero).

I don't use the Zotero date parser, BBTs date parser differs significantly from Zotero's.

  • You can of course manually export to bibtex

From Zotero, right?

Correct. BBT is only available in the Zotero client.

Do these pull-export the whole collection or just the files inside the article/publication?

With files you mean attachments? There's not yet an RPC-JSON endpoint for that. You can pull down bibtex or biblatex from the endpoint.

I can't figure out the UI for this: do you just write something like author:'Author1 Author2' and then Zotero quick search searches inside of tags? What about BBT, does this translate in a SELECT * WHERE author IN ('author1', 'author2')?

I don't translate it at all; I just pass the text on to the same code that handles the quick search above the item list in Zotero, and return the results.

Thank you @retorquere for your generous advice and explaining how BBT works. From that, I gather it relies on a local Zotero installation and access to its local API to pass on the search tasks;

correct

* JupyterHub and other setups will often have no access to the local Zotero installation as it is on a different computer

There's ways around that, but they're not convenient. I have a branch where I work on a BBT that doesn't need the client, but I have no ETA on that beyond "not soon".

* even on the same computer Zotero does not allow to access its API from the browser easily

correct.

* touching the Zotero database on disk directly is not a maintainable implementation for me (it is not a public API in the first place AFAIK) 

Only pain lies that way. You most certainly never want to write to the DB directly.

and this extension should not do that so we can in the future install it as an isolated package (say flatpak) with restricted or no access to the disk (which should really be the norm now)

Not a great fan of flatpack et al, but I see the appeal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants