Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method ListURLs to list all URLs known in the frontier with their next fetch date #93

Merged
merged 6 commits into from
Sep 18, 2024

Conversation

klockla
Copy link
Collaborator

@klockla klockla commented Aug 8, 2024

Added method ListURLs in API and client to list all URLs known in the frontier with their next fetch date
(RocksDB & MemoryFrontier implementations only)

@jnioche
Copy link
Collaborator

jnioche commented Aug 27, 2024

I see that this contains the changes from #92, it would be better to make it independent if possible (I appreciate that this has a draft status)

@klockla klockla force-pushed the listurl_github branch 3 times, most recently from 50ece65 to b314367 Compare September 4, 2024 13:59
@klockla
Copy link
Collaborator Author

klockla commented Sep 4, 2024

I rebased and updated this PR, think it's ready to be reviewed now.

@klockla klockla marked this pull request as ready for review September 4, 2024 14:03
@jnioche
Copy link
Collaborator

jnioche commented Sep 4, 2024

I had misread what this does I thought it would return all the URLs within a queue, which would be useful for debugging.
This will stream the entire content of the frontier by pages of e.g. 100 URLs. What do you want to be able to achieve with this? backups? debugs?
Should we consider an export endpoint instead? If so, what format would it use to represent all the data including the metadata? Should the service implementations be responsible for writing the data and communicating the location of the file to the client?

@klockla
Copy link
Collaborator Author

klockla commented Sep 5, 2024

It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)

@jnioche
Copy link
Collaborator

jnioche commented Sep 5, 2024

It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)

thanks for the explanation @klockla. in a sense PutURLs is the import feature but we could have one where a file is made available to the server locally and it could ingest it as a batch. This would be quicker than streaming from the client; I think @michaeldinzinger did something similar with the OpenSearch backend he uses at OpenWebSearch.

Going back to our primary topic, would it be OK to list the URLs per queue only? From a client's perspective you can list the queues separately and get the URLs for each one. This would be equivalent to paging in a sense.

Doing so would make more sense to me as in most cases you'd want to debug per queue. What do you think?

@klockla
Copy link
Collaborator Author

klockla commented Sep 5, 2024

What do you think of the following:

We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).

Parameters would look something like:

message ListUrlParams {
// position of the first result in the list; defaults to 0
uint32 start = 1;
// max number of values; defaults to 100
uint32 size = 2;
/** ID for the queue **/
string key = 3;
// crawl ID
string crawlID = 4;
// only for the current local instance
bool local = 5;
}

@jnioche
Copy link
Collaborator

jnioche commented Sep 5, 2024

What do you think of the following:

We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).

Parameters would look something like:

message ListUrlParams { // position of the first result in the list; defaults to 0 uint32 start = 1; // max number of values; defaults to 100 uint32 size = 2; /** ID for the queue **/ string key = 3; // crawl ID string crawlID = 4; // only for the current local instance bool local = 5; }

yes that would work I think

@klockla
Copy link
Collaborator Author

klockla commented Sep 6, 2024

Updated the PR to include queue parameter

Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great but given how similar the code is for the memory and rocks implementations, should we have the logic in the abstract super class instead? This is what we do when we retrieve URLs. What do you think?

API/urlfrontier.proto Show resolved Hide resolved
fetchDate = String.valueOf(item.getKnown().getRefetchableFromDate());
}

outstream.println(item.getKnown().getInfo().getUrl() + ";" + fetchDate);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to be able to see any metadata associated with the URL.
IIRC it is possible to de-ser in JSON pretty easily, see PutURL clients. It would be good to have a way of specifying the output format between JSON or char separated.
Maybe we could have a utility class to share the logic of reading to / from various formats? This could be done later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to be able to see any metadata associated with the URL.

I see one problem though, whether it is in MemoryFrontier or in RocksDB, we lose the Metadata once a URL is completed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to be able to see any metadata associated with the URL.

I see one problem though, whether it is in MemoryFrontier or in RocksDB, we lose the Metadata once a URL is completed

Ah, maybe something to fix separately. I suppose we could still display the metadata where possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to have (small) common logic in AbstractFrontierService.
Added the option to print output in JSON format.

Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great, thanks. Please see comments.

names = {"-c", "--crawlID"},
defaultValue = "DEFAULT",
paramLabel = "STRING",
description = "crawl to get the queues for")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tell what the default value is in the description?

* @return
*/
public static URLItem buildURLItem(
URLItem.Builder builder, KnownURLItem.Builder kbuilder, URLInfo info, long refetch) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice way of simplifying the code in the sub classes + calling clear is also a safe way of making sure no data is carried through from a previous entry

if (key != null && !key.isEmpty() && !e.getKey().getQueue().equals(key)) {
continue;
}
Iterator<URLItem> iter = urlIterator(e, start, maxURLs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the maxURLs is within a queue, I thought it was in general
That's fine but let's make this more explicit in the .proto and the client and perhaps add a maxNumberQueues param? For instance I have 1M queues in my test, calling getURLs takes forever.
The same seems to apply to start it is within a queue, which makes sense but the doc should make that clear

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we go back to my original suggestion of returning the list only for a specific queue, in which case the pagination as it is would be fine

@klockla
Copy link
Collaborator Author

klockla commented Sep 16, 2024

I have rebased the PR following your merge and updated it so that pagination is global over all queues

Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - see comment

@jnioche jnioche added this to the 2.3 milestone Sep 17, 2024
@jnioche jnioche added enhancement New feature or request API Client labels Sep 17, 2024
Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse these iterators in other parts of the code?
(These are my last comments on this issue - I promise)

responseObserver.onCompleted();
}

protected abstract Iterator<URLItem> urlIterator(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used once I think. Can't we simply have the one below with 0, and Integer.maxInt as values when called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed the redundant iterator constructors

@klockla
Copy link
Collaborator Author

klockla commented Sep 18, 2024

Could we reuse these iterators in other parts of the code? (These are my last comments on this issue - I promise)

I didn't see any direct opportunity for reuse , maybe in the future...

Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klockla

@jnioche jnioche merged commit 247b201 into crawler-commons:master Sep 18, 2024
2 checks passed
@klockla klockla deleted the listurl_github branch September 18, 2024 16:34
@klockla klockla restored the listurl_github branch September 18, 2024 16:34
@klockla klockla deleted the listurl_github branch September 18, 2024 16:35
@jnioche jnioche modified the milestones: 2.3, 2.4.0 Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Client enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants