-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method ListURLs to list all URLs known in the frontier with their next fetch date #93
Conversation
I see that this contains the changes from #92, it would be better to make it independent if possible (I appreciate that this has a draft status) |
50ece65
to
b314367
Compare
I rebased and updated this PR, think it's ready to be reviewed now. |
I had misread what this does I thought it would return all the URLs within a queue, which would be useful for debugging. |
It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature) |
thanks for the explanation @klockla. in a sense PutURLs is the import feature but we could have one where a file is made available to the server locally and it could ingest it as a batch. This would be quicker than streaming from the client; I think @michaeldinzinger did something similar with the OpenSearch backend he uses at OpenWebSearch. Going back to our primary topic, would it be OK to list the URLs per queue only? From a client's perspective you can list the queues separately and get the URLs for each one. This would be equivalent to paging in a sense. Doing so would make more sense to me as in most cases you'd want to debug per queue. What do you think? |
What do you think of the following: We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them). Parameters would look something like: message ListUrlParams { |
yes that would work I think |
Updated the PR to include queue parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great but given how similar the code is for the memory and rocks implementations, should we have the logic in the abstract super class instead? This is what we do when we retrieve URLs. What do you think?
fetchDate = String.valueOf(item.getKnown().getRefetchableFromDate()); | ||
} | ||
|
||
outstream.println(item.getKnown().getInfo().getUrl() + ";" + fetchDate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to be able to see any metadata associated with the URL.
IIRC it is possible to de-ser in JSON pretty easily, see PutURL clients. It would be good to have a way of specifying the output format between JSON or char separated.
Maybe we could have a utility class to share the logic of reading to / from various formats? This could be done later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to be able to see any metadata associated with the URL.
I see one problem though, whether it is in MemoryFrontier or in RocksDB, we lose the Metadata once a URL is completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to be able to see any metadata associated with the URL.
I see one problem though, whether it is in MemoryFrontier or in RocksDB, we lose the Metadata once a URL is completed
Ah, maybe something to fix separately. I suppose we could still display the metadata where possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to have (small) common logic in AbstractFrontierService.
Added the option to print output in JSON format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great, thanks. Please see comments.
names = {"-c", "--crawlID"}, | ||
defaultValue = "DEFAULT", | ||
paramLabel = "STRING", | ||
description = "crawl to get the queues for") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tell what the default value is in the description?
* @return | ||
*/ | ||
public static URLItem buildURLItem( | ||
URLItem.Builder builder, KnownURLItem.Builder kbuilder, URLInfo info, long refetch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice way of simplifying the code in the sub classes + calling clear is also a safe way of making sure no data is carried through from a previous entry
if (key != null && !key.isEmpty() && !e.getKey().getQueue().equals(key)) { | ||
continue; | ||
} | ||
Iterator<URLItem> iter = urlIterator(e, start, maxURLs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the maxURLs is within a queue, I thought it was in general
That's fine but let's make this more explicit in the .proto and the client and perhaps add a maxNumberQueues param? For instance I have 1M queues in my test, calling getURLs takes forever.
The same seems to apply to start
it is within a queue, which makes sense but the doc should make that clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we go back to my original suggestion of returning the list only for a specific queue, in which case the pagination as it is would be fine
Signed-off-by: Laurent Klock <[email protected]>
Signed-off-by: Laurent Klock <[email protected]>
Added JSON output option for ListURLs Signed-off-by: Laurent Klock <[email protected]>
Signed-off-by: Laurent Klock <[email protected]>
4c33643
to
095433d
Compare
I have rebased the PR following your merge and updated it so that pagination is global over all queues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - see comment
service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java
Show resolved
Hide resolved
Signed-off-by: Laurent Klock <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we reuse these iterators in other parts of the code?
(These are my last comments on this issue - I promise)
responseObserver.onCompleted(); | ||
} | ||
|
||
protected abstract Iterator<URLItem> urlIterator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used once I think. Can't we simply have the one below with 0, and Integer.maxInt as values when called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I removed the redundant iterator constructors
Signed-off-by: Laurent Klock <[email protected]>
I didn't see any direct opportunity for reuse , maybe in the future... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @klockla
Added method ListURLs in API and client to list all URLs known in the frontier with their next fetch date
(RocksDB & MemoryFrontier implementations only)