This repository has been archived by the owner on Dec 14, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 87
Fix solr collections #842
Open
thepsalmist
wants to merge
18
commits into
mediacloud:master
Choose a base branch
from
thepsalmist:fix-solr-collections
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Fix solr collections #842
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
815042a
Make no_dedup_sentences the default for extract-and-vector
philbudne c77dd18
common/Dockerfile: skip jieba.cache creation; makes empty root owned …
philbudne f37b7f9
solr-base/Dockerfile: try cloning mediacloud config as mediacloud64
philbudne e1be9a4
common/Dockerfile: reenable jieba.cache creation & chown it
philbudne 1868296
apps/common/src/python/mediawords/solr/request.py: add/use SOLR_COLLE…
philbudne 4ab3729
apps/import-solr-data/src/perl/MediaWords/Solr/Dump.pm: speedups for …
philbudne 791248a
add apps/solr-base/src/solr/aliases.json with "mediacloud2" solr alias
philbudne c4fb3d6
apps/common/src/requirements.txt: force MarkupSafe==2.0.1
philbudne 5845a61
solr-zookeeper: preload aliases.json into zookeeper
philbudne 40391ed
apps/postgresql-server/bin/apply_migrations.sh: increase PGCTL_START_…
philbudne f395116
postgresql-pgbouncer/conf/pgbounder.init:
philbudne 76be844
pgbouncer.ini: use postgresql server ip
philbudne bf74554
apps/webapp-api/src/perl/MediaWords/Controller/Api/V2/Timespans.pm
philbudne 6654fbb
fix solr query on multiple collections
thepsalmist e6bc74d
refactor merge solr function
thepsalmist bc3c626
fix solr response schema
thepsalmist e8bb5a8
update solr merge function
thepsalmist cb168c9
fix solr merge buckets duplicat date values
thepsalmist File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ | |
|
||
import abc | ||
import time | ||
import json | ||
from typing import Union, Optional | ||
from urllib.parse import urlencode | ||
|
||
|
@@ -24,6 +25,10 @@ | |
__QUERY_HTTP_TIMEOUT = 15 * 60 | ||
"""Timeout of a single HTTP query.""" | ||
|
||
# Testing alias!! | ||
SOLR_COLLECTION = 'mediacloud2' | ||
MEDIACLOUD_32 = 'mediacloud' | ||
MEDIACLOUD_64 = 'mediacloud64' | ||
|
||
class _AbstractSolrRequestException(Exception, metaclass=abc.ABCMeta): | ||
"""Abstract .solr.request exception.""" | ||
|
@@ -59,7 +64,7 @@ def __wait_for_solr_to_start(config: Optional[CommonConfig]) -> None: | |
"""Wait for Solr to start and collections to become available, if needed.""" | ||
|
||
# search for an empty or rare term here because searching for *:* sometimes causes a timeout for some reason | ||
sample_select_url = f"{config.solr_url()}/mediacloud/select?q=BOGUSQUERYTHATRETURNSNOTHINGNADA&rows=1&wt=json" | ||
sample_select_url = f"{config.solr_url()}/{SOLR_COLLECTION}/select?q=BOGUSQUERYTHATRETURNSNOTHINGNADA&rows=1&wt=json" | ||
|
||
connected = False | ||
|
||
|
@@ -152,6 +157,81 @@ def __solr_error_message_from_response(response: Response) -> str: | |
return error_message | ||
|
||
|
||
def merge_responses(mc_32_bit_collection: dict,mc_64_bit_collection: dict): | ||
""" | ||
Merge solr responses from each of the collections to one | ||
|
||
:param dict1: Response from mediacloud32 collection. | ||
:param dict2: Response from mediacloud64 collection. | ||
|
||
""" | ||
new_response = {} | ||
|
||
new_response.update(mc_32_bit_collection.get("responseHeader", {})) | ||
|
||
mc_32_bit_response = mc_32_bit_collection.get("response", {}) | ||
mc_64_bit_response = mc_64_bit_collection.get("response", {}) | ||
|
||
num_found = mc_32_bit_response.get("numFound", 0) + mc_64_bit_response.get("numFound", 0) | ||
start_index = mc_32_bit_response.get("start", 0) + mc_64_bit_response.get("start", 0) | ||
|
||
docs = [] | ||
|
||
docs.extend(mc_32_bit_response.get("docs", [])) | ||
docs.extend(mc_64_bit_response.get("docs", [])) | ||
|
||
new_response.update({ | ||
"response": { | ||
"numFound": num_found, | ||
"start": start_index, | ||
"docs": docs, | ||
} | ||
}) | ||
|
||
# facets | ||
if "facets" in mc_32_bit_collection or "facets" in mc_64_bit_collection: | ||
mc_32_bit_facets = mc_32_bit_response.get("facets", {}) | ||
mc_64_bit_facets = mc_64_bit_response.get("facets", {}) | ||
|
||
count = mc_32_bit_facets.get("count", 0) + mc_64_bit_facets.get("count", 0) | ||
x = mc_32_bit_facets.get("x", 0) + mc_64_bit_facets.get("x", 0) | ||
|
||
categories = {} | ||
|
||
if "categories" in mc_32_bit_facets or "categories" in mc_64_bit_facets: | ||
buckets = [] | ||
mc_32_buckets = mc_32_bit_facets.get("categories", {}).get("buckets", []) | ||
mc_64_buckets = mc_64_bit_facets.get("categories", {}).get("buckets", []) | ||
merged = {} | ||
for item in mc_32_buckets + mc_64_buckets: | ||
val = item['val'] | ||
if val in merged: | ||
merged[val]['count'] += item['count'] | ||
merged[val]['x'] += item['x'] | ||
else: | ||
merged[val] = item.copy() | ||
|
||
merged = list(merged.values()) | ||
buckets.extend(merged) | ||
categories.update({"buckets":buckets}) | ||
|
||
new_response.update({ | ||
"facets": { | ||
"count": count, | ||
"categories": categories | ||
} | ||
}) | ||
else: | ||
new_response.update({ | ||
"facets": { | ||
"count": count, | ||
"x": x | ||
} | ||
}) | ||
|
||
return new_response | ||
|
||
|
||
def solr_request(path: str, | ||
params: SolrParams = None, | ||
content: Union[str, SolrParams] = None, | ||
|
@@ -191,10 +271,8 @@ def solr_request(path: str, | |
if not params: | ||
params = {} | ||
|
||
abs_uri = furl(f"{solr_url}/mediacloud/{path}") | ||
abs_uri = abs_uri.set(params) | ||
abs_url = str(abs_uri) | ||
|
||
collections = [MEDIACLOUD_32, MEDIACLOUD_64] | ||
|
||
ua = UserAgent() | ||
ua.set_timeout(__QUERY_HTTP_TIMEOUT) | ||
ua.set_max_size(None) | ||
|
@@ -219,21 +297,38 @@ def solr_request(path: str, | |
|
||
content_encoded = content.encode('utf-8', errors='replace') | ||
|
||
request = Request(method='POST', url=abs_url) | ||
request.set_header(name='Content-Type', value=content_type) | ||
request.set_header(name='Content-Length', value=str(len(content_encoded))) | ||
request.set_content(content_encoded) | ||
|
||
results = [] | ||
for collection in collections: | ||
abs_uri = furl(f"{solr_url}/{collection}/{path}") | ||
abs_uri = abs_uri.set(params) | ||
abs_url = str(abs_uri) | ||
request = Request(method='POST', url=abs_url) | ||
request.set_header(name='Content-Type', value=content_type) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From this, I think we can combine all the headers while creating the
|
||
request.set_header(name='Content-Length', value=str(len(content_encoded))) | ||
request.set_content(content_encoded) | ||
results.append(request) | ||
|
||
else: | ||
|
||
request = Request(method='GET', url=abs_url) | ||
log.debug(f"Sending Solr request: {request}") | ||
|
||
responses = [] | ||
if len(results) > 1: | ||
for r in results: | ||
response = ua.request(r) | ||
if response.is_success(): | ||
responses.append(response.decoded_content()) | ||
else: | ||
error_message = __solr_error_message_from_response(response=response) | ||
raise McSolrRequestQueryErrorException(f"Error fetching Solr response: {error_message}") | ||
|
||
response = merge_responses(json.loads(responses[0]),json.loads(responses[1])) | ||
return json.dumps(response) | ||
|
||
log.debug(f"Sending Solr request: {request}") | ||
|
||
response = ua.request(request) | ||
|
||
if not response.is_success(): | ||
error_message = __solr_error_message_from_response(response=response) | ||
raise McSolrRequestQueryErrorException(f"Error fetching Solr response: {error_message}") | ||
else: | ||
response = ua.request(request) | ||
if not response.is_success(): | ||
error_message = __solr_error_message_from_response(response=response) | ||
raise McSolrRequestQueryErrorException(f"Error fetching Solr response: {error_message}") | ||
|
||
return response.decoded_content() | ||
return response.decoded_content() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"collection":{"mediacloud2":"mediacloud64,mediacloud"}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop the
abs_url = str(abs_uri)
unless theabs_url
is used somewhere else.