Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It would be great to have Duck Duck Go implemented within the icrawler framework. I created my own script, based upon other code (attribution provided below). My code does not conform to the icrawler framework style. It does nothing more than search from images on DDG and return URLs. I’ve looked through the icrawler framework and I’m not proficient to be able to implement it in this style. If you like, I could put something together as a pull request that would provide a minimally viable DDG engine within the framework. Alternatively, I post the code here is someone else wants to implement it themselves: #1

Closed
bibbu994 opened this issue Sep 6, 2020 · 0 comments
Assignees

Comments

@bibbu994
Copy link
Owner

bibbu994 commented Sep 6, 2020

It would be great to have Duck Duck Go implemented within the icrawler framework. I created my own script, based upon other code (attribution provided below). My code does not conform to the icrawler framework style. It does nothing more than search from images on DDG and return URLs. I’ve looked through the icrawler framework and I’m not proficient to be able to implement it in this style. If you like, I could put something together as a pull request that would provide a minimally viable DDG engine within the framework. Alternatively, I post the code here is someone else wants to implement it themselves:


### image_search_ddg.py                                                                                                                               
### C. Bryan Daniels                                                                                                                                  
### 9/1/2020                                                                                                                                          
### Adopted from https://github.com/deepanprabhu/duckduckgo-images-api                                                                                
###                                                                                                                                                   

import requests, re, json, time, sys

headers = {'authority':'duckduckgo.com','accept':'application/json,text/javascript,*/*; q=0.01','sec-fetch-dest':'empty',
        'x-requested-with':'XMLHttpRequest',
        'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/80.0.3987.163 Safari/537.36',
        'sec-fetch-site':'same-origin','sec-fetch-mode':'cors','referer':'https://duckduckgo.com/','accept-language':'en-US,en;q=0.9'}

def image_search_ddg(keywords,max_n=100):
    """Search for 'keywords' with DuckDuckGo and return a unique urls of 'max_n' images"""
    url = 'https://duckduckgo.com/'
    params = {'q':keywords}
    res = requests.post(url,data=params)
    searchObj = re.search(r'vqd=([\d-]+)\&',res.text)
    if not searchObj: print('Token Parsing Failed !'); return
    params = (('l','us-en'),('o','json'),('q',keywords),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
    requestUrl = url + 'i.js'
    urls = []
    while True:
        try:
            res = requests.get(requestUrl,headers=headers,params=params)
            data = json.loads(res.text)
            for obj in data['results']:
                urls.append(obj['image'])
                max_n = max_n - 1
                if max_n < 1: return print_uniq(urls)
            if 'next' not in data: return print_uniq(urls)
            requestUrl = url + data['next']
        except:
            pass

def print_uniq(urls):
    for url in set(urls):
        print(url)

if __name__ == "__main__": 
    if len(sys.argv)    == 2: image_search_ddg(sys.argv[1])
    elif len(sys.argv)  == 3: image_search_ddg(sys.argv[1],int(sys.argv[2]))
    else: print("usage: search(keywords,max_n=100)")

Originally posted by @prairie-guy in hellock/icrawler#82

@bibbu994 bibbu994 self-assigned this Sep 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant