Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duck Duck Go Search #82

Open
prairie-guy opened this issue Sep 2, 2020 · 0 comments
Open

Duck Duck Go Search #82

prairie-guy opened this issue Sep 2, 2020 · 0 comments

Comments

@prairie-guy
Copy link
Contributor

It would be great to have Duck Duck Go implemented within the icrawler framework. I created my own script, based upon other code (attribution provided below). My code does not conform to the icrawler framework style. It does nothing more than search from images on DDG and return URLs. I’ve looked through the icrawler framework and I’m not proficient to be able to implement it in this style. If you like, I could put something together as a pull request that would provide a minimally viable DDG engine within the framework. Alternatively, I post the code here is someone else wants to implement it themselves:


### image_search_ddg.py                                                                                                                               
### C. Bryan Daniels                                                                                                                                  
### 9/1/2020                                                                                                                                          
### Adopted from https://github.com/deepanprabhu/duckduckgo-images-api                                                                                
###                                                                                                                                                   

import requests, re, json, time, sys

headers = {'authority':'duckduckgo.com','accept':'application/json,text/javascript,*/*; q=0.01','sec-fetch-dest':'empty',
        'x-requested-with':'XMLHttpRequest',
        'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/80.0.3987.163 Safari/537.36',
        'sec-fetch-site':'same-origin','sec-fetch-mode':'cors','referer':'https://duckduckgo.com/','accept-language':'en-US,en;q=0.9'}

def image_search_ddg(keywords,max_n=100):
    """Search for 'keywords' with DuckDuckGo and return a unique urls of 'max_n' images"""
    url = 'https://duckduckgo.com/'
    params = {'q':keywords}
    res = requests.post(url,data=params)
    searchObj = re.search(r'vqd=([\d-]+)\&',res.text)
    if not searchObj: print('Token Parsing Failed !'); return
    params = (('l','us-en'),('o','json'),('q',keywords),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
    requestUrl = url + 'i.js'
    urls = []
    while True:
        try:
            res = requests.get(requestUrl,headers=headers,params=params)
            data = json.loads(res.text)
            for obj in data['results']:
                urls.append(obj['image'])
                max_n = max_n - 1
                if max_n < 1: return print_uniq(urls)
            if 'next' not in data: return print_uniq(urls)
            requestUrl = url + data['next']
        except:
            pass

def print_uniq(urls):
    for url in set(urls):
        print(url)

if __name__ == "__main__": 
    if len(sys.argv)    == 2: image_search_ddg(sys.argv[1])
    elif len(sys.argv)  == 3: image_search_ddg(sys.argv[1],int(sys.argv[2]))
    else: print("usage: search(keywords,max_n=100)")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants