Butler is a web-based Know Your Customer (KYC) application meant to assist in slot-filling an entity profile via human-in-the-loop feedback and a simple search query capable of hitting the open and dark web as well as enterprise search repositories.
It's capable of leveraging SRI's Lighthouse Search backend for free text search and information correlation.
The primary use case is to help analysts whose job it is to begin with small piece of information such as a phone number or user handle and understand the complete profile of the entity or persona. Often times Google search and a speadsheet is used and there exists few tools that aggregate and analyze search results such that the relevant profile information is captured. Often there is also ambiguity with regards to resolving an entity - Butler is designed to cluster pages in an attempt to group pages and information that is more similar together.
At a high level, Butler depends on four software projects:
- CoreNLP Server (used for entity and information extraction)
- Elasticsearch (used as the application database)
- Butler Server (scraping, analytic, and data processing component)
- Butler UI (User interface)
This installation works with Linux and OS X and requires Docker and Git on your machine. It has been tested with Version 17.09.0.
It is recommended to configure Docker with 2 CPUs and 8 GB of Memory for basic use.
This installation runs each of the four software components listed above in a separate docker container.
- CoreNLP Server runs on port 9000.
- Elasticsearch runs on port 9200.
- Butler Server runs on port 5000.
- Butler UI runs on port 3000.
# Go get the project!
git clone https://github.com/jgawrilo/butler_install.git
# Move into the project directory!
cd butler_install
# Install the full app!
./install.sh
# Start the app (all containers) for use/testing!
./start.sh
# Head to http://localhost:3000 in your browser. Use the application! See 'Testing' below if you don't know what to do. Go ahead! Try it out!
# Stop the app (all containers) when you're done!
./stop.sh
- Head to http://localhost:300 and ensure you see a screen like below. When you do, start a project called 'justin'.
- Type 'justin gawrilow' in the search bar and hit enter. This starts mining results.
- The search might take a few minutes to complete. Please keep in mind the tool goes to the open (and possibly dark) web and pulls results on the fly, taking screenshots, parsing HTML and trying to fill out a profile.
- After some time you should see a few results come back. Click on the clusters in the treemap or legend to checkout the pages.
- You can also check out the profile, by clicking 'Profile' in the upper right corner.
- To get more results, click on the 'More' button in the upper left. Again, after some time you'll see even more pages associated with 'justin gawrilow'
- To start another project, click the button in the upper right and then click 'Close' or just close the browser. You can always go back to your old project or start a new one.
- For more information on how to the use more features of the tool, please see the User Manual below.
It's possible to speed up the search and scraping aspect of Butler by installing gg on separate servers and then adding those endpoints to the butler_server config.json.
Doing this will essentially distribute the search to these servers and will limit the calls any one server will receive.
E.g.,
"search_boxes":["http://40.167.321.126:7777/get_urls"]
Please download this brief to understand more details about the application: Butler Cheat Sheet
Apache-2.0 and developed under the DARPA Memex program.