Efficiently reducing sampling bias in citizen science programs using principal-agent games, such as Avicaching in eBird, a bird-observational dataset.
Authors: Anmol Kabra, Yexiang Xue, Carla P. Gomes.
The publication is available at anmolkabra.com/docs/avicaching-compass19.pdf (doi: 10.1145/3314344.3332495), and this work is licensed CC-BY-4.0.
If you find this work useful, please cite it as:
Anmol Kabra, Yexiang Xue, and Carla P. Gomes. 2019. GPU-accelerated Principal-Agent Game for Scalable Citizen Science.
In ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS) (COMPASS ’19), July 3–5, 2019, Accra, Ghana.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3314344.3332495
[update bibtex citation]
Citizen science programs have been instrumental in boosting sustainability projects, large-scale scientific discovery, and crowdsourced experimentation. Nevertheless, these programs witness challenges in submissions' quality, such as sampling bias resulting from citizens' preferences to complete some tasks over others. The sampling bias frequently manifests itself in the program's dataset as spatially clustered submissions, which reduce the efficacy of the dataset for subsequent scientific studies. To address the spatial clustering problem, programs use reward schemes obtained from game-theoretical models to incentivize citizens to perform tasks that are more meaningful from a scientific point of view. Herein we propose a GPU-accelerated approach for the Avicaching game, which was recently introduced by the eBird citizen science program to incentivize birdwatchers to collect bird data from under-sampled locations. Avicaching is a Principal-Agent game, in which the principal corresponds to the citizen science program (eBird) and the agents to the birdwatchers or citizen scientists. Previous approaches for solving the Avicaching game used approximations based on mixed-integer programming and knapsack algorithms combined with learning algorithms, using standard CPU hardware. Following the recent advances in scalable deep learning and parallel computation on Graphical Processing Units (GPUs), we propose a novel approach to solve the Avicaching game, which takes advantage of neural networks and parallelism for large-scale games. We demonstrate that our approach better captures agents' behavior, which allows better learning and more effective incentive distribution in a real-world bird observation dataset. Our approach also allows for massive speedups using GPUs. As Avicaching is representative of games that are aimed at reducing spatial clustering in citizen science programs, our scalable reformulation for Avicaching enables citizen science programs to tackle sampling bias and improve submission quality on a large scale.
The project is tested in Ubuntu 16.04 64-bit, though we believe it would work in any Linux 64-bit OS.
Clone the repository and install the conda environment avicaching
from environment.yml
file as:
conda env create -f environment.yml
You can change the name of the conda environment by modifying the first line of the environment.yml
file.
We provide synthetic datasets for setup purposes and running scalability experiments. Please email [email protected] if you need access to the original eBird data or other files used for our experiments.
- The outputs of the scripts require this directory structure:
You can create this structure with:
- stats/ - find_weights/ - logs/ - map_plots/ - plots/ - weights/ - find_rewards/ - logs/ - plots/ - test_rewards_results/
for dir in logs map_plots plots weights; do mkdir -p "stats/find_weights/$dir/"; done for dir in logs plots test_rewards_results; do mkdir -p "stats/find_rewards/$dir"; done
- Running the
nn_avicaching_find_weights.py
file will run the identification problem models. You will have to specify the number of layers in the model with flag--layers k
. - Running the
nn_avicaching_find_rewards.py
file with the location of the weights files from identification problem models (specified with--weights-file filename
) will run the pricing problem models. The script will automatically set the number of layers to the one used in the identification problem model. - All other flags in both scripts are optional, as they are default set to the basic options.