diff --git a/papers/vldb-2020/sections/abstract.tex b/papers/vldb-2020/sections/abstract.tex index d25d521..10b12f2 100644 --- a/papers/vldb-2020/sections/abstract.tex +++ b/papers/vldb-2020/sections/abstract.tex @@ -2,15 +2,15 @@ Collaborative data science platforms, such as Google Colaboratory and Kaggle, affected the way users solve machine learning tasks. Instead of solving a task in isolation, users write their machine learning workloads and execute them on these platforms and share the workloads with others. This enables other users to learn from, modify, and make improvements to the existing workloads. -However, this collaborative platforms suffers from two inefficiencies. +However, collaborative platforms suffers from two inefficiencies. First, storing all the artifacts, such as raw datasets, generated features, and models with their hyperparameters, requires massive amounts of storage. -As a result, only some of the artifacts such as scripts, and models are stored and users must re-execute the scripts and operations to reconstruct the desired artifact. +As a result, only some of the artifacts such as raw data and machine learning models are stored and users must re-execute the scripts and operations to reconstruct the desired artifact. \hl{Second, even if all the artifacts are stored, manually finding desired artifacts is a time-consuming process.} % Tilmann: this is only a conclusion of the first problem, better say finding the artifacts is slow or similar... *Behrouz: I wanted to use the word manually (or something equal) to contrast between the current 'manual' querying or reading the scripts vs our way of reuse or warmstarting that doesn't require user's intervention. The contributions of this paper are two-fold. First, we utilize a graph to store the artifacts and operations of machine learning workloads as vertices and edges, respectively. %TODO Maybe we should rephrease "with high expected rates of future reuse" since we are also looking into quality -Since storing all the artifacts is not feasible, we propose two algorithms for selecting the artifacts with likelihood of future reuse. +Since storing all the artifacts is not feasible, we propose two algorithms for selecting the artifacts with high likelihood of future reuse. We then store the selected artifacts in memory for quick access, a process which we call artifact materialization. The algorithms consider several metrics, such as access frequency, size of the artifact, and quality of machine learning models to decide what artifacts to materialize. Second, using the graph, we propose three optimizations, namely reuse, model warmstarting, and fast hyperparameter tuning, to speed up the execution of the future workloads and increase the efficiency of collaborative data science platforms. diff --git a/papers/vldb-2020/sections/background.tex b/papers/vldb-2020/sections/background.tex index a795569..faf7663 100644 --- a/papers/vldb-2020/sections/background.tex +++ b/papers/vldb-2020/sections/background.tex @@ -9,8 +9,8 @@ \subsection{Motivating Example}\label{subsec-motivational-example} Users can also make their workloads publicly available to other users. As a result, many Kaggle users work together to find high-quality solutions. -Kaggle utilizes docker containers to provide isolated computational environments called Kaggle kernels. -Kaggle groups kernels by competition. Figure \ref{example-use-case} shows the infrastructure of Kaggle. +Kaggle utilizes docker containers to provide isolated computational environments called Kaggle kernels and groups these kernels by competitions. +Figure \ref{example-use-case} shows the infrastructure of Kaggle. Each kernel has limited CPU, GPU, disk space, and memory (i.e., 4 CPU cores, 17 GB of RAM, 5 GB of disk space, and a maximum of 9 hours of execution time. GPU kernels have 2 CPU cores and 13 GB of RAM\footnote{https://www.kaggle.com/docs/kernels}). In busy times, this results in users to be placed in queues (especially for GPU-enabled machines) until resources become available. diff --git a/papers/vldb-2020/sections/introduction.tex b/papers/vldb-2020/sections/introduction.tex index e0a35d2..e8c1c98 100644 --- a/papers/vldb-2020/sections/introduction.tex +++ b/papers/vldb-2020/sections/introduction.tex @@ -23,7 +23,7 @@ \section{Introduction} \label{sec-introduction} As a result, many data scientists execute their machine learning workload without checking if they can reuse part of the workload from the existing ones. % S -We propose a solution for efficiently storing the artifacts inside a graph, where vertices are the artifacts and edges are the operations connecting the artifacts and automatically optimize new workloads using the graph. +We propose a solution for efficiently storing the artifacts inside a graph, which we refer to as the \textit{experiment graph}, where vertices are the artifacts and edges are the operations connecting the artifacts and automatically optimize new workloads using the graph. We propose two algorithms to select promising artifacts and store them in memory for quick access, a process which we refer to as the artifact materialization. The first algorithm utilizes two different types of metrics for materializing the artifacts, i.e., general and machine learning specific metrics. The general metrics include the size and access frequency of the artifacts and the run-time of the operations. @@ -31,7 +31,7 @@ \section{Introduction} \label{sec-introduction} Since many of the artifacts have overlapping data columns, we perform column deduplication and remove duplicated columns before storing them. The second algorithm is storage-aware, i.e., it takes into account the deduplication information before deciding on what artifacts to materialize. -Using the \textit{experiment graph}, we automatically extract information to optimize the process of design and execution of future machine learning workloads. +Using the experiment graph, we automatically extract information to optimize the process of design and execution of future machine learning workloads. Specifically, we provide three optimizations, namely, reuse, model warmstarting, \hldel{and fast hyperparameter tuning}. In reuse, we look for opportunities to reuse an existing materialized artifact to avoid data reprocessing. Reuse decreases the data processing time especially during the initial exploratory data analysis phase where many data scientists perform similar data transformation, aggregation, and summarization operations on the data. diff --git a/papers/vldb-2020/sections/materialization.tex b/papers/vldb-2020/sections/materialization.tex index 856f574..f1eb93f 100644 --- a/papers/vldb-2020/sections/materialization.tex +++ b/papers/vldb-2020/sections/materialization.tex @@ -76,8 +76,9 @@ \subsection{Materialization Problem Formulation}\label{subsec-materialization-pr \subsection{ML-Based Greedy Algorithm}\label{subsec-ml-based-materialization} We propose a greedy heuristic-based algorithm for materializing the artifacts in the experiment graph which aims to minimize the weighted recreation cost function and maximize the estimated quality function. -Every task $T$ in the experiment graph has storage budget and runs a separate instance of the materialization algorithm. -For example, in Figure \ref{improved-use-case}, there is a materializer component and the competitions A, B, and C each have a dedicated storage budget. +Every task $T$ is associated with an experiment graph. +Each task also has a storage budget and runs a separate instance of the materialization algorithm. +For example, in Figure \ref{improved-use-case}, there are three separate experiment graphs for the competitions A, B, and C, each having a dedicated storage budget and running a separate materializer component. \begin{algorithm}[h] \caption{Artifacts-Materialization}\label{algorithm-materialization}