From cc8268556778bd29ce0512d694a7270df6d15292 Mon Sep 17 00:00:00 2001 From: sourav-mazumder Date: Wed, 22 Nov 2017 10:28:31 -0800 Subject: [PATCH] Some spellinga nd grammer changes --- spark-datasource-rest/README.md | 50 ++++++++++++++++----------------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/spark-datasource-rest/README.md b/spark-datasource-rest/README.md index 2424ae7..6037b3d 100644 --- a/spark-datasource-rest/README.md +++ b/spark-datasource-rest/README.md @@ -1,10 +1,10 @@ # Rest Data Source for Apache Spark -- This is a library for getting data from Rest based micro services/APIs for multiple input data points (in parallel) and collating the results, retuned by the API calls, in a Dataframe of Rows of Structure. +- his is a library for calling REST based services/APIs for multiple sets of input parameters in parallel and collating the results, returned by the REST service, in a Dataframe. -- Rest based micro services (for example Address Validation services, Google Search API, Watson Natural Language Proceing API, etc.) typically return result as one single record at a time for a specific value of imput parameters. However, for many Data Science problems same API needs to be called for multiple times for thousands of input Data points in hand (e.g. validating Address for a set of target customers, Getting personality insights from thousands of Tweets, getting Provider's details from NPI registry for a list of potential Doctors, etc.) +- REST based services (for example Address Validation services, Google Search API, Watson Natural Language Processing API, etc.) typically take only one set of input parameters at a time and return the corresponding record(s). However, for many Data Science problems, the same API needs to be called multiple times to account for a large set of different input parameters (e.g. validating the addresses for a set of target customers; getting personality insights from thousands of Tweets; obtaining Provider's details from the NPI registry for a list of potential Doctors, etc. -- This package supports calling the target micro service API in a distributed way for different data points. Also returns the results in a Structure (in a Dataframe) specific to the API signature without user having to specify the same. +- This package supports calling the target service API in a distributed way for different input parameter sets. The results are returned in a DataFrame in a Structure specific to the API without the user specifying this schema. ## Requirements @@ -32,25 +32,25 @@ $SPARK_HOME/bin/spark-shell --jars spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar ``` ## Features -This package allows multiple calls in parallel to a target Rest based Microservice for a set of different input data points. The input data points can be passed as a Temporary Spark Table. The column names of the table should be same as the keys of the target API. The rows in the table will have the values for the keys for which the target API has to be called multiple times (one time for one combination of the values of the keys). The result from the multiple calls of the API is returned as a [Spark DataFrames](https://spark.apache.org/docs/1.6.0/sql-programming-guide.html) of Rows of the resultant structure returned by the target API. +This package allows multiple calls, in parallel, to a target REST based micro service for a set of different input parameters. These parameters can be passed as a Temporary Spark Table where the column names of the table should be same as the keys of the target API. Each row in the table, and corresponding combination of parameter values, will be used to make one API call. The result from the multiple calls to the API is returned as a [Spark DataFrame](https://spark.apache.org/docs/1.6.0/sql-programming-guide.html) of Rows with an output structure matching that of the target API's response. -For calling the target Rest service this library supports several options: -* `url`: This is the uri of the target Microservice. You can also provide the common parameters (those don;t vary with the input data) in this url. This is a mandatory parameter. -* `input`: You need to pass the name of the Temporary Spark Table which contains the input data set. This is a mandatory parameter too. +This library supports several options for calling the target REST service: +* `url`: This is the uri of the target micro service. You can also provide the common parameters (those that don't vary with each API call) in this url. This is a mandatory parameter. +* `input`: You need to pass the name of the Temporary Spark Table which contains the input parameters set. This is a mandatory parameter too. * `method`: The supported http/https method. Possible types supported right now are `POST`, and `GET`. Default is `POST` * `userId` : The userId in case the target API needs basic authentication. * `userPassword` : The password in case the target API needs basic authentication * `partitions`: Number of partition to be used to increase parallelism. Default is 2. * `connectionTimeout` : In case the target API needs high time to connect. Default is 1000 (in ms) * `readTimeout` : In case the target API returns large volume of data which needs more read time. Default is 5000 (in ms) -* `schemaSamplePcnt` : Percentage of number of records in the input table to be used to infer teh schema. The default is "30" and minimum is 3. Incarese this number in case you are getting error or the schema is not propery inferred. -* `callStrictlyOnce` : This value is used to ensure if the backend API would be called only once for each input value or not. The default is "N". In that case the back end API may get called for multiple times - once for inferring the schema and then for other operations. If this value is set to "Y" the backend API would be called only once (during infering the schema) for all of the input data points and would be cached. This option is useful when the target APIs are paid service or does not support calls per day/per hour beyond certain number. However, the results would be cached which will increase the memory usage. +* `schemaSamplePcnt` : Percentage of records in the input table to be used to infer the schema. The default is "30" and minimum is 3. Increase this number in case you are getting an error or the schema is not propery inferred. +* `callStrictlyOnce` : This value can be used to ensure the backend API is called only once for each set of input parameters. The default is "N", allowing the back end API to get called for multiple times - once for inferring the schema and then for other operations. If this value is set to "Y" the backend API would be called only once (during infering the schema) for all of the input parameter sets and would be cached. This option is useful when the target APIs are paid service or does not support calls per day/per hour beyond certain number. However, the results would be cached which will increase the memory usage. ## Typical Structure of the Dataframe returned by Rest Data Source -The dataframe created by this Rest Data Source will return a set of Rows of same Structure. The Structure internally will contain the input fields with the names same as those passed through the input table. The structure will also contain the output returned by the target Rest API under the field 'output'. Whatever gets returned within 'output' field would be specific to the Rest API being called. But the structure of the same can be easily obtained by printSchema method of Dataframe. +The dataframe created by this REST Data Source will return a set of Rows of the same Structure. The Structure will contain the input fields that were used for the API call as well as the returned output under a new column named 'output'. Whatever gets returned in 'output' is specific to the target REST API. It's structure can be easily obtained by printSchema method of Dataframe. -Here below goes an example of the structure retutned by Rest Data Source when the target Rest API was [Watson API for Natural Language Understanding (with sentiment as feature)] (https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/). +Below is an example of the structure returned using REST Data Source with the [Watson API for Natural Language Understanding with sentiment as feature] (https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/). ```scala @@ -73,14 +73,14 @@ root ``` -In this example 'q' and 'url' are the input parameters passed for each data point. So the Temporary Spark Table, that used as input table, had two columns - 'q' and 'url'. Each row of that table had different values for 'q' and 'url'. The 'output' field contains the result returned by the Wtson API for natural language understanding. +In this example, 'q' and 'url' are the input parameters passed to the service; the Temporary Spark Table used as input had two columns - 'q' and 'url'. Each row of that table had different values for 'q' and 'url'. The 'output' field contains the result returned by the Watson API for Natural Language Understanding. -Sometimes there could be an additional field under root, namely '_corrupt_records'. This field will contain the outputs for the records for which the API returned an error. +Sometimes there could be an additional field under root, namely '\_corrupt_records'. This field will contain the outputs for the records for which the API returned an error. ## Examples (to try out with Spark Shell) -The examples below shows how to use this Rest Data Source for SODA api. The examples here get the Socrata dataset using SODA API. The columns in the Socrata dataset is presented by a field in the SODA API. Records are searched for using filters and SoQL queries (https://dev.socrata.com/docs/queries/). We will be using the filters with our API call. +The examples below shows how to use this REST Data Source with a SODA API to retrive a Socrata dataset. The columns in the Socrata dataset have corresponding fields in the SODA API. Records are searched for using filters and [SoQL queries](https://dev.socrata.com/docs/queries/). The below examples demonstrate how to use the filters. ### Scala API @@ -89,17 +89,17 @@ The examples below shows how to use this Rest Data Source for SODA api. The exam // Create the target url string for Soda API for Socrata data source val sodauri = "https://soda.demo.socrata.com/resource/6yvf-kk3n.json" -//Say we need to call the API for 3 sets of input data points for different values of 'region' and 'source'. The 'region' and 'source' are two filters supported by the SODA API for Socrata data source +///Say we need to call the API for 3 sets of input parameters for different values of 'region' and 'source'. The 'region' and 'source' are two filters supported by the SODA API for Socrata data source val sodainput1 = ("Nevada", "nn") val sodainput2 = ("Northern California", "pr") val sodainput3 = ("Virgin Islands region", "pr") -// Now we create a RDD using these input data points +// Now we create a RDD using these input parameter values val sodainputRdd = sc.parallelize(Seq(sodainput1, sodainput2, sodainput3)) -// Now we need to create the dataframe specifying column name for tghe dataframe same as the filter names +// Next we need to create the DataFrame specifying specific column names that match the field names we wish to filter on val sodainputKey1 = "region" val sodainputKey2 = "source" @@ -108,7 +108,7 @@ val sodaDf = sodainputRdd.toDF(sodainputKey1, sodainputKey2) // And we create a temporary table now using the sodaDf sodaDf.createOrReplaceTempView("sodainputtbl") -// Now we create the parameter map to pass to teh Rest Data Source. +// Now we create the parameter map to pass to the REST Data Source. val parmg = Map("url" -> sodauri, "input" -> "sodainputtbl", "method" -> "GET", "readTimeout" -> "10000", "connectionTimeout" -> "2000", "partitions" -> "10") @@ -130,7 +130,7 @@ spark.sql("select source, region, inline(output) from sodastbl").show() ### Python API -This time we are reading the input data points from a csv file - sodainput.csv. The csv file contains two coloumns - 'region' and 'source'. And it has 3 rows with different values for these 2 columns. We shall call the API for these 3 sets of input data points with different values of 'region' and 'source'. The 'region' and 'source' are two filters supported by the SODA API for Socrata data source. +This time we are reading the input parameter values from a csv file - sodainput.csv. The csv file contains two columns - 'region' and 'source'. This column names map to two filters supported by the SODA API for Socrata data source and do not require renaming. We shall call the API 3 times, one for each of the different combinations of 'region' and 'source' values. The csv file should look like this - @@ -139,7 +139,7 @@ Nevada,nn Northern California,pr Virgin Islands region,pr -Please ensure that the csv file doen not have any space in between the column names as well as in between the values for those columns in the rows. +Please ensure that the csv file does not have any space in between the column names as well as in between the values for those columns in the rows. ```python @@ -154,7 +154,7 @@ sodainputDf = spark.read.option('header', 'true').csv('/home/biadmin/spark-enabl sodainputDf.createOrReplaceTempView('sodainputtbl') -# Now we create the parameter map to pass to the Rest Data Source. +# Now we create the parameter map to pass to the REST Data Source. prmsSoda = { 'url' : sodauri, 'input' : 'sodainputtbl', 'method' : 'GET', 'readTimeout' : '10000', 'connectionTimeout' : '2000', 'partitions' : '10'} @@ -176,7 +176,7 @@ spark.sql("select source, region, inline(output) from sodastbl").show() ### R API -We shall use the same csv file for the input data as in case of the Pythin example +We shall use the same csv file for the input parameter values from the Python example ```R @@ -187,7 +187,7 @@ sodauri <- "https://soda.demo.socrata.com/resource/6yvf-kk3n.json" sodainputDf <- read.df("/home/biadmin/spark-enablement/datasets/sodainput.csv", "csv", header = "true", inferSchema = "true", na.strings = "NA") -# And we create a temporary table now using the sodainputDf +# And we create a temporary table now from sodainputDf createOrReplaceTempView(sodainputDf, "sodainputtbl") @@ -195,10 +195,10 @@ createOrReplaceTempView(sodainputDf, "sodainputtbl") sodasDf <- read.df(,"org.apache.dsext.spark.datasource.rest.RestDataSource", "url"=sodauri, "input"="sodainputtbl", "method"="GET") -# We inspect the structure of the results returned. For Soda data source it would return the result in array. +# We inspect the structure of the results returned. For Soda data source it would return the result in an array. printSchema(sodasDf) -# Now we are ready to apply SQL or any other processing on teh results +# Now we are ready to apply SQL or any other processing on the results createOrReplaceTempView(sodasDf, "sodastbl") sodas2df <- sql("select source, region, inline(output) from sodastbl")