Skip to content

Commit

Permalink
changing paths from /master to /main in the docs and notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
brifordwylie committed Apr 7, 2022
1 parent 2088124 commit fed88c4
Show file tree
Hide file tree
Showing 12 changed files with 44 additions and 44 deletions.
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ pip install zat[all] (include pyarrow, yara-python, and tldextract)
### Recent Improvements
- Faster/Smaller Pandas Dataframes for large log files: [Large Dataframes](https://supercowpowers.github.io/zat/large_dataframes.html)
- Better Panda Dataframe to Matrix (ndarray) support: [Dataframe To Matrix](https://supercowpowers.github.io/zat/dataframe_to_matrix.html)
- Scalable conversion from Zeek logs to Parquet: [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Parquet.ipynb)
- Vastly improved Spark Dataframe Class: [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Spark.ipynb)
- Scalable conversion from Zeek logs to Parquet: [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Parquet.ipynb)
- Vastly improved Spark Dataframe Class: [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Spark.ipynb)
- Updated/improved Notebooks: [Analysis Notebooks](#analysis-notebooks)
- Zeek JSON to DataFrame class: [Zeek JSON to DataFrame Example](https://github.com/SuperCowPowers/zat/blob/master/examples/zeek_json_to_pandas.py)
- Zeek JSON to DataFrame class: [Zeek JSON to DataFrame Example](https://github.com/SuperCowPowers/zat/blob/main/examples/zeek_json_to_pandas.py)

### Video Presentation
- [Data Analysis and Machine Learning with Zeek](https://www.youtube.com/watch?v=pG5lU9CLnIU)
Expand All @@ -45,16 +45,16 @@ from here to there.

### Analysis Notebooks

- [Zeek to Scikit-Learn](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Scikit_Learn.ipynb)
- [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Parquet.ipynb)
- [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Spark.ipynb)
- [Spark Clustering](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Spark_Clustering.ipynb)
- [Zeek to Kafka](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Kafka.ipynb)
- [Zeek to Kafka to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Kafka_to_Spark.ipynb)
- [Clustering: Picking K (or not)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Clustering_Picking_K.ipynb)
- [Anomaly Detection Exploration](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Anomaly_Detection.ipynb)
- [Risky Domains Stats and Deployment](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Risky_Domains.ipynb)
- [Zeek to Matplotlib](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Plot.ipynb)
- [Zeek to Scikit-Learn](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Scikit_Learn.ipynb)
- [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Parquet.ipynb)
- [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Spark.ipynb)
- [Spark Clustering](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Spark_Clustering.ipynb)
- [Zeek to Kafka](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Kafka.ipynb)
- [Zeek to Kafka to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Kafka_to_Spark.ipynb)
- [Clustering: Picking K (or not)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Clustering_Picking_K.ipynb)
- [Anomaly Detection Exploration](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Anomaly_Detection.ipynb)
- [Risky Domains Stats and Deployment](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Risky_Domains.ipynb)
- [Zeek to Matplotlib](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Plot.ipynb)

<img align="right" style="padding: 10px" src="notebooks/images/SCP_med.png" width="120">

Expand Down
4 changes: 2 additions & 2 deletions docs/dataframe_to_matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This documents discusses some of the design decisions made when implementing the new DataFrameToMatrix class.

- **Train/Predict Column Order:** The most important aspect of this class is that it must produce consistently ordered output between training and prediction. In particular one-hot encoding for categorical fields must keep an ordered list of categorical values that are captured during training (fit/fit-transform) and then used during prediction (transform). SCP Labs has a great notebook describing this issue in detail [Categorical Encoding Dangers](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Encoding_Dangers.ipynb)
- **Train/Predict Column Order:** The most important aspect of this class is that it must produce consistently ordered output between training and prediction. In particular one-hot encoding for categorical fields must keep an ordered list of categorical values that are captured during training (fit/fit-transform) and then used during prediction (transform). SCP Labs has a great notebook describing this issue in detail [Categorical Encoding Dangers](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/main/notebooks/Categorical_Encoding_Dangers.ipynb)

- **NaN Handling**: In general Pandas Dataframes are great about handling NaN values in a general and robust way. The same is NOT true of Scikit-Learn (see [Scikit No NaNs](https://stackoverflow.com/questions/30317119/classifiers-in-scikit-learn-that-handle-nan-null) and [Handling Missing Data](https://machinelearningmastery.com/handle-missing-data-python/)). So NaNs must be detected and handled accordingly. Specifically we propose this logic:
- **Categorical NaNs:** The NaNs will become another category value, this simply adds 1 column to the one-hot encoding matrix and provides the handling of NaNs in a meaningful and robust way.
Expand All @@ -14,7 +14,7 @@ This documents discusses some of the design decisions made when implementing the

### References

- [Categorical Encoding Dangers](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Encoding_Dangers.ipynb)
- [Categorical Encoding Dangers](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/main/notebooks/Categorical_Encoding_Dangers.ipynb)
- [Numpy NDarray](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)
- [Scikit-Learn No NaNs](https://stackoverflow.com/questions/30317119/classifiers-in-scikit-learn-that-handle-nan-null)
- [Handling Missing Data](https://machinelearningmastery.com/handle-missing-data-python/)
2 changes: 1 addition & 1 deletion docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,7 @@ def yara_match(file_path, rules):
### Risky Domains

The example will use the analysis in our [Risky
Domains](https://github.com/SuperCowPowers/zat/blob/master/notebooks/Risky_Domains.ipynb) notebook to flag domains that are 'at risk' and conduct a Virus Total query on those domains. See zat/examples/risky\_dns.py for full code
Domains](https://github.com/SuperCowPowers/zat/blob/main/notebooks/Risky_Domains.ipynb) notebook to flag domains that are 'at risk' and conduct a Virus Total query on those domains. See zat/examples/risky\_dns.py for full code
listing (code simplified below)

```python
Expand Down
20 changes: 10 additions & 10 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,16 @@

### Analysis Notebooks

- [Zeek to Scikit-Learn](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Scikit_Learn.ipynb)
- [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Parquet.ipynb)
- [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Spark.ipynb)
- [Spark Clustering](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Spark_Clustering.ipynb)
- [Zeek to Kafka](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Kafka.ipynb)
- [Zeek to Kafka to Spark (need updating)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Kafka_to_Spark.ipynb)
- [Clustering: Picking K (or not)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Clustering_Picking_K.ipynb)
- [Anomaly Detection Exploration](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Anomaly_Detection.ipynb)
- [Risky Domains Stats and Deployment](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Risky_Domains.ipynb)
- [Zeek to Matplotlib](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Plot.ipynb)
- [Zeek to Scikit-Learn](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Scikit_Learn.ipynb)
- [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Parquet.ipynb)
- [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Spark.ipynb)
- [Spark Clustering](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Spark_Clustering.ipynb)
- [Zeek to Kafka](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Kafka.ipynb)
- [Zeek to Kafka to Spark (need updating)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Kafka_to_Spark.ipynb)
- [Clustering: Picking K (or not)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Clustering_Picking_K.ipynb)
- [Anomaly Detection Exploration](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Anomaly_Detection.ipynb)
- [Risky Domains Stats and Deployment](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Risky_Domains.ipynb)
- [Zeek to Matplotlib](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Plot.ipynb)

### Installing on Raspberry Pi!
- [Raspberry Pi Instructions](raspberry_pi.md)
Expand Down
4 changes: 2 additions & 2 deletions docs/large_dataframes.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Since conn.log is typically the most voluminous, we're going to use this 2.5 Gig
- <https://data.kitware.com/#item/58ebde398d777f16d095fd0e>

**Test Script:**
We're simply going to use the [zeek\_to\_pandas.py](https://github.com/SuperCowPowers/zat/blob/master/examples/zeek_to_pandas.py) in the examples directory for testing. We'll be using Python 3.7.
We're simply going to use the [zeek\_to\_pandas.py](https://github.com/SuperCowPowers/zat/blob/main/examples/zeek_to_pandas.py) in the examples directory for testing. We'll be using Python 3.7.

```
$ time python zeek_to_pandas.py ~/data/bro/conn.log
Expand Down Expand Up @@ -62,7 +62,7 @@ A new PR focused specifically on memory/time improvements for large data frames.
As noted in this issue <https://github.com/SuperCowPowers/zat/issues/23> the baseline construction of a data frame is inefficient, for large data frames this inefficiency plus the time wasted on memory paging/swapping starts to dominate the load time.

**Memory:**
As we've demonstrated in some of our notebooks examples, properly encoding categorical data will provide a significant memory reduction [Categorical Notebook](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Data_Guide.ipynb).
As we've demonstrated in some of our notebooks examples, properly encoding categorical data will provide a significant memory reduction [Categorical Notebook](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/main/notebooks/Categorical_Data_Guide.ipynb).

**Details:**
The proper conversion of 'time' to datetime and 'interval' to timedelta are taken care of by PR 76. Also the 'ts' field is properly set as the index of the dataframe.
Expand Down
2 changes: 1 addition & 1 deletion examples/tor_and_port_count.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# Example to check for potential Tor connections and give a summary of different ports
# used for SSL connections. Please note that your Zeek installation must stamp the
# ssl.log file with the 'issuer' field. More info can be found here:
# https://docs.zeek.org/en/master/script-reference/proto-analyzers.html#zeek-ssl
# https://docs.zeek.org/en/main/script-reference/proto-analyzers.html#zeek-ssl

# Set up the regex search that is used against the issuer field
issuer_regex = re.compile('CN=www.\w+.com')
Expand Down
2 changes: 1 addition & 1 deletion notebooks/Anomaly_Detection.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
"- PCA: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html\n",
"\n",
"**Related Notebooks**\n",
"- Zeek to Scikit-Learn: https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Scikit_Learn.ipynb\n",
"- Zeek to Scikit-Learn: https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Scikit_Learn.ipynb\n",
"\n",
"**Note:** A previous version of this notebook used a large http log (1 million rows) but we wanted people to be able to run the notebook themselves, so we've changed it to run on the local example http.log."
]
Expand Down
8 changes: 4 additions & 4 deletions notebooks/Spark_Clustering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
"<div style=\"float: right; margin: 0px 0px 0px 0px\"><img src=\"images/parquet.png\" width=\"300px\"></div>\n",
"\n",
"### See these related notebooks\n",
"- [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Parquet.ipynb)\n",
"- [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Spark.ipynb)\n",
"- [Zeek to Parquet](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Parquet.ipynb)\n",
"- [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Spark.ipynb)\n",
"\n",
"Apache Parquet is a columnar storage format focused on performance. Reading Parquet data is fast and efficient, for this notebook we will specifically be using it for loading data into Spark.\n",
"\n",
Expand Down Expand Up @@ -96,7 +96,7 @@
"Here we're loading in a Zeek DNS log with ~1/2 million rows to demonstrate the functionality and do some analysis and clustering on the data. For more information on converting Zeek logs to Parquet files please see our Zeek to Spark notebook:\n",
"\n",
"#### Zeek logs to Parquet Notebook\n",
"- [Zeek to Spark (and Parquet)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Spark.ipynb)"
"- [Zeek to Spark (and Parquet)](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Spark.ipynb)"
]
},
{
Expand Down Expand Up @@ -330,7 +330,7 @@
"- **Assembler:** Combines the encoded categorical data and numerical data into a combined matrix\n",
"\n",
"\n",
"For more information on the details of Categorical Type to One Hot Encoding see our SCP Labs [Encoding Dangers](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Encoding_Dangers.ipynb) notebook."
"For more information on the details of Categorical Type to One Hot Encoding see our SCP Labs [Encoding Dangers](https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/main/notebooks/Categorical_Encoding_Dangers.ipynb) notebook."
]
},
{
Expand Down
12 changes: 6 additions & 6 deletions notebooks/Zeek_to_Kafka.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
"source": [
"# Part 1: Streaming data pipeline\n",
"To set some context, our long term plan is to build out a streaming data pipeline. This notebook will help you get started on this path. After completing this notebook you can look at the next steps by viewing our notebooks that use Spark on Zeek output.\n",
" - [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Spark.ipynb)\n",
" - [Zeek to Kafka to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Kafka_to_Spark.ipynb)\n",
" - [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Spark.ipynb)\n",
" - [Zeek to Kafka to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Kafka_to_Spark.ipynb)\n",
"\n",
"So our streaming pipeline looks conceptually like this.\n",
"<div style=\"margin: 20px 20px 20px 20px\"><img src=\"images/pipeline.png\" width=\"750px\"></div>\n",
Expand All @@ -50,7 +50,7 @@
"- The Kafka Plugin for Zeek: https://github.com/apache/metron-bro-plugin-kafka\n",
"- A Kafka Broker: https://kafka.apache.org\n",
"\n",
"The weblinks above do a pretty good job of getting you setup with Zeek, Kafka, and the Kafka plugin. If you already have these thing setup then you're good to go. If not take some time and get both up and running. If you're a bit wacky (like me) and want to set these thing up on a Mac you might check out my notes here [Zeek/Kafka Mac Setup](https://github.com/SuperCowPowers/zat/blob/master/docs/zeek_kafka_mac.md)\n",
"The weblinks above do a pretty good job of getting you setup with Zeek, Kafka, and the Kafka plugin. If you already have these thing setup then you're good to go. If not take some time and get both up and running. If you're a bit wacky (like me) and want to set these thing up on a Mac you might check out my notes here [Zeek/Kafka Mac Setup](https://github.com/SuperCowPowers/zat/blob/main/docs/zeek_kafka_mac.md)\n",
"\n",
"## Systems Check\n",
"Okay now that Zeek with the Kafka Plugin is setup, lets do just a bit of testing to make sure it's all AOK before we get into making a Kafka consumer in Python.\n",
Expand Down Expand Up @@ -149,7 +149,7 @@
"\n",
"## Now What?\n",
"Okay so now we can actually do something useful with our new streaming data, in this case we're going to use some results from our 'Risky Domains' Notebook that computed a risky set of TLDs.\n",
"- [Risky Domain Stats](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Risky_Domains.ipynb)"
"- [Risky Domain Stats](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Risky_Domains.ipynb)"
]
},
{
Expand Down Expand Up @@ -225,8 +225,8 @@
"source": [
"# Part 1: Streaming data pipeline\n",
"Recall that our long term plan is to build out a streaming data pipeline. This notebook has helped you get started on this path. After completing this notebook you can look at the next steps by viewing our notebooks that use Spark on Zeek output.\n",
" - [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Spark.ipynb)\n",
" - [Zeek to Kafka to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/master/notebooks/Zeek_to_Kafka_to_Spark.ipynb)\n",
" - [Zeek to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Spark.ipynb)\n",
" - [Zeek to Kafka to Spark](https://nbviewer.jupyter.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Kafka_to_Spark.ipynb)\n",
"\n",
"\n",
"<div style=\"margin: 20px 20px 20px 20px\"><img src=\"images/pipeline.png\" width=\"750px\"></div>\n",
Expand Down
Loading

0 comments on commit fed88c4

Please sign in to comment.