Skip to content

Latest commit

 

History

History
55 lines (39 loc) · 2.22 KB

README.md

File metadata and controls

55 lines (39 loc) · 2.22 KB

Cleaning data

Simple examples about cleaning text data

See source code, output and comments on the scala files:

  • Word freq
  • Special chars, identify and clean
  • Word stemmer
  • Natural Language Processing

## notebook/regexs.ipynb, regexs.pdf

  • Regexs
  • Stop words
  • Find patterns in tokens
  • querying Patstat outside the SQL relational model

Next time

  • comparing text, text distance, alignment, disambiguation, google refine…
  • regex vs CFG, web scraping, table stats, validation, data curation workflow

Requirements

How to run a scala example

$ sbt "runMain application.TextCleanExample"
$ sbt "runMain application.StanfordNLPExample"
$ export dbUrl="jdbc:mysql://example.com/patstat_2015a?user=__USER__&password=__PASSWORD__&useSSL=false"
$ sbt "runMain application.RemoveStopWordsExample $dbUrl"
$ sbt "runMain application.PatentNumbersPatterns $dbUrl"
$ sbt "runMain application.EPFLPatentsProject $dbUrl"

How to run jupyther with the regexs example

docker run -it --rm -p 8888:8888 -v $PWD/notebook:/home/jovyan/work jupyter/all-spark-notebook start-notebook.sh

Do you have other use cases or questions?

Contact me at [email protected]