Skip to content

Commit

Permalink
done
Browse files Browse the repository at this point in the history
  • Loading branch information
gokhanercan committed Jan 11, 2025
1 parent e386213 commit a96a3da
Showing 1 changed file with 54 additions and 35 deletions.
89 changes: 54 additions & 35 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# OSimUnr-Generator

## INTRODUCTION
This repository provides tools used to automatically generate new instances of **OSimUnr dataset** ([see the paper of the study](#cite)), which contains *orthographically similar but semantically unrelated* (OSimUnr) word-pairs.

Here are some word-pair examples from the [dataset repository](https://github.com/gokhanercan/OSimUnr):
Expand Down Expand Up @@ -44,7 +44,7 @@ The code has been tested on the following environments:
- **NLTK Version**: To ensure reproducibility in the English language processing pipeline, please use the exact 3.4.5 version of NLTK. This version is compatible with Python versions 2.7 through 3.8.
- **Python 3.8+**: Not supported due to NLTK compatibility issues. If you do not require the exact WordNet implementation, feel free to fork and test in higher Python versions. In theory, it should work.

## Installation
## INSTALLATION

1. Clone the repository to a folder:
2. Install required dependencies in the root of the project:
Expand Down Expand Up @@ -81,94 +81,113 @@ The code has been tested on the following environments:
4. You should see an output similar to the one given in the Output section below:
5. The generated dataset files should be located in the '../Resources/Studies/MyStudy/eng' directory relative to your source code.

For example, for the final Q3 dataset for English, checkout the file named *'S3a-OrthographicallySimilarButUnrelatedsQ3-nedit-{RandomDatasetID}.csv'*. The overall naming pattern is as follows:
For example, for the final Q3 dataset for English, check out the file named *'S3a-OrthographicallySimilarButUnrelatedsQ3-nedit-{RandomDatasetID}.csv'*. The overall naming pattern is as follows:
```S{StageNumber}-SubDatasetNameQ{SimilarityLevel}-{OrthographicAlg}-{RandomDatasetID}.csv ```

This is very similar to the naming conventions of the [released OSimUnr dataset.](https://github.com/gokhanercan/OSimUnr)

**Output:**
todo:output here..

## Customisation
## CUSTOMIZATION

You can customise the pipeline in three ways:
You can customize and extend the pipeline based on your needs as follows:

### Set the Initial Parameters and Algorithms

In Run.py file, you can set some parameters such as threshold assumptions, POS and pipeline stages to
In the `Run.py` file, you can set various parameters:

```
```python
GenerateDataset(wordPosFilters=[POSTypes.NOUN],minOrthographicSimQ3=0.50, minOrthographicSimQ4=0.75,maxRelatedness=0.25,limitWordCands=500)
```

### Parameters
> ***wordPosFilters**:* Defines the part-of-speech (POS) tags of the word-pool should use. Default is [POSTypes.NOUN].
> **wordPosFilters**: Defines the part-of-speech (POS) tags that the word-pool should use. Default is [POSTypes.NOUN].
> ***minOrthographicSimQ3**:* Defines the lower limit of the Q3 orthographic space. The upper limit is *minOrthographicSimQ4*. Default is 0.50.
> **minOrthographicSimQ3**: Defines the lower limit of the Q3 orthographic space. The upper limit is *minOrthographicSimQ4*. Default is 0.50.
> ***minOrthographicSimQ4**:* Defines the lower limit of the Q4 orthographic space. The upper limit is 1 by default. Default is 0.75.
> **minOrthographicSimQ4**: Defines the lower limit of the Q4 orthographic space. The upper limit is 1 by default. Default is 0.75.
> ***maxRelatedness**:* Set the threshold that defines the maximum level of 'unrelatedness' of word-pairs in the scale of o to 1. Default is 0.25.
> **maxRelatedness**: Sets the threshold that defines the maximum level of 'unrelatedness' of word pairs on a scale of 0 to 1. Default is 0.25.
> ***limitWordCands**:* The size of the word-pool you want to use. If set, it imits the word-pool by randomly picking word-pool items. Default is None.
> **limitWordCands**: The size of the word-pool you want to use. If set, it limits the word-pool by randomly picking words form the `IWordSource`. Default is None.
Please use parameters *resume*, *resumeStage3and4*, *wordpoolPath*, *wordpairsPath*, *s1Only* if you want to use the Save/Restore/Resume stages of the pipeline functionality. It is very useful for very long-running generations that take days.

Please use parameters *resume*, *resumeStage3and4*, *wordpoolPath*, *wordpairsPath*, *s1Only* if you want to use Save/Restore/Resume stages of the pipeline funcitonality. It is very useful for very long-running generation that take days.

### Change Providers and Settings

``Generator.py`` implementation relies on an abstract provider model called ``PipelineProviderBase`` to create concrete resources, data entries and implementations.
The default provider is the EnglishPipelineProvider set as follows:
```bash
The `Generator.py` implementation utilizes an abstract provider model called `PipelineProviderBase` to create concrete resources, data entries, and implementations.
The default provider is set as `EnglishPipelineProvider`, configured as follows:
```python
englishPipeline: PipelineProviderBase = EnglishPipeline(LinguisticContext.BuildEnglishContext(), EditDistance())
```

If you want to change the orthographic similarity for example, please provide a any Python implementation of *IWordSimilarity* and inject into the provider.

Here is the list of factory methods expected from a concrete provider, listed here in three groups:
If you wish to modify the orthographic similarity, for instance, please provide any Python implementation of `IWordSimilarity` and inject it into the provider.
Below is a list of factory methods expected from a concrete provider, organized into three groups:

**A. Morphological Resources**
```bash
```python
> CreateRootDetector()
> CreateFastRootDetector()
> CreateMorphoLex()
> CreateTokenizer()
```

**B. Semantic Resources**: Semantic Graph Database, W
```bash
**B. Semantic Resources**
```python
> CreateWordNet()
> CreateWordSource()
> CreateWordSimilarityAlgorithm()
```

**C. Semantic DEfinitions:** The manually defined list of WordNet concept mappings.
```bash
**C. Semantic Definitions:** Manually defined list of WordNet concept mappings.
```python
> CreateBlacklistedConceptsFilterer()
> CreateConceptPairFilterer()
> CreateDefinitionBasedRelatednessClassifier()
> CreateDerivationallyRelatedClassifier()
```

If you check out `EnglishPipeline.py`, you'll see a list of manual definitions and mappings introduced to reduce the false positive rates in the final dataset. As an example, here is the list of blacklisted concepts from English WordNet used in `CreateBlacklistedConceptsFilterer`:

The concept list (synset names) is ordered from general to specific:

```bash
- ill_health.n.01
- disorder.n.01
- pathologic_process.n.01
- plant_part.n.01
- biological_group.n.01
- medical_procedure.n.01
- animal.n.01
- microorganism.n.01
- plant.n.02
- chemical.n.01
- drug.n.01
- body_substance.n.01
- vasoconstrictor.n.01
- symptom.n.01
```


### Adding a New Language
If you want to add a new language, in addition to the morphologic and semantic provider types required for your language you should alter the 'LinguisticContext' type specific to your language code.
If the grammar ``IGrammar`` of that langauge is generic enough, with its alphabet casing of letter and accents, you can reuse INvariantGrammar instance but if the language is different enough please check our Turkish (``TRGrammar``) implementation.
To add a new language, along with the morphological and semantic provider types required for your language, you need to modify the `LinguisticContext` type specifically for your language code. If the grammar (`IGrammar`) of the language is generic enough, considering aspects such as the alphabet, casing, and accents, you may reuse the `InvariantGrammar` instance. However, if the language has distinct characteristics, please refer to our Turkish implementation (`TRGrammar`) as a model.

Here are the list of supported WordNets. The version 3.4.5 supports 29 languages.
Below is the list of languages supported by WordNet version 3.4.5, which includes 29 languages:
```bash
['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm']
```

The list can be achieved bu running the following code:
You can retrieve this list by running the following code:

```bash
from src.Core.WordNet.NLTKWordNetWrapper import QueryLanguages
QueryLanguages()
```
Note that Turkish is not included in this list. For generating OSimUnr, we utilized our study group's open-source [Java WordNet library](https://github.com/olcaytaner/TurkishWordNet), which adheres to the same IWordNet and IWordNetMeasure interfaces

Turkish is not in the list. While generating OSimUnr we utilized our study group's open-source Java WordNet library respecting the same ``IWordNet`` and ``IWordNetMeasure`` interfaces.


## Dependencies
## DEPENDENCIES
This project relies on minimal dependencies (see `requirements.txt` for details). The main dependencies are:

- **NLTK**: Ensure version 3.4.5 is used. This study heavily relies on NLTK's WordNet and other resources. Changing the NLTK version may cause some semantic or morphological assumption and tests to break.
Expand All @@ -179,7 +198,7 @@ This project relies on minimal dependencies (see `requirements.txt` for details)
- **EditDistance and Overlapping Coefficients**: Some implementations are adapted from the https://github.com/gokhanercan/python-string-similarity package.
- **Cython, Java and C++ Code**: These components are excluded from this repository due to their deployment and configuration complexity.

## Running Unit Tests
## RUNNING TESTS

If you want to verify the installation or have made changes to the code, you can execute the unit tests as follows:

Expand All @@ -193,12 +212,12 @@ Output:

See an example of unit test output [here.](unittests.md)

## Contributing and Support
## CONTRIBUTE AND SUPPORT

Feel free to open issues or pull requests to suggest improvements or report bugs. If you have read the paper and are looking for specific implementations, such as a C++ implementation of FastText models and n-gramming, resources for Turkish morphology, or Cython-optimized versions of the tools, please contact.

Before submitting a pull request, please ensure that all tests have been run and passed.

## Cite
### Cite

The paper for this work is currently under peer review. Citation details will be provided here once available.

0 comments on commit a96a3da

Please sign in to comment.