Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release to pypi with a github CI #44

Merged
merged 7 commits into from
Jul 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .github/workflows/build-wheels.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
name: build-wheels

on:
push:
branches:
- master
tags:
- '*'

concurrency:
group: build-wheels-${{ github.ref }}
cancel-in-progress: true

jobs:
build_wheels:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, windows-latest]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is macOS not supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought macOS belongs to manylinux, will add macOS, thanks!


steps:
- uses: actions/checkout@v2

# see https://cibuildwheel.readthedocs.io/en/stable/changelog/
# for a list of versions
- name: Build wheels
uses: pypa/[email protected]
env:
CIBW_BEFORE_BUILD: "pip install -U cmake numpy"
CIBW_SKIP: "cp27-* cp35-* cp36-* *-win32 pp* *-musllinux* *-manylinux_i686"
CIBW_BUILD_VERBOSITY: 3

- name: Display wheels
shell: bash
run: |
ls -lh ./wheelhouse/

ls -lh ./wheelhouse/*.whl

- uses: actions/upload-artifact@v2
with:
path: ./wheelhouse/*.whl

- name: Publish wheels to PyPI
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python3 -m pip install --upgrade pip
python3 -m pip install wheel twine setuptools

twine upload ./wheelhouse/*.whl

- name: Build sdist
if: ${{ matrix.os == 'ubuntu-latest' }}
shell: bash
run: |
python3 -m pip install --upgrade build
python3 -m build -s
ls -l dist/*

- name: Publish sdist to PyPI
if: ${{ matrix.os == 'ubuntu-latest' }}
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
twine upload dist/fasttextsearch-*.tar.gz
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
cmake_minimum_required(VERSION 3.8 FATAL_ERROR)
project(textsearch)

set(TS_VERSION "0.5")
set(TS_VERSION "0.6")

set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/lib")
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY "${CMAKE_BINARY_DIR}/lib")
Expand Down
67 changes: 57 additions & 10 deletions examples/libriheavy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ git lfs install
git clone https://huggingface.co/datasets/pkufool/librilight-text
```

> We provide a shell script `run.sh` to run all the following stages step by step.

## Prepare manifests
## Prepare manifests (stage 1 in run.sh)

Note: You need to install [lhotse](https://github.com/lhotse-speech/lhotse) to prepare manifests.

Expand Down Expand Up @@ -98,29 +99,75 @@ The cuts look like this (only one line of it):
```


## Decode the audios
## Decode the audios (stage 2,3,4 in run.sh)

This stage decodes the audios to texts with a pre-trained ASR model.
Firstly split the long audio into smaller pieces (for eaxmple 30 seconds), then decode these pieces of audios to texts, combine them together at last.
We will firstly split the long audio into smaller pieces (for eaxmple 30 seconds), then decode these pieces of audios to texts, combine them together at last.

Code is available here: https://github.com/k2-fsa/icefall/pull/980
### Split

You can run the whole pipeline with the script long_file_recog.sh
```
./tools/split_into_chunks.py \
--manifest-in path/to/input_manifest \
--manifest-out path/to/output_manifest \
--chunk 30 \
--extra 2 # Extra duration (in seconds) at both sides
```
The input_manifest is the output of previous stage.

### Transcription

```
./tools/recognize.py \
--world-size 4 \
--num-workers 8 \
--manifest-in path/to/input_manifest \
--manifest-out path/to/output_manifest \
--nn-model-filename path/to/jit_script.pt \
--tokens path/to/tokens.txt \
--max-duration 2400 \
--decoding-method greedy_search \
--master 12345
```
The input_manifest is the output of previous stage.

**Note:** The whole pipeline includes the stages to prepare raw manifests in the above stage (stage 1 in long_file_recog.sh).
### Combine

```
./tools/merge_chunks.py \
--manifest-in path/to/input_manifest \
--manifest-out path/to/output_manifest \
--extra 2 # should be the same as in split stage
```
The input_manifest is the output of previous stage.

It will generate a manifest (including the transcripted text and timestamps).


## Align the decoded texts to the reference books
## Align the decoded texts to the reference books (stage 5 in run.sh)

This stage aligns the transcripted texts to its reference books.

First, you have to install the text_search library (https://github.com/danpovey/text_search),
First, you have to install the text_search library (https://github.com/k2-fsa/text_search),
then run the following command:

```
python examples/librilight/matching.py --manifest-in path/to/librilight_cuts_small.jsonl.gz --manifest-out path/to/cuts_small.jsonl.gz --batch-size 50 --num-workers 4
python examples/librilight/matching.py \
--manifest-in path/to/librilight_cuts.jsonl.gz \
--manifest-out path/to/librilight_out_cuts.jsonl.gz \
--batch-size 50
```

Or the parallel version:

```
python examples/librilight/matching_parallel.py \
--manifest-in path/to/librilight_cuts.jsonl.gz \
--manifest-out path/to/librilight_out_cuts.jsonl.gz \
--batch-size 50 \
--num-workers 5
```

The manifest-in is the manifests generated in the previous stage.


The manifest-in is the manifests generated in the decode audios stage.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "fasttextsearch"
version = "0.5"
version = "0.6"
authors = [
{ name="Next-gen Kaldi development team", email="[email protected]" },
]
Expand Down
Loading