phonagen/README.md

144 lines
10 KiB
Markdown

# Phonagen
Phonemic word generation tools.
Phonagen provide several tools to make words generators based on the prononciation and transcriptions of phonemes.
The tools are built around a JSON representation of phonemes and word generators.
## Web interface
The `web` directory contains a sample web interface to generate words from the JSON description included in the `web/data.json` file.
The implementation of generators is located in the script `web/phonagen.js`.
To use it on any webpage:
- include the script on your page (`<script src='phonagen.js'></script>` in the headers)
- add a div (or another block element) with the `phonagen` id
- call the `phonagen.load()` function with the JSON file to use as an argument (either in the `onload` method of the body, or in a script tags placed after the `phonagen` block ex: `<script>phonagen.load('data.json')</script>`)
## Python scripts
Those are located in the `py-phonagen` directory.
### phonagen.py
The main module containing all the abstractions on which phonagen is based. Imported by the other tools.
### phonology-csv2json.py
Convert a csv file listing the phonemes and their transcriptions into the corresponding JSON phonology representation.
The input csv file should have a header indicating the names of the columns.
A `phoneme` column is mandatory. The `id` and `description` columns are optional. The other columns are treated as different transcriptions of the phonemes.
If no main transcription is provided in the command line, the first column that is not `phoneme`, `id`, or `description` is taken as the main transcription.
The `id` column serve to identify a phoneme, to be notably used in example lists. The `description` column may provide informations about a phoneme.
Phonemes can be tagged in the description to guide some generators.
Examples of csv files are present in the `examples` directory.
### generator-list2chain.py
Convert a list of examples into a chain-based generator (Markov chains).
The list of examples can be checked against a phonology, by giving the corresponding JSON file in the arguments.
If a JSON file of the phonology is given, the phonology is included in the output.
The output can be used as the input file of the web interface to generate words.
The file containing the list of examples should be formatted as follow:
- one example by line
- each phonemes are indicated by its corresponding id
- the phoneme's ids are separated by spaces
Lists of examples can be found in the `examples` directory (.list files).
### generator-list2rule.py
Convert a list of examples into a rule-based generator (substitution rules).
The list of examples must be checked against a phonology, by giving the corresponding JSON file in the arguments.
The phonology is included in the output.
The output can be used as the input file of the web interface to generate words.
The examples should follow the the same format as `generator-list2chain.py`.
### generator-rulemaker.py
Generate a rule-based generator from a phonology.
This script generate a new rule-based generator based on a given phonology without any example.
This generator can take some parameters, like the minimum and maximum numbers of syllables, wether stress are phonemic and if so the position of the stress syllable, or some control on distributions weights.
The description field of a phoneme can be used to guide the generator with hashtags. Several hashtags can be present in a description. If no hashtag are indicated in the descriptions, the generator will guess which phoneme is a syllable separator, a consonant, or a vowel the from the phonemic transcriptions. IPA notations must be used in that case.
The following hashtags are understood by the generator:
- `#stress`: indicator of stressed syllable in the phonemic transcription. Usually represented with an apostrophe or a dedicated primary stress symbol (u+02C8).
- `#syllable-break`: indicator of syllable separation in the phonemic transcription. Usually represented with a dot.
- `#onset`: indicate that a phoneme is present in the onset of syllables (beggining of a syllable). Onsets are usually made of consonants.
- `#nucleus`: indicate that a phoneme is present in the nucleus of syllables (sonorant part of a syllable). Nucleus are usually made of vowels.
- `#coda`: indicate that a phoneme is present in the coda of syllables (ending of a syllable). Coda are usually made of consonants.
- `#consonant`: synonymous of `#onset #coda`.
- `#vowel`: synonymous of `#nucleus`
- `#stressed`: indicate that a phoneme is present in stressed syllables. If the hashtag `#unstressed` is not present in the description, the phoneme will only be present in stressed syllables. If both are missing, the generate will behave as if both were present.
- `#unstressed`: indicate that a phoneme is present in unstressed syllables. If the hashtag `#stress` is not present in the description, the phoneme will only be present in unstressed syllables.
- `#single`: indicate that a phoneme is present in single syllable words (if the generator can generate them). If the hashtags `#initial`, `#middle`, `#final` are not present in the description, the phoneme will only be present in single syllable words.
- `#initial`: indicate that a phoneme is present in the first syllable of words. If the hastags `#middle`, and `#final` are not present in the description, the phoneme will only be present in the first syllable of words. If the three hashtags are absent, the generator behave as if they were all present. `#initial` imply `#single`.
- `#middle`: indicate that a phoneme is present in the syllables other than the first and the last of words. If the hastags `#initial`, and `#final` are not present in the description, the phoneme will only be present in the middle syllables of words.
- `#final`: indicate that a phoneme is present in the last syllable of words. If the hastags `#initial`, and `#middle` are not present in the description, the phoneme will only be present in the last syllable of words. `#final` imply `#single`.
### phonology-maker.py
Generate a phonology.
This script can be used to generate a phonology without any input. Phoneme present in the phonology are choosen randomly from some rules.
Combine this script with `generator-rulemaker.py` to make a procedurally generated word generator.
### phonagen-merge.py
Merge several Phonagen JSON files into a single JSON file.
### phonagen-generate.py
Generate words from a generator present in a JSON file. Can output multiple words and their transcriptions.
## JSON Representation
A rather simple and readable example of JSON is provided in `web/data.json`.
The JSON structure used by Phonagen is an object containing two fields `phonologies` and `generators`:
- The `phonologies` field contains a list of phonology objects.
- The `generators` field contains a list of generator objects.
### Phonology
A phonology object contains the fields:
- `id`: a string identifying a phonology
- `description`: a description for the phonology
- `transcriptions`: a list of strings indicating the names of phoneme transcriptions. The `"phoneme"` string must be present in the list, and indicate the phonemic transcription.
- `main-transcription`: a string present in the `transcriptions` list identifying the main transcription for the web interface. The main transcription is shown larger.
- `entries`: a list of phoneme objects.
The phoneme objects contains the fields:
- `id`: a string identifying the phoneme. Used in the generators.
- `description`: a string describing the phoneme. May be anything, and a list of hashtags can be used to help the python scripts generating generators from phonologies.
- a field containing a string for each transcription of the `transcriptions` list of the phonology.
### Generators
Two kind of generators are currently implemented in Phonagen:
- a chain-based generator, describing a Markov chain generator
- a rule-based generator, using subsitution rules to make words from substructures
Each generator object contain common fields and specific field based on the kind of generator used.
The generic fields are:
- `id`: a string identifying the generator
- `description`: a string describing the generator. The description is used in the web interface for the selector.
- `phonology`: a string indicating which phonology the generator uses. Must correspond to the identifier of a phonology in the list of phonologies.
- `type`: the type of generator. Currently supported: `rules` and `chains`.
For a rule-based generator, the specific fields are:
- `rules`: a list of rule objects. One of the rules must have the id `word`, indicating the starting point of the generator.
A rule object contain the following fields:
- `id`: an identifier of the rule, that can be used in the patterns of other rules.
- `distribution`: a list of objects describing a discrete distribution.
The distribution objects of a rule contain the fields:
- `pattern`: a list of string representing the sequence of elements of the pattern. The strings must be either identifiers of rule or identifiers of phoneme. Unknown identifiers are ignored when displaying the generated words.
- `occurences`: an integer indicating the weight of the pattern in the distribution.
For a chain-based generator, the specific fields are:
- `order`: an integer indicating the order of the Markov chain. It's the number of items needed to compute the next element of the chain.
- `chains`: a list of chain objects describing the transitions of the Markov chain generator.
A chain object contain the following fields:
- `input`: a list of strings (whose size correspond to the order) indicating an input state of a Markov chain. The strings must be identifiers of phoneme or empty strings. Empty strings are used for starting and ending the generating process. A list of empty strings indicate the starting state. The strings of input lists must be either all non-empty or all empty.
- `possible-outputs`: a list of objects describing a discrete distribution.
The distribution objects of a chain contain the fields:
- `value`: a string indicating the next value of the chain. The string must be an identifier of phoneme or an empty string. An empty string indicate the end of the generating process.
- `occurences`: an integer indicating the weight of the value in the distribution.