Update README

2018-06-23 03:34:48 +02:00 · 2018-06-23 03:34:48 +02:00 · 7723e7744f
parent e5a81f8b8f
commit 7723e7744f
1 changed files with 97 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -43,5 +43,101 @@ The file containing the list of examples should be formatted as follow:

 Lists of examples can be found in the `examples` directory (.list files).

+### generator-list2rule.py
+Convert a list of examples into a rule-based generator (substitution rules).
+
+The list of examples must be checked against a phonology, by giving the corresponding JSON file in the arguments.
+The phonology is included in the output.
+The output can be used as the input file of the web interface to generate words.
+
+The examples should follow the the same format as `generator-list2chain.py`.
+
+### generator-rulemaker.py
+Generate a rule-based generator from a phonology.
+
+This script generate a new rule-based generator based on a given phonology without any example.
+This generator can take some parameters, like the minimum and maximum numbers of syllables, wether stress are phonemic and if so the position of the stress syllable, or some control on distributions weights.
+
+The description field of a phoneme can be used to guide the generator with hashtags. Several hashtags can be present in a description. If no hashtag are indicated in the descriptions, the generator will guess which phoneme is a syllable separator, a consonant, or a vowel the from the phonemic transcriptions. IPA notations must be used in that case.
+The following hashtags are understood by the generator:
+- `#stress`: indicator of stressed syllable in the phonemic transcription. Usually represented with an apostrophe or a dedicated primary stress symbol (u+02C8).
+- `#syllable-break`: indicator of syllable separation in the phonemic transcription. Usually represented with a dot.
+- `#onset`: indicate that a phoneme is present in the onset of syllables (beggining of a syllable). Onsets are usually made of consonants.
+- `#nucleus`: indicate that a phoneme is present in the nucleus of syllables (sonorant part of a syllable). Nucleus are usually made of vowels.
+- `#coda`: indicate that a phoneme is present in the coda of syllables (ending of a syllable). Coda are usually made of consonants.
+- `#consonant`: synonymous of `#onset #coda`.
+- `#vowel`: synonymous of `#nucleus`
+- `#stressed`: indicate that a phoneme is present in stressed syllables. If the hashtag `#unstressed` is not present in the description, the phoneme will only be present in stressed syllables. If both are missing, the generate will behave as if both were present.
+- `#unstressed`: indicate that a phoneme is present in unstressed syllables. If the hashtag `#stress` is not present in the description, the phoneme will only be present in unstressed syllables.
+- `#single`: indicate that a phoneme is present in single syllable words (if the generator can generate them). If the hashtags `#initial`, `#middle`, `#final` are not present in the description, the phoneme will only be present in single syllable words.
+- `#initial`: indicate that a phoneme is present in the first syllable of words. If the hastags `#middle`, and `#final` are not present in the description, the phoneme will only be present in the first syllable of words. If the three hashtags are absent, the generator behave as if they were all present. `#initial` imply `#single`.
+- `#middle`: indicate that a phoneme is present in the syllables other than the first and the last of words. If the hastags `#initial`, and `#final` are not present in the description, the phoneme will only be present in the middle syllables of words.
+- `#final`: indicate that a phoneme is present in the last syllable of words.  If the hastags `#initial`, and `#middle` are not present in the description, the phoneme will only be present in the last syllable of words. `#final` imply `#single`.
+
+### phonology-maker.py
+Generate a phonology.
+This script can be used to generate a phonology without any input. Phoneme present in the phonology are choosen randomly from some rules.
+
+Combine this script with `generator-rulemaker.py` to make a procedurally generated word generator.
+
 ### phonagen-merge.py
-Merge several phonagen JSON files into a single JSON file.
+Merge several Phonagen JSON files into a single JSON file.
+
+### phonagen-generate.py
+Generate words from a generator present in a JSON file. Can output multiple words and their transcriptions.
+
+
+## JSON Representation
+A rather simple and readable example of JSON is provided in `web/data.json`.
+
+The JSON structure used by Phonagen is an object containing two fields `phonologies` and `generators`:
+- The `phonologies` field contains a list of phonology objects.
+- The `generators` field contains a list of generator objects.
+
+### Phonology
+A phonology object contains the fields:
+- `id`: a string identifying a phonology
+- `description`: a description for the phonology
+- `transcriptions`: a list of strings indicating the names of phoneme transcriptions. The `"phoneme"` string must be present in the list, and indicate the phonemic transcription.
+- `main-transcription`: a string present in the `transcriptions` list identifying the main transcription for the web interface. The main transcription is shown larger.
+- `entries`: a list of phoneme objects.
+
+The phoneme objects contains the fields:
+- `id`: a string identifying the phoneme. Used in the generators.
+- `description`: a string describing the phoneme. May be anything, and a list of hashtags can be used to help the python scripts generating generators from phonologies.
+- a field containing a string for each transcription of the `transcriptions` list of the phonology.
+
+### Generators
+Two kind of generators are currently implemented in Phonagen:
+- a chain-based generator, describing a Markov chain generator
+- a rule-based generator, using subsitution rules to make words from substructures
+
+Each generator object contain common fields and specific field based on the kind of generator used.
+The generic fields are:
+- `id`: a string identifying the generator
+- `description`: a string describing the generator. The description is used in the web interface for the selector.
+- `phonology`: a string indicating which phonology the generator uses. Must correspond to the identifier of a phonology in the list of phonologies.
+- `type`: the type of generator. Currently supported: `rules` and `chains`.
+
+For a rule-based generator, the specific fields are:
+- `rules`: a list of rule objects. One of the rules must have the id `word`, indicating the starting point of the generator.
+
+A rule object contain the following fields:
+- `id`: an identifier of the rule, that can be used in the patterns of other rules.
+- `distribution`: a list of objects describing a discrete distribution.
+
+The distribution objects of a rule contain the fields:
+- `pattern`: a list of string representing the sequence of elements of the pattern. The strings must be either identifiers of rule or identifiers of phoneme. Unknown identifiers are ignored when displaying the generated words.
+- `occurences`: an integer indicating the weight of the pattern in the distribution.
+
+For a chain-based generator, the specific fields are:
+- `order`: an integer indicating the order of the Markov chain. It's the number of items needed to compute the next element of the chain.
+- `chains`: a list of chain objects describing the transitions of the Markov chain generator.
+
+A chain object contain the following fields:
+- `input`: a list of strings (whose size correspond to the order) indicating an input state of a Markov chain. The strings must be identifiers of phoneme or empty strings. Empty strings are used for starting and ending the generating process. A list of empty strings indicate the starting state. The strings of input lists must be either all non-empty or all empty.
+- `possible-outputs`: a list of objects describing a discrete distribution.
+
+The distribution objects of a chain contain the fields:
+- `value`: a string indicating the next value of the chain. The string must be an identifier of phoneme or an empty string. An empty string indicate the end of the generating process.
+- `occurences`: an integer indicating the weight of the value in the distribution.