A tool to generate word generators based on phonemes.

Go to file

Feufochmar fad0967421 Change description of handwritten generators to be more in phase the description given by the generate-example-set.sh script.		2018-07-02 01:04:01 +02:00
examples	Change description of handwritten generators to be more in phase the description given by the generate-example-set.sh script.	2018-07-02 01:04:01 +02:00
py-phonagen	A sample script to generate words	2018-06-23 03:34:13 +02:00
web	Simplify css to minimum for the example.	2018-06-09 23:20:01 +02:00
COPYING	Add licence (MIT)	2018-06-23 19:12:38 +02:00
README.md	Update README	2018-06-23 03:34:48 +02:00
generate-example-set.sh	Add the handwritten rules in the example set.	2018-07-01 20:21:35 +02:00
make-archive.sh	Fix the missing web subdirectory.	2018-06-23 19:15:10 +02:00

README.md

Phonagen

Phonemic word generation tools. Phonagen provide several tools to make words generators based on the prononciation and transcriptions of phonemes.

The tools are built around a JSON representation of phonemes and word generators.

Web interface

The web directory contains a sample web interface to generate words from the JSON description included in the web/data.json file. The implementation of generators is located in the script web/phonagen.js. To use it on any webpage:

include the script on your page (<script src='phonagen.js'></script> in the headers)
add a div (or another block element) with the phonagen id
call the phonagen.load() function with the JSON file to use as an argument (either in the onload method of the body, or in a script tags placed after the phonagen block ex: <script>phonagen.load('data.json')</script>)

Python scripts

Those are located in the py-phonagen directory.

phonagen.py

The main module containing all the abstractions on which phonagen is based. Imported by the other tools.

phonology-csv2json.py

Convert a csv file listing the phonemes and their transcriptions into the corresponding JSON phonology representation.

The input csv file should have a header indicating the names of the columns. A phoneme column is mandatory. The id and description columns are optional. The other columns are treated as different transcriptions of the phonemes. If no main transcription is provided in the command line, the first column that is not phoneme, id, or description is taken as the main transcription. The id column serve to identify a phoneme, to be notably used in example lists. The description column may provide informations about a phoneme. Phonemes can be tagged in the description to guide some generators.

Examples of csv files are present in the examples directory.

generator-list2chain.py

Convert a list of examples into a chain-based generator (Markov chains).

The list of examples can be checked against a phonology, by giving the corresponding JSON file in the arguments. If a JSON file of the phonology is given, the phonology is included in the output. The output can be used as the input file of the web interface to generate words.

The file containing the list of examples should be formatted as follow:

one example by line
each phonemes are indicated by its corresponding id
the phoneme's ids are separated by spaces

Lists of examples can be found in the examples directory (.list files).

generator-list2rule.py

Convert a list of examples into a rule-based generator (substitution rules).

The list of examples must be checked against a phonology, by giving the corresponding JSON file in the arguments. The phonology is included in the output. The output can be used as the input file of the web interface to generate words.

The examples should follow the the same format as generator-list2chain.py.

generator-rulemaker.py

Generate a rule-based generator from a phonology.

This script generate a new rule-based generator based on a given phonology without any example. This generator can take some parameters, like the minimum and maximum numbers of syllables, wether stress are phonemic and if so the position of the stress syllable, or some control on distributions weights.

The description field of a phoneme can be used to guide the generator with hashtags. Several hashtags can be present in a description. If no hashtag are indicated in the descriptions, the generator will guess which phoneme is a syllable separator, a consonant, or a vowel the from the phonemic transcriptions. IPA notations must be used in that case. The following hashtags are understood by the generator:

#stress: indicator of stressed syllable in the phonemic transcription. Usually represented with an apostrophe or a dedicated primary stress symbol (u+02C8).
#syllable-break: indicator of syllable separation in the phonemic transcription. Usually represented with a dot.
#onset: indicate that a phoneme is present in the onset of syllables (beggining of a syllable). Onsets are usually made of consonants.
#nucleus: indicate that a phoneme is present in the nucleus of syllables (sonorant part of a syllable). Nucleus are usually made of vowels.
#coda: indicate that a phoneme is present in the coda of syllables (ending of a syllable). Coda are usually made of consonants.
#consonant: synonymous of #onset #coda.
#vowel: synonymous of #nucleus
#stressed: indicate that a phoneme is present in stressed syllables. If the hashtag #unstressed is not present in the description, the phoneme will only be present in stressed syllables. If both are missing, the generate will behave as if both were present.
#unstressed: indicate that a phoneme is present in unstressed syllables. If the hashtag #stress is not present in the description, the phoneme will only be present in unstressed syllables.
#single: indicate that a phoneme is present in single syllable words (if the generator can generate them). If the hashtags #initial, #middle, #final are not present in the description, the phoneme will only be present in single syllable words.
#initial: indicate that a phoneme is present in the first syllable of words. If the hastags #middle, and #final are not present in the description, the phoneme will only be present in the first syllable of words. If the three hashtags are absent, the generator behave as if they were all present. #initial imply #single.
#middle: indicate that a phoneme is present in the syllables other than the first and the last of words. If the hastags #initial, and #final are not present in the description, the phoneme will only be present in the middle syllables of words.
#final: indicate that a phoneme is present in the last syllable of words. If the hastags #initial, and #middle are not present in the description, the phoneme will only be present in the last syllable of words. #final imply #single.

phonology-maker.py

Generate a phonology. This script can be used to generate a phonology without any input. Phoneme present in the phonology are choosen randomly from some rules.

Combine this script with generator-rulemaker.py to make a procedurally generated word generator.

phonagen-merge.py

Merge several Phonagen JSON files into a single JSON file.

phonagen-generate.py

Generate words from a generator present in a JSON file. Can output multiple words and their transcriptions.

JSON Representation

A rather simple and readable example of JSON is provided in web/data.json.

The JSON structure used by Phonagen is an object containing two fields phonologies and generators:

The phonologies field contains a list of phonology objects.
The generators field contains a list of generator objects.

Phonology

A phonology object contains the fields:

id: a string identifying a phonology
description: a description for the phonology
transcriptions: a list of strings indicating the names of phoneme transcriptions. The "phoneme" string must be present in the list, and indicate the phonemic transcription.
main-transcription: a string present in the transcriptions list identifying the main transcription for the web interface. The main transcription is shown larger.
entries: a list of phoneme objects.

The phoneme objects contains the fields:

id: a string identifying the phoneme. Used in the generators.
description: a string describing the phoneme. May be anything, and a list of hashtags can be used to help the python scripts generating generators from phonologies.
a field containing a string for each transcription of the transcriptions list of the phonology.

Generators

Two kind of generators are currently implemented in Phonagen:

a chain-based generator, describing a Markov chain generator
a rule-based generator, using subsitution rules to make words from substructures

Each generator object contain common fields and specific field based on the kind of generator used. The generic fields are:

id: a string identifying the generator
description: a string describing the generator. The description is used in the web interface for the selector.
phonology: a string indicating which phonology the generator uses. Must correspond to the identifier of a phonology in the list of phonologies.
type: the type of generator. Currently supported: rules and chains.

For a rule-based generator, the specific fields are:

rules: a list of rule objects. One of the rules must have the id word, indicating the starting point of the generator.

A rule object contain the following fields:

id: an identifier of the rule, that can be used in the patterns of other rules.
distribution: a list of objects describing a discrete distribution.

The distribution objects of a rule contain the fields:

pattern: a list of string representing the sequence of elements of the pattern. The strings must be either identifiers of rule or identifiers of phoneme. Unknown identifiers are ignored when displaying the generated words.
occurences: an integer indicating the weight of the pattern in the distribution.

For a chain-based generator, the specific fields are:

order: an integer indicating the order of the Markov chain. It's the number of items needed to compute the next element of the chain.
chains: a list of chain objects describing the transitions of the Markov chain generator.

A chain object contain the following fields:

input: a list of strings (whose size correspond to the order) indicating an input state of a Markov chain. The strings must be identifiers of phoneme or empty strings. Empty strings are used for starting and ending the generating process. A list of empty strings indicate the starting state. The strings of input lists must be either all non-empty or all empty.
possible-outputs: a list of objects describing a discrete distribution.

The distribution objects of a chain contain the fields:

value: a string indicating the next value of the chain. The string must be an identifier of phoneme or an empty string. An empty string indicate the end of the generating process.
occurences: an integer indicating the weight of the value in the distribution.