Levenshtein edit distance

Levenshtein is a collection of procedures providing various forms of the Levenshtein edit distance calculation.

The Levenshtein edit distance has been used for areas as diverse as soil sample and language dialect analysis. Not just for text strings.

Performs edit distance calculation for byte strings. All return the total edit cost.

**procedure:**(levenshtein-distance/byte SOURCE TARGET)-
Calculates the edit distance from the

`SOURCE`to the`TARGET`. All costs are unitary. **procedure:**(levenshtein-distance/transpose-byte SOURCE TARGET)-
Calculates the edit distance from the

`SOURCE`to the`TARGET`, taking into account the Transpose operation. All costs are unitary.

(require-extension levenshtein-byte)

(require-extension levenshtein-transpose-byte)

Supplies the arithmetic and string operations for the distance algorithms below.

Should you wish to supply your own 'means' please see the procedure-surface egg documentation, "levenshtein-*-surface.scm", and "levenshtein-*-means.scm" in this egg for more information.

(require-extension levenshtein-utf8-means)

Uses procedures from the utf8 egg.

Loads the extensions "utf8" and "utf8-srfi-13".

(require-extension levenshtein-octet-means)

Uses procedures from the srfi-13 unit.

Loads the srfi-13 unit.

(require-extension levenshtein-string-means)

Uses string procedures in the toplevel at load time.

Load either the extensions "utf8" and "utf8-srfi-13", or the srfi-13 unit.

(require-extension levenshtein-numbers-means)

Uses procedures from the numbers egg.

Loads the extension "numbers".

(require-extension levenshtein-gennum-means)

Uses only fixnum and flonum procedures.

(require-extension levenshtein-fixnum-means)

Uses only fixnum procedures.

Performs edit distance calculation for strings.

(require-extension levenshtein-generic-string)

Calculates the edit distance from the `SOURCE` to the `TARGET`.

SOURCE |
A string. |

TARGET |
A string. |

insert-cost: |
A number. The edit cost of an insert. |

delete-cost: |
A number. The edit cost of a delete. |

substitute-cost: |
A number. The edit cost of a substitute. |

=?: |
A procedure; (-> object object boolean). The equality predicate. Probably not useful to override the default predicate. |

limit-cost: |
A number or #f. Limit edit distance calculation to cost. Number type must be compatible with the numeric-means. |

numeric-means: |
A procedure-means. The arithmetic means. |

string-means: |
A procedure-means. The string means. |

(require-extension levenshtein-generic-sequence)

Calculates the edit distance from the `SOURCE` to the `TARGET`.

SOURCE |
A vector, string, or list. |

TARGET |
A vector, string, or list. |

insert-cost: |
A number. The edit cost of an insert. |

delete-cost: |
A number. The edit cost of a delete. |

substitute-cost: |
A number. The edit cost of a substitute. |

get-work-vector: |
A procedure; (-> number vector). Returns a work vector of length 'number', or larger. |

=?: |
A procedure; (-> object object boolean). The equality predicate. Must handle the types of the source & target elements in either argument position! |

limit-cost: |
A number or #f. Limit edit distance calculation to cost. Number type must be compatible with the numeric-means. |

numeric-means: |
A procedure-means. The arithmetic means. |

string-means: |
A procedure-means. The string means. Used when source and/or target are strings. |

Performs edit distance calculation for vectors. Allows definition of new edit operations. Will keep track of edit operations performed.

**procedure:**(levenshtein-distance/vector* SOURCE TARGET [EDIT-OPER ...] [#:=? char=?] [#:operations #f] [#:numeric-means levenshtein-fixnum-means])-
Calculates the edit distance from the source vector

`SOURCE`to the target vector`TARGET`. Returns the total edit cost or (values <total edit cost> <performed operations matrix>).SOURCE A vector.

TARGET A vector.

EDIT-OPER Edit operation definitions to apply. Defaults are the basic Insert, Delete, and Substitute.

=?: A procedure; (-> object object boolean). The equality predicate.

operations: A boolean. Include the matrix of edit operations performed?

numeric-means: A procedure-means. The arithmetic means.

**procedure:**(levenshtein-path-iterator MATRIX)-
Creates an optimal edit distance operation path iterator over the performed operations matrix

`MATRIX`. The matrix is usually the result of an invocation of '(levenshtein-distance/vector* ... operations: #t)'.Each invocation of the iterator will generate a list of the form: ((cost source-index target-index levenshtein-operator) ...). The last invocation will return #f.

(require-extension levenshtein-vector)

(require-extension levenshtein-path-iterator)

Edit operation specification. A set of base operations is predefined, but may be overridden. The base set is identified by the keys Insert, Delete, Substitute, and Transpose. A printer and reader are provided for edit operations.

(require-extension levenshtein-operators)

**record:**levenshtein-operatorkey A symbol. Key for the operation.

name A string. Describes the operation.

cost A number. The cost of the operation.

above A non-negative fixnum. How far back in the source.

left A non-negative fixnum. How far back in the target

**procedure:**(make-levenshtein-operator KEY NAME COST ABOVE LEFT)-
Returns a new edit operator.

**procedure:**(levenshtein-operator? OBJECT)-
Is the

`OBJECT`a levenshtein operator? **procedure:**(clone-levenshtein-operator EDIT-OPERATION [#:key] [#:name] [#:cost] [#:above] [#:left])-
Returns a duplicate of the

`EDIT-OPERATION`, with field values provided by the optional keyword arguments. EDIT-OPERATION may be the key of the already defined edit operation. **procedure:**(levenshtein-operator-ref KEY)-
Get the definition of an edit operation.

**procedure:**(levenshtein-operator-set! EDIT-OPERATION)-
Define an edit operation.

**procedure:**(levenshtein-operator-delete! EDIT-OPERATION)-
Removes the

`EDIT-OPERATION`definition. EDIT-OPERATION may be the key of the already defined edit operation. **procedure:**(levenshtein-operator-reset)-
Restore defined edit operations to the base set.

**procedure:**(levenshtein-operator-equal? A B)-
Are the levenshtein operators

`A`&`B`equal for all fields?

(require-extension levenshtein)

Warning - The numbers and utf8 eggs will be used!

**macro:**(levenshtein-distance/string SOURCE TARGET [EQ INSERT-COST DELETE-COST SUBSTITUTE-COST])-
Calculates the edit distance from the source string

`SOURCE`to the target string`TARGET`. The default equivalence procedure`EQ`is 'char=?'. The default costs`INSERT-COST`,`DELETE-COST`, and`SUBSTITUTE-COST`are unitary. **macro:**(levenshtein-distance/list SOURCE TARGET [EQ INSERT-COST DELETE-COST SUBSTITUTE-COST])-
Calculates the edit distance from the source list

`SOURCE`to the target list`TARGET`. The default equivalence procedure`EQ`is 'equal?'. The default costs`INSERT-COST`,`DELETE-COST`, and`SUBSTITUTE-COST`are unitary. **macro:**(levenshtein-distance/vector SOURCE TARGET [EQ INSERT-COST DELETE-COST SUBSTITUTE-COST])-
Calculates the edit distance from the source vector

`SOURCE`to the target vector`TARGET`. The default equivalence procedure`EQ`is 'equal?'. The default costs`INSERT-COST`,`DELETE-COST`, and`SUBSTITUTE-COST`are unitary. **macro:**(levenshtein-distance/sequence SOURCE TARGET [EQ INSERT-COST DELETE-COST SUBSTITUTE-COST])-
Calculates the edit distance from the source sequence

`SOURCE`to the target sequence`TARGET`. The default equivalence procedure`EQ`is 'char=?'. The default costs`INSERT-COST`,`DELETE-COST`, and`SUBSTITUTE-COST`are unitary. **macro:**(levenshtein-distance/scratch SOURCE TARGET [GET-SCRATCH EQ INSERT-COST DELETE-COST SUBSTITUTE-COST])-
Calculates the edit distance from the source vector

`SOURCE`to the target vector`TARGET`. The default equivalence procedure`EQ`is 'equal?'. The default costs`INSERT-COST`,`DELETE-COST`, and`SUBSTITUTE-COST`are unitary. The default get-scratch procedure`GET-SCRATCH`is make-vector.Few, if any, programs will use this procedure directly. This is like levenshtein-distance/vector, but allows get-scratch to be specified. get-scratch is a procedure of one term, n, that yields a vector of length n or greater, which is used for record-keeping during execution of the Levenshtein algorithm. make-vector can be used for get-scratch, although some programs comparing a large size or quantity of vectors may wish to reuse a record-keeping vector, rather than each time allocating a new one that will need to be garbage-collected.

(use levenshtein-vector levenshtein-path-iterator) (define iter (levenshtein-path-iterator (levenshtein-distance/vector* "YWCQPGK" "LAWYQQKPGKA" operations: #t)) ; ignoring interpreter feedback (define r0 (iter)) (define t0 r0) (define r1 (iter)) (define r2 (iter)) (define r3 (iter)) (define r4 (iter)) (define r5 (iter)) (iter) ; r0 now has #f, since the iterator finishes by returning to the initial caller, which is the ; body of '(define r0 (iter))', thus re-binding r0. However, t0 has the original returned value. ; Please see the 'levenshtein*test.scm' files in the levenshtein egg for more examples.

- 1.7.1 Added 'levenshtein-string-means'. Removed a syntax-case use.
- 1.7.0 Support for toplevel-only utf8 egg. Removed syntax-case dependency.
- 1.602 Changes for syntax-case support of define-inline
- 1.6.1 Missing means source files for syntax-case module
- 1.6 Refactoring
- 1.5 Needs misc-extn > 2.0
- 1.4 Shared code
- 1.3 Major changes
- 1.2 Switched to array-lib
- 1.1 Requirement for srfi-43 was wrong [Thanks to Benedikt Rosenau]
- 1.0 Initial release

Copyright (c) 2005, 2006, Kon Lovett. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED ASIS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.