Levenshtein edit distance
Levenshtein is a collection of procedures providing various forms of the Levenshtein edit distance calculation.
The Levenshtein edit distance has been used for areas as diverse as soil sample and language dialect analysis. Not just for text strings.
Performs edit distance calculation for byte strings. All return the total edit cost.
(require-extension levenshtein-byte)
Calculates the edit distance from the SOURCE to the TARGET. All costs are unitary.
(require-extension levenshtein-transpose-byte)
Calculates the edit distance from the SOURCE to the TARGET, taking into account the Transpose operation. All costs are unitary.
Supplies the arithmetic and string operations for the distance algorithms below.
Should you wish to supply your own 'means' please see the procedure-surface egg documentation, "levenshtein-*-surface.scm", and "levenshtein-*-means.scm" in this egg for more information.
(require-extension levenshtein-utf8-means)
Uses procedures from the utf8 egg.
Loads the extensions "utf8" and "utf8-srfi-13".
(require-extension levenshtein-octet-means)
Uses procedures from the srfi-13 unit.
Loads the srfi-13 unit.
(require-extension levenshtein-string-means)
Uses string procedures in the toplevel at load time.
Load either the extensions "utf8" and "utf8-srfi-13", or the srfi-13 unit.
(require-extension levenshtein-numbers-means)
Uses procedures from the numbers egg.
Loads the extension "numbers".
(require-extension levenshtein-gennum-means)
Uses only fixnum and flonum procedures.
(require-extension levenshtein-fixnum-means)
Uses only fixnum procedures.
Performs edit distance calculation for strings.
(require-extension levenshtein-generic-string)
Calculates the edit distance from the SOURCE to the TARGET.
SOURCE |
A string. |
TARGET |
A string. |
insert-cost: |
A number. The edit cost of an insert. |
delete-cost: |
A number. The edit cost of a delete. |
substitute-cost: |
A number. The edit cost of a substitute. |
=?: |
A procedure; (-> object object boolean). The equality predicate. Probably not useful to override the default predicate. |
limit-cost: |
A number or #f. Limit edit distance calculation to cost. Number type must be compatible with the numeric-means. |
numeric-means: |
A procedure-means. The arithmetic means. |
string-means: |
A procedure-means. The string means. |
(require-extension levenshtein-generic-sequence)
Calculates the edit distance from the SOURCE to the TARGET.
SOURCE |
A vector, string, or list. |
TARGET |
A vector, string, or list. |
insert-cost: |
A number. The edit cost of an insert. |
delete-cost: |
A number. The edit cost of a delete. |
substitute-cost: |
A number. The edit cost of a substitute. |
get-work-vector: |
A procedure; (-> number vector). Returns a work vector of length 'number', or larger. |
=?: |
A procedure; (-> object object boolean). The equality predicate. Must handle the types of the source & target elements in either argument position! |
limit-cost: |
A number or #f. Limit edit distance calculation to cost. Number type must be compatible with the numeric-means. |
numeric-means: |
A procedure-means. The arithmetic means. |
string-means: |
A procedure-means. The string means. Used when source and/or target are strings. |
Performs edit distance calculation for vectors. Allows definition of new edit operations. Will keep track of edit operations performed.
(require-extension levenshtein-vector)
Calculates the edit distance from the source vector SOURCE to the target vector TARGET. Returns the total edit cost or (values <total edit cost> <performed operations matrix>).
SOURCE |
A vector. |
TARGET |
A vector. |
EDIT-OPER |
Edit operation definitions to apply. Defaults are the basic Insert, Delete, and Substitute. |
=?: |
A procedure; (-> object object boolean). The equality predicate. |
operations: |
A boolean. Include the matrix of edit operations performed? |
numeric-means: |
A procedure-means. The arithmetic means. |
(require-extension levenshtein-path-iterator)
Creates an optimal edit distance operation path iterator over the performed operations matrix MATRIX. The matrix is usually the result of an invocation of '(levenshtein-distance/vector* ... operations: #t)'.
Each invocation of the iterator will generate a list of the form: ((cost source-index target-index levenshtein-operator) ...). The last invocation will return #f.
Edit operation specification. A set of base operations is predefined, but may be overridden. The base set is identified by the keys Insert, Delete, Substitute, and Transpose. A printer and reader are provided for edit operations.
(require-extension levenshtein-operators)
key |
A symbol. Key for the operation. |
name |
A string. Describes the operation. |
cost |
A number. The cost of the operation. |
above |
A non-negative fixnum. How far back in the source. |
left |
A non-negative fixnum. How far back in the target |
Returns a new edit operator.
Is the OBJECT a levenshtein operator?
Returns a duplicate of the EDIT-OPERATION, with field values provided by the optional keyword arguments. EDIT-OPERATION may be the key of the already defined edit operation.
Get the definition of an edit operation.
Define an edit operation.
Removes the EDIT-OPERATION definition. EDIT-OPERATION may be the key of the already defined edit operation.
Restore defined edit operations to the base set.
Are the levenshtein operators A & B equal for all fields?
(require-extension levenshtein)
Warning - The numbers and utf8 eggs will be used!
Calculates the edit distance from the source string SOURCE to the target string TARGET. The default equivalence procedure EQ is 'char=?'. The default costs INSERT-COST, DELETE-COST, and SUBSTITUTE-COST are unitary.
Calculates the edit distance from the source list SOURCE to the target list TARGET. The default equivalence procedure EQ is 'equal?'. The default costs INSERT-COST, DELETE-COST, and SUBSTITUTE-COST are unitary.
Calculates the edit distance from the source vector SOURCE to the target vector TARGET. The default equivalence procedure EQ is 'equal?'. The default costs INSERT-COST, DELETE-COST, and SUBSTITUTE-COST are unitary.
Calculates the edit distance from the source sequence SOURCE to the target sequence TARGET. The default equivalence procedure EQ is 'char=?'. The default costs INSERT-COST, DELETE-COST, and SUBSTITUTE-COST are unitary.
Calculates the edit distance from the source vector SOURCE to the target vector TARGET. The default equivalence procedure EQ is 'equal?'. The default costs INSERT-COST, DELETE-COST, and SUBSTITUTE-COST are unitary. The default get-scratch procedure GET-SCRATCH is make-vector.
Few, if any, programs will use this procedure directly. This is like levenshtein-distance/vector, but allows get-scratch to be specified. get-scratch is a procedure of one term, n, that yields a vector of length n or greater, which is used for record-keeping during execution of the Levenshtein algorithm. make-vector can be used for get-scratch, although some programs comparing a large size or quantity of vectors may wish to reuse a record-keeping vector, rather than each time allocating a new one that will need to be garbage-collected.
(use levenshtein-vector levenshtein-path-iterator) (define iter (levenshtein-path-iterator (levenshtein-distance/vector* "YWCQPGK" "LAWYQQKPGKA" operations: #t)) ; ignoring interpreter feedback (define r0 (iter)) (define t0 r0) (define r1 (iter)) (define r2 (iter)) (define r3 (iter)) (define r4 (iter)) (define r5 (iter)) (iter) ; r0 now has #f, since the iterator finishes by returning to the initial caller, which is the ; body of '(define r0 (iter))', thus re-binding r0. However, t0 has the original returned value. ; Please see the 'levenshtein*test.scm' files in the levenshtein egg for more examples.
Copyright (c) 2005, 2006, Kon Lovett. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED ASIS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.