Language processing Tools on the Internet.
An overview
Date: 20 January, 2000
By: Jan van Gemert
Introduction
This document provides a preliminary survey of the available natural language processing tools available on the Internet. Because of the large quantity of information found, this document can not be expected to reveal more then the tip of an iceberg. The document mentions an Internet site, by URL, and to an extent the relevant contents is described.
Most interesting are the first three sites, (typically last found). They all represent information about natural language processing in the form of a repository or some kind of index. They’re not bound to one single company or research center.
The latter two sites, were found first, and are more specific. They represent a single company or university.
and so also included in this document. But the first three sites provide an more general entrance to
There are several types of related subjects. Natural language processing is related to parsers, lexicons, grammars, knowledge bases, etc…
The most general pages found:
Cambridge Computer Science Research Centre
*The NATURAL LANGUAGE SOFTWARE REGISTRY
*AI Education Repository
*ISSCO
*Multext
*
Cambridge Computer Science Research Centre
http://www.cam.sri.com/HIGHLIGHT is SRI Cambridge's natural language processing system for information extraction from text.
HIGHLIGHT takes input texts and extracts information to fill slots in a template. A slot is an empty place in a table where some information can be placed. A template is like a form to fill in. It contains a number of slots which are completed automatically by the IE engine as it scans the texts. This method of extracting information by filling in forms ensures that only relevant information is discovered. HIGHLIGHT can be used in any application where large quantities of text would otherwise have to be read by an expert. The accuracy of HIGHLIGHT will never approach that of a skilled human information analyst, because the expert can 'read between the lines' and bring a great deal of knowledge to bear. HIGHLIGHT, on the other hand, can process large volumes of text very quickly, and may discover information that the skilled human analyst would not have time to look at all.
VERY large index of all sort of tools:
http://www.cs.columbia.edu/~radev/u/bin/search-index.cgi?database_name=acl&keywords=tools&max_output=1000see also for a directory structure of RESOURCES
http://www.cs.columbia.edu/~radev/u/db/acl/html/RESOURCES/For a Directory listing for: RESOURCES:
ARIES Natural Language Tools
Bibliography [DIR: 20 entries] ...
Books [DIR: 40 entries] ...
Corpora [DIR: 68 entries] ...
Courses [DIR: 22 entries] ...
Dictionaries [DIR: 29 entries] ...
Electronic mailing lists [DIR: 13 entries] ...
Journals [DIR: 25 entries] ...
Language and Linguistic Science information sources
Language-specific resources (e.g. German, Italian) [DIR: 8 entries] ...
Linguistic News Usenet News: Mailing Lists: Resources:
Miscellaneous FTP sites [DIR: 4 entries] ...
On-line resources [DIR: 5 entries] ...
Other comprehensive sites [DIR: 35 entries] ...
Papers [DIR: 12 entries] ...
Software on the Internet [DIR: 225 entries] ...
The RELATOR language resources server
Usenet newsgroups [DIR: 6 entries] ...
The NATURAL LANGUAGE SOFTWARE REGISTRY
http://www.dfki.de/lt/registry/general.htmlThe NATURAL LANGUAGE SOFTWARE REGISTRY (NLSR) is a concise summary of the capabilities and sources of language processing software available to researchers. It comprises academic, commercial and proprietary software with theory, specifications and terms on which it can be acquired clearly indicated.
The Natural Language Software Registry:
http://www.dfki.de/lt/registry/sections.htmlProvides software and tools links to:
Speech Signal Analysis
Morphological Analysis
Syntactic Analysis
Formalisms
Semantic and Pragmatic Analysis
Generation
Knowledge Representation Systems
Multicomponent Systems
NLP-tools
Data Sets
Applications and text processing
http://www.cacs.usl.edu/~manaris/ai-education-repository/index.html
Welcome to the Artificial Intelligence Education Repository. This repository is a central registry of (and distribution point for) resources related to Artificial Intelligence (AI) education.
It contains information on AI textbooks; pointers to syllabi, sample programming assignments, and sample written assignments. On-line tutorials on specific AI topics; tools and environments for the classroom or lab (general and specific); papers related to AI pedagogy; and mechanisms for sharing your own AI education resources with the AI community.
Natural Language Processing tools:
http://www.cacs.usl.edu/~manaris/ai-education-repository/nlp-tools.htmlFreeware Tools:
ALE (Attribute Logic Engine)
Description: ALE is a environment that integrates phrase structure parsing and constraint logic programming with typed feature structures. It can handle several formalisms including HPSG, PATR-II, DCG grammars, and Prolog, Prolog-II, and LOGIN programs. Sample grammars are provided with the distribution.
Platforms: Platforms with SICStus Prolog, or Quintus Prolog.
Source: The latest version is available from the CMU Artificial Intelligence Repository.
Reference: Additional information is available from the CMU Artificial Intelligence Repository.
Contact: carp@lcl.cmu.edu.
CGPARSER
Description: CGParser is a linear parser of Conceptual Graphs. It was written using the YACC compiler generator utility. The distribution includes examples of various levels of complexity for testing purposes.
Platforms: UNIX.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: hdp@nmsu.edu.
CHARON
Description: CHARON is an environment for the development and testing of LFG grammars. It integrates parsers, semantic components and the generator, and provides a user-interface for the compilation and the testing of LFG grammars.
Platforms: UNIX.
Source: The latest version is available from ftp.ims.uni-stuttgart.de.
Reference: Additional information is available from ftp.ims.uni-stuttgart.de.
CHAT (Conversational Hypertext Access Technology)
Description: CHAT is a computer program developed by Communications Canada that provides easy access to electronic information. CHAT provides a natural-language interface that allows users to ask English questions and receive answers. (The software can also be adapted to other languages.) The interface is much easier to use than traditional menu or keyword systems, and it is ideally suited for situations where people have little knowledge of computers.
Platforms: PC, and UNIX
Source: The latest version is available for license. Free access is provided via WWW from debra.dgbt.doc.ca.
Reference: Additional information is available from debra.dgbt.doc.ca.
Contact: thom@dgbt.doc.ca.
Conc
Description: Conc is used for producing concordances of texts. It also produces a frequency index for each word in the text. It displays the original text, the concordance, and the index each in synchronized windows.
Platforms: Mac.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: antworth@am.dallas.sil.org.
ELIZA
Description: This is the classic NLP program by Weizenbaum. It allows for a simple first assignment in NLP. Students are asked to develop a new knowledge base for some domain other that the classic psychoanalyst-patient one.
Platforms: PC, Mac, VAX, UNIX and others.
Source: The latest version is available from the CMU Artificial Intelligence Repository.
Reference: N/A.
Contact: N/A.
ENGLEX
Description: Englex is a lexicon for morphological analysis of English text. It is intended for use with PC-KIMMO (or programs that use the PC-KIMMO parser, such as KTEXT). Combined with software, it facilitates production of sets of records of the morphological constituents in English texts.
Platforms: PC, Mac, and UNIX.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: evan@txsil.lonestar.org.
FLEX (Fast Lexical Analyzer Generator)
Description: FLEX is a generator of lexical pattern recognizers. It is an extension to the UNIX LEX lexical analyzer utility.
Platforms: UNIX.
Source: The latest version is available from the ftp.ee.lbl.gov.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: vern@ee.lbl.gov.
FONOL
Description: Fonol is a programming language for experimenting with Transformation-Grammar-style phonological rules. It also incorporates input and output filters/conditions. It is intended for both phonology students and researchers in that it facilitates understanding of phonological rule fundamentals and helps manage large complex bodies of phonological rules.
Platforms: PC (and platforms with Turbo Pascal).
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: brandon@gamma.is.tcu.edu.
Grammar Workbench
Description: The Grammar Workbench is an environment for the development and analysis of grammars. It is geared towards the AGFL (Affix Grammars over a Finite Lattice) formalism.
Platforms: PC, and Sun.
Source: The latest version is available from hades.cs.kun.nl.
Reference: Additional information is available from hades.cs.kun.nl.
Contact: agfl@cs.kun.nl.
KGEN
Description: KGEN is a program for building morphological parsers for NLP systems. It is an auxiliary program for PC-KIMMO.
Platforms: PC.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: evan@txsil.lonestar.org.
LINK
Description: LINK is a parser for Link Grammar, a context-free formalism for the description of natural language. It also includes a Link Grammar for English.
Platforms: UNIX.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: sleator@cs.cmu.edu.
Lotec
Description: The Lotec Speech Recognition Package is a simple set of libraries and tools for building single-speaker, small-vocabulary, low-quality continuous speech recognition applications.
Platforms: Sun.
Source: The latest version is available from the ftp.sanpo.t.u-tokyo.ac.jp.
Reference: Additional information is available from The Natural Language Software Registry.
Contact: nigel@sanpo.t.u-tokyo.ac.jp.
LT Thistle
Description: LT Thistle is a parameterizable display engine and editor for diagrams, allowing the inclusion of interactive diagrams within Web pages. Originally designed for use with linguistic diagrams, we envisage widespread application within other areas involving the presentation and interpretation of highly structured information. It is available free of charge for non-commercial purposes.
Platforms: Java
Source: http://www.ltg.ed.ac.uk/software/thistle/demos/index.html
Reference: http://www.ltg.ed.ac.uk/software/thistle/index.html
Contact: Jo Calder J.Calder@ed.ac.uk
OGI Speech Tools
Description: The OGI Speech Tools are a set of speech data manipulation tools including an X Windows display tool (Lyre) for displaying data in a time synchronous fashion, a Neural Network training package, a set of C library routines (LIBNSPEECH) for speech data manipulation, a set of sound-file format conversion utilities, and a set of Pearl scripts for automating the use of the above tools.
Platforms: UNIX.
Source: N/A.
Reference: N/A.
Contact: tools@cse.ogi.edu.
PC-KIMMO
Description: PC-KIMMO is a popular program among computational linguists, descriptive linguists, and NLP system developers. It generates and/or recognizes words using a two-level model of word structure, i.e., a lexical-level form, and a surface-level form.
Platforms: PC, Mac, UNIX.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research, and PC-KIMMO: A Two-Level
Processor for Morphological Analysis by Evan L. Antworth, published by the Summer Institute of Linguistics (1990).
Contact: evan@txsil.lonestar.org.
SAX (Sequential Analyzer for syntaX and semantics)
Description: SAX is a syntactic analyzer for Definite Clause Grammar. It employs a bottom-up and breadth-first parsing algorithm. Distribution includes a Japanese grammar and some sample Japanese data.
Platforms: Platforms with SICStus Prolog.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: N/A.
SYNTACTICA
Description: SYNTACTICA is a system for grammar development with a simple graphical user interface. It is intended for use in introductory syntax classes, or introductory linguistics classes with a syntax component.
Platforms: NextStep.
Source: The latest version is available from the Consortium for Lexical Research.
Reference: Additional information is available from the Consortium for Lexical Research.
Contact: rlarson@semlab1.sbs.sunysb.edu.
Commercial Tools:
Alvey Natural Language Tools (ANLT)
Description: The Alvey Natural Language Tools is a set of tools for use in natural language processing research. These include a morphological analyzer, parsers, a grammar and a lexicon. They can be used independently or with a grammar development environment to form a complete system for the morphological, syntactic and semantic analysis of a considerable subset of English.
Platforms: UNIX.
Source: N/A.
Reference: Additional information is available from ftp.cl.cam.ac.uk.
Contact: N/A.
ALEP
Description: The Advanced Language Engineering Platform (ALEP) is a versatile and flexible general purpose NLP platform. It is independent of formalism, incorporates a number of standards such as SGML, ISO character sets, and MOTIF and comes with a graphical user interface, an extensive on-line documentation, and various tools for text handling, linguistic processing, and debugging.
Platforms: Platforms supporting Prolog by BIM 4.0.5, ClauseDB 2.0, GNU Emacs 19.19, OSF/MOTIF 1.2.
Source: Available on tape through contact below.
Reference: Additional information is available from The Natural Language Software Registry.
Contact: Mr. N. K. Simpkins, Cray Systems, ALEP Support, 11b Bvd Joseph II, LUXEMBOURG, L-1840 LUXEMBOURG.
CSRE -- Canadian Speech Research Environment
Description: CSRE is designed to support speech research by providing a powerful, low-cost facility using
mass-produced and widely-available hardware. Functions include speech capture, editing, and replay, spectral analysis procedures, 3D displays, parameter extraction/tracking and tools to automate measurement and support data logging. CSRE components include a speech editor, a time-domain analyzer, a spectral analyzer, a formant tracker, a pitch tracker, a speech synthesizer, an acoustic signal synthesizer, and an experiment generator/controller.
Platforms: PC.
Source: N/A.
Reference: Additional information is available from The Natural Language Software Registry.
Contact: Donald G. Jamieson, , University of Western Ontario, Hearing Health Care Research Unit, Communicative
Disorders, London, Ontario N6G1H1, Canada.
ESPS - Entropic Signal Processing System
Description: ESPS is a set of signal and speech processing utilities. Their functionality includes spectrum analysis, time series manipulation, pattern classification, file manipulation, plotting, speech processing, data I/O and conversion, filter design, and filtering.
Platforms: SUN, SGI, HP 9000/700, or DEC 3100/5000 and ALPHA computer running UNIX.
Source: N/A.
Reference: Additional information is available from The Natural Language Software Registry.
Contact: Ken Nelson, Director of Sales and Marketing, Entropic Research Laboratory, 600 Penn. Ave. S.E., Suite 202,
Washington, D.C., USA 20003.
Natural Language (TM)
Description: Natural Language (TM) is an extensible natural language interface to relational SQL databases. It employs a parser, semantic interface, natural language generator, and a deductive system that interprets English questions in the context of the specific applications. Its extension mechanism, Intelligent Connector (ICon), may be used to customize Natural Language to specific applications.
Platforms: MS-Windows, VMS, and UNIX.
Source: N/A.
Reference: Additional information is available from The Natural Language Software Registry.
Contact: Cilla DeVries, Natural Language Inc., Marketing Department, 1125 Atlantic Avenue, Alameda, CA 94501,
U.S.A.
NL Builder (TM)
Description: NL Builder (TM) may be used to develop NLP applications or experiment with various linguistic components. It consists of a tokenizer, a dictionary, a morphological analyzer, a parser, a semantic interpreter, a semantic network KRL, lexical acquisition tools, "C" hooks, and a debugger.
Platforms: PC, Mac, Apollo, Sun, VAX, NeXT, and others.
Source: N/A.
Reference: Additional information is available from The Natural Language Software Registry.
Contact: Edwin R. Addison, Synchronetics, Inc., Synchronetics, Inc., 301 N. Front St., Baltimore, MD 21202, U.S.A.
Parser
Description: Parser is designed to perform grammatical tagging and parsing of English. It consists of a morphological analyzer, a parser, and corresponding generators. This system comes with a large collection of CFPSG rules and English lexicon entries. It can handle arbitrary large texts with typical accuracy exceeding 98%. It provides both a command-line interface and a graphical user interface. It may generate a phrase-structure parse tree for each input sentence.
Platforms: PC.
Source: N/A.
Reference: Additional information is available from The Natural Language Software Registry.
Contact: Dr. Mike Oakes, Prospero Software Ltd, 190 Castelnau, London SW13 9DH, England.
http://issco-www.unige.ch/
ISSCO is a research laboratory attached to the University of Geneva, and conducts basic and applied research in computational linguistics (CL), and artificial intelligence (AI).
Tools:
LHIP V2.0 (a left-head-corner island parser compiler)
LHIP is a compiler which turns an extended form of Prolog DCG-like grammars into island parsers in Prolog. The program has been tested under Sicstus 0.6 but should run under any compatible Prolog. This directory also contains a LaTeX version of a paper which describes the concept behind LHIP.
ELU Environnement Linguistique d'Unification
ELU (Environnement Linguistique d'Unification) is an enhanced PATR-II style environment for linguistic development written and developed at ISSCO As its name indicates, ELU is based on unification and its purpose is the development of computational linguistic applications in general. It provides a declarative environment which allows linguists to write grammars that can be used both for parsing and for generation. In addition to these two standard functions of a linguistic development environment, ELU also supports a transfer component. Together, these three components allow the development of a system which can analyze a text in one language and generate its translation in another language, making ELU particularly suitable for experimenting with machine translation.
Dictionaries to download or paying:
http://issco-www.unige.ch/resources/Linguistics/payants-angl.html )Britannica Online
Britannica Encyclopaedia; free test during 7 days.
Dictionary to Download
List dictionary to be downloaded.
Larousse Dictionary
Paying, possiblility of free test.
Freeware Dictionary for Windows to download.
1.Wordlist French (160k)
2.Wordlist German (436k)
3.Wordlist Italian (153k)
4.Wordlist Spanish (121k)
5.Wordlist Portuguese (37k)
WordWeb thesaurus/dictionary
dictionary to be downloaded, monolingual, English.
Information Diccionario butt
English-Spanish dictionary to download, for Windows 95
The Internet Project Dictionary:
Dictionary to be downloaded; list; English - -> French, Spanish, German, Italian, Portuguese.
Projects
http://issco-www.unige.ch/projects/index.html#ewgsResearch Themes
Grammars and Formalisms for Natural Language Processing
Semantics and Pragmatics of Language
Language Corpora
Evaluation of Language Systems
European Projects
DicoPro - On-Line Dictionary Consultation For Language Professionals On Intranet
Diagnostic and Evaluation Tools for Natural Language Applications (DiET)
EAGLES I Evaluation Working Group Final Report
EAGLES II
Test Suites for Natural Language Processing (TSNLP)
Multilingual Text and Tools (MULTEXT)
Multilingual Corpora for Coorporation (MLCC)
A Testbed Study of Evaluation Methodologies: Authoring Aids (TEMAA)
TRANSTERM
Grammars which are Reusable to Automatically Analyze Languages (GRAAL)
Projects of the Swiss National Science Foundation
Definition and Exploitation for Sublanguage Description for MT in a finite domain, part I and II
Free text medical document retrieval by using terminological resources and statistical linguistics
Projects of the French Speaking Community
Observatoire Suisse des Industries de la Langue
Alignment of Bi- and Multilingual Corpora
Other Projects
Environnement Linguistique d'Unification (ELU)
Semantic Modelling
Dico - A network based dictionary consultation tool
Spoken Language Translator (SLT)
http://www.lpl.univ-aix.fr/projects/multext/MUL7.html
Multext encompasses a series of projects whose goals are to develop standards and specifications for the encoding and processing of linguistic corpora, and to develop tools, corpora and linguistic resources embodying these standards. Multext is developing tools, corpora, and linguistic resources for a wide variety of languages, including Bambara, Bulgarian, Catalan, Czech, Dutch, English, Estonian, French, German, Hungarian, Italian, Kikongo, Occitan, Romanian, Slovenian, Spanish, Swedish and Swahili. All Multext results are made freely and publicly available for non-commercial, non-military purposes.
Multext is developing a series of tools for accessing and manipulating corpora, including corpora encoded in SGML, and for accomplishing a series of corpus annotation tasks, including token and sentence boundary recognition, morphosyntactic tagging, parallel text alignment, and prosody markup. Annotation results may also be generated in SGML format.
Tools under development:
Multilingual text editor
MtScript - Multilingual text editor
SGML manipulation tools
MtSgmlQL - SGML query language interpreter
SAM tools - SGML API
Text segmentation tools
MtSeg - Text segmenter
Morpho-lexical tools
MtLex -- Multext lexical access tools
MtTag - Multext POS disambiguator and related utilities
MtMorph - Multext morphological tools
Multilingual text alignment
MtAlign - Multilingual text aligner
Speech Workbench
MES-SIGNAIX - Speech Signal editor and processing tools
The Multext Prosody Tools (rtf.gz) [82K]
The Multext Prosody Tools: Tutorial (rtf.gz) [380K]
Libraries and utilities
MtStr - Multilingual string library
MtRecode - Character conversion program