Found in Translation

DARPA SEEKS AUTOMATED TECHNOLOGY TO
UNDERSTAND AND ANALYZE FOREIGN LANGUAGE INTELLIGENCE.
Spurred by the military and intelligence communities’ growing need to translate and retrieve pertinent foreignlanguage intelligence, the Defense Advanced Research Project Agency (DARPA) has launched a program aimed at improving automated, searchable translations.
The Global Autonomous Language Exploitation (GALE) program is focused on developing, integrating and applying technologies that will analyze and translate huge volumes of speech and text in multiple languages. It will be deployed through three processing engines that will transcribe, translate and distill pertinent information.
The transcription engine will convert audio (speech) into English text, while the translation engine will convert text from other languages into readable English, with annotations regarding language of origin, topics, parts of speech, names and other factors. The distillation engine will search for and integrate information from multiple sources that is relevant to specific queries. It will discard repetition to reduce the volume of information while retaining important content.
“GALE’s distillation engine will allow computerized searches that will return just the specific context and sense desired, even when the text was not originally in English,” said DARPA spokeswoman Jan Walker. For example, an analyst will be able to ask the GALE distillation engine to produce a biography of a person, provide information about an organization during a specified period of time or to find statements made by certain people during specified times on designated topics.
Three prime contractor teams are developing GALE technology: BBN Technologies, SRI International and IBM Watson Research Center. While the companies meet weekly to coordinate data collection system evaluation, each company is developing its own unique technology.
GALE is the successor to two previous DARPA contracts—Translingual Information Detection, Extraction and Summarization (TIDES), which was for text only, and Effective Affordable Reusable Speech-to-Text (EARS), for speech recognition.
However, neither of these contracts included GALE’s current efforts to integrate technologies and to deliver distilled output using advanced search technology.
IBM was also a prime contractor on the TIDES program. “The high-level difference between GALE and TIDES is the focus on distillation of the information in English regardless of the language or medium input source. We will focus the research on advancing the technology to preserve meaning as much as possible,” said Salim Roukos, senior manager for multilingual natural language technologies at IBM Research.
The IBM Research team uses data from the United Nations proceedings in six languages (English Spanish, French, Arabic, Chinese and Russian) to build statistical machine translation systems between any pair of the six parallel texts. “Our goal is to improve the accuracy of the translation process and produce focused answers to a user’s question,” said Roukos.
SEMANTIC MODEL
DARPA awarded BBN a contract with initial funding for one year to develop GALE with its subcontractors. The company’s tasks include speech recognition and speech-to-text transcription, machine translation and the integration of transcription and translation.
There are also three additional tasks, beginning with development of a semantic model for languages’ data structures to pad any bare-bones translations with meaning. “Statistical methods typically do not include semantic models. All they have is the data and the translation but not the meaning, so the semantic model provides that,” said John Makhoul, BBN chief scientist.
The second involves the extraction and distillation of relevant translated language to prevent users from having to spend time sifting through volumes of irrelevant data. The third is an operational system that integrates the various technologies. BBN has deployed a number of such systems for government applications. The machine translation component of these systems is supplied by Language Weaver.
Language Weaver is a spinout of the University of Southern California’s Information Sciences Institute, which worked on the TIDES program to achieve wordto- word and phrase-to-phrase translation alignment. However, relying solely on word and phrase alignment translations does not produce accurate results. The words and phrases must be in the right order in a sentence in order for the translation to make sense.
Under GALE, Language Weaver will move beyond translation alignment into the more sophisticated realms of grammar and syntax to assure that translations are grammatically correct.
Working with the Information Sciences Institute, Language Weaver is running experiments to find the mathematical changes that are needed to achieve further advancements in automated translation technology. “We are using statistical data approaches to learn where grammatical errors are [in translations] and to learn how to fix them automatically,” said Bryce Benjamin, chief executive officer of Language Weaver.
“We are always improving the fundamental algorithms that are used to translate language,” said Daniel Marcu, the company’s chief technology officer.
Serving both as an integrator and a provider of deployable technology, Language Weaver plans to expand on its base product, Statistical Machine Translation Software (SMTS), which was released in 2003 and is now in Version 4.0. The company is supplying translation software for the systems that will be deployed, and BBN is providing the speech summarization and entity extraction portions of the technology, said Bryce.
BBN also operates the Broadcast Monitoring System, which creates a continuous searchable archive of international TV broadcasts. The real-time audio stream is automatically transcribed by the BBN Audio Indexer and translated into English with technology from Language Weaver. Both the transcript and translation are searchable and synchronized to the video, providing capabilities for retrieval and playback of the video based on its speech content.
NATURAL LANGUAGE
While GALE seeks to explore alternative language approaches to translation to assure a high level of accuracy, the program also requires an end product that is usable by the soldier and analyst in the field.
There is a difference between research-level translation systems developed in the lab, which operate with no memory or speed constraints, and production systems, which need to translate text in real time within the constraints of desktop-like systems. “We have put substantial effort into implementing sophisticated engineering processes in order to ensure that the software can operate within these constraints at little or no loss in translation quality,” said Marcu.
Another subcontractor, Language Computer, focuses on distillation by providing question and answering products to the mix. “Our technology allows users to input questions or problem statements in the natural language form, not in a query form, in order to elicit specific answers,” said Dan Moldovan, president and chief executive officer.
Natural language form is that which is normally used, such as in the question: “Who is the mayor of Concord?” A query would simply supply the words, “mayor, Concord.” Using Language Computer’s Power Answer and Power Tools products, answers to a natural language question are more precise and require less distillation. Queries, by contrast, produce long lists, through which users must take time to search and sort to find the right answer.
In addition, using natural language format, users can ask more specific and complicated questions compared with that of a simple query. For example, a user can ask a natural language question such as: “What countries did the mayor of Concord visit between 1990 and 1995?”
Power Answer uses advanced harvesting tools to access, collect and index large volumes of structured or unstructured information. Power Tools includes three functionalities: PowerIndex, which integrates content, PowerOntology, which continuously builds domain-specific concepts to expand available information, and Document Manager, which manages structured and unstructured data by keeping track of multiple revisions, data changes and document locations. The module integrates with PowerIndex to provide an overview of content resources and fresh knowledge dissemination.
“Our product is being adapted to GALE by expanding the question format to include problem statements,” said Moldovan. “Our product understands problem statements. For example, if you ask it to describe the relationship between a specific organization and an event, the system extracts all the relevant information—the nuggets, such as meetings, financial transactions, phone conversations, flights and people.”
GALE is supporting new research in language technology in order to distinguish between irrelevant or “noisy” data and important data, to combine speech with language translation and sophisticated text understanding, and to explore the various approaches to search and translation technology.
One school of thought suggests that it is more effective to query and search in the source language and then translate into English than to translate the volumes in English and then query in English. By translating all data into English and then searching in English, it is possible to miss important information that only could be found by searching in the native language before translating into English.
DOCUMENT TRIAGE
“Document triage in foreign language is one of the biggest problems facing government today, and the GALE approach attacks the problem in the reverse order. It is better to search and triage first and translate only after the important documents have been identified,” said Carl Hoffman, chief executive officer of Basis Technology.
Basis Technology provides foreign-language conversion for intelligence that focuses on name translation. For instance, the Department of Defense uses the Basis Technology Arabic Name Translator software to extract and translate names from source intelligence.
Because of the many nuances and broad differences between languages, implicit meaning can be ambiguous and translations are not always accurate. In addition, the English spelling of foreign words varies from agency to agency. The way the Library of Congress spells a foreign word can be different from the spellings used by the media or intelligence community.
For example, the media spells Al Qaeda differently than the alQa’idah spelling in the intelligence community. The apostrophe in the name is an English imposition to represent a glottal stop in the pronunciation. And “Qaeda” can mean “base” when it’s a noun or “seated” or “sitting” when it’s an adjective.
There is also a difference between orthographic accuracy and phonetic accuracy in transliteration. How to pronounce a word is different from how to write or how to translate it. So analysts can risk missing important information because of the potential for inconsistent transliteration.
Take the full name of the number two man in Al Qaeda. Ayman Muhammad Rabi al-Zawahiri is the official intelligence community spelling. Ayman is a proper noun, but it is also an adjective that means “right hand’ or “lucky,” while Zawahiri means “phenomenal.” And in Arabic, there is no upper case, lower case or hyphen. Rather, these are impositions by the English translation as a way to make the name more understandable to English speakers.
For instance, “green” in English is a color when it begins in lower case but a name when it begins with upper case. Another example is “Dr.,” which can mean either doctor or Drive, as in the name of a street. The number two man in Al Qaeda is referred to as ‘the doctor’ or “the authorized one,” but the process of translation could either mistranslate it or take too long.
“If I’m searching for ‘the doctor,’ it would catch too many documents and take too long,” said Hoffman.
“One of the other challenges in Arabic is that it is spoken with vowels but written without vowels because the person speaking it understands from context where vowels go,” said Hoffman. “How you insert vowels is important in order to understand the meaning of words accurately.”
IBM is in the process of investigating both approaches to the problem—issuing a query in the source language versus in English. While it is less expensive to issue a query in English, it is also less accurate. It might require more computing time to search in the native language, but the results can be more accurate. “It is more effective to do a query in the source language. We have found it experimentally to be better,” said Roukos.
While the research and development of language translation technology is an ongoing effort, there is no question about the need for it. A DoD directive, the Defense Language Program (DLP), issued as an update last year, called for improving the military’s foreign language capabilities, including both trained personnel and automated technology.
In addition, CIA Director Porter Goss has called for his agency to increase the number of officers who are tested and proficient in mission-critical languages by 50 percent. He also has urged the agency to develop and employ information technology tools to assist in the processing and use of information in foreign languages.
DARPA’s GALE program is directly addressing a cross-agency requirement for increased foreign language skills and searchable language-translation technology. ♦






