Text extraction is a process of extracting text from a document. This can be done manually or automatically. Manual text extraction is the process of reading the document and copying the text into a new document. Automatic text extraction is the process of using software to extract the text from the document.
Text extraction can be used for a
variety of purposes. One common purpose is to convert documents into a
machine-readable format. This can be useful for storing documents in a database
or for processing documents with software. Another common purpose is to extract
information from documents. This can be useful for finding specific information
in a document or for summarizing the content of a document.
Large language models (LLMs) are good
for text extraction because they are trained on massive datasets of text and
code, which allows them to learn the relationships between words and phrases.
They can also understand the context of text and generate text, which allows
them to extract information that is not explicitly stated or fill in the gaps
in text that is missing information. The answers from LLMs can also be further
improved through methods like few-shot prompting.
Objective
In this tutorial, you will learn how to
use generative models to extract the information from text by working through
the following examples:
- Google Pixel technical specifications
extraction
- WiFi troubleshooting with constraints
- Respond to inquiries in character
- Converting an ingredients list to JSON format
- Organizing the results of a text extraction
Getting
Started
!pip install google-cloud-aiplatform
--upgrade --user
Colab only: Uncomment the following cell to
restart the kernel or use the button to restart the kernel. For Vertex AI
Workbench you can restart the terminal using the button on top.
Authenticating
your notebook environment
- If you are using Colab to run
this notebook, uncomment the cell below and continue.
- If you are using Vertex AI Workbench,
check out the setup instructions here.
Import
libraries
Colab only: Uncomment the following cell to
initialize the Vertex AI SDK. For Vertex AI Workbench, you don't need to run
this.
from vertexai.language_models import
TextGenerationModel
Import
models
generation_model = TextGenerationModel.from_pretrained("text-bison@001")
Text
Extraction
Google
Pixel technical specifications extraction
In this example, you try to extract the
technical specifications of a Pixel phone from text in JSON format using the
PaLM API.
Extract the technical specifications from the text below in JSON format.
Text: Google Nest WiFi, network speed up to 1200Mpbs, 2.4GHz and 5GHz frequencies, WP3 protocol
JSON: {
"product":"Google Nest WiFi",
"speed":"1200Mpbs",
"frequencies": ["2.4GHz", "5GHz"],
"protocol":"WP3"
}
Text: Google Pixel 7, 5G network, 8GB RAM, Tensor G2 processor, 128GB of storage, Lemongrass
JSON:
"""
print(
generation_model.predict(
prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
).text
)
WiFi
troubleshooting with constraints
In this example, you ask the generative
model to answer a question about troubleshooting a Google WiFi router based on
the description of the different status lights on the router. The model will
only be able to respond with the text that was provided, which helps to prevent
it from generating potentially harmful or incorrect answers. Here is how you
can do this using the PaLM API.
Respond
to inquiries in character
Now, you instruct the generative model
to pretend to be Klara, a person. You will also tell the model about Klara's
personality traits. Then, you will ask the model to answer a question as Klara
would answer it.
You are Klara.
Klara is an investment manager.
Klara only answers if Klara is sure it is correct.
Klara answers the user question based on the summaries of the pages below.
Klara outputs the Reference ID where Klara found the answer for each sentence in the answer.
Klara also summarizes the part where the information is found.
Summaries of the pages: ['Reference ID 1. UBS wants to be seen as a
global bank with Swiss roots, not just a European bank . New board chairman
Colm Kelleher and CEO Ralph Hamers have held a series of meetings with
influential U.S. fund managers to increase their stakes in the bank . UBS is
one of the most valuable European banks with a price-to-book ratio of 1 .',
'Reference ID 2. Credit Suisse sells 30 percent stake in Swiss asset manager
Energy Infrastructure Partners . EIP specializes in long-term equity
investments for large-scale renewable and energy infrastructure assets . The
transaction may be part of an ongoing effort by the Swiss bank to further close
the capital gap .', 'Reference ID 3. Saudi Arabias Crown Prince Mohammed bin
Salman is preparing to invest in Credit Suisse Group AGs investment bank .
Prince Mohammed would inject about $500 million in the spinoff of CS First Boston
. Other investors could include former Barclays chief Bob Diamonds Atlas
Merchant Capital . Saudi National Bank already has a 9. 9 percent stake in the
troubled Swiss institution .', 'Reference ID 4. US-based Apollo Global
Management is among a group of financial firms in talks with Credit Suisse
about a stake in the revamped investment bank . The Wall Street Journal has
reported that Apollo is also said to be interested in investing in CS First
Boston . The investment banks investment banking division is set to be spun off
into a new unit .', 'Reference ID 5. Geneva-based private bank Pictet has
signed up a new team in its billion-dollar private market investment business .
Edmund Buckley, Nikolaus Hubmann, Sean Howard, Jan Dreesen and Hugo Hickson will
work in the area of direct investments in private equity in the future . The
commitment of Buckleys team can be seen as a coup for Pictet partner Elif AktuÄŸ
.', 'Reference ID 6. UBS wants to be seen as a global bank with Swiss roots,
not just a European bank . New board chairman Colm Kelleher and CEO Ralph
Hamers have held a series of meetings with influential U.S. fund managers to
increase their stakes in the bank . UBS is one of the most valuable European
banks with a price-to-book ratio of 1 .', 'Reference ID 7. US-based Apollo
Global Management is among a group of financial firms in talks with Credit
Suisse about a stake in the revamped investment bank . The Wall Street Journal
has reported that Apollo is also said to be interested in investing in CS First
Boston . The investment banks investment banking division is set to be spun off
into a new unit .', 'Reference ID 8. Luxembourg-based Apex subsidiary European
Depositary Bank (EDB) and securitization specialist Gentwo enter into
partnership . EDB and Gentwo will offer paying agent and banking services to
third-party investors around the world . Gentwo develops platforms for asset
managers, banks, family offices, and venture capitalists .', 'Reference ID 9.
SNB has been pursuing plans to acquire stakes in European and American
financial institutions for some time . Credit Suisse, Julius Baer, Standard
Chartered and the Asian DBS Group have also been mentioned as possible targets
.', 'Reference ID 10. Zuercher Kantonalbank is aiming to expand its private banking
business, including abroad . CEO Urs Baumann sees Credit Suisse as a reliable
partner for the state-owned bank . Baumann will continue to focus on reducing
the banks dependence on the interest differential business in an attempt to
diversify its business .']
Converting
an ingredients list to JSON format
Suppose that you want to itemize
ingredients in recipes to enter into a database, which requires a
well-formatted output like JSON. This can be done using a generative model in
the following way:
prompt = """
Extract the ingredients from the following recipe. Return the ingredients in JSON format with keys: ingredient, quantity, type.
Ingredients:
* 1 onion, chopped
* 2 carrots, chopped
* 2 celery stalks, chopped
* 1 teaspoon ground cumin
* 1/2 teaspoon ground coriander
* 1/4 teaspoon turmeric powder
* 1/4 teaspoon cayenne pepper (optional)
* Salt and pepper to taste
* 1 (15 ounce) can black beans, rinsed and drained
* 1 (15 ounce) can kidney beans, rinsed and drained
* 1 (14.5 ounce) can diced tomatoes, undrained
* 1 (10 ounce) can diced tomatoes with green chilies, undrained
* 4 cups vegetable broth
* 1 cup chopped fresh cilantro
"""
print(
generation_model.predict(
prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
).text
)
Organizing
the results of a text extraction
In this section, you extract the
information you want from a block of text and organize it in a structured way,
such as separating it by commas. Here you use few-shot prompting to guide the
model to format your outputs to be separated by commas.
prompt = """
Message: Rachel Green (Jennifer Aniston), a sheltered but friendly
woman, flees her wedding day and wealthy yet unfulfilling life and finds
childhood friend Monica Geller (Courteney Cox), a tightly wound but caring
chef.
Rachel becomes a waitress at West Village coffee house Central Perk
after she moves into Monica\'s apartment above Central Perk and joins Monica\'s
group of single friends in their mid-20s:
previous roommate Phoebe Buffay (Lisa Kudrow), an odd masseuse and
musician; neighbor Joey Tribbiani (Matt LeBlanc), a dim-witted yet loyal
struggling actor; Joey\'s roommate Chandler Bing (Matthew Perry),
a sarcastic, self-deprecating data processor; and Monica\'s older brother and Chandler\'s college roommate Ross Geller (David Schwimmer), a sweet-natured but insecure paleontologist.
Extract the characters and the actors who played them from above
message:
Rachel Green - Jennifer Aniston, Monica Geller - Courteney Cox, Phoebe Buffay - Lisa Kudrow, Joey Tribbiani - Matt LeBlanc, Chandler Bing - Matthew Perry, Ross Geller - David Schwimmer
Message: Games such as chess, poker, Go, and many video games have
always been fertile ground for AI research. Diplomacy is a seven-player game of
negotiation and alliance formation, played on an old map of Europe partitioned
into provinces, where each player controls multiple units (rules of
Diplomacy). In the standard version of the game, called Press Diplomacy, each
turn includes a negotiation phase, after which all players reveal their
chosen moves simultaneously. The heart of Diplomacy is the negotiation
phase, where players try to agree on their next moves. For example, one unit
may support another unit, allowing it to overcome resistance by other units,
as illustrated here: Computational approaches to Diplomacy have been
researched since the 1980s, many of which were explored on a simpler version of
the game called No-Press Diplomacy, where strategic communication between
players is not allowed. Researchers have also proposed computer-friendly negotiation protocols, sometimes called \342\200\234Restricted-Press\342\200\235.
Extract the deinition of Diplomacy:
A seven-player game of negotiation and alliance formation
Message: Back in 2016, when we weren\'t using simulation and were using
a small lab-configuration of industrial robots to learn how to grasp small
objects like toys, keys and everyday household items, it took the equivalent of
four months for one robot to learn how to perform a simple grasp with a
75%% success rate. Today, a single robot learns how to perform a complex task
such as opening doors with a 90%% success rate with less than a day
of real-world learning. Even more excitingly, we\'ve shown that we can
build on the algorithms and learnings from door opening and apply them to a new
task: straightening up chairs in our cafes. This progress gives us
hope that our moonshot for building general purpose learning robots might just be possible.
Extract the success rates of the robots in 2016 and today, respectively:
75%, 90%
Message: CapitalG was founded a decade ago to empower entrepreneurs with
Alphabet and Google\'s unparalleled expertise in growth.
We are privileged to share the lessons learned from helping to scale
Google, Stripe, Airbnb, CrowdStrike, Databricks, and Zscaler with the next wave
of generational tech companies-perhaps including yours.
Alphabet is our sole LP and provides patient, long-term capital. As an
independent growth fund, our priorities align with our entrepreneurs. CapitalG
companies have achieved product-market fit and are ready to scale. We maintain
a small, concentrated portfolio so every company receives substantial capital
and hands-on support.
"""
print(
generation_model.predict(
prompt, temperature=0.2, max_output_tokens=256, top_k=1, top_p=0.8
).text
)
As you can see in the output above, based on the few-shot prompt, you should see the names of companies funded by CapitalG.
0 Comments