If this Machine Learning thing never works out, you can still make passive income by mass selling these on Amazon. Thanks us later (disclaimer: this is a joke)
Meet FableForge, AI Picture Books Generator powered by OpenAI, LangChain, Stable Diffusion, & Deep Lake
Imagine a world where children’s picture books are created on-demand by children from a single prompt. With each generated image, the text and prompt pairs are stored for further finetuning if the child likes the story - to fit one human’s imagination perfectly.
This is the grand vision of FableForge.
FableForge is an open-source app that generates children’s picture books from a single prompt. First, GPT-3.5/4 is instructed to write a short children’s book. Then, using the new function calling feature OpenAI just announced, the text from each book page is transformed into a prompt for Stable Diffusion. These prompts are sent to Replicate, corresponding images are generated, and all the elements are combined for a complete picture book. The matching images and prompts are stored in a Deep Lake vector database, allowing easy storing and visualizing of multimodal data (image and text pairs). Beyond that, the generated data can be streamed to machine learning frameworks in real time while training, to finetune our generative AI model. While the latter is beyond the scope of this example, we’d love to cover how it all works together.
But first…
What Did and Didn’t Work while Building FableForge?
Before we look at the exact solution we eventually decided on, let’s take a glance at the approaches that didn’t work and what we learned from them:
Didn’t Work: Instructing GPT-4 To Generate Stable Diffusion Prompts
Initially, it seemed like it might be possible to send the LLM the text of our book and tell it to generate a prompt for each page. However, this didn’t work for a few reasons:
Stable Diffusion released in 2022: While it might seem like Stable Diffusion is already "old news", to GPT-3.5 and GPT-4 it’s in the future. Look at GPT-4’s response to the question, "What is Stable Diffusion?":
Teaching the LLM how to prompt is difficult: It’s possible to instruct the LLM to generate prompts without the LLM knowing what Stable Diffusion is; giving it the exact format to generate a prompt with has decent results. Unfortunately, the often injects plot details or non-visual content into the prompts, no matter how often you tell it not to. These details skew the relevance of the prompts and negatively impact the quality of the generated images.
What Did Work: Function Calling Capabilities
What is OpenAI Function Calling?
On June 13th, OpenAI announced a huge update to the chat completions API - function calling!. This means we can provide the chat model with a function, and the chat model will output a JSON object according to that function’s parameters.
Now, the chat models can interpret natural language input into a structured format suitable for external tools, APIs, or database queries. The chat models are designed to detect when a function needs to be called based on the user’s input and can then respond with JSON that conforms to the described function’s signature.
In essence, function calling is a way to bridge the gap between unstructured language input and structured, actionable output that other systems, tools, or services can use.
How FableForge Uses Functions
For our Stable Diffusion prompts, we need structured data that strictly adheres to specific rules - a function is perfect for that! Let’s take a look at one of the functions we used:
1get_visual_description_function = [{
2 'name': 'get_passage_setting',
3 'description': 'Generate and describe the visuals of a passage in a book. Visuals only, no characters, plot, or people.',
4 'parameters': {
5 'type': 'object',
6 'properties': {
7 'setting': {
8 'type': 'string',
9 'description': 'The visual setting of the passage, e.g. a green forest in the pacific northwest',
10 },
11 'time_of_day': {
12 'type': 'string',
13 'description': 'The time of day of the passage, e.g. nighttime, daytime. If unknown, leave blank.',
14 },
15 'weather': {
16 'type': 'string',
17 'description': 'The weather of the passage, eg. rain. If unknown, leave blank.',
18 },
19 'key_elements': {
20 'type': 'string',
21 'description': 'The key visual elements of the passage, eg tall trees',
22 },
23 'specific_details': {
24 'type': 'string',
25 'description': 'The specific visual details of the passage, eg moonlight',
26 }
27 },
28 'required': ['setting', 'time_of_day', 'weather', 'key_elements', 'specific_details']
29 }
30}]
31
With this, we can send the chat model a page from our book, the function, and instructions to infer the details from the provided page. In return, we get structured data that we can use to form a great Stable Diffusion prompt!
LangChain and OpenAI Function Calling
When we created FableForge, OpenAI announced the new function calling capabilities. Since then, LangChain - the open-source library we use to interact with OpenAI’s Large Language Models - has added even better support for using functions. Our implementation of functions using LangChain is as follows:
Define our function: First, we define our function, as we did above with
get_visual_description_function
.Give the chat model access to our function: Next, we call our chat model, including our function within the
functions
parameter, like so:
1
2response= self.chat([HumanMessage(content=f'{page}')],functions=get_visual_description_function)
3
4
- Parse the JSON object: When the chat model uses our function, it provides the output as a JSON object. To convert the JSON object into a Python dictionary containing the function output, we can do the following:
1
2function_dict = json.loads(response.additional_kwargs['function_call']['arguments'])
3
4
In the function, we defined earler, ‘setting’ was one of the parameters. To access this, we can write:
1setting = function_dict['setting']
2
And we’re done! We can follow the same steps for the each of the other parameters to extract them.
Perfecting the Process: Using Deep Lake for Storage and Analysis
The final step breakthrough for perfecting FableForge was using Deep Lake to store the generated images and text. With Deep Lake, we could store multiple modalities of data, such as images and text, in the cloud. The web-based UI provided by Deep Lake made it incredibly straightforward to display, analyze, and optimize the generated images and prompts, improving the quality of our picture book output. For future Stable Diffusion endeavors, we now have a decently-sized dataset showing us what prompts work, and what prompts don’t!
Building FableForge
FableForge’s open-sourced code is located here.
FableForge consists of four main components:
- The generation of the text and images
- The combining of the text and images to create the book
- Saving the images and prompts to the Deep Lake dataset
- The UI
Let’s take a look at each component individually, starting with the generation of the text and images. Here’s a high-level overview of the architecture:
First Component: AI Book Generation
All code for this component can be found in the api_utils.py
file.
- Text Generation: To generate the text for the children’s book, we use LangChain and the ChatOpenAI chat model.
1def get_pages(self):
2 pages = self.chat([HumanMessage(content=f'{self.book_text_prompt} Topic: {self.input_text}')]).content
3 return pages
4
5
self.book_text_prompt
is a simple prompt instructing the model to generate a children’s story. We specify the number of pages inside the prompt and what format the text should come in. The full prompt can be found in the prompts.py
file.
- Visual Prompts Generation: To produce the prompts we will use with Stable Diffusion, we use functions, as outlined above. First, we send the whole book to the model:
1 def get_prompts(self):
2 base_atmosphere = self.chat([HumanMessage(content=f'Generate a visual description of the overall lightning/atmosphere of this book using the function.'
3 f'{self.book_text}')], functions=get_lighting_and_atmosphere_function)
4 summary = self.chat([HumanMessage(content=f'Generate a concise summary of the setting and visual details of the book')]).content
5
6
Since we want our book to have a consistent style throughout, we will take the contents of base_atmosphere
and append it to each individual prompt we generate later on. To further ensure our visuals stay consistent, we generate a concise summary of the visuals of the book. This summary will be sent to the model later on, accompanying each individual page, to generate our Stable Diffusion prompts.
1 def generate_prompt(page, base_dict):
2 prompt = self.chat([HumanMessage(content=f'General book info: {base_dict}. Passage: {page}. Infer details about passage if they are missing, '
3 f'use function with inferred detailsm as if you were illustrating the passage.')],
4 functions=get_visual_description_function)
5
This method will be called for each individual page of the book. We send the model the info we just gathered along with a page from the book, and give it access to the get_visual_description_function
function. The output of this will be a JSON object containing all the elements we need to form our prompts!
1 for i, prompt in enumerate(prompt_list):
2 entry = f"{prompt['setting']}, {prompt['time_of_day']}, {prompt['weather']}, {prompt['key_elements']}, {prompt['specific_details']}, " \
3 f"{base_dict['lighting']}, {base_dict['mood']}, {base_dict['color_palette']}, in the style of {style}"
4
Here, we combine everything. Now that we have our prompts, we can send them to Replicate’s Stable Diffusion API and get our images. Once those are downloaded, we can move on to the next step.
Second Component: Combining Text and Images
Now that we have our text and images, we can open up MS Paint and copy-paste the text onto each corresponding image. That would be different, and it’s also time-consuming; instead, let’s do it programmatically. In pdf_gen_utils.py
, we turn our ingredients into a proper book in these steps:
- Text Addition and Image Conversion: First, we take each image, resize it, and apply a fading mask to the bottom - a white space for us to place our text. We then add the text to the faded area, convert it into a PDF, and save it.
- Cover Generation: A book needs a cover that follows a different format than the rest of the pages. Instead of a fading mask, we take the cover image and place a white box over a portion for the title to be placed within. The other steps (resizing and saving as PDF) are the same as above.
- PDF Assembly: Once we have completed all the pages, we combine them into a single PDF and delete the files we no longer need.
Third Component: Saving to Deep Lake
Now that we have finalized our picture book, we want to store the images and prompts in Deep Lake. For this, we created a SaveToDeepLake
class:
1import deeplake
2
3class SaveToDeepLake:
4 def __init__(self, buildbook_instance, name=None, dataset_path=None):
5 self.dataset_path = dataset_path
6 try:
7 self.ds = deeplake.load(dataset_path, read_only=False)
8 self.loaded = True
9 except:
10 self.ds = deeplake.empty(dataset_path)
11 self.loaded = False
12
13 self.prompt_list = buildbook_instance.sd_prompts_list
14 self.images = buildbook_instance.source_files
15
16 def fill_dataset(self):
17 if not self.loaded:
18 self.ds.create_tensor('prompts', htype='text')
19 self.ds.create_tensor('images', htype='image', sample_compression='png')
20 for i, prompt in enumerate(self.prompt_list):
21 self.ds.append({'prompts': prompt, 'images': deeplake.read(self.images[i])})
22
23
When initialized, the class first tries to load a Deep Lake dataset from the provided path. If the dataset doesn’t exist, a new one is created.
If the dataset already existed, we simply added the prompts and images. The images can be easily uploaded using deeplake.read()
, as Deep Lake is built to handle multimodal data.
If the dataset is empty, we must first create the tensors to store our data. In this case, we create a tensor ‘prompts’ for our prompts and ‘images’ for our images. Our images are in PNG format, so we set sample_compression
to 'png'
.
Once uploaded, we can view them in the UI, as shown above.
All code can be found in the deep_lake_utils.py
file.
Final Component: Streamlit UI
To create a quick and simple UI, we used Streamlit. The complete code can be found in main.py
.
Our UI has three main features:
- Prompt Format: In this text input box, we allow the user to specify the prompt to generate the book based on. This could be anything from a theme, a plot, a time, and so on.
- Book Generation: Once the user has input their prompt, they can click the Generate button to generate the book. The app will run through all of the steps outlined above until it completes the generation. The user will then have a button to download their finished book.
- Saving to Deep Lake: The user can click the Save to Deep Lake checkbox to save the prompts and images to their Deep Lake vector database. Once the book is generated, this will run in the background, filling the user’s dataset with all their generated prompts and images.
Streamlit is an excellent choice for quick prototyping and smaller projects like FableForge - the entire UI is less than 60 lines of code!
Conclusion: The Future of AI-Generated Picture Books with FableForge & Deep Lake
Developing FableForge was a perfect example of how new AI tools and methodologies can be leveraged to overcome hurdles. By leveraging the power of LangChain, OpenAI’s function calling feature, Stable Diffusion’s image generation abilities, and Deep Lake’s multimodal dataset storage and analysis capabilities, we created an app that opens up a new frontier in children’s picture book creation.
Everyone can create an app like this - we did it, too. What will matter for you in the end, however, is having the data as the moat - and using the data you gether from your users to finetune models - providing them personal, curated experiences as they immerse themsleves into fiction. This is where Deep Lake comes into place. With its ‘data lake’ features of visualization of multimodal data and streaming capability, Deep Lake enables teams to finetune their LLM performance or train entirely new ML models in a cost-effective manner.