LangChain & GPT-4 for Code Understanding: Twitter Algorithm

Q: How to Build Code Understanding App with LangChain, GPT-4, & Conversational Retriever Chain?

Index the Codebase: Duplicate the target repository, load all contained files, divide the files, and initiate the indexing procedure. Alternatively, you can bypass this step and use a pre-indexed dataset. Store Embeddings and the Code: Code segments are embedded using a code-aware embedding model and saved in the Deep Lake VectorStore. Assemble the Retriever: Conversational Retriever Chain searches the VectorStore to find a specific query's most relevant code segments. Build the Conversational Chain : Customize retriever settings and define any user-defined filters as necessary. Pose Questions: Create a list of questions about the codebase, then use the Conversational Retrieval Chain to produce context-sensitive responses. The LLM (GPT-4, in this case) should now generate detailed, context-aware answers based on the retrieved code segments and conversation history.

Twitter open-sourced a part of its recommendation algorithm on March 31, 2023; we’re here for it. Read this article if you want to learn more about how to understand any codebase in seconds by using LangChain, LangChain’s Conversational Retriever Chain, Deep Lake, and GPT-4. As a bonus, you’ll also learn how the Twitter recommendation algorithm works and the top 10 tips for trending on Twitter in 2023.

Doing this quickly is only possible thanks to the awesome integration between LangChain and Deep Lake as a vector store. Seriously. Here’s how the “legacy” way of understanding any GitHub Repository worked:

The Legacy Way to Understand Code

Acquire a broad comprehension of the codebase’s role within the project.
Read the codebase documentation.
Develop a dependency map for the codebase to comprehend its organization and interconnections.
Examine the primary function to grasp the code’s structure.
Ask a colleague, "wtf is the main function doing?".
For test-driven development, execute test cases and use breakpoints to decipher the code.
If test cases exist but are outside test-driven development, review them to comprehend the specifications.
Shed a few tears.
Employ a debugger to step through the code if test cases are absent.
Examining Git history can reveal the codebase’s evolution and areas more susceptible to modifications.
Alter the code and introduce personal test cases to assess the consequences.
Investigate previous alterations to identify potential impact areas and confirm your assumptions.
Continually monitor changes made by teammates to remain informed on current advancements.

The New Way to Understand Code Repositories

The new way is just four steps that take less than an hour to build:

Index the codebase
Store embeddings and the code in Deep Lake
Use Conversational Retriever Chain from LangChain
Ask any questions you’d like!

Now, this doesn’t mean you don’t need to take the steps outlined above in the previous section, but we do hope this new approach aids the learning speed along the way. We will delve deeper into this process below, but let’s review the basics first.

LangChain basics

Before moving on to the process and architecture behind code comprehension, let’s first understand the basics.

What is LangChain?

In essence, LangChain is a wrapper for utilizing Large Language Models like GPT-4 and connecting them to many tools (such as vector stores, calculators, or even Zapier). LangChain is especially appealing to developers because it offers a novel way to construct user interfaces. Instead of relying on dragging and dropping or coding, users can state their desired outcome. Broadly speaking, LangChain is enticing to devs as it augments already robust LLMs with memory and context (which comes in handy in tasks such as code understanding). By artificially incorporating “reasoning,” we can tackle more sophisticated tasks with greater precision.

If you want to learn more about LangChain, read the ultimate guide on LangChain. In this example, we build ChatGPT to answer questions about your financial data. If you were to use an LLM about the top-performing quarter of all time for Amazon (maybe after feeding it a copy-paste text from a pdf), It would likely produce a plausible SQL query to retrieve the result using fabricated yet real-sounding column names. However, using LangChain, you can compose a workflow that would iteratively go through the process and arrive at a definitive answer, such as "Q4 2022 was the strongest quarter for Amazon all-time". You can read more about analyzing your financial data with LangChain.

What is Deep Lake as a Vector Store in LangChain?

In the LangChain ecosystem, Deep Lake is a serverless, open-source, and multi-modal vector store. Deep Lake not only stores embeddings but also the original data with automatic version control. For these reasons, Deep Lake can be considered one of the best Vector Stores for LangChain (if you ask us, haha!). Deep Lake goes beyond a simple vector store, but we’ll dive into it in another blog post.

What is LangChain Conversational Retriever Chain?

A conversational Retriever Chain is a retrieval-centric system interacting with data stored in a VectorStore like Deep Lake. It extracts the most applicable code snippets and details for a specific user request using advanced methods like context-sensitive filtering and ranking. The conversational Retriever Chain is designed to provide high-quality, relevant outcomes while considering conversation history and context.

How to Build Code Understanding App with LangChain, GPT-4, & Conversational Retriever Chain?

Index the Codebase: Duplicate the target repository, load all contained files, divide the files, and initiate the indexing procedure. Alternatively, you can bypass this step and use a pre-indexed dataset.
Store Embeddings and the Code: Code segments are embedded using a code-aware embedding model and saved in the Deep Lake VectorStore.
Assemble the Retriever: Conversational Retriever Chain searches the VectorStore to find a specific query’s most relevant code segments.
Build the Conversational Chain: Customize retriever settings and define any user-defined filters as necessary.
Pose Questions: Create a list of questions about the codebase, then use the Conversational Retrieval Chain to produce context-sensitive responses. The LLM (GPT-4, in this case) should now generate detailed, context-aware answers based on the retrieved code segments and conversation history.

The Twitter Recommendation Algorithm

Ironically, we will use some words to describe the Twitter Algorithm for the general audience. Still, you can skip right to the code part (that will answer even more questions on how the Twitter algorithm works in 2023).

With approximately 500 million Tweets daily, the Twitter recommendation algorithm is instrumental in selecting top Tweets for your the “For You” feed. The Twitter trending algorithm employs intertwined services and jobs to recommend content across different app sections, such as Search, Explore, and Ads. However, we will focus on the home timeline’s “For You” feed.

Twitter Recommendation Pipeline

Twitter’s open-sourced recommendation algorithm works in three main steps:

Candidate Sourcing (fancy speak for data aggregation): the algorithm collects data about your followers, your tweets, and you. The “For You” timeline typically comprises 50% In-Network (people you follow) and 50% Out-of-Network (people you don’t follow) Tweets.
Feature Formation & Ranking: Turns the data into key feature buckets:
Embedding Space (SimClusters and TwHIN), In Network (RealGraph and Trust & Safety), and Social Graph (Follower Graph, Engagements); look for our practical example to discover what each of those is. Later, a neural network trained on Tweet interactions to optimize for positive engagement is used to obtain the final ranking.
Mixing: Finally, in the mixing step, the algorithm groups all features into candidate sources and uses a model called Heavy Ranker to predict user actions, applying heuristics and filtering.

Source Code Understading with LangChain: Practical Guide

Step 1: Installing required libraries and authenticating with Deep Lake and Open AI

First, we will install everything we’ll need.

 
      
        1!python3 -m pip install --upgrade langchain deeplake openai tiktoken langchain-openai
2
3

Next, let’s import the necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY and define the OpenAI embeddings. For full documentation of Deep Lake please the Deep Lake LangChain docs page and the Deep Lake API reference.

You’d need to authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the Deep Lake platform here

 
      
        1import os
2import getpass
3
4from langchain_openai import OpenAIEmbeddings
5from langchain.vectorstores import DeepLake
6
7os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
8os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')
9embeddings = OpenAIEmbeddings()
10

Step 2: Indexing the Twitter Algorithm Code Base (Optional)

You can skip this part and jump right into using an already indexed dataset (just like the one in this example). To index the code base, first clone the repository, parse the code, break it into chunks, and apply OpenAI indexing:

 
      
        1!git clone https://github.com/twitter/the-algorithm # replace any repository of your choice
2

Next, load all files inside the repository.

 
      
        1import os
2from langchain.document_loaders import TextLoader
3
4root_dir = './the-algorithm'
5docs = []
6for dirpath, dirnames, filenames in os.walk(root_dir):
7    for file in filenames:
8        try: 
9            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
10            docs.extend(loader.load_and_split())
11        except Exception as e: 
12            pass
13

Subsequently, divide the loaded files into chunks:

 
      
        1from langchain.text_splitter import CharacterTextSplitter
2
3text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
4texts = text_splitter.split_documents(docs)
5
6

Perform the indexing process. This takes roughly 4 minutes to calculate embeddings and upload them to Activeloop. Afterward, you can publish the dataset publicly:

 
      
        1username = "davitbun" # replace with your username from app.activeloop.ai
2db = DeepLake(dataset_path=f"hub://{username}/twitter-algorithm", embedding=embeddings)
3db.add_documents(texts)
4

If the dataset has been already created, you can load it later without recomputing embeddings as seen below.

Step 3: Conversational Retriever Chain

First, load the dataset, establish the retriever, and create the Conversational Chain:

 
      
        1db = DeepLake(dataset_path="hub://davitbun/twitter-algorithm", read_only=True, embedding_function=embeddings)
2

A preview of the dataset would look something like this:

 
      
        1Dataset(path='hub://davitbun/twitter-algorithm', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])
2
3  tensor     htype       shape       dtype  compression
4  -------   -------     -------     -------  ------- 
5 embedding  generic  (23152, 1536)  float32   None   
6    ids      text     (23152, 1)      str     None   
7 metadata    json     (23152, 1)      str     None   
8   text      text     (23152, 1)      str     None   
9

 
      
        1retriever = db.as_retriever()
2retriever.search_kwargs['distance_metric'] = 'cos'
3retriever.search_kwargs['fetch_k'] = 100
4retriever.search_kwargs['k'] = 10
5

You can also define custom filtering functions using Deep Lake filters:

 
      
        1def filter(x):
2    if 'com.google' in x['text'].data()['value']:
3        return False
4    metadata = x['metadata'].data()['value']
5    return 'scala' in metadata['source'] or 'py' in metadata['source']
6
7# Uncomment the following line to apply custom filtering
8# retriever.search_kwargs['filter'] = filter
9

Connect to GPT-4 for question answering.

 
      
        1from langchain.chat_models import ChatOpenAI
2from langchain.chains import ConversationalRetrievalChain
3
4model = ChatOpenAI(model='gpt-3.5-turbo') # switch to 'gpt-4'
5qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)
6

Step 4: Ask Questions to the Codebase in Natural Language

Define all the juicy questions you want to be answered:

 
      
        1questions = [
2    "What does favCountParams do?",
3    "is it Likes + Bookmarks, or not clear from the code?",
4    "What are the major negative modifiers that lower your linear ranking parameters?",   
5    "How do you get assigned to SimClusters?",
6    "What is needed to migrate from one SimClusters to another SimClusters?",
7    "How much do I get boosted within my cluster?",   
8    "How does Heavy ranker work. what are it’s main inputs?",
9    "How can one influence Heavy ranker?",
10    "why threads and long tweets do so well on the platform?",
11    "Are thread and long tweet creators building a following that reacts to only threads?",
12    "Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?",
13    "Content meta data and how it impacts virality (e.g. ALT in images).",
14    "What are some unexpected fingerprints for spam factors?",
15    "Is there any difference between company verified checkmarks and blue verified individual checkmarks?",
16] 
17chat_history = []
18
19for question in questions:  
20    result = qa.invoke({"question": question, "chat_history": chat_history})
21    chat_history.append((question, result['answer']))
22    print(f"-> **Question**: {question} \n")
23    print(f"**Answer**: {result['answer']} \n")
24

Finally, configure the conversational model and chain:

The output would be a series of questions and answers. We’ve redacted the answers of the model for brevity. You’re welcome to read through the re-run of the notebook here)

Code Understanding with LangChain: Asking about Variables

Question: What is favCountParams? Is it just Likes and Bookmarks?

Answer: From the provided code, it is not clear if the favorite count metric is determined by the sum of likes and bookmarks. The favorite count is mentioned in the code, but there is no explicit reference to how it is calculated in terms of likes and bookmarks.

Question: What are the major negative modifiers that lower your linear ranking parameters?

Answer: In this code repository, the major negative modifiers lowering the linear ranking parameters are:

scoringData.querySpecificScore: This score adjustment is based on the query-specific information. If its value is negative, it will lower the linear ranking parameters.
scoringData.authorSpecificScore: This score adjustment is based on the author-specific information. If its value is negative, it will also lower the linear ranking parameters. If the author has been blocked, muted, reported for abuse or spam, or has been unfollowed (over the last 50 day rolling period) - the algorithm will adjust the score negatively.

Code Understanding with LangChain: Asking about Functions or Algorithms

Question: How does Heavy ranker work. what are it’s main inputs?

Answer: The Heavy Ranker, a vital machine learning model, ranks and scores candidates within the recommendation algorithm, mainly predicting user engagement with tweets or connections on the platform.

Key Heavy Ranker inputs include:

Static Features: Features calculated directly from a tweet as it’s created, such as URL presence, cards, quotes, etc. The Index Ingester generates these features and stores them in the index.
Real-time Features: Per-tweet features that change post-indexing, including social engagements (retweet, favorite, reply counts) and spam signals. The Signal Ingester, part of a Heron topology, processes event streams to gather and compute these features.
User Table Features: Per-user features obtained from the User Table Updater, processing a stream written by the user service. This input stores sparse real-time user data, later linked to the tweet being scored via author lookup.
Search Context Features: Features representing the searcher’s context, like UI language, content consumption, and current time (implied). These features are combined with Tweet Data for scoring purposes.
The Heavy Ranker processes these inputs to score and rank candidates based on relevance and user engagement likelihood.

Question: What are some unexpected fingerprints for spam factors?

Answer: Within the given context, a peculiar spam sign is a tweet containing a non-media, non-news link. If a tweet has a link without an image, video, or news URL, it’s seen as a potential spam source, setting a user reputation threshold (tweepCredThreshold) to MIN_TWEEPCRED_WITH_LINK.

Note that this rule may not contain all atypical spam indicators but is based on the particular codebase and logic provided in the context.

Code Understanding with LangChain: Asking about Suggestions Based on the Code

You can also brainstorm with the source code in mind on how to accomplish a certain task. Some of the output will be more general (and generic), but it can be finetuned to heavily base its answers on what the model detects in the source code.

Question: Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?

Growing followers: The primary objective is to expand your audience. Tactics include:____

Sharing top-notch, niche-specific content regularly to gain autority within SimCluster.
Interacting with users through replies, retweets, and mentions. Engage with your commenters: recap.engagement.is_replied_reply_engaged_by_author in Heavy Ranker gives you 75x boost.
Utilizing pertinent hashtags and joining popular discussions.
Partnering with influencers and users with sizable followings.
Publishing content when your target audience is most active.

Boosting likes and bookmarks per tweet: The goal is to produce content that connects with your existing followers and promotes engagement. Tactics include:

Creating tweets on topics that are trending. tweetHasTrendBoost gives you a 1.1x boost for your tweet to be shown to people.
Incorporating eye-catching visuals like images, or videos. twieetHasImageUrlBoost or twieetHasVideoUrlBoost will get you a 2x boost for having a video or image in your tweet.
Posing questions, expressing opinions, or starting conversations for enhanced user engagement.

Top 7 Twitter Recommendation Algorithm Tips: How to Trend on Twitter

Here’s some other interesting facts we’ve found from our exploration of the Twitter code base. Perhaps they’ll help you to gain a larger Twitter following and even trend on Twitter!

To be more visible on Twitter, you should:

Aim for more likes and bookmarks as they give your tweet a significant boost.
Encourage retweets as they give your tweet a 20x boost.
Include videos or images in your tweets for a 2x boost.
Avoid posting links or using unrecognized languages to prevent deboosts.
Create content that appeals to users in your SimClusters to increase relevance.
Engage in conversations by replying to others and encouraging replies to your tweets.
Maintain a good reputation by avoiding being classified as a misinformation spreader, blocked, muted, reported for abuse, or unfollowed.

Concluding Remarks: Analyzing Codebase with LangChain and Deep Lake

In conclusion, the powerful combination of LangChain, Deep Lake, and GPT-4 revolutionizes code comprehension, making it faster and more efficient. Developers can quickly grasp complex codebases like Twitter’s recommendation algorithm using four key steps: indexing the codebase, storing embeddings and code in Deep Lake, using LangChain’s Conversational Retriever Chain, and asking questions in natural language.

Hopefully, this powerful combination of tools enables developers to quickly gain insights into the inner workings of any code repository, eliminating the need for tedious, time-consuming methods. Since the release of this blogpost, we’ve seen some great usage of our code to build exciting projects, just like this one: