Note
This is a take-home assignment completed in ~3 hours.
This is an app for searching for similar NYC Airbnb listings in different neighborhoods, given a listing ID. It provides a FastAPI backend for serving the similar listings and a Streamlit web app for a user-friendly interface.
Check out the live demo of the Streamlit app
- Search for similar Airbnb listings based on a given listing ID (the API and app both provide ways to retrieve a random ID if you don't have one)
- Filter similar listings by price tolerance and accommodation capacity
- Access through a FastAPI backend or Streamlit interface
- Python 3.10
- Make (for running setup commands)
- Clone the repository:
git clone https://github.com/naingthet/similar-listings.git
cd similar-listings
- Create a virtual environment:
make venv
- Activate the virtual environment:
-
For Windows:
venv\Scripts\activate
-
For macOS and Linux:
source venv/bin/activate
- Install the dependencies:
make install
- Set the Pinecone API key (ask Thet for this key):
export PINECONE_API_KEY={PINECONE_API_KEY}
- Start the FastAPI server (
main.py
):
make start-api
The API will be available at http://localhost:8000
.
- Start the Streamlit app (
app.py
):
make start-app
The Streamlit app will open in your default web browser.
The app uses instruction-tuned embeddings from the hkunlp/instructor-large
model to encode custom text representations of key information about Airbnb listings. These embeddings are generated for all listings and uploaded to a Pinecone index for efficient similarity search.
When searching for similar listings, the app considers the overall characteristics of the listings, such as the property type, room type, amenities, and location. Additionally, it allows users to set limitations on the price and the number of people the listing accommodates to refine the search results. E.g. by setting a price_tol
of 0.1, all similar listings will be within +/-10% of the original listing.
The FastAPI backend handles the similarity search requests and returns the top similar listings. The Streamlit web app provides a user-friendly interface for inputting the listing ID and search parameters, displaying the search results, and navigating through the similar listings.
- The app filters for listings that have non-null values for each of the selected attributes to ensure data quality and consistency.
- Listings from the same neighborhood are filtered out.
- The primary assumption is that people are typically driven by price constraints and necessary headcount when searching for similar listings. Therefore, the app uses these criteria to filter down the similar listings.
- After applying the price and headcount filters, the app focuses on surfacing listings that are similar in essence, by considering numerical features such as
beds
andreview_scores_rating
as well as textual features such asname
anddescription
.
- The app uses instructor embeddings (
hkunlp/instructor-large
) because they allow encoding multiple fields of different types in the context of the task, while using a relatively small model without the need for fine-tuning or building a complex search and rerank system. - Instructor embeddings provide a fast and efficient way to encode the relevant information about listings for similarity search.
- An alternative approach could have been to create a composite score based on multiple features, such as multiple sets of embeddings and numerical features for each listing.
- However, this approach would require returning a much larger top-k result set, running multiple encodings, and reranking the results, which can be computationally expensive compared to the implemented approach.
- The fields used in this project are:
name
,description
,host_is_superhost
,price
,accommodates
,room_type
,beds
,bathrooms
,review_scores_rating
- The app could have incorporated many more fields to capture additional aspects of the listings, but it was kept simple for demonstration purposes.
- Different embedding techniques, such as ColBERT (Contextualized Late Interaction over BERT), could have been explored for generating contextual embeddings.
- The app currently relies on a single embedding model and a limited set of attributes for similarity search. Expanding the set of attributes and exploring additional embedding techniques could potentially enhance the quality of similar listing recommendations.
- The app assumes that the selected attributes are sufficient for capturing the essential characteristics of listings. Further analysis and user feedback could help identify additional attributes that are important for similarity search.
- The app focuses on similarity search within the same city. Extending the app to support cross-city or even cross-country similar listing recommendations could be a valuable addition.
- Incorporating user feedback and learning from user interactions could enable personalized and more accurate similar listing suggestions over time.