Building a Treasure Finding Bot (Part 1)

July 19th, 2020

I've always heard the common adage: "one man's trash is another man's treasure". However, anyone that has gone online secondhand shopping knows that the usual reality is people think their trash is another man's treasure, and are sure to price it as such.

an absolute steal

I'm a big fan of secondhand shopping: you can get premium clothes for much cheaper, be environmentally friendly, and get the thrill of finding treasures. What I DON'T like about secondhand shopping is how tedious it can be to find those rare gems. Everyday you have to slough through new listings of items you're interested in, filter out overpriced trash, and sniff out scammers.

The first time I dipped my toe into secondhand shopping was buying on Grailed. I had heard online about sleek, comfortable, and flattering Reigning Champ hoodies, but college me just couldn't justify the hefty $150 price tag at the time (however, I am all for buying full price from stores that price high so as to support ethical treatment of environment and workers--shoutout Patagonia). Naturally, I found myself browsing secondhand stores online until, after much hemming and hawwing and disapproval from my girlfriend, I purchased a Reigning Champ Tiger Fleece hoodie for $70 from Grailed. I was hooked, and started browsing Grailed every few hours to hunt for steals.

The glorious hoodie. My girlfriend wears it more than me at this point...

Eventually, I realized that the hours sunk into browsing Grailed just were't worth it for the off-chance of finding a steal; it was pretty much just playing a lottery. I started thinking though, everything I was doing was repetitive and automatable--visit a website, check new listings, determine worth, move on. It should be possible to have a bot to do all these things for me, and send me notifications when it found a listing that was a super good deal. I was a Computer Engineering major, I was a master at coding (cough finding things on stack overflow cough), I could do this. After a rough design sketch, I realized that the hardest thing to do would be to "determine worth", which would be impossible without data. Fantastically enough, Grailed had a great feature that showed sold listings on its website. I realized that with all that data, I could compare a new listing with previously sold listings and determine if the item was over or under priced. My first challenge was ahead of me: collecting and storing all of the sold listings on Grailed.

Grabbing the data*

*Also known as producing traffic patterns that probably made some backend engineer at Grailed scratch his head a bit.

At the time, Grailed had this awesome endless scrolling feature; you could visit the page that displayed sold items, and continually scroll to show older and older sold items until there were none left. Idea 1 was hatched: I would capture all of the sold items displayed on the screen, scroll, wait for the next sold items to load, capture them, rinse and repeat. I had previously used Selenium (a framework for controlling web browsers) and Java (I know I'll get some flack for this) for a school project, so I decided to use the same stack for this project. Attempt 1 was coded up and left to run overnight. The following morning, I checked my laptop only to find less than 6500 listings downloaded and saved. How many sold listings did Grailed have at the time? 2.3 million. Assuming I had slept for 10 hours, it would take around 147 days at that rate to collect all of the data. On top of that, it was clear from observing the program that every subsequent scroll and load of older sold listings was taking exponentially longer. Clearly, the browser performance was starting to degrade as it had to support tens of thousands of listings--I was going to have to let the program run for an absurd amount of time in order to collect all of the data. Crap.

Grailed's infinite scroll

I killed the program, and starting thinking some more. Digging around, I noticed something interesting with how Grailed loaded their pages (Grailed no longer does this; I like to think they stopped because of my weird traffic patterns). When you scrolled and older sold listings were loaded, Grailed would actually change the current page's URL to a new url. If you opened a new window and visited the new URL, it would take you to a new page with only the older sold listings loaded and displayed, with all of the newer sold listings removed. However, you could still scroll down and load older listings. This gave me an idea: if the browser's struggling performance was because of the the sheer number of listings on the page, then I could scroll a constant number of times, grab everything on the page, visit the new URL to refresh the number of listings the browser has to handle, scroll another constant number of times, grab everything on the page, rinse and repeat.

Attempt 2 was coded up and left running overnight. The next morning I woke up, only to find 78,000 listings downloaded and counting. Shucks, still too slow. I would have to not use my computer for around 30 days for this to work. I went through and profiled the code, checking how long the execution of each portion of code took. I suspected it was the loading of new listings on the browser that was the bottleneck, and if that was the case, I really couldn't do anything about the speed. However, I discovered something surprising. The vast majority of the time taken wasn't actually the loading of listings, but actually the usage of Selenium to grab information about sold listings from the page.

Selenium is expensive!

I was using Selenium's selectors to grab the listing information from the page. Every attempt to identify an item property (such as title, brand, price, sell time, or clothing type) used a Selenium selector. Turns out, these were quite slow, as each selector call made a network request to the browser's http server asking it to perform a function, and had to wait for the server's response. It didn't really matter if used once or twice, but when scaled to millions of listings, the execution time blew out of proportion.

The next logical conclusion was to get rid of this expensive process of grabbing listing information. I needed some way to remove as many of the expensive network calls to the browser's http server as I could while still getting the data I needed. After some thinking, I turned to regex. Instead of repeatedly using Selenium selectors to grab listing information from the page, I would only use Selenium once per page load to grab the inner html of the html tag that contained all of the listing information. Then, I would use regex to find all the listing information I needed from the raw inner html, voiding the need for all those expensive network calls.

Attempt 3 was coded up and run, and there was an instantaenous difference in speed. It took only seconds to process hundreds of listings, reducing the time taken to collect all 2.3 million sold listings to only 6-7 hours. I was jumping for joy.

Actual footage of me jumping for joy

Finally I had all the data I wanted. Now I could do interesting things like:

Query and find the most expensive thing ever sold on Grailed:

select title, price, brands
from sold_history
where price=(select max(price) from sold_history);
                  title                   | price |     brands     
------------------------------------------+-------+----------------
 AW 2001 Riot! Riot! Riot! Patched Bomber | 47000 | {"Raf Simons"}

Query the most expensive underwear purchased:

with underwear_id as (
  select subcategory_id
  from categories
  where subcategory='Socks & Underwear'
)
select title, price, brands
from sold_history
where subcategory_id = underwear_id
and price=(
  select max(price)
  from sold_history
  where subcategory_id = underwear_id
);
                    title                     | price |  brands   
----------------------------------------------+-------+-----------
 Versace Baroque Gold Crown Animal Print 100% |   515 | {Versace}
 Silk Boxers size 6 $495 SALE                 |       |

and of course, query the average sell price of Reigning Champ hoodies:

select brands, avg, stddev
from subcategory_data
where brands=ARRAY['Reigning Champ']
and subcategory_id=(
  select subcategory_id
  from categories
  where subcategory='Sweatshirts & Hoodies'
);
       brands       |         avg         |       stddev        
--------------------+---------------------+---------------------
 {"Reigning Champ"} | 70.3193548387096774 | 24.0212700924706417

Looks like my purchase was perfectly average 😅

I still had to make the rest of the tracker, and was faced with a couple of issues. How would I determine what was or wasn't good deal? No two listings are the same, with varying titles, conditions, item types, yearly styles, etc. In part 2, I detail how I tackled these problems while building the rest of the treasure finding bot.

August 2nd, 2020 update: a friend Rex pointed out I could have actually recorded the network traffic when Grailed loaded new items and reverse engineered the actual API, which would have cut back on even MORE time. This might have been doable in 10-20 minutes! You always learn something new everyday.