100 Days of code – 15% complete

Remember when you took statistics in high school and thought “When am I ever going to use this crap?” Well, I don’t. I never actually took statistics.

Since Monday, I’ve been working on the following problem from PyBites Challenge 6:
Predict when PyPI will have more than 200,000 packages.

It’s really just a basic prediction problem, which means I’m going to try to use a linear regression model to make a prediction. I’m quickly realizing that statistics are important!

I started Monday on this challenge with just using the PyPI simple API. That was pretty easy to use, but it only gave me a list of all packages on PyPI. I started searching for a way to list packages by their first upload date. The PyPI API mentions how to list all uploads after a certain date, but that mixes updates to existing packages with the new package uploads I care about. I can list details for a particular package from the JSON API, but that just gives me all the updates for that package.

I was out of ideas, so I checked to see how PyBites did it (spoiler here.)

Those methods were OK, but they seemed like a lot of work, and it’d take a long time to get a decent prediction. I’m coding for only an hour a day, so I need to keep moving!

I decided to take the full list of packages from the PyPI simple API and select every hundredth package. Then, I ask for all the details about that package using the JSON API. This currently means 1,585 API requests, which means I want to wait half a second between each call to avoid annoying the PyPI admins. Once I have each response, I find all the upload dates for that package and save only the earliest one to a list. The results take about 13 minutes to complete, which gives me plenty of time to test the next step!

Next, I’ll create a running total of the number of packages since the earliest package date. This should give me the count of total packages by date, which perfectly fits into a linear regression model. I mean, I think it fits into a linear regression model.

I never actually took statistics.