We can postulate quite a bit from current tech which does many of the same things - regardless of "how it works" the limits on data storage at the end-user level will be common.
Tineye is a identification based photo search engine, and craws the web and downloads images. It currently has 1.12 billion images in its database. If only a 8KB hash was delivered to every end user about each of those images we are talking 8.5 Terrabytes of data. This MUST be on a central database. (PS - tineye's 1.12 billiion images are nowhere near enough to find sources for any but the most popular of images)
Piximiliar is considered state-of-the-art when it comes to searching for similar images. They flat-out say the service is not metadata or keyword dependent. This means image parsing, but not necessarily image storage. Already I have shown how a small database of just parsed hashes takes more storage space than can be delivered reasonably to an end-user.
So let's say we can be confident this is a server-client solution. (I think the storage numbers involved speak for themselves). That means only one of two things - the server either stores the original images itself OR the server stores LINKS to said images. If it stores images it is subject to copyright law in and of itself. If it stores links a randomization scheme such as I briefly outlined above will kill it.
See? Logic can answer questions even when we don't know details.
EDIT:
Keyword or metadata based options are POSSIBLE, but improbable. The advantage is they would could cut the size of the database down by an order of magnitude or two (85 GB is still too much to give to an end user) but they would require sophisticated tagging of hundreds of million of images, and not be as successful at matches as the recognition-based systems currently deployed - they have been tried and have failed so far. If you have the recognition engine to parse a drawing and know what to look for going from fingerprint -> keyword -> fingerprint is a needless and error-prone step as well. Better to just do a fuzzy search on fingerprint->fingerprint.
EDIT2:
But let us run with the idea that this is keyword / metadata and not recognition. As I started to say earlier recognition has worked better so far because you don't need to know X=Apple and Y=Apple, therefore X might look like Y, you know the X and Y share enough similar traits that they are likely the same (or more importantly, same enough without building the huge database of Apples, pears, oranges, etc etc).
Metadata is an unlikely one in this case because it works off a sketch. It would be much easier math to parse a sketch similar to the way you parse an image and look for those fingerprints than it would be to _unaided_ assign relevant keywords to a sketch, then look for those keywords - not to mention the already discussed monumental task of keywording millions and millions of images.
Then we get to the other flaw of metadata: Pose.
It is not enough to know, when attempting to duplicate a sketch, that it IS a girl, you need to know how the girl is posing. Using already available metadata (if you wanted to pick-and-choose your source material from well-tagged sources such as flickr) to answer the problem of how to tag so many won't give you pose information.
If we assume an incredible 20 bits of information is all it would take for a metadata solution (that's ~1 million unique values only) that's about a DVDs worth of database to end-user. Doable - but only 1 million unique keys to describe 1 billion unique images.