Find what NCBI RefSeq genomes match or are contained within your sequence data using Mash MinHash with a Mash sketch database of 54,925 NCBI RefSeq Genomes. Usage: refseq_masher [OPTIONS] COMMAND ...
Abstract: Detecting similar data is crucial for optimizing file storage and transmission in HTTP protocols and Content Delivery Networks. Traditional MinHash methods encounter significant efficiency ...
Multimodal Art Projection (M-A-P) researchers have introduced FineFineWeb, a large open-source automatic classification system for fine-grained web data. The project decomposes the deduplicated ...
In the ever-evolving world of large language models (LLMs), pre-training datasets form the backbone of how AI systems comprehend and generate human-like text. LLM360 has recently unveiled TxT360, a ...
The MinHash token filter has four configurable parameters: bucket_count, hash_count, hash_set_size, and with_rotation. Currently both bucket_count and hash_set_size have no effect, because there ...
Low-cost clouds can alleviate the compute and storage burden of the genome sequencing data explosion. However, moving personal genome data analysis to the cloud can raise serious privacy concerns.
In this work, we study privacy preserving trajectory sensing and query when mobile entities (e.g., mobile devices or vehicles) move in an environment of checkpoints (e.g, WiFi or cellular towers). The ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results