This dataset comprises two years' worth of postings from Bazoš.cz, the largest marketplace in the Czech Republic for second-hand goods and various short-term advertisements.
In this repository, I release:
- Python script used for scraping Bazoš.cz
- Script used for post-processing the scraped data
- Dataset description
- Example of usage
- Links to download the data.
The period covered is from 1.5.2022 to 31.7.2024. The dataset contains approx. 29 million short advertisements in Czech with title, price, date, permalink and category.
The main categories are Cars, Children, House and Garden, Electronics, Photography, Music, Books, Mobiles, Motorcycles, Furniture, Clothing, PC, Jobs, Real Estate, Services, Sports, Machinery, Tickets, Animals and Others.
Tento dataset obsahuje dva roky inzercí z největšího českého internetového bazaru Bazoš.cz.
V tomto repozitáři najdete:
- Skript pro stahování inzerátů z Bazoš.cz
- Skripty pro postprocessing stažených dat
- Popis datasetu
- Příklad použití
- Odkazy ke stažení
Archiv pokrývá období od 1.5.2022 do 31.7.2024 a obsahuje téměř 29 milionů inzerátů v češtině. Každý inzerát obsahuje titulek, cenu, datum uveřejnění, odkaz a kategorii. Kategorie jsou uvedeny v tabulce níže.
The dataset is split by category, which can be downloaded individually.
- The source of the data is the RSS feed of Bazoš.cz, which does not contain all information from the post.
- I removed the "description" column from the dataset because in some cases it contained sensitive information.
- There are two gaps in the data scraped in October 2022 and in the spring of 2024.
As far as I know, this is the only publicly available Czech dataset of this type and I hope it could be useful for research, personal use or other interesting applications. Please contact me if you have any questions or concerns.
According to the Bazoš.cz terms of service, the use of web-scrapers is not prohibited. I made sure that my scraping script was using as little requests as possible to not overload the servers.
The data does not contain any columns with any personal information. I removed the "description" column from the dataset because in some cases it contained sensitive information. I also removed posts that contained sensitive information in the title.
The code is released under the GNU General Public License.