Django How to Upload Csv to Database

Loading huge amounts of data to a database only got easier

Pixelated character over an orange backdrop

Image by the author.

Problem Overview and App Configuration

It's often the instance that you want to load the data into your database from a CSV file. Usually, it's not a problem at all, but at that place are some cases when performance problems tin occur — especially if you lot want to load a massive corporeality of data. In this case, "massive" means a CSV file that has 500MB to 1GB of data and millions of rows.

In this article, I will focus on a state of affairs where using the database utilities to load CSV files is not possible (similar PostgreSQL Re-create), as you demand to do a transformation in the process.

As well, information technology is worth noting here that data load of this size should e'er be questioned and you should try to notice more suitable means to do it. Always check if you can re-create the data directly to the database using database engine utilities like COPY. These kinds of operations will almost e'er be much more performant than using ORM and your awarding code.

Permit's say that we accept ii models: Production and ProductCategory. We get the data from a different organization department and nosotros have to load the information to the organisation. Our Django models will look like this:

The information structure is pretty uncomplicated, but it will exist enough to show the problems with massive data load. One thing worth noting here is the relationship between Product and ProductCategory. In this case, we tin can expect that the number of product categories volition be several orders of magnitude lower than the number of products. We will utilise this cognition later on.

We need also a generator for the CSV files. The CSV file has the following columns:

  • product_name
  • product_code
  • price
  • product_category_name
  • product_category_code

Using the script above, you can create a CSV file with the data nosotros will need to do load testing. Y'all can laissez passer a number when calling arguments, and this will be the number of rows in the generated file:

            python3 csv_mock_data_create.py 10000          

The command to a higher place will create a file with 10,000 products. Notation that the script is skipping the CSV header now. I volition get dorsum to that later.

Exist conscientious here, equally 10 meg rows will create a file of size around 600MB.

Now we simply need a simple Django management command to load the file. We will not practice it via the view because, as we know already, the files are huge. This means that we will need to upload ~500MB files using a request handler, and as a effect, load the files to the memory. This is inefficient.

The command now has a naive implementation of the data loading and as well shows the time needed to process the CSV file:

            python3 manage.py load_csv /path/to/your/file.csv                      

For 200 products, the code above was executed in 0.220191 seconds. For 100,000 products information technology took 103.066553 seconds. And it would probable take ten times longer for ane million products. Can we arrive faster?

ane. Do Not Load the Whole File Into Memory

The first thing to note is that the lawmaking above loads the whole CSV into memory. Even more interestingly, it's doing it twice. These 2 lines are really bad:

            information = list(csv.reader(csv_file, delimiter=","))
for row in information[1:]:
...

It's a mutual mistake to try to skip the header from processing like that. The code is iterating from the second element on the listing, but the csv.reader is an iterator, which ways information technology's retentiveness-efficient. If a programmer forces listing conversion, and then the CSV file volition be loaded into a list and thus into the memory of the process. On instances without enough RAM retentivity, that can be an consequence. The second re-create of the data is done when the information[1:] is used in the for loop. And so how can we handle it?

            data = csv.reader(csv_file, delimiter=",")
next(data)
for row in information:
...

Calling adjacent will motion the iterator to the adjacent particular, and we volition exist able to skip a CSV header (in nigh cases, it'south not needed for the processing). Too, the memory footprint of the process volition be much lower. This change has no big affect on the execution fourth dimension (negligible), but it has a large impact on the memory used by the process.

2. Practice Not Brand Unnecessary Queries When Iterating

I am talking about this line in item:

            product_category = ProductCategory.objects.get_or_create(name=row[3], code=row[4])          

What we are fetching hither is the ProductCategory instance on each loop using the category name and code. How we can solve this?

We can load the categories before the for loop and add them only when they don't exist in the database:

This alter alone decreases the fourth dimension for 100,000 products by 34 seconds (effectually thirty%). The control executes in 69 seconds subsequently the change.

iii. Do Not Save One Element at a Time

When we are creating the instance of the Production, we are asking our database to commit the changes in each loop:

            Product.objects.create(
name=row[0],
code=row[1],
price=row[2],
product_category=product_category
)

This is the I/O functioning of each loop. It must be plush. As it'due south pretty fast, the problem here is that there tin can exist millions of such operations and we can subtract the number of such operations significantly. How? By using Django'south bulk_create method:

And this change has a tremendous outcome. For 100,000 products, the command executes in only 3.5 seconds. You need to remember that the last loop can still have items in the products list (fewer than v,000 items in our case). This needs to be handled later on the loop:

            if products:
Production.objects.bulk_create(products)

Those iii changes we made together allowed us to increase the performance of the control by more than 96%. Code matters. Good code matters even more. The concluding command looks like this:

With the lawmaking above, 1 1000000 products are loaded in xxx seconds!

Pro Tip: Utilise Multi-Processing

Yet another idea for improving the loading speed of a massive CSV would be to use multi-processing, I will only nowadays the idea here. In the command to a higher place, you could split the ane big CSV file into multiple smaller files (the best arroyo would be to try to use indexes of rows) and put each batch of work under a separate process. If yous tin use multiple CPUs on your motorcar, the scaling will be linear (2x CPUs — two times faster, 4x CPUs — iv times faster).

Imagine that you have one one thousand thousand rows to process. And so the first process can take rows with the numbers 099999, the 2d takes rows with the numbers 100000199999, and so on until the concluding one takes rows with the numbers 900000999999.

The just downside here is that you demand to take ten free CPUs.

Summary

  • You should avoid loading the file into retentiveness. Use iterators instead.
  • If you lot are processing the file line by line, avoid queries to the database in the for loop body.
  • Practice non save ane chemical element per loop. Use the bulk_create method.

Thanks for reading!

marquezbeirch.blogspot.com

Source: https://betterprogramming.pub/3-techniques-for-importing-large-csv-files-into-a-django-app-2b6e5e47dba0

0 Response to "Django How to Upload Csv to Database"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel