Django How to Upload Csv to Database
3 Techniques for Importing Large CSV Files Into a Django App
Loading huge amounts of data to a database only got easier
Problem Overview and App Configuration
It's often the instance that you want to load the data into your database from a CSV file. Usually, it's not a problem at all, but at that place are some cases when performance problems tin occur — especially if you lot want to load a massive corporeality of data. In this case, "massive" means a CSV file that has 500MB to 1GB of data and millions of rows.
In this article, I will focus on a state of affairs where using the database utilities to load CSV files is not possible (similar PostgreSQL Re-create
), as you demand to do a transformation in the process.
As well, information technology is worth noting here that data load of this size should e'er be questioned and you should try to notice more suitable means to do it. Always check if you can re-create the data directly to the database using database engine utilities like COPY
. These kinds of operations will almost e'er be much more performant than using ORM and your awarding code.
Permit's say that we accept ii models: Production
and ProductCategory
. We get the data from a different organization department and nosotros have to load the information to the organisation. Our Django models will look like this:
The information structure is pretty uncomplicated, but it will exist enough to show the problems with massive data load. One thing worth noting here is the relationship between Product
and ProductCategory
. In this case, we tin can expect that the number of product categories volition be several orders of magnitude lower than the number of products. We will utilise this cognition later on.
We need also a generator for the CSV files. The CSV file has the following columns:
-
product_name
-
product_code
-
price
-
product_category_name
-
product_category_code
Using the script above, you can create a CSV file with the data nosotros will need to do load testing. Y'all can laissez passer a number when calling arguments, and this will be the number of rows in the generated file:
python3 csv_mock_data_create.py 10000
The command to a higher place will create a file with 10,000 products. Notation that the script is skipping the CSV header now. I volition get dorsum to that later.
Exist conscientious here, equally 10 meg rows will create a file of size around 600MB.
Now we simply need a simple Django management command to load the file. We will not practice it via the view because, as we know already, the files are huge. This means that we will need to upload ~500MB files using a request handler, and as a effect, load the files to the memory. This is inefficient.
The command now has a naive implementation of the data loading and as well shows the time needed to process the CSV file:
python3 manage.py load_csv /path/to/your/file.csv
For 200 products, the code above was executed in 0.220191 seconds. For 100,000 products information technology took 103.066553 seconds. And it would probable take ten times longer for ane million products. Can we arrive faster?
ane. Do Not Load the Whole File Into Memory
The first thing to note is that the lawmaking above loads the whole CSV into memory. Even more interestingly, it's doing it twice. These 2 lines are really bad:
information = list(csv.reader(csv_file, delimiter=","))
for row in information[1:]:
...
It's a mutual mistake to try to skip the header from processing like that. The code is iterating from the second element on the listing, but the csv.reader
is an iterator, which ways information technology's retentiveness-efficient. If a programmer forces listing
conversion, and then the CSV file volition be loaded into a list and thus into the memory of the process. On instances without enough RAM retentivity, that can be an consequence. The second re-create of the data is done when the information[1:]
is used in the for
loop. And so how can we handle it?
data = csv.reader(csv_file, delimiter=",")
next(data)
for row in information:
...
Calling adjacent
will motion the iterator to the adjacent particular, and we volition exist able to skip a CSV header (in nigh cases, it'south not needed for the processing). Too, the memory footprint of the process volition be much lower. This change has no big affect on the execution fourth dimension (negligible), but it has a large impact on the memory used by the process.
2. Practice Not Brand Unnecessary Queries When Iterating
I am talking about this line in item:
product_category = ProductCategory.objects.get_or_create(name=row[3], code=row[4])
What we are fetching hither is the ProductCategory
instance on each loop using the category name and code. How we can solve this?
We can load the categories before the for
loop and add them only when they don't exist in the database:
This alter alone decreases the fourth dimension for 100,000 products by 34 seconds (effectually thirty%). The control executes in 69 seconds subsequently the change.
iii. Do Not Save One Element at a Time
When we are creating the instance of the Production
, we are asking our database to commit the changes in each loop:
Product.objects.create(
name=row[0],
code=row[1],
price=row[2],
product_category=product_category
)
This is the I/O functioning of each loop. It must be plush. As it'due south pretty fast, the problem here is that there tin can exist millions of such operations and we can subtract the number of such operations significantly. How? By using Django'south bulk_create
method:
And this change has a tremendous outcome. For 100,000 products, the command executes in only 3.5 seconds. You need to remember that the last loop can still have items in the products
list (fewer than v,000 items in our case). This needs to be handled later on the loop:
if products:
Production.objects.bulk_create(products)
Those iii changes we made together allowed us to increase the performance of the control by more than 96%. Code matters. Good code matters even more. The concluding command looks like this:
With the lawmaking above, 1 1000000 products are loaded in xxx seconds!
Pro Tip: Utilise Multi-Processing
Yet another idea for improving the loading speed of a massive CSV would be to use multi-processing, I will only nowadays the idea here. In the command to a higher place, you could split the ane big CSV file into multiple smaller files (the best arroyo would be to try to use indexes of rows) and put each batch of work under a separate process. If yous tin use multiple CPUs on your motorcar, the scaling will be linear (2x CPUs — two times faster, 4x CPUs — iv times faster).
Imagine that you have one one thousand thousand rows to process. And so the first process can take rows with the numbers 0
–99999
, the 2d takes rows with the numbers 100000
–199999
, and so on until the concluding one takes rows with the numbers 900000
–999999
.
The just downside here is that you demand to take ten free CPUs.
Summary
- You should avoid loading the file into retentiveness. Use iterators instead.
- If you lot are processing the file line by line, avoid queries to the database in the
for
loop body. - Practice non save ane chemical element per loop. Use the
bulk_create
method.
Thanks for reading!
Source: https://betterprogramming.pub/3-techniques-for-importing-large-csv-files-into-a-django-app-2b6e5e47dba0
0 Response to "Django How to Upload Csv to Database"
Postar um comentário