How to import a huge csv file into Cassandra

cassandracsv

I am trying to import a csv file into Cassandra which is very long.
These are food products: ingredients, nutrition, labels… It comes from Open Food Data. List information on food products: ingredients, nutritional information, labels, etc. Most of the data comes from crowdsourcing information. The file this envelope of the open platform of French public data.gouv.fr

The command I tried

I try the following command with all the columns that I was able to collect with a python script:

cqlsh> COPY bouffe(code, url, creator, created_t, created_datetime, last_modified_t, last_modified_datetime, product_name, generic_name, quantity, packaging, packaging_tags, brands, brands_tags, categories, categories_tags, categories_fr, origins, origins_tags, manufacturing_places, manufacturing_places_tags, labels, labels_tags, labels_fr, emb_codes, emb_codes_tags, first_packaging_code_geo, cities, cities_tags, purchase_places, stores, countries, countries_tags, countries_fr, ingredients_text, allergens, allergens_fr, traces, traces_tags, traces_fr, serving_size, no_nutriments, additives_n, additives, additives_tags, additives_fr, ingredients_from_palm_oil_n, ingredients_from_palm_oil, ingredients_from_palm_oil_tags, ingredients_that_may_be_from_palm_oil_n, ingredients_that_may_be_from_palm_oil, ingredients_that_may_be_from_palm_oil_tags, nutrition_grade_uk, nutrition_grade_fr, pnns_groups_1, pnns_groups_2, states, states_tags, states_fr, main_category, main_category_fr, image_url, image_small_url, energy_100g, energy-from-fat_100g, fat_100g, saturated-fat_100g, butyric-acid_100g, caproic-acid_100g, caprylic-acid_100g, capric-acid_100g, lauric-acid_100g, myristic-acid_100g, palmitic-acid_100g, stearic-acid_100g, arachidic-acid_100g, behenic-acid_100g, lignoceric-acid_100g, cerotic-acid_100g, montanic-acid_100g, melissic-acid_100g, monounsaturated-fat_100g, polyunsaturated-fat_100g, omega-3-fat_100g, alpha-linolenic-acid_100g, eicosapentaenoic-acid_100g, docosahexaenoic-acid_100g, omega-6-fat_100g, linoleic-acid_100g, arachidonic-acid_100g, gamma-linolenic-acid_100g, dihomo-gamma-linolenic-acid_100g, omega-9-fat_100g, oleic-acid_100g, elaidic-acid_100g, gondoic-acid_100g, mead-acid_100g, erucic-acid_100g, nervonic-acid_100g, trans-fat_100g, cholesterol_100g, carbohydrates_100g, sugars_100g, sucrose_100g, glucose_100g, fructose_100g, lactose_100g, maltose_100g, maltodextrins_100g, starch_100g, polyols_100g, fiber_100g, proteins_100g, casein_100g, serum-proteins_100g, nucleotides_100g, salt_100g, sodium_100g, alcohol_100g, vitamin-a_100g, beta-carotene_100g, vitamin-d_100g, vitamin-e_100g, vitamin-k_100g, vitamin-c_100g, vitamin-b1_100g, vitamin-b2_100g, vitamin-pp_100g, vitamin-b6_100g, vitamin-b9_100g, folates_100g, vitamin-b12_100g, biotin_100g, pantothenic-acid_100g, silica_100g, bicarbonate_100g, potassium_100g, chloride_100g, calcium_100g, phosphorus_100g, iron_100g, magnesium_100g, zinc_100g, copper_100g, manganese_100g, fluoride_100g, selenium_100g, chromium_100g, molybdenum_100g, iodine_100g, caffeine_100g, taurine_100g, ph_100g, fruits-vegetables-nuts_100g, fruits-vegetables-nuts-estimate_100g, collagen-meat-protein-ratio_100g, cocoa_100g, chlorophyl_100g, carbon-footprint_100g, nutrition-score-fr_100g, nutrition-score-uk_100g, glycemic-index_100g, water-hardness_100g) FROM 'bouffe.csv' WITH HEADER = true;

But it gives me the following error :

...
Failed to import 23 rows: ParseError - Invalid row length 84 should be 163,  given up without retries
Failed to import 47 rows: ParseError - Invalid row length 77 should be 163,  given up without retries
Failed to import 73 rows: ParseError - Invalid row length 52 should be 163,  given up without retries
Failed to import 5000 rows: Error - new-line character seen in unquoted field - do you need to open the file in universal-newline mode?,  given up after 1 attempts
Failed to import 2 rows: ParseError - Invalid row length 32 should be 163,  given up without retries
Failed to import 56 rows: ParseError - Invalid row length 69 should be 163,  given up without retries
Exceeded maximum number of insert errors 1000 Avg. rate:    7467 rows/s
Failed to process 192457 rows; failed rows written to import_k1_bouffe.err
Exceeded maximum number of insert errors 1000
Processed: 185000 rows; Rate:    4855 rows/s; Avg. rate:    7407 rows/s
185000 rows imported from 0 files in 24.977 seconds (0 skipped).

Beforehand, I had created :

create ColumnFamily Bouffe
(Code varchar PRIMARY KEY,
url varchar,
...
)

When I ask cassandra to describe my table I have :

cqlsh:k1> DESCRIBE TABLE bouffe;

CREATE TABLE k1.bouffe (
    code int PRIMARY KEY,
    additives text,
    additives_fr text,
    additives_n text,
    additives_tags text,
    alcohol_100g text,
    allergens text,
    allergens_fr text,
    alpha_linolenic_acid_100g text,
    arachidic_acid_100g text,
    arachidonic_acid_100g text,
    behenic_acid_100g text,
    beta_carotene_100g text,
    bicarbonate_100g text,
    biotin_100g text,
    brands text,
    brands_tags text,
    butyric_acid_100g text,
    caffeine_100g text,
    calcium_100g text,
    capric_acid_100g text,
    caproic_acid_100g text,
    caprylic_acid_100g text,
    carbohydrates_100g text,
    carbon_footprint_100g text,
    casein_100g text,
    categories text,
    categories_fr text,
    categories_tags text,
    cerotic_acid_100g text,
    chloride_100g text,
    chlorophyl_100g text,
    cholesterol_100g text,
    chromium_100g text,
    cities text,
    cities_tags text,
    cocoa_100g text,
    collagen_meat_protein_ratio_100g text,
    copper_100g text,
    countries text,
    countries_fr text,
    countries_tags text,
    created_datetime text,
    created_t text,
    creator text,
    dihomo_gamma_linolenic_acid_100g text,
    docosahexaenoic_acid_100g text,
    eicosapentaenoic_acid_100g text,
    elaidic_acid_100g text,
    emb_codes text,
    emb_codes_tags text,
    energy_100g text,
    energy_from_fat_100g text,
    erucic_acid_100g text,
    fat_100g text,
    fiber_100g text,
    first_packaging_code_geo text,
    fluoride_100g text,
    folates_100g text,
    fructose_100g text,
    fruits_vegetables_nuts_100g text,
    fruits_vegetables_nuts_estimate_100g text,
    gamma_linolenic_acid_100g text,
    generic_name text,
    glucose_100g text,
    glycemic_index_100g text,
    gondoic_acid_100g text,
    image_small_url text,
    image_url text,
    ingredients_from_palm_oil text,
    ingredients_from_palm_oil_n text,
    ingredients_from_palm_oil_tags text,
    ingredients_text text,
    ingredients_that_may_be_from_palm_oil text,
    ingredients_that_may_be_from_palm_oil_n text,
    ingredients_that_may_be_from_palm_oil_tags text,
    iodine_100g text,
    iron_100g text,
    labels text,
    labels_fr text,
    labels_tags text,
    lactose_100g text,
    last_modified_datetime text,
    last_modified_t text,
    lauric_acid_100g text,
    lignoceric_acid_100g text,
    linoleic_acid_100g text,
    magnesium_100g text,
    main_category text,
    main_category_fr text,
    maltodextrins_100g text,
    maltose_100g text,
    manganese_100g text,
    manufacturing_places text,
    manufacturing_places_tags text,
    mead_acid_100g text,
    melissic_acid_100g text,
    molybdenum_100g text,
    monounsaturated_fat_100g text,
    montanic_acid_100g text,
    myristic_acid_100g text,
    nervonic_acid_100g text,
    no_nutriments text,
    nucleotides_100g text,
    nutrition_grade_fr text,
    nutrition_grade_uk text,
    nutrition_score_fr_100g text,
    nutrition_score_uk_100g text,
    oleic_acid_100g text,
    omega_3_fat_100g text,
    omega_6_fat_100g text,
    omega_9_fat_100g text,
    origins text,
    origins_tags text,
    packaging text,
    packaging_tags text,
    palmitic_acid_100g text,
    pantothenic_acid_100g text,
    ph_100g text,
    phosphorus_100g text,
    pnns_groups_1 text,
    pnns_groups_2 text,
    polyols_100g text,
    polyunsaturated_fat_100g text,
    potassium_100g text,
    product_name text,
    proteins_100g text,
    purchase_places text,
    quantity text,
    salt_100g text,
    saturated_fat_100g text,
    selenium_100g text,
    serum_proteins_100g text,
    serving_size text,
    silica_100g text,
    sodium_100g text,
    starch_100g text,
    states text,
    states_fr text,
    states_tags text,
    stearic_acid_100g text,
    stores text,
    sucrose_100g text,
    sugars_100g text,
    taurine_100g text,
    traces text,
    traces_fr text,
    traces_tags text,
    trans_fat_100g text,
    url text,
    vitamin_a_100g text,
    vitamin_b12_100g text,
    vitamin_b1_100g text,
    vitamin_b2_100g text,
    vitamin_b6_100g text,
    vitamin_b9_100g text,
    vitamin_c_100g text,
    vitamin_d_100g text,
    vitamin_e_100g text,
    vitamin_k_100g text,
    vitamin_pp_100g text,
    water_hardness_100g text,
    zinc_100g text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

What does the data looks like

there are columns that are not below a column heading water-hardness_100g:

foto que proviene de librofiice que muestra que hay cosas despues de water-hardness_100g

Thus, how to import a huge csv file into Cassandra ?

The idea I have at the moment is to create a csv file with python in order to fill empty spaces between , with NaN.

Best Answer

Use Cassandra BulkLoader (earlier referred to as SSTableloader). The details of how to execute the load is clearly explained here https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated