1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

  2. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[SQL] How do I take a randomized sample without replacement where if an ID is drawn, then all...

Discussão em 'Outras Linguagens' iniciado por Stack, Fevereiro 4, 2025.

  1. Stack

    Stack Membro Participativo

    I am currently trying to solve a problem where I need to get a randomized sample from a data table using SQL in Vertica. I need to get a randomized sample without replacement of 60,000 rows over a span of 3 years (2021 to 2024) from a data table. Each row has an OrderID as well as an Item, Price, and Date column. If an OrderID is in the randomized sample, then all instances of that OrderID from 2021 to 2024 in the data table must also be in the randomized sample. I also need this to be as efficient as possible. Is it possible to do this in SQL?

    The data looks approximately like this:

    OrderID Item Price Date
    1234 Apple 0.99 2024-03-25
    2573 Tomato 0.99 2023-01-16
    1234 Cereal 6.99 2024-03-25
    1234 Milk 7.99 2024-03-25
    8473 Salmon 15.99 2021-11-19
    7392 Plum 0.99 2022-05-02

    Using my example above, If OrderID 1234 is selected for the randomized sample, then I want all instances of OrderID 1234 to appear in the sample along with other OrderIDs, like this:

    Sample

    OrderID Item Price Date
    1234 Apple 0.99 2024-03-25
    1234 Cereal 6.99 2024-03-25
    1234 Milk 7.99 2024-03-25
    7392 Plum 0.99 2022-05-02

    I've looked into various SQL functions for getting a randomized sample with replacement efficiently, but I haven't been able to figure out how to also get it so that if an OrderID is pulled into the sample, then all instances (rows) with that OrderID from those 3 years are also pulled into the sample. This is so far outside my current skill level in SQL that I currently feel very stuck on how to tackle this. Any help would be greatly appreciated!

    I've tried this bit of code because I found something similar elsewhere, but even that isn't working (and to be honest, I'm not even sure if it's in the kind of SQL that Vertica would understand) and is giving me this error:


    SQL ERROR [4856][42601][Vertica]VJDBC ERROR: Syntax error at or near "."

    SELECT t.*
    FROM table t
    WHERE t.OrderID in (SELECT top(60000) t2.OrderID
    FROM (SELECT DISTINCT t2.OrderID
    FROM table t2
    ) t2
    ORDER BY newid()
    );

    Continue reading...

Compartilhe esta Página